Cleansing Data for Mining and Warehousing

Mong Li LEE; Hongjun LU; Tok Wang LING; Yee Teng KO

Cleansing Data for Mining and Warehousing

dc.contributor.author	Mong Li LEE	en_US
dc.contributor.author	Hongjun LU	en_US
dc.contributor.author	Tok Wang LING	en_US
dc.contributor.author	Yee Teng KO	en_US
dc.date.accessioned	2004-10-21T14:28:52Z	en_US
dc.date.accessioned	2017-01-23T06:59:51Z
dc.date.available	2004-10-21T14:28:52Z	en_US
dc.date.available	2017-01-23T06:59:51Z
dc.date.issued	1999-06-01T00:00:00Z	en_US
dc.description.abstract	Given the rapid growth of data, it is increasingly important to extract, mine and discover useful information from databases and data warehouses. The process of data cleansing or data scrubbing is crucial because of the "garbage in, garbage out" principle. However, "dirty" data files are prevalent as a result of incorrect or missing data values due to data entry errors or typographical errors, inconsistent value naming conventions due to different field entry format and use of abbreviations, and incomplete information due to unavailability of data. Hence, we may have multiple records refering to the same real world entity. This not only contributes to the problem of handling ever-increasing amount of data, but also leads to the mining of inconsistent or inaccurate information which is obviously undesirable. In this paper, we examine the problem of detecting and removing duplicating records. We present several efficient techniques to pre-process the records before sorting them so that potentially matching records will be brought to a close neighbourhood subsequently. These techniques include scrubbing data fields using external source files to remove typographical errors and the use of abbreviations, tokenizing data fields and then sorting the tokens in the data fields. We also propose a method to determine the similarity between two records. Based on these techniques, we implement a data cleansing system which is able to detect and remove more duplicate records than existing methods.	en_US
dc.format.extent	251580 bytes	en_US
dc.format.extent	1569380 bytes	en_US
dc.format.mimetype	application/pdf	en_US
dc.format.mimetype	application/postscript	en_US
dc.identifier.uri	https://dl.comp.nus.edu.sg/xmlui/handle/1900.100/1399	en_US
dc.language.iso	en	en_US
dc.relation.ispartofseries	TRA6/99	en_US
dc.title	Cleansing Data for Mining and Warehousing	en_US
dc.type	Technical Report	en_US

Cleansing Data for Mining and Warehousing

Files

Original bundle

License bundle

Collections