Cleansing Data for Mining and Warehousing

dc.contributor.authorMong Li LEEen_US
dc.contributor.authorHongjun LUen_US
dc.contributor.authorTok Wang LINGen_US
dc.contributor.authorYee Teng KOen_US
dc.date.accessioned2004-10-21T14:28:52Zen_US
dc.date.accessioned2017-01-23T06:59:51Z
dc.date.available2004-10-21T14:28:52Zen_US
dc.date.available2017-01-23T06:59:51Z
dc.date.issued1999-06-01T00:00:00Zen_US
dc.description.abstractGiven the rapid growth of data, it is increasingly important to extract, mine and discover useful information from databases and data warehouses. The process of data cleansing or data scrubbing is crucial because of the "garbage in, garbage out" principle. However, "dirty" data files are prevalent as a result of incorrect or missing data values due to data entry errors or typographical errors, inconsistent value naming conventions due to different field entry format and use of abbreviations, and incomplete information due to unavailability of data. Hence, we may have multiple records refering to the same real world entity. This not only contributes to the problem of handling ever-increasing amount of data, but also leads to the mining of inconsistent or inaccurate information which is obviously undesirable. In this paper, we examine the problem of detecting and removing duplicating records. We present several efficient techniques to pre-process the records before sorting them so that potentially matching records will be brought to a close neighbourhood subsequently. These techniques include scrubbing data fields using external source files to remove typographical errors and the use of abbreviations, tokenizing data fields and then sorting the tokens in the data fields. We also propose a method to determine the similarity between two records. Based on these techniques, we implement a data cleansing system which is able to detect and remove more duplicate records than existing methods.en_US
dc.format.extent251580 bytesen_US
dc.format.extent1569380 bytesen_US
dc.format.mimetypeapplication/pdfen_US
dc.format.mimetypeapplication/postscripten_US
dc.identifier.urihttps://dl.comp.nus.edu.sg/xmlui/handle/1900.100/1399en_US
dc.language.isoenen_US
dc.relation.ispartofseriesTRA6/99en_US
dc.titleCleansing Data for Mining and Warehousingen_US
dc.typeTechnical Reporten_US
Files
Original bundle
Now showing 1 - 2 of 2
Loading...
Thumbnail Image
Name:
report.ps
Size:
1.5 MB
Format:
Postscript Files
Description:
Loading...
Thumbnail Image
Name:
report.pdf
Size:
245.68 KB
Format:
Adobe Portable Document Format
Description:
License bundle
Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
license.txt
Size:
1.52 KB
Format:
Plain Text
Description: