Sampling and Its Application in Data Mining: A Survey

Baohua GU; Feifang HU; Huan LIU

Sampling and Its Application in Data Mining: A Survey

Files

report.ps(1.07 MB)

report.pdf(221.29 KB)

Date

2000-06-01T00:00:00Z

Authors

Baohua GU

Feifang HU

Huan LIU

Abstract

Large data sets are becoming obstacles for efficient data mining. Sampling, as a well-established technique in statistics, is desired to play its role in overcoming the obstacles. Statistical community has provided plenty of sampling strategies which are generally believed also applicable in data mining. However, since data mining community has different starting-points and requirements from statistics community, some of these strategies may need to be reexamined when applied to data mining and it is also desirable to invent novel strategies for specific data mining tasks on specific data. This paper summarizes basic ideas and general considerations of sampling and categorizes sampling strategies existing in statistics so as to obtain potentially useful sampling strategies for data mining. Then the state-of-the-art ways of applying sampling in data mining are reviewed. By analyzing the strategies used in different data mining tasks and relating them to their precedents in statistics, we show that how traditional strategies are directly or indirectly applied. We discuss general considerations and research issues of sampling in data mining. We show that these issues are either usually not considered in statistics or not well-studied yet but essential to data mining. We believe extensive studies on sampling will contribute more to data mining, especially when dealing with large data sets.

URI

https://dl.comp.nus.edu.sg/xmlui/handle/1900.100/1408

Collections

Technical Reports

Full item page