Sampling and Its Application in Data Mining: A Survey

No Thumbnail Available
Date
2000-06-01T00:00:00Z
Journal Title
Journal ISSN
Volume Title
Publisher
Abstract
Large data sets are becoming obstacles for efficient data mining. Sampling, as a well-established technique in statistics, is desired to play its role in overcoming the obstacles. Statistical community has provided plenty of sampling strategies which are generally believed also applicable in data mining. However, since data mining community has different starting-points and requirements from statistics community, some of these strategies may need to be reexamined when applied to data mining and it is also desirable to invent novel strategies for specific data mining tasks on specific data. This paper summarizes basic ideas and general considerations of sampling and categorizes sampling strategies existing in statistics so as to obtain potentially useful sampling strategies for data mining. Then the state-of-the-art ways of applying sampling in data mining are reviewed. By analyzing the strategies used in different data mining tasks and relating them to their precedents in statistics, we show that how traditional strategies are directly or indirectly applied. We discuss general considerations and research issues of sampling in data mining. We show that these issues are either usually not considered in statistics or not well-studied yet but essential to data mining. We believe extensive studies on sampling will contribute more to data mining, especially when dealing with large data sets.
Description
Keywords
Citation