Finding Statistically Optimal Sample Size By Measuring Sample Quality
No Thumbnail Available
Date
2000-11-01T00:00:00Z
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
Abstract
Sampling is a useful means to handling large data sets in data mining. Given a data set and a learning algorithm, an Optimal Sample Size (OSS) can be approximately found by progressively increasing sample size until the learning accuracy no longer increases. However, if the starting sample size is too far away from the OSS, the extra computation done before arriving at the OSS could be very expensive due to repeatedly running the learning algorithm on a sequence of samples. Thus directly starting from a size close to the OSS will greatly reduce computational cost. In this paper, we attempt to find such a size without a learning algorithm, but via a statistical approach. We name this size Statistically Optimal Sample Size SOSS, in the sense that a sample of this size statistically sufficiently ``resembles'' its mother data. We define an information-based measure of sample quality that can be efficiently calculated in only one scan of a data set. We show that learning on a sample of SOSS will produce accuracy very close to that learned on a sample of OSS. We present an efficient algorithm that calculates a ``quality curve'' of sample quality with respect to sample size, based on which we determine a SOSS by detecting convergence. Experiments on artificial data sets and UCI data sets show that learning accuracy obtained on a sample of SOSS is fairly close to that on an sample of OSS as well as that of the whole data set. We conclude that SOSS is an effective measure determining sample size in the sense that it can find OSS but with much less computational cost.