Finding Statistically Optimal Sample Size By Measuring Sample Quality

Baohua GU; Feifang HU; Huan LIU

Finding Statistically Optimal Sample Size By Measuring Sample Quality

Files

report.ps(866.25 KB)

report.pdf(16.71 KB)

Date

2000-11-01T00:00:00Z

Authors

Baohua GU

Feifang HU

Huan LIU

Abstract

Sampling is a useful means to handling large data sets in data mining. Given a data set and a learning algorithm, an Optimal Sample Size (OSS) can be approximately found by progressively increasing sample size until the learning accuracy no longer increases. However, if the starting sample size is too far away from the OSS, the extra computation done before arriving at the OSS could be very expensive due to repeatedly running the learning algorithm on a sequence of samples. Thus directly starting from a size close to the OSS will greatly reduce computational cost. In this paper, we attempt to find such a size without a learning algorithm, but via a statistical approach. We name this size Statistically Optimal Sample Size SOSS, in the sense that a sample of this size statistically sufficiently ``resembles'' its mother data. We define an information-based measure of sample quality that can be efficiently calculated in only one scan of a data set. We show that learning on a sample of SOSS will produce accuracy very close to that learned on a sample of OSS. We present an efficient algorithm that calculates a ``quality curve'' of sample quality with respect to sample size, based on which we determine a SOSS by detecting convergence. Experiments on artificial data sets and UCI data sets show that learning accuracy obtained on a sample of SOSS is fairly close to that on an sample of OSS as well as that of the whole data set. We conclude that SOSS is an effective measure determining sample size in the sense that it can find OSS but with much less computational cost.

URI

https://dl.comp.nus.edu.sg/xmlui/handle/1900.100/1412

Collections

Technical Reports

Full item page