Finding Statistically Optimal Sample Size By Measuring Sample Quality

dc.contributor.authorBaohua GUen_US
dc.contributor.authorFeifang HUen_US
dc.contributor.authorHuan LIUen_US
dc.date.accessioned2004-10-21T14:28:52Zen_US
dc.date.accessioned2017-01-23T06:59:41Z
dc.date.available2004-10-21T14:28:52Zen_US
dc.date.available2017-01-23T06:59:41Z
dc.date.issued2000-11-01T00:00:00Zen_US
dc.description.abstractSampling is a useful means to handling large data sets in data mining. Given a data set and a learning algorithm, an Optimal Sample Size (OSS) can be approximately found by progressively increasing sample size until the learning accuracy no longer increases. However, if the starting sample size is too far away from the OSS, the extra computation done before arriving at the OSS could be very expensive due to repeatedly running the learning algorithm on a sequence of samples. Thus directly starting from a size close to the OSS will greatly reduce computational cost. In this paper, we attempt to find such a size without a learning algorithm, but via a statistical approach. We name this size Statistically Optimal Sample Size SOSS, in the sense that a sample of this size statistically sufficiently ``resembles'' its mother data. We define an information-based measure of sample quality that can be efficiently calculated in only one scan of a data set. We show that learning on a sample of SOSS will produce accuracy very close to that learned on a sample of OSS. We present an efficient algorithm that calculates a ``quality curve'' of sample quality with respect to sample size, based on which we determine a SOSS by detecting convergence. Experiments on artificial data sets and UCI data sets show that learning accuracy obtained on a sample of SOSS is fairly close to that on an sample of OSS as well as that of the whole data set. We conclude that SOSS is an effective measure determining sample size in the sense that it can find OSS but with much less computational cost.en_US
dc.format.extent17113 bytesen_US
dc.format.extent887039 bytesen_US
dc.format.mimetypeapplication/pdfen_US
dc.format.mimetypeapplication/postscripten_US
dc.identifier.urihttps://dl.comp.nus.edu.sg/xmlui/handle/1900.100/1412en_US
dc.language.isoenen_US
dc.relation.ispartofseriesTR11/00en_US
dc.titleFinding Statistically Optimal Sample Size By Measuring Sample Qualityen_US
dc.typeTechnical Reporten_US
Files
Original bundle
Now showing 1 - 2 of 2
Loading...
Thumbnail Image
Name:
report.ps
Size:
866.25 KB
Format:
Postscript Files
Description:
Loading...
Thumbnail Image
Name:
report.pdf
Size:
16.71 KB
Format:
Adobe Portable Document Format
Description:
License bundle
Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
license.txt
Size:
1.52 KB
Format:
Plain Text
Description: