DSpace Angular :: Browsing by Author "Baohua GU"

Browsing by Author "Baohua GU"

Now showing 1 - 3 of 3

An Empirical Study of Fitting Learning Curves
(2001-04-01T00:00:00Z) Baohua GU; Feifang HU; Huan LIU
It is well known that many learning algorithms have diminishing returns for increased training data size. This paper empirically studies fitting learning curves of large data sets in search of a principled stopping criterion. Such a criterion is particularly useful when the data size is huge as in most data mining applications. Learning curves are obtained by running decision tree algorithm C4.5 and logistic discrimination algorithm LOG on eight large UCI data sets, then fitted with six competing models, which are compared and ranked in terms of their performance on fitting full-size learning curves and on predicting late portion with curves fitted from early portions of learning curves with small data sizes. The three-parameters power law is found in the experiments overall close to the best in fitting and the best in predicting. It is also found that although the fit ranking of these fitting models is almost consistent for all the eight data sets using the two algorithms, their prediction ranking varies more for LOG than for C4.5 over the eight data sets and the amount of data used in fitting. The findings here can be used in effective data mining with large data.
Finding Statistically Optimal Sample Size By Measuring Sample Quality
(2000-11-01T00:00:00Z) Baohua GU; Feifang HU; Huan LIU
Sampling is a useful means to handling large data sets in data mining. Given a data set and a learning algorithm, an Optimal Sample Size (OSS) can be approximately found by progressively increasing sample size until the learning accuracy no longer increases. However, if the starting sample size is too far away from the OSS, the extra computation done before arriving at the OSS could be very expensive due to repeatedly running the learning algorithm on a sequence of samples. Thus directly starting from a size close to the OSS will greatly reduce computational cost. In this paper, we attempt to find such a size without a learning algorithm, but via a statistical approach. We name this size Statistically Optimal Sample Size SOSS, in the sense that a sample of this size statistically sufficiently ``resembles'' its mother data. We define an information-based measure of sample quality that can be efficiently calculated in only one scan of a data set. We show that learning on a sample of SOSS will produce accuracy very close to that learned on a sample of OSS. We present an efficient algorithm that calculates a ``quality curve'' of sample quality with respect to sample size, based on which we determine a SOSS by detecting convergence. Experiments on artificial data sets and UCI data sets show that learning accuracy obtained on a sample of SOSS is fairly close to that on an sample of OSS as well as that of the whole data set. We conclude that SOSS is an effective measure determining sample size in the sense that it can find OSS but with much less computational cost.
Sampling and Its Application in Data Mining: A Survey
(2000-06-01T00:00:00Z) Baohua GU; Feifang HU; Huan LIU
Large data sets are becoming obstacles for efficient data mining. Sampling, as a well-established technique in statistics, is desired to play its role in overcoming the obstacles. Statistical community has provided plenty of sampling strategies which are generally believed also applicable in data mining. However, since data mining community has different starting-points and requirements from statistics community, some of these strategies may need to be reexamined when applied to data mining and it is also desirable to invent novel strategies for specific data mining tasks on specific data. This paper summarizes basic ideas and general considerations of sampling and categorizes sampling strategies existing in statistics so as to obtain potentially useful sampling strategies for data mining. Then the state-of-the-art ways of applying sampling in data mining are reviewed. By analyzing the strategies used in different data mining tasks and relating them to their precedents in statistics, we show that how traditional strategies are directly or indirectly applied. We discuss general considerations and research issues of sampling in data mining. We show that these issues are either usually not considered in statistics or not well-studied yet but essential to data mining. We believe extensive studies on sampling will contribute more to data mining, especially when dealing with large data sets.

Browsing by Author "Baohua GU"

Results Per Page

Sort Options