Browsing by Author "Huan LIU"
Now showing 1 - 5 of 5
Results Per Page
Sort Options
- ItemDiscretization: An Enabling Technique(1999-06-01T00:00:00Z) Farhad HUSSAIN; Huan LIU; Chew Lim TAN; Manoranjan DASHDiscrete values have important roles in data mining and knowledge discovery. They are about intervals of numbers which are more concise to represent and specify, easier to use and comprehend as they are closer to a knowledge-level representation than continuous values. Many studies show induction tasks can benefit from discretization: rules with discrete values are normally shorter and more understandable and discretization can lead to improved predictive accuracy. Furthermore, many induction algorithms found in the literature require discrete features. All these prompt researchers and practitioners to discretize continuous features before or during a machine learning or data mining task. There are numerous discretization methods available in the literature. It is time for us to examine these seemingly different methods for discretization and find out how different they really are, what are the key components of a discretization process, how we can improve the current level of research for new development as well as the use of existing methods. This paper aims at a systematic study of discretization methods with their history of development, effect on classification, and trade-off between speed and accuracy. Contributions of this paper are an abstract description summarizing existing discretization methods, a hierarchical framework to categorize the existing methods and pave the way for further development, concise discussions of representative discretization methods, extensive experiments and their analysis, and some guidelines as to how to choose a discretization method under various circumstances. We also identify some issues yet to solve and future research for discretization.
- ItemAn Empirical Study of Fitting Learning Curves(2001-04-01T00:00:00Z) Baohua GU; Feifang HU; Huan LIUIt is well known that many learning algorithms have diminishing returns for increased training data size. This paper empirically studies fitting learning curves of large data sets in search of a principled stopping criterion. Such a criterion is particularly useful when the data size is huge as in most data mining applications. Learning curves are obtained by running decision tree algorithm C4.5 and logistic discrimination algorithm LOG on eight large UCI data sets, then fitted with six competing models, which are compared and ranked in terms of their performance on fitting full-size learning curves and on predicting late portion with curves fitted from early portions of learning curves with small data sizes. The three-parameters power law is found in the experiments overall close to the best in fitting and the best in predicting. It is also found that although the fit ranking of these fitting models is almost consistent for all the eight data sets using the two algorithms, their prediction ranking varies more for LOG than for C4.5 over the eight data sets and the amount of data used in fitting. The findings here can be used in effective data mining with large data.
- ItemFinding Statistically Optimal Sample Size By Measuring Sample Quality(2000-11-01T00:00:00Z) Baohua GU; Feifang HU; Huan LIUSampling is a useful means to handling large data sets in data mining. Given a data set and a learning algorithm, an Optimal Sample Size (OSS) can be approximately found by progressively increasing sample size until the learning accuracy no longer increases. However, if the starting sample size is too far away from the OSS, the extra computation done before arriving at the OSS could be very expensive due to repeatedly running the learning algorithm on a sequence of samples. Thus directly starting from a size close to the OSS will greatly reduce computational cost. In this paper, we attempt to find such a size without a learning algorithm, but via a statistical approach. We name this size Statistically Optimal Sample Size SOSS, in the sense that a sample of this size statistically sufficiently ``resembles'' its mother data. We define an information-based measure of sample quality that can be efficiently calculated in only one scan of a data set. We show that learning on a sample of SOSS will produce accuracy very close to that learned on a sample of OSS. We present an efficient algorithm that calculates a ``quality curve'' of sample quality with respect to sample size, based on which we determine a SOSS by detecting convergence. Experiments on artificial data sets and UCI data sets show that learning accuracy obtained on a sample of SOSS is fairly close to that on an sample of OSS as well as that of the whole data set. We conclude that SOSS is an effective measure determining sample size in the sense that it can find OSS but with much less computational cost.
- ItemFrom Incremental Learning to Model Independent Instance Selection - A Support Vector Machine Approach(1999-09-01T00:00:00Z) Nadeem Ahmed SYED; Huan LIU; Kah Kay SUNGWith large amounts of data being available to machine learning community, the need to design techniques that scale well is more critical than ever before. As some data may be collected over long periods, there is also a continuous need to incorporate the new data into the previously learned concept. Incremental learning techniques can satisfy the need for both the scalability and incremental update. In this paper, we categorize the incremental techniques into two broad categories: block by block vs instance by instance. We suggest three criteria to evaluate the robustness and reliability of incremental learning methods. We then propose an incremental learning method for Support Vector Machines, and use the suggested criteria to evaluate the effectiveness of the suggested training method. Motivated by positive results on these experiments, we research the possibility of using SVMs for another approach to handling very large datasets. We have carried out a study to evaluate whether the Support Vector Machine (SVM) training can be used to select a small subset of examples from the training set in a model independent way. We compare the results of SVM selection, with IB2 selection method and random sampling. We analyze the experiment results, and discuss their implications. All the results have been illustrated using standard machine learning benchmark datasets.
- ItemSampling and Its Application in Data Mining: A Survey(2000-06-01T00:00:00Z) Baohua GU; Feifang HU; Huan LIULarge data sets are becoming obstacles for efficient data mining. Sampling, as a well-established technique in statistics, is desired to play its role in overcoming the obstacles. Statistical community has provided plenty of sampling strategies which are generally believed also applicable in data mining. However, since data mining community has different starting-points and requirements from statistics community, some of these strategies may need to be reexamined when applied to data mining and it is also desirable to invent novel strategies for specific data mining tasks on specific data. This paper summarizes basic ideas and general considerations of sampling and categorizes sampling strategies existing in statistics so as to obtain potentially useful sampling strategies for data mining. Then the state-of-the-art ways of applying sampling in data mining are reviewed. By analyzing the strategies used in different data mining tasks and relating them to their precedents in statistics, we show that how traditional strategies are directly or indirectly applied. We discuss general considerations and research issues of sampling in data mining. We show that these issues are either usually not considered in statistics or not well-studied yet but essential to data mining. We believe extensive studies on sampling will contribute more to data mining, especially when dealing with large data sets.