Browsing by Author "ZHANG, Zhenjie"
Now showing 1 - 3 of 3
Results Per Page
Sort Options
- ItemEfficient and Effective Similarity Search over Probabilistic Data based on Earth Mover's Distance(2010-05-03) XU, Jia; ZHANG, Zhenjie; TUNG, Anthony K.H.; YU, GeProbabilistic data is coming as a new deluge along with the technical advances on geographical tracking, multimedia processing, sensor network and RFID. While similarity search is an important functionality supporting the manipulation of probabilistic data, it raises new challenges to traditional relational database. The problem stems from the limited e®ectiveness of the distance metric supported by the existing database system. On the other hand, some complicated distance operators have been proved their values for better distinguishing ability in probabilistic domain. In this paper, we discuss the similarity search problem with the Earth Mover's Distance, which is the most successful distance metric on probabilistic histograms and an expensive operator with cubic complexity. We present a new database approach to answer range queries and k-nearest neighbor queries on probabilistic data, on the basis of Earth Mover's Distance. Our solution utilizes the primal-dual theory in linear programming and deploys B+ tree index structures for e®ective candidate pruning. Extensive experiments show that our proposal dramatically improves the scalability of probabilistic databases.
- ItemMinimizing the Communication Cost for Continuous Skyline Maintenance(2008-05-29) ZHANG, Zhenjie; CHENG, Reynold; PAPADIAS, Dimitris; TUNG, Anthony K.H.Numerous algorithms in the recent database literature deal with variants of skyline queries in different problem settings. However, the existing work focuses on optimizing the processing cost. This paper aims at minimization of the communication overhead in client-server architectures, where a server continuously maintains the skyline of dynamic objects. Our first contribution is a Filter method that avoids transmission of updates from objects that cannot influence the skyline. Specifically, each object is assigned a filter so that it needs to issue an update only if it violates its filter. The Filter method achieves significant savings over the naive approach of transmitting all updates. Going one step further, we introduce the concept of frequent skyline query over a sliding window (FSQW). The motivation is that snapshot skylines are not very useful in streaming environments because they keep changing over time. Instead, FSQW reports the objects that appear in the skylines of at least ? of the s most recent times- tamps. The Filter method can be easily adapted to FSQW processing, however, with potentially high overhead for large and frequently updated datasets. To further reduce the communication cost, we propose a Sampling method, which returns approximate FSQW results without computing each snapshot skyline. Finally, we integrate the Filter and Sampling methods in a Hybrid approach that combines their individual advantages. We evaluate our techniques with extensive experiments.
- ItemWhat is Unequal among the Equals? Ranking Equivalent Rules from Gene Expression Data(2009-06-01T08:08:33Z) CAI, Ruichu; TUNG, Anthony K. H.; ZHANG, Zhenjie; HAO, ZhifengIn previous studies, association rules have been proven to be useful in classification problems over high dimensional gene expression data. However, due to the nature of such datasets, it is often the case that millions of rules can be derived such that many of them are covered by exactly the same set of training tuples and thus have exactly the same support and confidence. Ranking and selecting useful rules from such equivalent rule groups remain an interesting and unexplored problem. In this paper, we look at two interestingness measures for ranking the interestingness of rules within equivalent rule group: Max-Subrule-Conf and Min-Subrule-Conf. Based on these interestingness measures, an incremental Apriori-like algorithm is designed to select more interesting rules from the lower bound rules of the group. Moreover, we present an improved classification model to fully exploit the potentials of the selected rules. Our empirical studies on our proposed methods over five gene expression datasets show that our proposals improve both the efficiency and effectiveness of the rule extraction and classifier construction over gene expression datasets.