Browsing by Author "Mong Li LEE"
Now showing 1 - 5 of 5
Results Per Page
Sort Options
- ItemCleansing Data for Mining and Warehousing(1999-06-01T00:00:00Z) Mong Li LEE; Hongjun LU; Tok Wang LING; Yee Teng KOGiven the rapid growth of data, it is increasingly important to extract, mine and discover useful information from databases and data warehouses. The process of data cleansing or data scrubbing is crucial because of the "garbage in, garbage out" principle. However, "dirty" data files are prevalent as a result of incorrect or missing data values due to data entry errors or typographical errors, inconsistent value naming conventions due to different field entry format and use of abbreviations, and incomplete information due to unavailability of data. Hence, we may have multiple records refering to the same real world entity. This not only contributes to the problem of handling ever-increasing amount of data, but also leads to the mining of inconsistent or inaccurate information which is obviously undesirable. In this paper, we examine the problem of detecting and removing duplicating records. We present several efficient techniques to pre-process the records before sorting them so that potentially matching records will be brought to a close neighbourhood subsequently. These techniques include scrubbing data fields using external source files to remove typographical errors and the use of abbreviations, tokenizing data fields and then sorting the tokens in the data fields. We also propose a method to determine the similarity between two records. Based on these techniques, we implement a data cleansing system which is able to detect and remove more duplicate records than existing methods.
- ItemORA-SS: An Object-Relationship-Attribute Model for Semi-Stractured Data(2000-12-01T00:00:00Z) Gillian DOBBIE; Xiaoying WU; Tok Wang LING; Mong Li LEESemi-structured data is becoming increasingly important with the introduction of XML and related languages and technologies. The recent shift from DTDs (document type definitions) to XML-Schema for XML data highlights the importance of a schema definition for semi-structured data applications. At the same time, there is a move to extend semi-structured data models to express richer semantics. In this paper we propose a semantically rich data model for semi-structured data, ORA-SS (Object-Relationship-Attribute model for Semi-Structured data). ORA-SS not only reflects the nested structure of semi-structured data, but it also distinguishes between objects, relationships and attributes. It is possible to specify the degree of n-ary relationships and indicate if an attribute is an attribute of a relationship or an attribute of an object. Such information is lacking in existing semi-structured data models, and is essential information for designing an efficient and non-redundant storage organization for semi-structured data.
- ItemSupporting Frequent Updates in R-Trees: A Bottom-Up Approach(2004-04-01T00:00:00Z) Mong Li LEE; Wynne HSU; Christian S. JENSEN; Bin CUI; Keng Lik TEOAdvances in hardware-related technologies promise to enable new data management applications that monitor continuous processes. In these applications, enormous amounts of state samples are obtained via sensors and are streamed to a database. Further, updates are very frequent and may exhibit locality. While the R-tree is the index of choice for multi-dimensional data with low dimensionality, and is thus relevant to these applications, R-tree updates are also relatively inefficient. We present a bottom-up update strategy for R-trees that generalizes existing update techniques and aims to improve update performance. It has different levels of reorganization---ranging from global to local---during updates, avoiding expensive top-down updates. A compact main-memory summary structure that allows direct access to the R-tree index nodes is used together with efficient bottom-up algorithms. Empirical studies indicate that the bottom-up strategy outperforms the traditional top-down technique, leads to indices with better query performance, achieves higher throughput, and is scalable.
- ItemTowards Cleaning XML Databases: Experience and Performance Evaluation(2003-01-01T00:00:00Z) Wai Lup LOW; Wee Hyong TOK; Mong Li LEE; Tok Wang LINGWith the increasing popularity of data-centric XML, data warehousing and mining applications are being developed for the rapidly burgeoning XML data repositories. Data quality will no doubt be a critical factor for the success of such applications. Data cleaning, which refers to the processes used to improve data quality, has been well researched in the context of traditional databases. In this work, we present a novel attempt to clean XML databases. We discuss the new challenges that arise in XML data cleaning and propose solutions to overcome these problems. Our experimental dataset is the DBLP database, a popular online XML bibliography database used by many researchers. The DBLP database is a large collection of small XML documents. Our study shows the benefits of performance gains, flexibility and maintainability that can be achieved by leveraging on the use of a relational database management system to clean XML data. We also investigate the conventional practice of using XML parsers when the structure of the XML data is simple and static, and compare their performance against string matching approaches.
- ItemXOO7: Applying OO7 Benchmark to XML Query Processing Tools(2001-06-01T00:00:00Z) Stephane BRESSAN; Gillian DOBBIE; Zoe LACROIX; Mong Li LEE; Ying Guang LI; Ullas NAMBIAR; Bimlesh WADHWAIf XML is to play the critical role of the lingua franca for Internet data interchange that many predict, it is necessary to start designing and adopting benchmarks allowing the comparative performance analysis of the tools being developed and proposed. The effectiveness of existing XML query languages has been studied by many who focused on the comparison of linguistic features, implicitly reflecting the fact that most XML tools exist only on paper. In this paper, with a focus on efficiency and concreteness, we propose a pragmatic first step toward the systematic benchmarking of XML query processing platforms with an initial focus on the data (versus document) point of view. We propose XOO7, an XML version of the OO7 benchmark. We discuss the applicability of XOO7, its strengths, limitations and the extensions we are considering. We illustrate its use by presenting and discussing the performance comparison against XOO7 of three different query processing platforms for XML.