Browsing by Author "Tok Wang LING"
Now showing 1 - 13 of 13
Results Per Page
Sort Options
- ItemAdaptive Pre-computed Partition Top Method for Range Top-k Queries in OLAP Data Cubes(2001-07-01T00:00:00Z) Zheng Xuan LOH; Tok Wang LING; Chuan Heng ANG; Sin Yeung LEE; Hua-Gang LIA range top-k query finds the top-k maximum values over all selected cells of an On-Line Analytical Processing (OLAP) data cube where the selection is specified by providing ranges of contiguous values for each dimension. The naive method answers a range query by accessing every individual cell in the data cube. Therefore, the naive method is very costly and is exponentially proportional to the number of dimensions of the data cube. Here, we propose a new method for handling the range top-k queries efficiently, termed the Adaptive Pre-computed Partition Top method (APPT). This method partitions the given data cube and stores r pre-computed maximum values with the corresponding locations over partitioned sub-blocks. Furthermore, an area termed overflow array stores additional maximum values for sub-block, which requires more than r maximum values for some range queries. By pre-storing the additional maximum values in the overflow array, the response time for the subsequent queries with the similar ranges will be significantly reduced. The experiment results exhibit vast improvement in terms of the response time of our APPT method as compared to the naive method. Our method promises an average update cost of (1+t) total disk accesses, where t is the average number of IO blocks in the overflow array for each sub-block. This method, which is equipped with intelligence and self- learning capability, is able to maintain the value of t below 0.5. Moreover, the APPT method can be efficiently applied on uniform and non-uniform data distributions in both dense and sparse OLAP data cubes.
- ItemCleansing Data for Mining and Warehousing(1999-06-01T00:00:00Z) Mong Li LEE; Hongjun LU; Tok Wang LING; Yee Teng KOGiven the rapid growth of data, it is increasingly important to extract, mine and discover useful information from databases and data warehouses. The process of data cleansing or data scrubbing is crucial because of the "garbage in, garbage out" principle. However, "dirty" data files are prevalent as a result of incorrect or missing data values due to data entry errors or typographical errors, inconsistent value naming conventions due to different field entry format and use of abbreviations, and incomplete information due to unavailability of data. Hence, we may have multiple records refering to the same real world entity. This not only contributes to the problem of handling ever-increasing amount of data, but also leads to the mining of inconsistent or inaccurate information which is obviously undesirable. In this paper, we examine the problem of detecting and removing duplicating records. We present several efficient techniques to pre-process the records before sorting them so that potentially matching records will be brought to a close neighbourhood subsequently. These techniques include scrubbing data fields using external source files to remove typographical errors and the use of abbreviations, tokenizing data fields and then sorting the tokens in the data fields. We also propose a method to determine the similarity between two records. Based on these techniques, we implement a data cleansing system which is able to detect and remove more duplicate records than existing methods.
- ItemEfficient Processing of XML Pattern Matching: A String Matching-based Approach(2004-02-01T00:00:00Z) Jiaheng LU; Tok Wang LINGIn this paper, we propose a new approach of indexing XML documents and processing twig patterns in an XML database. Every XML document in the database is labeled with a variation of Dewey ID labeling scheme, namely Extended Dewey ID. The unique feature of this labeling scheme is that from the label of an element alone, we can use finite state transducers (FST) to derive the names of elements along the path from the root to this element. This feature enables us to directly reduce XML path pattern matching into string matching. We then develop a holistic twig join algorithm, called TwigComPath. The algorithm is quite different from the previous strategies in that it solves XML pattern matching problem by string matching and comparing instead of binary relationship matching and stitching. Furthermore, our algorithm only needs to visit the labels of elements that satisfy the leaf node predicates in a twig (or path) pattern; hence, it has performance advantages over the methods that need to visit the labels of all nodes in the pattern. Finally, we provide experimental results to demonstrate the perform-ance benefits of our proposed approaches.
- ItemEfficient View Maintenance Using Version Numbers(2000-02-01T00:00:00Z) Eng Koon SZE; Tok Wang LINGMaintaining a materialized view in an environment of multiple, distributed, autonomous data sources is a challenging issue. Given that the view site does not control the transactions at the data sources, results of incremental computation is affected by interfering updates and compensation process is required to resolve such problem. The compensation algorithm in our previous work handles the resolving of interfering updates without making any assumption on the first-sent-first-received and non-lost delivery of messages over the communication network. In this paper, we improve the incremental computation process by cutting down unnecessary maintenance queries and thus their corresponding query results, through using the knowledge of referential integrity constraints and functional dependencies of the base relations. We reduce the number of times of sending sub-queries to a data source with multiple base relations, as well as avoiding the execution of cartesian products, by using the join graph of the view to determine the access path of querying these base relations. We also provide a compensation algorithm that makes the accessing of multiple base relations from the same data source within a single sub-query possible. Overall performance is improved by cutting down the sending of unnecessary view maintenance queries and results, sizes of results are reduced drastically as cartesian products are avoided, as well as the number of times of issuing sub-queries to each data source.
- ItemExtending and Inferring Functional Dependencies in Schema Transformations: Extended Version(2004-03-01T00:00:00Z) Qi HE; Tok Wang LINGFunctional dependency (FD) plays an important role in individual databases. In this paper, we study FDs in the context of multi-database interoperability. A major challenge in the integration of heterogeneous database schemas is of schematic discrepancies, when the data (values) of one database correspond to metadata (schema labels) of another. We first study the schematic discrepant transformations, i.e., transformations between schematic discrepant schemas. We then define "restricted FD", an extension to conventional FD, to formalize some class of constraints in schematic discrepant databases, and give a complete set of inference rules of restricted FDs. Then we study the propagation of restricted FDs during schematic discrepant transformations. Algorithms are proposed to derive all the restricted FDs in transformed schemas from restricted FDs in original schemas. At last, we show some applications of restricted FDs in the context of multi-database interoperability: (1) use FDs to verify whether a SchemaSQL view is well-defined, (2) use FDs to normalize transformed(integrated) schemas, and so on.
- ItemInferring and Applying Functional Dependencies in Schematic Discrepant Transformations(2003-06-01T00:00:00Z) Tok Wang LING; Qi HEIn relational model, schematic discrepancy occurs when the same information is modeled differently as attribute values, attribute names, relation names or database names in different schemas. Originally raised in schema integration, people have identified many applications of it recently. This paper focuses on the inference and use of functional dependency (FD) constraints in the transformations among schematic discrepant schemas. We first study restructur-ing operators which are used to implement schematic discrepant transformations. Specifically, we study the reconstructibility and commutativity of restructuring operators, which can be used to simplify a transformation. Then to infer FDs in transformed relations, we propose restricted FDs to represent integrity con-straints in original relations. We also study the properties on how restricted FDs change when applying each kind of restructuring operators. Then we give an al-gorithm to infer FDs in schematic discrepant transformations. Our algorithm can compute all FDs in transformed relations which can be inferred from restricted FDs in original relations. At last, we identify the use of FDs in 3 scenarios: (1) use FDs to judge the correctness of SchemaSQL views; (2) use FDs to normalize a transformed relation; (3) use FD constraints to verify the integrity of data from different sources.
- ItemA Model for Evaluating and Comparing Materialized View Maintenance Algorithms(1999-11-01T00:00:00Z) Eng Koon SZE; Tok Wang LINGProviding integrated access to information from multiple, distributed, autonomous data sources has received much attention from both the research communities and industries in recent years. Having a materialized view is a straightforward solution to provide fast access to such information, but maintaining such a view to reflect the changes at the data sources is not an easy task. Numerous algorithms have been proposed to handle the materialized view maintenance. The problems faced in this view maintenance include the presence of interfering updates, misordering of messages and the loss of these messages. In this paper, we provide a framework for comparing the merits of each of these approaches. We classify our criteria under four main categories in terms of the environment that they function in, the correctness of the algorithms, their efficiency, and the application requirements that these algorithms provide. The environment criteria includes the number of data sources the algorithm supports, and the nature of the compensation process. The correctness criteria looks at how accurate is the identification of interfering updates, and the proper working of the view maintenance process when messages are misordered or lost during their transmission through the network. Efficiency consideration includes the number of base relations accessed per sub-query of the incremental computation, the sequential or parallel handling of incremental computation both within the same update and between different updates, the use of partial self-maintenance in the incremental computation and how is modification handled. Application requirement involves the flexibility of the view definition, degree of consistency provided, and whether is quiescent state of the system necessary before the view can be refreshed. Based on these criteria, we evaluate and compare the existing algorithms in detail.
- ItemORA-SS: An Object-Relationship-Attribute Model for Semi-Stractured Data(2000-12-01T00:00:00Z) Gillian DOBBIE; Xiaoying WU; Tok Wang LING; Mong Li LEESemi-structured data is becoming increasingly important with the introduction of XML and related languages and technologies. The recent shift from DTDs (document type definitions) to XML-Schema for XML data highlights the importance of a schema definition for semi-structured data applications. At the same time, there is a move to extend semi-structured data models to express richer semantics. In this paper we propose a semantically rich data model for semi-structured data, ORA-SS (Object-Relationship-Attribute model for Semi-Structured data). ORA-SS not only reflects the nested structure of semi-structured data, but it also distinguishes between objects, relationships and attributes. It is possible to specify the degree of n-ary relationships and indicate if an attribute is an attribute of a relationship or an attribute of an object. Such information is lacking in existing semi-structured data models, and is essential information for designing an efficient and non-redundant storage organization for semi-structured data.
- ItemPractical approach to selecting data warehouse views using data dependencies(2000-07-01T00:00:00Z) Gillian DOBBIE; Tok Wang LINGData warehouses integrate information from heterogeneous sources and enable efficient analysis of the information. The two main characteristics of data warehouses are the huge volumes of data they store and the requirement of fast access to the data. Because of the huge volumes of data, simple search techniques are not sufficient. Materialized views in data warehouses are typically complicated, based on many tables, often containing summarized information, but are very important for improving access to the data. Because data warehouses are expected to contain current information, it is also important that the data warehouse, and the views, can be easily updated periodically. The selection of materialized views changes over time, with new materialized views created and old ones dropped. So, the selection of materialized views is crucial. Most research to date has treated the selection of materialized views as an optimization problem with respect to the cost of view maintenance and/or with respect to the cost of queries. In this paper, we consider practical aspects of data warehousing. We identify problems with the star and snowflake schema and suggest solutions, that we call enhanced star schema and enhanced snowflake schema. We also identify practical problems that may arise during view design and suggest heuristics based on data dependencies that can be used to measure if one set of views is better than another set of views, or used to improve a set of views with respect to speed of access.
- ItemResolving Schematic Discrepancy in the Integration of ER Schemas(2003-06-01T00:00:00Z) Qi HE; Tok Wang LINGn this work, we address the schematic discrepancy problem in the integration of ER schemas. Schematic discrepancy occurs when same information is modeled as attribute values, object type (entity type or relationship set) names, or attribute names in different ER schemas of databases. We first define the concepts of local/global attributes and identifiers of entity types and relationship sets resp in an ER schema, which provide essential semantics to resolve schematic discrepancy. Then based on those concepts, we give algorithms to resolve schematic discrepancy among ER schemas to be integrated.
- ItemA semantic approach to XML schema integration(2004-09-01T00:00:00Z) Qi HE; Tok Wang LINGIn this paper, we adopt a semantic rich model, Object-Relationship-Attribute model for Semi-Structured data (or ORASS) to represent XML schemas. A challenge in XML schema integration comes from the hierarchical structure of XML. For example, two sets of XML elements from two sources may constitute the same relationship type, but in different hierarchies. Then in the integrated schema, we need decide a "good" hierarchy of these elements. In general, we require an integrated schema preserves the semantics of source schemas, has minimum redundancy and leads to low cost data transformation. Guided by these criteria, we developed algorithms to merge equivalent elements and equivalent relationship types among elements from source schemas, and proposed a top-down approach to integrate ORASS schemas, meeting challenges caused by the hierarchical structure of XML.
- ItemTowards Cleaning XML Databases: Experience and Performance Evaluation(2003-01-01T00:00:00Z) Wai Lup LOW; Wee Hyong TOK; Mong Li LEE; Tok Wang LINGWith the increasing popularity of data-centric XML, data warehousing and mining applications are being developed for the rapidly burgeoning XML data repositories. Data quality will no doubt be a critical factor for the success of such applications. Data cleaning, which refers to the processes used to improve data quality, has been well researched in the context of traditional databases. In this work, we present a novel attempt to clean XML databases. We discuss the new challenges that arise in XML data cleaning and propose solutions to overcome these problems. Our experimental dataset is the DBLP database, a popular online XML bibliography database used by many researchers. The DBLP database is a large collection of small XML documents. Our study shows the benefits of performance gains, flexibility and maintainability that can be achieved by leveraging on the use of a relational database management system to clean XML data. We also investigate the conventional practice of using XML parsers when the structure of the XML data is simple and static, and compare their performance against string matching approaches.
- ItemXTree: A Declarative Query Language for XML Documents(2005-01-11T07:45:30Z) Zhuo CHEN; Tok Wang LING; Mengchi LIU; Gillian DOBBIEXML is becoming prevalent in data presentation and data exchange on the internet. One important issue in the XML research community is how to query XML documents to extract and restructure information. Currently, XQuery based on XPath is the most promising standard. In this paper, we discuss limitations of XPath and XQuery, and propose a generalization of XPath called XTree that overcomes these limitations. Using XTree, multiple variable bindings can be instantiated in one expression; and XTree expressions, which represent a tree rather than a path, can be used in both the querying part and the result construction part of a query. Based on XTree, we develop an XTree query language, which is more compact and convenient to use than XQuery, and supports common query operations such as join, negation, grouping, and recursion in a direct way. We describe an algorithm that converts XTree query scripts to XQuery scripts. This algorithm provides not only a means of executing queries written in XTree query language but also highlights differences between the two query languages.