Towards Cleaning XML Databases: Experience and Performance Evaluation
No Thumbnail Available
Date
2003-01-01T00:00:00Z
Journal Title
Journal ISSN
Volume Title
Publisher
Abstract
With the increasing popularity of data-centric XML, data warehousing and mining applications are being developed for the rapidly burgeoning XML data repositories. Data quality will no doubt be a critical factor for the success of such applications. Data cleaning, which refers to the
processes used to improve data quality, has been well researched in the context of traditional databases.
In this work, we present a novel attempt to clean XML databases. We discuss the new challenges that arise in XML data cleaning and propose solutions to overcome these problems. Our experimental dataset is the DBLP database, a popular online XML bibliography database used by many researchers. The DBLP database is a large collection of small XML documents.
Our study shows the benefits of performance gains, flexibility and maintainability that can be achieved by leveraging on the use of a relational database management system to clean XML data. We also investigate the conventional practice of using XML parsers when the structure of the XML data is simple and static, and compare their performance against string matching approaches.