Edit Distance between XML and Probabilistic XML Documents

TANG, Ruiming; WU, Huayu; NOBARI, Sadegh; BRESSAN, Stephane

Edit Distance between XML and Probabilistic XML Documents

Files

TRB6-11.pdf(358.12 KB)

Date

2011-06-03

Authors

Abstract

We propose an efficient algorithm for computing of the edit distance between an XML document and a probabilistic XML document. Probabilistic XML is a hierarchical data model capturing uncertainty of both value and structure. It is suitable to many modern applications such as information extraction, scientific data management and data integration. The computation of similarity is an essential building block for the comparison, alignment, clustering and classification of data in these applications. Several algorithms exist for measuring the structural similarity between XML documents among themselves or XML documents and XML document type definitions and schemas. The new challenge in efficiently computing the similarity between an XML document and a probabilistic XML document is the multiplicity of the possible worlds that a probabilistic XML document represents. In this paper, we devise and discuss algorithms for computing the similarity between an XML document and a probabilistic XML document. We empirically and comparatively evaluate their performance. In the absence of established corpora and benchmarks for probabilistic XML, we also propose and use random probabilistic XML models together with the associated random generation algorithms.

URI

https://dl.comp.nus.edu.sg/xmlui/handle/1900.100/3472

Collections

Technical Reports

Full item page