Phrase Matching in XML

 

Introduction     

            We present a system that enables flexible and efficient phrase matching in XML documents. Since XML allows structured and unstructured information to be interleaved in the same document, phrase matching in XML raises new challenges. Our system, named PIX, permits phrase matching in XML documents that contain ``mixed content''. A key feature of PIX is that users can specify which markup and annotations to ignore when matching a phrase. PIX uses inverted indices and an efficient evaluation algorithm to compute the set of matches and returns answers where phrases, ignored markup and ignored annotations are highlighted. In addition, query answers are sorted using a ranking function. PIX is implemented as an extension of GALAX, a full-fledged XQuery engine. The functionality of PIX is fully integrated into XQuery and permits a natural combination of XPath based structure matching with phrase matching.

Joint work with Sihem Amer-Yahia, Mary Fernandez, Divesh Srivastava AT&T Labs Research

Demo 

Full description(ICDE 2003 Demo pdf)

back to Yu Xu's homepage.