XSM paper published at VLDB 2002 and slides of the talk given at VLDB.

The ever-evolving XSM demo. Current online version operates on XSM interpreter. Compiled XQueries soon to be available.

We thank the NPACI DICE project for its support

The Goals and Quick Description of the XSM Project

XML and its Data Manipulation Siblings. XML has emerged as the de facto data model of information exchange. The needs of applications that manipulate the XML data by filtering and transforming them have given raise to the XPath, XSLT, and XQuery standards of W3C and the development of corresponding XML processors. The commercial interest in developing efficient and flexible XML processors, which will support the next generation of eBusiness applications, increases dramatically, as XML's adoption widens.

The functionality and efficiency challenges faced by such XML processors have naturally attracted the interest of the database community, whose principled approach to data manipulation can lead the way to efficient implementations of the above.

High Efficiency with Stream Processing. The first generation of XML query processors, including the MIX processor of UCSDís DB lab, focused on the efficient processing of XML queries issued on XML databases or XML views of information sources, such as relational databases, files, etc. The recently started XSM (Xml Stream Machine) project of UCSD narrows its focus to XML streams and expects order-of-magnitude performance improvements using a novel architecture and algorithms that reduce XQuery statements into an automaton-like structure, also called an XSM. (This work carries to XPath and XSLT.) We aim to design algorithms which enable the XSM to have minimal memory footprints and make maximum use of the available schema information (in the form of DTDs or XML Schema).

1st Generation of XML processors

The database lab of UCSD and its SDSC affiliates have been working on issues related to XML querying and transformation for multiple years. The MIX project founded its XML query processing on an adaptation of proven (in the SQL world) algebraic query processing techniques to queries on XML data. (See "Navigation-Driven Evaluation of Virtual Mediated Views" for details on the algebra and the query processing.) The query power ranges from regular path queries in the style of XPath, to full-fledged queries in the style of XQuery. Techniques for pipelining and on-demand evaluation were developed in the context of the algebra, as well as optimizations of the expressions used to evaluate the queries. This approach is followed by many other recent XML processors. It is very general in the class of queries and transformations it covers but it misses key opportunities, which we will exploit in the Xml Stream Machine project.

2nd Generation: Efficient XML Stream Machines

In the XSM approach, the XSM compiler inputs an XQuery statement and (optionally) the XML schema or DTD of the stream it will operate on. It outputs an XSM, which is an automaton, potentially augmented with an input event stack and a query size-dependent number of data buffers. The XSM reads the input stream, which roughly consists of open tag events (e.g., <name>), close tag events (e.g., </name>) and content events (e.g., "Victor"). The XSMís finite control responds to each event by making a state transition, which is potentially influenced by the top symbol of the stack. Upon the transition the XSM may produce an output stream event, may remove and/or place a symbol on the stack and may perform one of a limited set of operations that it can do on the data buffers. Overall, the XSM has a very limited instruction set, which gives raise to efficient implementations.

Consider an XML DTD orderdb holding online orders, grouped by customer. The above XQuery fragment creates a summary list of orderInfo elements, each of which pairs the customerís name with the values of his orders.

<orderdb>

<customer>

<name>Bob</name>

<address> ... </address>

<order>

<orderid>0815</orderid>

<value>$12.50</value>

</order>

<order>

<orderid>4711</orderid>

<value>$87.33</value>

</order>

</customer>

<customer> ...

...

</orderdb>

 

First consider how the individual subexpressions can be compiled into individual XSMs: XSM(c) is an extremely simple and efficient machine that listens on its input stream for <customer> elements and only writes those to its output stream (this requires constant memory only). XSM(o) is exactly like XSM(c) but listens for <orders> elements. XSM(p) assumes two input streams, corresponding to its free variables C and O. On XSM(p)ís output, an <orderInfo> element is written for each pair (customer name, order value). From these individual XSMs, a combined XSM(r) can be constructed and optimized. The figure below shows such an optimized XSM(r) that assumes a DTD structure dictating that the customer's orders trail the customer name in the input. Observe that XSM(r) uses a limited buffer for the customer name since it has to delay the decision whether the customer is output until after an <order> event has been received. Further observe that a less efficient machine XSM(rí) would be required if no DTD had been provided. In that case, an unbounded data buffer for holding all orders preceding the customer name would be needed.