|
|
|
|
|
Yannis
Papakonstantinou |
|
Asst Professor, Computer Science &
Engineering, |
|
University of California, San Diego |
|
|
|
Based on work and experience at the |
|
UCSD DataBase lab & |
|
Enosys Software |
|
|
|
|
|
|
|
|
Data Integration: High-level need and challenges |
|
Approaches and Architectures |
|
Workflow Vs Query-based integration |
|
Materialized Vs Virtual Views |
|
Relational Vs XML data model |
|
Query Processing: What’s New, What’s Not |
|
Algebras and Iterator models for XML |
|
Mixing Browsing & Navigation |
|
Navigation Vs Streaming |
|
Facing the limited capabilities of the sources |
|
Exporting Data on the Web |
|
Efficient Top-N preference queries on Views |
|
Web Interfaces to XML & Semistructured Data |
|
Schema Issues |
|
|
|
|
It starts with …
”What if we could provide our customers, and partners with Application X”,
where X may be in Business Intelligence, Customer Support, Supply Chain
Management,… |
|
Then the pain begins…
“How do I use necessary information assets widely distributed
across my enterprise?” |
|
If only….
“Give to the client application a single place to go to access all the
information required to implement our eBusiness strategy. Our strategy is
evolving so make sure this solution accommodates dynamic business
requirements.” |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Address the heterogeneity and distribution of
content |
|
heterogenous formats, heterogenous interfaces,
different rates of change (static versus dynamic), autonomous sources |
|
Complicated logic and semantics for putting
together needed data |
|
Provide distribution and heterogeneity
transparency |
|
Decouple application logic from integration |
|
easily add/change sources |
|
Customize the delivery of content |
|
|
|
|
|
|
Data Integration: High-level need and challenges |
|
Approaches and Architectures |
|
Workflow Vs Query-based integration |
|
Materialized Vs Virtual Views |
|
Relational Vs XML data model |
|
Query Processing: What’s New, What’s Not |
|
Algebras and Iterator models for XML |
|
Mixing Browsing & Navigation |
|
Navigation Vs Streaming |
|
Facing the limited capabilities of the sources |
|
Exporting Data on the Web |
|
Efficient Top-N preference queries on Views |
|
Web Interfaces to XML & Semistructured Data |
|
Schema Issues |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Extracting information from heterogeneous data
sources is difficult (multiple interfaces & formats) |
|
What call should I make to the data source? |
|
How do I structure the source results into the
objects my application needs? |
|
Integration logic is hidden in application
programs |
|
How do objects of DB 1 match (which?) objects of
data source 2? |
|
How are source data mapped into the objects
required by the application’s front-end? |
|
Result: Hard to abstract and maintain in a Java
program |
|
Every new source causes a new project |
|
|
|
|
|
Integration logic is repeated in application
programs |
|
Even for same sources and similar information
needs |
|
Cost increased by manually coding the
integration steps |
|
Developer has to specify which sources have to
be accessed, how, in what order (incl. parallel, sequential), and how each
data piece is assembled into the object the application needs |
|
|
|
|
|
|
|
Source access and extraction is facilitated with
XML adaptors |
|
Less trouble in managing the source results
(everything is XML now) |
|
Message transformers |
|
Integration logic is hidden in application
programs |
|
Integration logic is repeated in application
programs |
|
Cost increased by manually coding the
integration steps |
|
EAI Vendors: TIBCO, Vitria, WebMethods,
SeeBeyond |
|
EAI with Web Services’ Spin: Bowstreet,
Microsoft (.Net) |
|
|
|
|
|
|
|
|
|
|
|
|
|
Integration logic is abstracted away from
application development |
|
Leveraged across applications |
|
Specified with high-level, useful metadata (XML
schema, integrated view) |
|
Integration platforms optimize how to get the
data |
|
Automatically create plan to access sources
(order, etc) and combine the data pieces |
|
|
|
|
|
Result: Much improved application development,
low maintenance cost |
|
Easy to meet data needs for new applications |
|
Easy to add and remove data sources |
|
Result: Business logic of integration is
abstracted in view and becomes a valuable asset |
|
Primary Vendors: Callixa, Cohera, Enosys
Software, Nimble, Metamatrix, XPeranto |
|
Related: Datajoiner by IBM, Informatica |
|
|
|
|
|
Information integrated in advance |
|
Copied and stored in warehouse for direct
querying and analysis |
|
Warehouse is updated periodically to maintain
consistency |
|
Reconstruct or update incrementally |
|
Online analytical processing (OLAP), data mining |
|
|
|
|
|
Information accessed, fetched, and integrated
on-demand |
|
Mediators answer queries but do not store data
persistently |
|
Virtual database |
|
|
|
|
|
|
|
|
|
|
|
Best when source data not very dynamic or
(slightly) out-of-date data is ok |
|
High query performance |
|
Doesn’t interfere with local processing at
sources |
|
Many corporations warehouse their transactional
sales data for analysis |
|
Time consuming decision-support/data mining
queries |
|
|
|
|
|
Important when up-to-date information is needed |
|
Good for clients with unpredictable needs |
|
Support hierarchies of abstraction where one
mediator becomes an information source for others |
|
Emerging technology |
|
Recently resolved technical challenges |
|
Major challenge: optimization of ad-hoc queries |
|
|
|
|
|
|
|
Relational Data Model |
|
Set of rows and columns |
|
Fixed set of simple data types |
|
Data cube |
|
Specialized warehouse management system |
|
Uses a single, multi-dimensional relation as
model |
|
Object-Oriented Model |
|
Semistructured data |
|
Flexible, self-describing, object-like |
|
Permits missing components |
|
Late bindings of data types |
|
E.g., XML and related technologies |
|
|
|
|
|
Very good for representing semistructured data |
|
Nesting, support of missing/duplicate elements,
flexible typing |
|
Lots of research in the area resulted
independently in similar models |
|
Very good for representing heterogeneous data |
|
Flexibility to accommodate different source data
models, and all sorts of heterogeneities |
|
Very good for fitting data to application needs |
|
Excellent fit to Web/HTML applications |
|
|
|
|
CustomerOrder(cname,balance,orderid) |
|
OrderDetail(orderid, oname, odate) |
|
|
|
<db> |
|
<CustomerOrder> |
|
<cname>Yannis</cname> |
|
<balance>1000</balance> |
|
<orderid>23</orderid> |
|
</CustomerOrder> |
|
<CustomerOrder>…</CustomerOrder> |
|
<OrderDetail> |
|
<orderid>23</orderid> |
|
<oname>potatos for Yannis</oname> |
|
<odate>2/30/02</odate> |
|
</OrderDetail> |
|
… |
|
</db> |
|
|
|
|
|
|
|
XQuery can be the key component of integrated
data access, manipulation |
|
High-level, declarative language |
|
Full power for data access and manipulation |
|
Closes the gap between data management,
transformation, presentation |
|
XQuery standard does NOT elaborate on the
physical aspects of the XML sources |
|
|
|
|
|
|
Data Integration: High-level need and challenges |
|
Approaches and Architectures |
|
Workflow Vs Query-based integration |
|
Materialized Vs Virtual Views |
|
Relational Vs XML data model |
|
Query Processing: What’s New, What’s Not |
|
Algebras and Iterator models for XML |
|
Mixing Browsing & Navigation |
|
Navigation Vs Streaming |
|
Facing the limited capabilities of the sources |
|
Exporting Data on the Web |
|
Efficient Top-N preference queries on Views |
|
Web Interfaces to XML & Semistructured Data |
|
Schema Issues |
|
|
|
|
|
|
Only give sources requests they can answer |
|
How do we describe source abilities? |
|
How (and when) can we answer a query given the
limited abilities of the sources? |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Data Integration: High-level need and challenges |
|
Approaches and Architectures |
|
Workflow Vs Query-based integration |
|
Materialized Vs Virtual Views |
|
Relational Vs XML data model |
|
Query Processing: What’s New, What’s Not |
|
Algebras and Iterator models for XML |
|
Mixing Browsing & Navigation |
|
Navigation Vs Streaming |
|
Facing the limited capabilities of the sources |
|
Exporting Data on the Web |
|
Efficient Top-N preference queries on Views |
|
Web Interfaces to XML & Semistructured Data |
|
Schema Issues |
|
|
|
|
|
|
|
|
Data intensive Web site generators |
|
Strudel |
|
Forms as functions on edges/links |
|
Araneus |
|
Autoweb |
|
|
|
|
|
|
|
Handle semistructureness |
|
Powerful query forms and reports |
|
Be declarative |
|
Separate logic from presentation |
|
Encode compactly a large number of queries |
|
Compared to a set of query templates |
|
Visual interface for the developer |
|
Programming should NOT be a requirement |
|
|
|
|
|
|
|
XML Schema-driven |
|
Declarative! |
|
Separation of content & presentation |
|
Editor |
|
Visual actions to declarative specifications |
|
Automatic construction of report pages |
|
Query Set Specification (QSS) |
|
Large set of parameterized queries |
|
Compact representation |
|
Engine |
|
Automatic query formulation |
|
Direct result construction |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Data Integration: High-level need and challenges |
|
Approaches and Architectures |
|
Workflow Vs Query-based integration |
|
Materialized Vs Virtual Views |
|
Relational Vs XML data model |
|
Query Processing: What’s New, What’s Not |
|
Algebras and Iterator models for XML (EDBT00,
DKE02) |
|
Mixing Browsing & Navigation (ICDE02) |
|
Navigation Vs Streaming (VLDB02) |
|
Facing the limited capabilities of the sources
(PDIS96, DAPD98, VLDB97, JLP00) |
|
Exporting Data on the Web |
|
Efficient Top-N preference queries on Views
(SIGMOD01) |
|
Web Interfaces to XML & Semistructured Data
(WWW10, SIGMOD02) |
|
Schema Issues (ICDE00, PODS00) |
|