Notes
Outline
Data Integration:
The Need, the Challenges, and the Approaches
 Yannis Papakonstantinou
Asst Professor, Computer Science & Engineering,
University of California, San Diego
Based on work and experience at the
UCSD DataBase lab &
Enosys Software
Overview
Data Integration: High-level need and challenges
Approaches and Architectures
Workflow Vs Query-based integration
Materialized Vs Virtual Views
Relational Vs XML data model
Query Processing: What’s New, What’s Not
Algebras and Iterator models for XML
Mixing Browsing & Navigation
Navigation Vs Streaming
Facing the limited capabilities of the sources
Exporting Data on the Web
Efficient Top-N preference queries on Views
Web Interfaces to XML & Semistructured Data
Schema Issues
Data Integration Requirements in eBusiness
It starts with …
”What if we could provide our customers, and partners with Application X”, where X may be in Business Intelligence, Customer Support, Supply Chain Management,…
Then the pain begins…
“How do I use necessary information assets widely distributed across my enterprise?”
If only….
“Give to the client application a single place to go to access all the information required to implement our eBusiness strategy. Our strategy is evolving so make sure this solution accommodates dynamic business requirements.”
eCommerce Application: Content from Multiple Sources
Customer Support
Business Intelligence
Data Integration Challenges in
Homeland Security
Data Integration Challenges in Science and Environment
Data Integration Challenges & Desiderata (high level)
Address the heterogeneity and distribution of content
heterogenous formats, heterogenous interfaces, different rates of change (static versus dynamic), autonomous sources
Complicated logic and semantics for putting together needed data
Provide distribution and heterogeneity transparency
Decouple application logic from integration
easily add/change sources
Customize the delivery of content
Overview
Data Integration: High-level need and challenges
Approaches and Architectures
Workflow Vs Query-based integration
Materialized Vs Virtual Views
Relational Vs XML data model
Query Processing: What’s New, What’s Not
Algebras and Iterator models for XML
Mixing Browsing & Navigation
Navigation Vs Streaming
Facing the limited capabilities of the sources
Exporting Data on the Web
Efficient Top-N preference queries on Views
Web Interfaces to XML & Semistructured Data
Schema Issues
Most-Generic Integration System Architecture
Approaches towards Client-Driven (Pull) Data Integration
Case Study: Customer Support
Slide 14
Challenges and Inefficiencies of the Brute-Force Approach (I)
Extracting information from heterogeneous data sources is difficult (multiple interfaces & formats)
What call should I make to the data source?
How do I structure the source results into the objects my  application needs?
Integration logic is hidden in application programs
How do objects of DB 1 match (which?) objects of data source 2?
How are source data mapped into the objects required by the application’s front-end?
Result: Hard to abstract and maintain in a Java program
Every new source causes a new project
Challenges and Inneficiencies of the Brute-Force Approach (II)
Integration logic is repeated in application programs
Even for same sources and similar information needs
Cost increased by manually coding the integration steps
Developer has to specify which sources have to be accessed, how, in what order (incl. parallel, sequential), and how each data piece is assembled into the object the application needs
Slide 17
Workflow/EAI approach: data integration still a big pain
Source access and extraction is facilitated with XML adaptors
Less trouble in managing the source results (everything is XML now)
Message transformers
Integration logic is hidden in application programs
Integration logic is repeated in application programs
Cost increased by manually coding the integration steps
EAI Vendors: TIBCO, Vitria, WebMethods, SeeBeyond
EAI with Web Services’ Spin: Bowstreet, Microsoft (.Net)
The Query/View-Based Approach to Building Applications
Slide 20
The power of queries and views (I)
Integration logic is abstracted away from application development
Leveraged across applications
Specified with high-level, useful metadata (XML schema, integrated view)
Integration platforms optimize how to get the data
Automatically create plan to access sources (order, etc) and combine the data pieces
The power of queries and views (II)
Result: Much improved application development, low maintenance cost
Easy to meet data needs for new applications
Easy to add and remove data sources
Result: Business logic of integration is abstracted in view and becomes a valuable asset
Primary Vendors: Callixa, Cohera, Enosys Software, Nimble, Metamatrix, XPeranto
Related: Datajoiner by IBM, Informatica
Eager or On-Demand?
The Data Warehousing Approach
Information integrated in advance
Copied and stored in warehouse for direct querying and analysis
Warehouse is updated periodically to maintain consistency
Reconstruct or update incrementally
Online analytical processing (OLAP), data mining
Eager or On-Demand:
The Mediation Approach
Information accessed, fetched, and integrated on-demand
Mediators answer queries but do not store data persistently
Virtual database
On-Demand Mediator Operation
On-Demand Mediator
Operation
Data Warehouse Applications
Best when source data not very dynamic or (slightly) out-of-date data is ok
High query performance
Doesn’t interfere with local processing at sources
Many corporations warehouse their transactional sales data for analysis
Time consuming decision-support/data mining queries
Mediator Applications
Important when up-to-date information is needed
Good for clients with unpredictable needs
Support hierarchies of abstraction where one mediator becomes an information source for others
Emerging technology
Recently resolved technical challenges
Major challenge: optimization of ad-hoc queries
Wrappers & Mediators from
High-Level Specifications
How Do We Model Data in the Integration System?
Relational Data Model
Set of rows and columns
Fixed set of simple data types
Data cube
Specialized warehouse management system
Uses a single, multi-dimensional relation as model
Object-Oriented Model
Semistructured data
Flexible, self-describing, object-like
Permits missing components
Late bindings of data types
E.g., XML and related technologies
Why XML?
Very good for representing semistructured data
Nesting, support of missing/duplicate elements, flexible typing
Lots of research in the area resulted independently in similar models
Very good for representing heterogeneous data
Flexibility to accommodate different source data models, and all sorts of heterogeneities
Very good for fitting data to application needs
Excellent fit to Web/HTML applications
“Standard” relational to XML Schema translation
CustomerOrder(cname,balance,orderid)
OrderDetail(orderid, oname, odate)
<db>
 <CustomerOrder>
   <cname>Yannis</cname>
   <balance>1000</balance>
   <orderid>23</orderid>
 </CustomerOrder>
 <CustomerOrder>…</CustomerOrder>
 <OrderDetail>
   <orderid>23</orderid>
   <oname>potatos for Yannis</oname>
   <odate>2/30/02</odate>
 </OrderDetail>
 …
</db>
How do XML Standards come in?
XQuery as Query and View Definition Language in Data Integration
XQuery can be the key component of integrated data access, manipulation
High-level, declarative language
Full power for data access and manipulation
Closes the gap between data management, transformation, presentation
XQuery standard does NOT elaborate on the physical aspects of the XML sources
Overview
Data Integration: High-level need and challenges
Approaches and Architectures
Workflow Vs Query-based integration
Materialized Vs Virtual Views
Relational Vs XML data model
Query Processing: What’s New, What’s Not
Algebras and Iterator models for XML
Mixing Browsing & Navigation
Navigation Vs Streaming
Facing the limited capabilities of the sources
Exporting Data on the Web
Efficient Top-N preference queries on Views
Web Interfaces to XML & Semistructured Data
Schema Issues
What Stays the Same in Query Processing
Limited and Different Source Capabilities
Only give sources requests they can answer
How do we describe source abilities?
How (and when) can we answer a query given the limited abilities of the sources?
Garlic Query Processing
Algebraic Approach to Capabilities Problem
Algebraic Approach to Capabilities Problem (Cont’d)
Slide 41
Slide 42
Slide 43
Slide 44
Slide 45
Slide 46
Slide 47
Overview
Data Integration: High-level need and challenges
Approaches and Architectures
Workflow Vs Query-based integration
Materialized Vs Virtual Views
Relational Vs XML data model
Query Processing: What’s New, What’s Not
Algebras and Iterator models for XML
Mixing Browsing & Navigation
Navigation Vs Streaming
Facing the limited capabilities of the sources
Exporting Data on the Web
Efficient Top-N preference queries on Views
Web Interfaces to XML & Semistructured Data
Schema Issues
Exporting DBMSs on the Web
XQuery behind the scenes
Web and Databases Effort
Data intensive Web site generators
Strudel
Forms as functions on edges/links
Araneus
Autoweb
Query Forms and Reports
Handle semistructureness
Powerful query forms and reports
Be declarative
Separate logic from presentation
Encode compactly a large number of queries
Compared to a set of query templates
Visual interface for the developer
Programming should NOT be a requirement
Query Forms and Reports
Query Forms and Reports
XML Schema-driven
Declarative!
Separation of content & presentation
Editor
Visual actions to declarative specifications
Automatic construction of report pages
Query Set Specification (QSS)
Large set of parameterized queries
Compact representation
Engine
Automatic query formulation
Direct result construction
Query Forms and Reports
Developing Query Forms from the XML Schema
Developing Reports from the XML Schema
QURSED System Architecture
Slide 58
Summary and Links
Data Integration: High-level need and challenges
Approaches and Architectures
Workflow Vs Query-based integration
Materialized Vs Virtual Views
Relational Vs XML data model
Query Processing: What’s New, What’s Not
Algebras and Iterator models for XML (EDBT00, DKE02)
Mixing Browsing & Navigation (ICDE02)
Navigation Vs Streaming (VLDB02)
Facing the limited capabilities of the sources (PDIS96, DAPD98, VLDB97, JLP00)
Exporting Data on the Web
Efficient Top-N preference queries on Views (SIGMOD01)
Web Interfaces to XML & Semistructured Data (WWW10, SIGMOD02)
Schema Issues (ICDE00, PODS00)