2007
Presenter: Hyun Jin Moon
Title: Semantic Adaptation of Schema Mappings when Schemas Evolve
Authors: Cong Yu (Dept. of EECS, Univ. of Michigan) 
         and Lucian Popa (IBM Almaden Research Center)
Published: VLDB 2005 (download paper)
Date: April 6, 2007
Time: 12:30-1:15pm
Room: BH 4549
Abstract
Schemas evolve over time to accommodate the changes in the information they
represent. Such evolution causes invalidation of various artifacts depending
on the schemas, such as schema mappings. In a heterogeneous environment,
where cooperation among data sources depends essentially upon them, schema
mappings must be adapted to reflect schema evolution. In this study, we
explore the mapping composition approach for addressing this mapping 
adaptation problem. We study the semantics of mapping composition in the 
context of mapping adaptation and compare our approach with the incremental 
approach of Velegrakis et al [21]. We show that our method is superior in 
terms of capturing the semantics of both the original mappings and the 
evolution. We design and implement a mapping adaptation system based
on mapping composition as well as additional mapping pruning techniques that
significantly speed up the adaptation. We conduct comprehensive experimental
analysis and show that the composition approach is practical in various
evolution scenarios. The mapping language that we consider is a nested
relational extension of the second-order dependencies of Fagin et al [7]. 
Our work can also be seen as an implementation of the mapping composition 
operator of the model management framework.

Speaker: Uri Schonfeld
Title: Do Not Crawl in the DUST: Different URLs with Similar Text
To be presented at: WWW'07
Date: April 13, 2007
Time: 12:30-1:15pm
Room: BH 4549
Abstract
We consider the problem of DUST: Different URLs with SimilarText.  
Such duplicate URLs are prevalent in web sites, as web server
software often uses aliases and redirections, and dynamically 
generates the same page from various different URL requests.  
We present a novel algorithm,  DustBuster, for uncovering  DUST; 
that is, for discovering rules that transform a given URL to 
others that are likely to have similar content. DustBuster mines 
DUST effectively from previous crawl logs or web server logs, 
without examining page contents. Verifying these rules via sampling 
requires fetching few actual web pages.  Search engines can benefit 
from information about DUST to increase the effectiveness of crawling, 
reduce indexing overhead, and improve the quality of popularity 
statistics such as PageRank.

Speaker: Hetal Thakkar
Title: Supporting Knowledge Discovery in Data Stream Management Systems
Date: April 20, 2007
Time: 12:30-1:15pm
Room: BH 4549
Abstract
Data mining represents an exciting and vibrant area of research. In 
particular, on-line mining has gained significant momentum in recent 
years. The changing data characteristics and real-time response 
constraints of streaming data preclude the use of existing mining 
algorithms that were designed for stored datasets. Therefore, 
researchers are proposing new fast and light algorithms for on-line 
mining tasks, such as classification, clustering, frequent itemsets, 
pattern matching, and many others. Beyond the interesting research 
problems posed by the design of individual algorithms for different 
mining methods, there remains the issue that these must be integrated 
into an Inductive Data Stream Mining System that supports (i) libraries 
of interoperable mining methods, (ii) all essential functions of data 
stream management systems, such as continuous query optimization, load 
shedding, synoptic constructs, and non-stop computing, and (iii) ease-
of-use and extensibility. The issue of Inductive Data Stream Mining 
System has received little attention in the past, and thus offers a 
pristine opportunity for research contributions in my thesis work. 
Furthermore, there will be an opportunity for advancing the state of 
the art in mining algorithms, particularly in terms of extensibility, 
interoperability, and genericity of data representation. In this 
prospectus, I briefly discuss data stream management systems,  data 
mining methods and systems, and other areas closely related to the 
topic of this thesis. Then, I present recently obtained preliminary 
results, including data representations for achieving generic 
implementations for many on-line mining algorithms. Furthermore, I 
discuss advanced techniques such as ensemble-based bagging and 
boosting, for which generic implementations were also devised. 
Finally, preliminary experiments are presented to verify the 
efficiency of the proposed approach. Future work will focus on more 
experiments and fine tuning of mining algorithms. Furthermore, support 
for high-level mining languages also represents a promising topic for 
future research.

Speaker: Yijian Bai
Title: Data Stream Processing and Query Optimization Techniques
Date: May 25, 2007
Time: 12:30-1:15pm
Room: BH 4549
Abstract
Many modern applications require continuous processing of massive
data streams, which presents research challenges on multiple fronts.
One of the critical challenges is to build Data Stream Management
Systems (DSMSs) that provide the users with system-sponsored
querying interface and services. A DSMS must support efficiently (i)
multiple continuous queries that reside in the system, with ii)
complex query plans, on (iii) massive and bursty data streams, and
produce (iv) fast, potentially near-realtime, response to the user
upon each data tuple arrival. Existing query processing/optimization
techniques for DataBase Management Systems (DBMSs) are not
sufficient for such requirements. In this work we present the query
scheduling framework and timestamp-based optimization mechanism used
in the Stream Mill DSMS. Our flexible query-execution model enables
on-demand timestamp generation and propagation, which
efficiently solves the idle-waiting problem to improve query
response time. Furthermore, our weight-based scheduling framework
allows us to treat latency/memory optimization uniformly, and
explicitly control the latency/memory tradeoff. We also extend
existing scheduling algorithms to handle
complex topologies of the query graph.

Query languages for data streams have to support stream-specific
constructs, such as sliding windows. We analyze a noise-elimination
problem for RFID (Radio Frequency
IDentification) data, and present an efficient, time-order
preserving solution based on sliding windows, which requires
incremental state-update upon each tuple arrival and expiration. Then
we present the integration of sliding windows with arbitrary User
Defined Aggregates (UDAs) in the Expressive Stream Language (ESL).

Back to 2007 Events