Presenter: Hyun Jin Moon Title: Semantic Adaptation of Schema Mappings when Schemas Evolve Authors: Cong Yu (Dept. of EECS, Univ. of Michigan) and Lucian Popa (IBM Almaden Research Center) Published: VLDB 2005 (download paper) Date: April 6, 2007 Time: 12:30-1:15pm Room: BH 4549 Abstract Schemas evolve over time to accommodate the changes in the information they represent. Such evolution causes invalidation of various artifacts depending on the schemas, such as schema mappings. In a heterogeneous environment, where cooperation among data sources depends essentially upon them, schema mappings must be adapted to reflect schema evolution. In this study, we explore the mapping composition approach for addressing this mapping adaptation problem. We study the semantics of mapping composition in the context of mapping adaptation and compare our approach with the incremental approach of Velegrakis et al [21]. We show that our method is superior in terms of capturing the semantics of both the original mappings and the evolution. We design and implement a mapping adaptation system based on mapping composition as well as additional mapping pruning techniques that significantly speed up the adaptation. We conduct comprehensive experimental analysis and show that the composition approach is practical in various evolution scenarios. The mapping language that we consider is a nested relational extension of the second-order dependencies of Fagin et al [7]. Our work can also be seen as an implementation of the mapping composition operator of the model management framework.
Speaker: Uri Schonfeld Title: Do Not Crawl in the DUST: Different URLs with Similar Text To be presented at: WWW'07 Date: April 13, 2007 Time: 12:30-1:15pm Room: BH 4549 Abstract We consider the problem of DUST: Different URLs with SimilarText. Such duplicate URLs are prevalent in web sites, as web server software often uses aliases and redirections, and dynamically generates the same page from various different URL requests. We present a novel algorithm, DustBuster, for uncovering DUST; that is, for discovering rules that transform a given URL to others that are likely to have similar content. DustBuster mines DUST effectively from previous crawl logs or web server logs, without examining page contents. Verifying these rules via sampling requires fetching few actual web pages. Search engines can benefit from information about DUST to increase the effectiveness of crawling, reduce indexing overhead, and improve the quality of popularity statistics such as PageRank.
Speaker: Hetal Thakkar Title: Supporting Knowledge Discovery in Data Stream Management Systems Date: April 20, 2007 Time: 12:30-1:15pm Room: BH 4549 Abstract Data mining represents an exciting and vibrant area of research. In particular, on-line mining has gained significant momentum in recent years. The changing data characteristics and real-time response constraints of streaming data preclude the use of existing mining algorithms that were designed for stored datasets. Therefore, researchers are proposing new fast and light algorithms for on-line mining tasks, such as classification, clustering, frequent itemsets, pattern matching, and many others. Beyond the interesting research problems posed by the design of individual algorithms for different mining methods, there remains the issue that these must be integrated into an Inductive Data Stream Mining System that supports (i) libraries of interoperable mining methods, (ii) all essential functions of data stream management systems, such as continuous query optimization, load shedding, synoptic constructs, and non-stop computing, and (iii) ease- of-use and extensibility. The issue of Inductive Data Stream Mining System has received little attention in the past, and thus offers a pristine opportunity for research contributions in my thesis work. Furthermore, there will be an opportunity for advancing the state of the art in mining algorithms, particularly in terms of extensibility, interoperability, and genericity of data representation. In this prospectus, I briefly discuss data stream management systems, data mining methods and systems, and other areas closely related to the topic of this thesis. Then, I present recently obtained preliminary results, including data representations for achieving generic implementations for many on-line mining algorithms. Furthermore, I discuss advanced techniques such as ensemble-based bagging and boosting, for which generic implementations were also devised. Finally, preliminary experiments are presented to verify the efficiency of the proposed approach. Future work will focus on more experiments and fine tuning of mining algorithms. Furthermore, support for high-level mining languages also represents a promising topic for future research.
Speaker: Yijian Bai Title: Data Stream Processing and Query Optimization Techniques Date: May 25, 2007 Time: 12:30-1:15pm Room: BH 4549 Abstract Many modern applications require continuous processing of massive data streams, which presents research challenges on multiple fronts. One of the critical challenges is to build Data Stream Management Systems (DSMSs) that provide the users with system-sponsored querying interface and services. A DSMS must support efficiently (i) multiple continuous queries that reside in the system, with ii) complex query plans, on (iii) massive and bursty data streams, and produce (iv) fast, potentially near-realtime, response to the user upon each data tuple arrival. Existing query processing/optimization techniques for DataBase Management Systems (DBMSs) are not sufficient for such requirements. In this work we present the query scheduling framework and timestamp-based optimization mechanism used in the Stream Mill DSMS. Our flexible query-execution model enables on-demand timestamp generation and propagation, which efficiently solves the idle-waiting problem to improve query response time. Furthermore, our weight-based scheduling framework allows us to treat latency/memory optimization uniformly, and explicitly control the latency/memory tradeoff. We also extend existing scheduling algorithms to handle complex topologies of the query graph. Query languages for data streams have to support stream-specific constructs, such as sliding windows. We analyze a noise-elimination problem for RFID (Radio Frequency IDentification) data, and present an efficient, time-order preserving solution based on sliding windows, which requires incremental state-update upon each tuple arrival and expiration. Then we present the integration of sliding windows with arbitrary User Defined Aggregates (UDAs) in the Expressive Stream Language (ESL).