CS240B  Spring 2013---Instructor: Carlo Zaniolo


Data Stream Management Systems--Supporting Data Stream Mining Applications

In the age of the Internet, massive amounts of information are continuously exchanged as data streams that are then processed by on-line applications of increasing complexity. For such advanced applications, a store-now and process-later approach cannot be used because of real time (or quasi real-time) requirements and excessive data rates. Therefore, current research seeks to develop a new generation of information management systems, called Data Stream Management Systems (DSMS), that can support complex applications on massive data streams with Quality of Service (QoS) guarantees. This work has produced novel techniques, research prototypes, startup companies, and the successful deployment of DSMS in many applications, including network traffic analysis, transaction log analysis, intrusion detection, credit-card fraud detection, click stream analysis, and algorithmic trading.

Since many such applications involve both streaming data and stored data, the approach taken by most DSMS consists in expressing continuous queries on data streams using extensions of SQL. But significant changes in the language and its implementation are needed, since DSMS must support persistent queries on ordered streams of transient tuples-instead of the transient queries on unordered sets of persistent tuples of relational DBMS. In particular, only monotonic queries and non-blocking operators can be used. Also, the unbounded streams must be represented by synopses, such as windows containing the most recent tuples in the streams. Thus the semantics of basic operators such as joins and aggregates must be revised for windows. At the implementation level, we have new query optimization techniques that seek to minimize response time and memory utilization. Load shedding techniques based on samples and sketches are used to achieve QoS under overload conditions. The first part of the course, will cover these techniques and the architectures of the main DSMS systems.

The second part of the course will focus on the data stream mining problem that represents a vibrant area of new research. Past work concentrated on devising data mining algorithms that (i) are fast and light enough for on line applications, and (ii) can cope with the concept shifts and drifts that are often present in data streams. However, integrating mining primitives into an SQL-based environment represents a very difficult problem on its own, as demonstrated by the very slow progress made by DBMS on this issue. Thus, we first discuss efficient algorithms proposed for the mining tasks of classification, association, clustering and sequential pattern detection on data streams; then, we explore alternative approaches to integrate them into DSMS.

Tentative Schedule

Grade Basis