CS240B Spring 2013---Instructor: Carlo Zaniolo

Data Stream Management Systems--Supporting Data Stream Mining Applications

In the age of the Internet, massive amounts of information are continuously exchanged as data streams that are then processed by on-line applications of increasing complexity. For such advanced applications, a store-now and process-later approach cannot be used because of real time (or quasi real-time) requirements and excessive data rates. Therefore, current research seeks to develop a new generation of information management systems, called Data Stream Management Systems (DSMS), that can support complex applications on massive data streams with Quality of Service (QoS) guarantees. This work has produced novel techniques, research prototypes, startup companies, and the successful deployment of DSMS in many applications, including network traffic analysis, transaction log analysis, intrusion detection, credit-card fraud detection, click stream analysis, and algorithmic trading.

Since many such applications involve both streaming data and stored data, the approach taken by most DSMS consists in expressing continuous queries on data streams using extensions of SQL. But significant changes in the language and its implementation are needed, since DSMS must support persistent queries on ordered streams of transient tuples-instead of the transient queries on unordered sets of persistent tuples of relational DBMS. In particular, only monotonic queries and non-blocking operators can be used. Also, the unbounded streams must be represented by synopses, such as windows containing the most recent tuples in the streams. Thus the semantics of basic operators such as joins and aggregates must be revised for windows. At the implementation level, we have new query optimization techniques that seek to minimize response time and memory utilization. Load shedding techniques based on samples and sketches are used to achieve QoS under overload conditions. The first part of the course, will cover these techniques and the architectures of the main DSMS systems.

The second part of the course will focus on the data stream mining problem that represents a vibrant area of new research. Past work concentrated on devising data mining algorithms that (i) are fast and light enough for on line applications, and (ii) can cope with the concept shifts and drifts that are often present in data streams. However, integrating mining primitives into an SQL-based environment represents a very difficult problem on its own, as demonstrated by the very slow progress made by DBMS on this issue. Thus, we first discuss efficient algorithms proposed for the mining tasks of classification, association, clustering and sequential pattern detection on data streams; then, we explore alternative approaches to integrate them into DSMS.

Tentative Schedule

Week 1: Introduction to DSMS

Continuous query languages
Aggregates and blocking operators
Timestamp management
Language design:windows, slides, and tumbles
New OLAP functions and Windows in SQL:2003
Support for Sequence Queries

Week 2: DSMS Architecture and Implementation

Load shedding
Execution Models
Optimization
Supporting XML Streams

Week 3: Systems for Mining Data Bases and Data Streams

Weka and other DM platforms
Toward a Data Stream Mining Workbench: SMM

Week 4: Mining Data Streams and Data Bases

Classification and Concept Shift
Classifier Ensembles: Boosting versus bagging
Clustering and Outlier Detection

Week 5: Mining Data Streams and Data Bases

Association Rules detection and verification
Time Series Analysis
New Methods and Applications

Week 6: Inductive DBMS and DSMS

Data Mining Query Languages and Inductive Databases
Support for DM in Commercial DBMS
Dedicated DM systems
Inductive DSMS

Weeks 8, 9, 10: Research Paper Presentations by Students

Grade Basis

2 Homeworks: 4% each
2 Project: 8% each
Midterm: 30%
Presentation: 16%
Final Project and Report: 30%.