|
CS240B
Winter 2015
Syllabus
Notes
Home
|
----------------------------------------------------------------------------------------------------------------
MW 16:00--17:50, Winter 2015
CS240B:ADVANCED DATA BASES and KNOWLEDGE BASES
Data
Stream Management Systems:
Supporting Data Stream Mining Applications
Instructor:
Carlo
Zaniolo
Office Hours: Tuesday, 15:00--17:50
First
Assignement (Due on Wednesday January 21)
- Study the following papers:
- New SQL OLAP Functions for
everyone, by K.M. Guion.
- B. Babcock, S. Babu, M. Datar, R. Motwani, and J. Widom:
Models
and Issues in Data Stream Systems. PODS 2002.
- SQL-MR
.
- Yan-Nei Law, Haixun Wang, Carlo Zaniolo: Relational
Languages and Data Models for Continuous Queries on
Sequences and Data Streams, ACM Transactions on
Database Systems, Vol. 36, Issue 2, 2011.
- Yijian Bai and Carlo Zaniolo: Minimizing
Latency and Memory in DSMS: A Unified Approach to
Quasi-Optimal Scheduling . The Second International
Workshop on Scalable Stream Processing Systems, March 29,
2008, Nantes, France
- Complete the following
Tasks:
- Task 1.1: Write a four page
(about) report on SQL-MR focusing on the things that you can
do with SQL-MR that cannot be done in SQL-TS, and the main
changes between the original 2007 report and the 2010
revised version.
- Task 1.2: Assume a stream of
temporal tuples emp(Eno,
Sal, Dept, From, To) that arrive ordered by their
timestamp To
and you project out Dept.
Now, you must perform a temporal temporal coalescing on
Sal. Write a UDA to perform the coalescing and return
only maximal tuples. Is this aggregate blocking or
non-blocking or something in between? (Use the ESL syntax from
the notes.)
- Task 1.3: Assume that you have the temporal
relation emp(Eno,
Sal, Dept, From, To) and you project out Dept.
Now you must perform a temporal temporal coalescing on
Sal. Can you express that using SQL OLAP functions?
(this is difficult, you might have to search the literature
for a solution).
-------------------------------------------------------------------------------------------------
Second Assignement (Due on
Monday, February 2)
- Study the following
papers:
- Yijian Bai and Carlo Zaniolo: Minimizing
Latency and Memory in DSMS: A Unified Approach to
Quasi-Optimal Scheduling . The Second International
Workshop on Scalable Stream Processing Systems, March 29,
2008, Nantes, France.
- Y. Bai, H. Thakkar, H. Wang, and C. Zaniolo. Time-stamp
Management and QueryExecution in Data Stream Management
Systems. IEEE Internet Computing, 12(6): 2008.
- P. Flajolet and N.G. Martin. Probabilistic counting
algorithms for data base applications, Journal of Computer
and System Sciences 31(2), pp 182-209, 1985. http://algo.inria.fr/flajolet/Publications/FlMa85.pdf
- Complete the following
Tasks: See Solutions
- Task 2.1: We
showed that binary operators, such as union and joins, are
subject to idle-wating problem. But this problem can also
occur in some unary operators. Say for instance that an
aggregate is called on a time-stamped tumble window (or on a
timestamped window with slides). Discuss this situation, its
idle-waiting problem, and how it can be solved (using
concepts and notation from the paper by Bai and Zaniolo
above ).
- Task 2.2: Propose a simple
generalization of the chain algorithm to minimize memory for
the following situations:
(A) Two simple paths without forks,
(B) a simple tree where a path forks out into a pair of paths.
(You should use concepts and notation from paper 1 above.)
- Task 2.3: Using a syntax based on that of notes
and reference 3 above, express a user-defined aggregate
d_count to perform the exact count of distinct values in a
window on a data stream. Your window aggregate could, e.g.,
be called as follows:
SELECT col_name1,
d_count(col_name2) OVER (ROWS 99999 PRECEDING) FROM
my_stream
-------------------------------------------------------------------------------------------------
Third Assignement (Due
on Monday, February 9)
- Study the following
papers:
- Jure Leskovec, Anand Rajaraman, Jeff Ullman, Mining
Data Streams - The Stanford University InfoLab. You
only need to study sections 4.3 and 4.4. But
the rest is also interesting.
http://infolab.stanford.edu/~ullman/mmds/ch4.pdf
- Study: Chapter 3 (A Survey of Classification Methods in
Data Streams) from DATA STREAMS: MODELS AND ALGORITHMS
DATA STREAMS. CHARU C. AGGARWAL editor, Kluwer Academic
Publishers.
Chapter 3 is at pages 39-60 of
the following pdf document.
- Complete the following
Tasks:
- Task 3.1: Express the
Flajolet-Martin's distinct_count sketch as a user-defined
aggregate mamed dcount_sketch, to be called in the same way
as d_count. You can assume that you have available a
function LmostbitH(X) that return K, where the K position
contains a 1, whereas all the position to its right are
zeros, for the value returned by a randomizing hash function
H(X).
- Task 3.2: Write the window
UDA to compute the standard deviation on a data stream.
Explain the math formulas you use in the expire of the
window UDA---you might want to consult textbooks or the
web to find the best formula.
- Task 3.3: Assume
that you have a stream of temperature readings
temperature(Celsius Integer)
that start everyday at time 00:01 and end at time
23:59. At the end of each day, we want to have
10,000 temperature samples stored into a table
tenKsamples(Rowno
integer, Celsius Integer). We do not know how
many temperature readings are going to arrive every day,
except that their number is significantly larger than
10,000. Please write a UDA that uses the reservoir
algorithm to populate tenKsamples(Rowno
, Celsius) with 10,000 random samples taken from
temperature(Celsius
Integer), which is then processed and reset to
empty at midnight.
You can assume that
the system support a function random(K),
which given a positive integer K
returns a random integer between
1 and K.
------------------------------------------------------------------------------------------------------------------
Presentations: suggested papers on
Data Stream Management Systems and
on Mining Data Streams.
-------------------------------------------------------------------------------------------------------------------
Project: Implementing a Data
Stream Mining Algorithm on a DSMS of your choosing.
---------------------------------------------------------------------------------------------------------------------
Deadline: you must deliver one of the two above by March
2, and the other by March 9.
The order is up to you.
|