|
CS240B
Fall 2018
Syllabus
Notes
Home
Talks
|
----------------------------------------------------------------------------------------------------------------
Tuesday and Thursday 14:00--15:50
Fall 2018
CS240B:ADVANCED DATA BASES and KNOWLEDGE BASES
Data
Stream Management Systems:
Supporting Data Stream Mining Applications
Instructor:
Carlo
Zaniolo
Office Hours: Tuesdays and Thursdays,
16:00--17:00.
First
Assignement (Due on Tuesday Oct. 9) Some answers for
2.1, 2.2
- Study the following papers:
- Complete the following tasks and turn them in via
ccle:
- Task 1.1: Write a four page
(about) report on SQL-MR focusing on the things that you can
do with SQL-MR that cannot be done in SQL-TS,
- Task 1.2: Some of the new
OLAP functions support windows others do not. List those in
the two groups, and for those in the second group, suggest a
general implementation for (i) unlimited preceding windows,
and (ii) physical windows.
-------------------------------------------------------------------------------------------------
Second
Assignement (Due on Tuesday Oct. 16) Answers
for 2.1,
and 2.2.
- Study the following papers:
- Y.
Bai, H. Thakkar, C. Luo, H. Wang and CZaniolo: A
Data Stream Language and System Designed for Power and
Extensibility. (CIKM'06),
Arlington, Virginia, USA,
- H.
Thakkar, N. Laptev, H. Mousavi, B. Mozafari, V. Russo
and C. Zaniolo: SMM:
A data stream management system for Knowledge Discovery. ICDE
2011.
- Complete the following tasks and turn them in via
ccle:
- Task 2.1: Using
a syntax based on that of notes and the two reference above,
express a user-defined aggregate d_count to perform the exact
count of distinct values in a window on a data stream. Your
window aggregate could, e.g., be called as follows:
SELECT col_name1, d_count(col_name2) OVER (ROWS 99999
PRECEDING) FROM my_stream
- Task 2.2: Using the same
syntax, write a UDA that implements the RANK function (without
window or with a unlimited-preceding window). Then extend
this definition to support windows.
------------------------------------------------------------------------------------------------
Third Assignement (Due
on Tuesday, Oct. 23) Answers
for 3.1, 3.2,
and 3.3.
- Study the following
papers:
- Yijian Bai and Carlo Zaniolo: Minimizing
Latency and Memory in DSMS: A Unified Approach to
Quasi-Optimal Scheduling . The Second International
Workshop on Scalable Stream Processing Systems, March 29,
2008, Nantes, France.
- Y. Bai, H. Thakkar, H. Wang, and C. Zaniolo. Time-stamp
Management and QueryExecution in Data Stream Management
Systems. IEEE Internet Computing, 12(6): 2008.
- Jure Leskovec, Anand Rajaraman, Jeff Ullman, Mining Data
Streams - The Stanford University InfoLab. You
only need to study sections 4.3 and 4.4. But the
rest is also interesting.
http://infolab.stanford.edu/~ullman/mmds/ch4.pdf
- Complete the following
Tasks:
- Task 3.1: We
showed that binary operators, such as union and joins, are
subject to idle-wating problem. But this problem can also
occur in some unary operators. Say for instance that an
aggregate is called on a time-stamped tumble window (or on a
timestamped window with slides). Discuss this situation, its
idle-waiting problem, and how it can be solved (using
concepts and notation from the paper by Bai and Zaniolo
above ).
- Task 3.2:
Propose a simple generalization of the chain algorithm to
minimize memory for the following situations:
(A) Two or more simple paths without forks,
- Task 3.3:
Write the window UDA to compute the standard deviation on a
data stream. Explain the math formulas you use in the expire
of the window UDA---you might want to consult textbooks or
the web to find the best formula.
-------------------------------------------------------------- -----------------------------------
Fourth Assignement (Due on Tuesday Oct. 30) Answers
for
4.1,
and
4.2.
- Study the following
papers:
- Complete the following
Tasks:
- Task 4.1: Express
the Flajolet-Martin's distinct_count sketch as a user-defined
aggregate mamed dcount_sketch, to be called in the same way as
d_count. You can assume that you have available a function
LmostbitH(X) that return K, where the K position contains a 1,
whereas all the position to its right are zeros, for the value
returned by a randomizing hash function H(X).
- Task 4.2: Assume that you
have a stream of temperature readings temperature(Celsius
Integer) that
start everyday at time 00:01 and end at time 23:59. At the
end of each day, we want to have 10,000 temperature samples
stored into a table tenKsamples(Rowno
integer, Celsius Integer). We
do not know how many temperature readings are going to
arrive every day, except that their number is significantly
larger than 10,000. Please write a UDA that uses the reservoiralgorithm
to populate tenKsamples(Rowno
, Celsius) with
10,000 random samples taken fromtemperature(Celsius
Integer), which is then processed and reset to
empty at midnight. You
can assume that the system support a function random(K), which
given a positive integer K returns
a random integer between 1 and K.
-----------------------------------------------------------------------------------------------------
Project
Assigned on October 30--Due on November 20 (strict
deadline).
-----------------------------------------------------------------------------------------------------
-----------------------------------------------------------------------------------------------------
Presentations
(individual or by pairs). November 15--December 6.
Possible topics.
For additional references see here
and here.
Current
schedule.
-----------------------------------------------------------------------------------------------------
Take home final
project. Research Report due on or before December 17.
It should be done individually (rather than in
pairs.)
-----------------------------------------------------------------------------------------------------
Holidays
|