|
CS240B
Winter 2017
Syllabus
Notes
Home
Talks |
----------------------------------------------------------------------------------------------------------------
TuTh: 14:00--15:50
Fall 2018
CS240B:ADVANCED DATA BASES and KNOWLEDGE BASES
In Data
Stream Management Systems:
Supporting Data Stream Mining Applications
Instructor:
Carlo
Zaniolo
Office Hours: Tuesday, 16:00--18:00
Midterm
on TBA
First
Assignement (Due on ?)
- Study the following papers:
- Complete the following tasks and turn them in via
ccle:
- Task 1.1: Write a four page
(about) report on SQL-MR focusing on the things that you can
do with SQL-MR that cannot be done in SQL-TS,
- Task 1.2: Some of the new
OLAP functions support windows others do not. List those in
the two groups, and for those in the second group outline the
various problems that will have to be solved to support
(i) unlimited preceding windows, and (ii) physical windows.
Whenever simple implementation exist express them by the
SQL-defined UDAs described
in the notes.
-------------------------------------------------------------------------------------------------
Second
Assignement (Due on ?)
- Study the following papers:
- Y.
Bai, H. Thakkar, C. Luo, H. Wang and CZaniolo: A
Data Stream Language and System Designed for Power and
Extensibility.
(CIKM'06),
Arlington, Virginia, USA,
- H.
Thakkar, N. Laptev, H. Mousavi, B. Mozafari, V. Russo
and C. Zaniolo: SMM:
A data stream management system for Knowledge Discovery. ICDE
2011.
- Complete the following tasks and turn them in via
ccle: Answers
for 2.1,
and 2.2.
- Task 2.1: Using a syntax
based on that of notes and the two reference above, express a
user-defined aggregate d_count to perform the exact count of
distinct values in a window on a data stream. Your window
aggregate could, e.g., be called as follows:
SELECT col_name1, d_count(col_name2) OVER (ROWS 99999
PRECEDING) FROM my_stream
- Task 2.2: Using the same
syntax, write a UDA that implements the RANK function (without
window or with a unlimited-preceding window). Then extend
this definition to support windows.
-----------------------------------------------------------------------------------------------------
Third Assignement (Due
on ?)
- Study the following
papers:
- Yijian Bai and Carlo Zaniolo: Minimizing
Latency and Memory in DSMS: A Unified Approach to
Quasi-Optimal Scheduling . The Second International
Workshop on Scalable Stream Processing Systems, March 29,
2008, Nantes, France.
- Y. Bai, H. Thakkar, H. Wang, and C. Zaniolo. Time-stamp
Management and QueryExecution in Data Stream Management
Systems. IEEE Internet Computing, 12(6): 2008.
- Jure Leskovec, Anand Rajaraman, Jeff Ullman, Mining Data
Streams - The Stanford University InfoLab. You
only need to study sections 4.3 and 4.4. But the
rest is also interesting.
http://infolab.stanford.edu/~ullman/mmds/ch4.pdf
- Complete the following
Tasks: Answers for
3.1, 3.2,
and 3.3.
- Task 3.1: We
showed that binary operators, such as union and joins, are
subject to idle-wating problem. But this problem can also
occur in some unary operators. Say for instance that an
aggregate is called on a time-stamped tumble window (or on a
timestamped window with slides). Discuss this situation, its
idle-waiting problem, and how it can be solved (using
concepts and notation from the paper by Bai and Zaniolo
above ).
- Task 3.2:
Propose a simple generalization of the chain algorithm to
minimize memory for the following situations:
(A) Two simple paths without forks,
(B) a simple tree where a path forks out into a pair of paths.
(You should use concepts and notation from the papers above.)
- Task 3.3:
Write the window UDA to compute the standard deviation on a
data stream. Explain the math formulas you use in the expire
of the window UDA---you might want to consult textbooks or
the web to find the best formula.
Fourth Assignement (due
on?)
- Study the following
papers:
- Complete the following
Tasks:
- Task 4.1: Express the
Flajolet-Martin's distinct_count sketch as a user-defined
aggregate mamed dcount_sketch, to be called in the same way as
d_count. You can assume that you have available a function
LmostbitH(X) that return K, where the K position contains a 1,
whereas all the position to its right are zeros, for the value
returned by a randomizing hash function H(X).
- Task 4.2: Assume that you
have a stream of temperature readings
temperature(Celsius Integer) that start everyday at
time 00:01 and end at time 23:59. At the end of each day,
we want to have 10,000 temperature samples stored into a
table tenKsamples(Rowno
integer, Celsius Integer). We do not know how
many temperature readings are going to arrive every day,
except that their number is significantly larger than
10,000. Please write a UDA that uses the reservoir
algorithm to populate tenKsamples(Rowno
, Celsius) with 10,000 random samples taken from
temperature(Celsius
Integer), which is then processed and reset to
empty at midnight. You
can assume that the system support a function random(K),
which given a positive integer K
returns a random integer between
1 and K.
-----------------------------------------------------------------------------------------------------
-----------------------------------------------------------------------------------------------------
-----------------------------------------------------------------------------------------------------
-----------------------------------------------------------------------------------------------------
|