CS240B Syllabus

First Assignement (Due on Wednesday January 21)

Study the following papers:
- New SQL OLAP Functions for everyone, by K.M. Guion.
- B. Babcock, S. Babu, M. Datar, R. Motwani, and J. Widom: Models and Issues in Data Stream Systems. PODS 2002.
- SQL-MR .
- Yan-Nei Law, Haixun Wang, Carlo Zaniolo: Relational Languages and Data Models for Continuous Queries on Sequences and Data Streams, ACM Transactions on Database Systems, Vol. 36, Issue 2, 2011.
- Yijian Bai and Carlo Zaniolo: Minimizing Latency and Memory in DSMS: A Unified Approach to Quasi-Optimal Scheduling . The Second International Workshop on Scalable Stream Processing Systems, March 29, 2008, Nantes, France

Complete the following Tasks:

Task 1.1: Write a four page (about) report on SQL-MR focusing on the things that you can do with SQL-MR that cannot be done in SQL-TS, and the main changes between the original 2007 report and the 2010 revised version.
Task 1.2: Assume a stream of temporal tuples emp(Eno, Sal, Dept, From, To) that arrive ordered by their timestamp To and you project out Dept. Now, you must perform a temporal temporal coalescing on Sal. Write a UDA to perform the coalescing and return only maximal tuples. Is this aggregate blocking or non-blocking or something in between? (Use the ESL syntax from the notes.)

Task 1.3: Assume that you have the temporal relation emp(Eno, Sal, Dept, From, To) and you project out Dept. Now you must perform a temporal temporal coalescing on Sal. Can you express that using SQL OLAP functions? (this is difficult, you might have to search the literature for a solution).

-------------------------------------------------------------------------------------------------

Second Assignement (Due on Monday, February 2)

Study the following papers:

Yijian Bai and Carlo Zaniolo: Minimizing Latency and Memory in DSMS: A Unified Approach to Quasi-Optimal Scheduling . The Second International Workshop on Scalable Stream Processing Systems, March 29, 2008, Nantes, France.
Y. Bai, H. Thakkar, H. Wang, and C. Zaniolo. Time-stamp Management and QueryExecution in Data Stream Management Systems. IEEE Internet Computing, 12(6): 2008.
P. Flajolet and N.G. Martin. Probabilistic counting algorithms for data base applications, Journal of Computer and System Sciences 31(2), pp 182-209, 1985. http://algo.inria.fr/flajolet/Publications/FlMa85.pdf

Complete the following Tasks: See Solutions

Task 2.1: We showed that binary operators, such as union and joins, are subject to idle-wating problem. But this problem can also occur in some unary operators. Say for instance that an aggregate is called on a time-stamped tumble window (or on a timestamped window with slides). Discuss this situation, its idle-waiting problem, and how it can be solved (using concepts and notation from the paper by Bai and Zaniolo above ).
Task 2.2: Propose a simple generalization of the chain algorithm to minimize memory for the following situations:
(A) Two simple paths without forks,
(B) a simple tree where a path forks out into a pair of paths. (You should use concepts and notation from paper 1 above.)

Task 2.3: Using a syntax based on that of notes and reference 3 above, express a user-defined aggregate d_count to perform the exact count of distinct values in a window on a data stream. Your window aggregate could, e.g., be called as follows:
SELECT col_name1, d_count(col_name2) OVER (ROWS 99999 PRECEDING) FROM my_stream

-------------------------------------------------------------------------------------------------

Third Assignement (Due on Monday, February 9)

Study the following papers:

Jure Leskovec, Anand Rajaraman, Jeff Ullman, Mining Data Streams - The Stanford University InfoLab. You only need to study sections 4.3 and 4.4. But the rest is also interesting. http://infolab.stanford.edu/~ullman/mmds/ch4.pdf
Study: Chapter 3 (A Survey of Classification Methods in Data Streams) from DATA STREAMS: MODELS AND ALGORITHMS DATA STREAMS. CHARU C. AGGARWAL editor, Kluwer Academic Publishers.
Chapter 3 is at pages 39-60 of the following pdf document.

Complete the following Tasks:

Task 3.1: Express the Flajolet-Martin's distinct_count sketch as a user-defined aggregate mamed dcount_sketch, to be called in the same way as d_count. You can assume that you have available a function LmostbitH(X) that return K, where the K position contains a 1, whereas all the position to its right are zeros, for the value returned by a randomizing hash function H(X).
Task 3.2: Write the window UDA to compute the standard deviation on a data stream. Explain the math formulas you use in the expire of the window UDA---you might want to consult textbooks or the web to find the best formula.

Task 3.3: Assume that you have a stream of temperature readings temperature(Celsius Integer)
that start everyday at time 00:01 and end at time 23:59. At the end of each day, we want to have
10,000 temperature samples stored into a table tenKsamples(Rowno integer, Celsius Integer). We do not know how many temperature readings are going to arrive every day, except that their number is significantly larger than 10,000. Please write a UDA that uses the reservoir algorithm to populate tenKsamples(Rowno , Celsius) with 10,000 random samples taken from
temperature(Celsius Integer), which is then processed and reset to empty at midnight.

You can assume that the system support a function random(K), which given a positive integer K returns a random integer between 1 and K.

------------------------------------------------------------------------------------------------------------------

Presentations: suggested papers on Data Stream Management Systems and
on Mining Data Streams.

-------------------------------------------------------------------------------------------------------------------

Project: Implementing a Data Stream Mining Algorithm on a DSMS of your choosing.

---------------------------------------------------------------------------------------------------------------------
Deadline: you must deliver one of the two above by March 2, and the other by March 9.
The order is up to you.

----------------------------------------------------------------------------------------------------------------

MW 16:00--17:50, Winter 2015

CS240B:ADVANCED DATA BASES and KNOWLEDGE BASES

First Assignement (Due on Wednesday January 21)

Second Assignement (Due on Monday, February 2)

Third Assignement (Due on Monday, February 9)