CS240B Syllabus

Study the following papers:
- B. Babcock, S. Babu, M. Datar, R. Motwani, and J. Widom: Models and Issues in Data Stream Systems. PODS 2002.
- SQL-MR.
- New SQL OLAP Functions for everyone, by K.M. Guion.
Complete the following tasks and turn them in via ccle:

Task 1.1: Write a four page (about) report on SQL-MR focusing on the things that you can do with SQL-MR that cannot be done in SQL-TS,
Task 1.2: Some of the new OLAP functions support windows others do not. List those in the two groups, and for those in the second group outline the various problems that will have to be solved to support (i) unlimited preceding windows, and (ii) physical windows. Whenever simple implementation exist express them by the SQL-defined UDAs described in the notes.

-------------------------------------------------------------------------------------------------

Study the following papers:
- Y. Bai, H. Thakkar, C. Luo, H. Wang and CZaniolo: A Data Stream Language and System Designed for Power and Extensibility. (CIKM'06), Arlington, Virginia, USA,
- H. Thakkar, N. Laptev, H. Mousavi, B. Mozafari, V. Russo and C. Zaniolo: SMM: A data stream management system for Knowledge Discovery. ICDE 2011.
Complete the following tasks and turn them in via ccle: Answers for 2.1, and 2.2.

Task 2.1: Using a syntax based on that of notes and the two reference above, express a user-defined aggregate d_count to perform the exact count of distinct values in a window on a data stream. Your window aggregate could, e.g., be called as follows:
SELECT col_name1, d_count(col_name2) OVER (ROWS 99999 PRECEDING) FROM my_stream
Task 2.2: Using the same syntax, write a UDA that implements the RANK function (without window or with a unlimited-preceding window). Then extend this definition to support windows.

-----------------------------------------------------------------------------------------------------

Yijian Bai and Carlo Zaniolo: Minimizing Latency and Memory in DSMS: A Unified Approach to Quasi-Optimal Scheduling . The Second International Workshop on Scalable Stream Processing Systems, March 29, 2008, Nantes, France.
Y. Bai, H. Thakkar, H. Wang, and C. Zaniolo. Time-stamp Management and QueryExecution in Data Stream Management Systems. IEEE Internet Computing, 12(6): 2008.
Jure Leskovec, Anand Rajaraman, Jeff Ullman, Mining Data Streams - The Stanford University InfoLab. You only need to study sections 4.3 and 4.4. But the rest is also interesting. http://infolab.stanford.edu/~ullman/mmds/ch4.pdf

Task 3.1: We showed that binary operators, such as union and joins, are subject to idle-wating problem. But this problem can also occur in some unary operators. Say for instance that an aggregate is called on a time-stamped tumble window (or on a timestamped window with slides). Discuss this situation, its idle-waiting problem, and how it can be solved (using concepts and notation from the paper by Bai and Zaniolo above ).
Task 3.2: Propose a simple generalization of the chain algorithm to minimize memory for the following situations:
(A) Two simple paths without forks,
(B) a simple tree where a path forks out into a pair of paths. (You should use concepts and notation from the papers above.)
Task 3.3: Write the window UDA to compute the standard deviation on a data stream. Explain the math formulas you use in the expire of the window UDA---you might want to consult textbooks or the web to find the best formula.

Study: Chapter 3: A Survey of Classification Methods in Data Streams) from DATA STREAMS: MODELS AND ALGORITHMS DATA STREAMS. CHARU C. AGGARWAL editor, Kluwer Academic Publishers.
Chapter 3 is at pages 39-60 of the following pdf document.
N. Laptev et al. : Extendin Relational query Languages for Data Streams. pp. 361--386 of
Data StreamManagement--Processing High-Speed Data Streams. Springer, 2017.

Task 4.1: Express the Flajolet-Martin's distinct_count sketch as a user-defined aggregate mamed dcount_sketch, to be called in the same way as d_count. You can assume that you have available a function LmostbitH(X) that return K, where the K position contains a 1, whereas all the position to its right are zeros, for the value returned by a randomizing hash function H(X).
Task 4.2: Assume that you have a stream of temperature readings temperature(Celsius Integer) that start everyday at time 00:01 and end at time 23:59. At the end of each day, we want to have 10,000 temperature samples stored into a table tenKsamples(Rowno integer, Celsius Integer). We do not know how many temperature readings are going to arrive every day, except that their number is significantly larger than 10,000. Please write a UDA that uses the reservoir algorithm to populate tenKsamples(Rowno , Celsius) with 10,000 random samples taken from temperature(Celsius Integer), which is then processed and reset to empty at midnight. You can assume that the system support a function random(K), which given a positive integer K returns a random integer between 1 and K.

-----------------------------------------------------------------------------------------------------

-----------------------------------------------------------------------------------------------------

-----------------------------------------------------------------------------------------------------

-----------------------------------------------------------------------------------------------------

Midterm on TBA