Winter 2015





MW 16:00--17:50, Winter 2015


Data Stream Management Systems:
Supporting Data Stream Mining Applications

Instructor: Carlo Zaniolo

Office Hours: Tuesday,  15:00--17:50

First Assignement (Due on Wednesday  January 21)

  • Complete the  following Tasks:
  • Task 1.1: Write a four page (about) report on SQL-MR focusing on the things that you can do with SQL-MR that cannot be done in  SQL-TS, and the main changes between the original 2007 report and the 2010  revised version.
  • Task 1.2: Assume a stream of temporal tuples emp(Eno, Sal, Dept, From, To) that arrive ordered by their timestamp To and you project out Dept. Now, you must perform a temporal temporal coalescing on Sal. Write a UDA to perform the coalescing and return only maximal tuples. Is this aggregate blocking or non-blocking or something in between? (Use the ESL syntax from the notes.) 
  • Task 1.3:  Assume that you have the temporal relation emp(Eno, Sal, Dept, From, To) and you project out Dept.  Now you must  perform a temporal temporal coalescing  on Sal. Can you express that using SQL OLAP functions? (this is difficult, you might have to search the literature for a solution).


  Second Assignement (Due on Monday, February 2)

  •  Study the following papers:
  • Task 2.1: We showed that binary operators, such as union and joins, are subject to idle-wating problem. But this problem can also occur in some unary operators. Say for instance that an aggregate is called on a time-stamped tumble window (or on a timestamped window with slides). Discuss this situation, its idle-waiting problem, and how it can be solved (using concepts and notation from the paper by Bai and Zaniolo above ).

  • Task 2.2: Propose a  simple generalization of the chain algorithm to minimize memory for the following situations:
    (A) Two simple paths without forks,
    (B) a simple tree where a path forks out into a pair of paths. (You should use concepts and notation from paper 1 above.)
  • Task 2.3:   Using a syntax based on that of notes and reference 3 above, express a user-defined aggregate d_count to perform the exact count of distinct values in a window on a data stream. Your window aggregate could, e.g., be called as follows:
    SELECT col_name1, d_count(col_name2) OVER (ROWS 99999 PRECEDING) FROM my_stream



  Third Assignement (Due on Monday, February 9)

  •  Study the following papers:
    • Jure Leskovec, Anand Rajaraman, Jeff Ullman, Mining Data Streams - The Stanford University InfoLab.  You only need to study sections 4.3 and 4.4. But the rest is also interesting.  http://infolab.stanford.edu/~ullman/mmds/ch4.pdf

    • Study: Chapter 3 (A Survey of Classification Methods in Data Streams) from  DATA STREAMS: MODELS AND ALGORITHMS DATA STREAMS. CHARU C. AGGARWAL editor, Kluwer Academic Publishers. 
      Chapter 3  is at pages 39-60 of the following pdf document.

  • Complete the  following Tasks:
  • Task 3.1: Express the Flajolet-Martin's distinct_count sketch as a user-defined aggregate mamed dcount_sketch, to be called in the same way as d_count. You can assume that you have available a function LmostbitH(X) that return K, where the K position contains a 1, whereas all the position to its right are zeros, for the value returned by a randomizing hash function H(X).

  • Task 3.2: Write the window UDA to compute the standard deviation on a data stream. Explain the math formulas you use in the expire of the window UDA---you might want to consult textbooks or the  web  to find  the best formula.
  • Task 3.3:  Assume that  you have a stream of temperature readings temperature(Celsius Integer)
     that start everyday at time 00:01  and end at  time 23:59.  At the end of each day, we want to have
    10,000 temperature samples stored into a table   tenKsamples
    (Rowno integer, Celsius Integer). We do not know how many temperature readings are going to arrive every day, except that their number is significantly larger than 10,000. Please write a UDA that uses the reservoir algorithm to populate tenKsamples(Rowno , Celsius) with 10,000 random samples taken from
    temperature(Celsius Integer), which is then processed and reset to empty at midnight
  •  You can assume that the system support a function random(K), which given a  positive integer returns a random integer between 1 and K.


    Presentations: suggested papers on Data Stream Management Systems and
    on  Mining Data Streams.


    Project: Implementing a Data Stream Mining Algorithm on a DSMS of your  choosing.

    Deadline: you must deliver one of the two above by March 2, and the other by March 9.
    The order is up to you.