CS240B
Spring 2007


CS240B

ADVANCED DATA BASES and KNOWLEDGE BASES

Instructor: Carlo Zaniolo

Office Hours: Thursday 3:00--5:00pm



June 8: Presentations

June 6: Presentations

June 4: Presentations

Assignment No 5

Read the following reports

Research Issues in Data Stream Association Rule Mining, by
Nan Jiang and Le Gruenwald, SIGMOD Record, Vol. 35, No. 1, Mar. 2006.

Frequent Itemsets Mining Examples by Osmar Zaïane's, www.cs.ualberta.ca/~zaiane/courses/cau/slides/w2w3-ex-sol.pdf

A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise. Martin Ester, Hans-Peter Kriegel, Jörg Sander, Xiaowei Xu. By KDD 1996: 226-231

OPTICS: Ordering Points To Identify the Clustering Structure. by Mihael Ankerst, Markus M. Breunig, Hans-Peter Kriegel, Jörg Sander: SIGMOD Conference 1999: 49-60

Answer the following Questions (Due on May 30)

Task 5.1: Given the following measurement for the variable age:

18, 22, 25, 42, 28, 43, 33, 35, 56, 28,

standardize the variable by the following:

(a) compute the mean absolute deviation of age,
(b) compute the z-score for the first four measurement.

Task 5.2. Present situations for which density-based clustering is more
suitable than partitioning-based clustering.

Task 5.3. Show how the ESL code for DBSCAN can be modified to implement
OPTICS.

Task 5.4. Association rule mining using frequent itemsets often generate a large number of rules. Explain how the notion of correlation can be used to eliminate many rules while preservintg those that are semantically interesting.

Task 5.5. Explain the benefits offered by boosting ensembles when mining data bases and data streams.

Assignment No 4

Study the following papers:
Unifying the Processing of XML Streams and Relational Data Streams. Xin Zhou, Hetal Thakkar and Carlo Zaniolo. ICDE 2006.

Mining Databases and Data Streamswith Query Languages and Rules. Invited Talk. Carlo Zaniolo. Fourth International Workshop on Knowledge Discovery in Inductive Databases, KDID 2005.

Building data mining solutions with OLE db for DM and XML for analysis. Z. Tang, J. Maclennan, and P.P. Kim. SIGMOD Record, 34(2): 80--85, 2005.

Do the following project (cs240A final revised for WEKA--Due May 16):

Assignment No 3 (Due on Mo, April 30 WD, May 2)

Read The following papers:

A heartbeat mechanism and its application in gigascope. by Theodore Johnson et al.: In VLDB 2005, pages 1079-1088.

Optimizing Timestamp Management in Data Stream Management Systems. by Yijian Bai et al.: In ICDE 2007.

Task 3.1. EXAMPLE 13--MAX with Windows--of the CIKM 2006 listed below has a bug insofar as it does not consider the case where inwindow is empty:

-- discuss the cases where inwindow is empty,
--fix the bug a show a correct definition for this UDA.

Task 3.2. We showed how to fix the idle-wating problem for union and joins. However, aggregates on logical windows with slides can also experience a similar problem [answer]

    • explain how the problem occurs (you might want to start from the case of tumbles)

    • give the execution of aggregates on a logical window with slides (in the style of the foil with header tau be the content of the Time Stamp Memory (TSM) Register).

    Task 3.3. Aleri and Progress (apama) represent two recently annnounced DSMS products. Find out what you can about these two. Compare them with Coral8 and Stream Base in a short report (one page).


Week 2 Assignment (Due on Monday, April 23).

Read the following papers on DSMS:

Models and Issues in Data Stream Systems
. by B. Babcock, S. Babu, M. Datar, R. Motwani, and J. Widom. Invited paper in Proc. of PODS 2002, June 2002.

Query Languages and Data Models for Database Sequences and Data Streams. By Yan-Nei Law, Haixun Wang, Carlo Zaniolo. VLDB 2004.

A Data Stream Language and System Designed for Power and Extensibility. By Yijian Bai, Hetal Thakkar, Chang Luo, Haixun Wang, Carlo Zaniolo. CIKM 2006.
  • Deliverables:

Task 2.1 Write a short (one page or two powerpoint slides ) report on Coral8 (vs. Streambase)

Task 2.2 Write a short (one page or two powerpoint slides ) report on Streambase (vs. Coral8)

Taks 2.3 Show some blocking and non-blocking queries in SQL-MR and, for each, explain why it is or it is not blocking.

Task 2.4. Assume that you have a table describing Ebay bids for the day.

Bids ( itemID /* the item being bid for*/,
price, /* bid price*/
bidderID, /* id of the bidder*/
timestamp /* time when bid was registered*/)


Write two ESL UDAs to compute the SUM and the SUM-DISTICNT over a window, and use them to return, for each itemID, the sum and sum-distinct of the prices over the last 100 bids.

 


 

Week 1 Assignment (Due on Monday, April 9).

* Study K. Guion's overview on OLAP Functions.

* Study the proposed Match-Recognize ANSI Standards for SQL (SQL-MR for short)

Turn in your answer to the following assignment:

Task 1.1. Assume that you have the following temporal table:

emp(Eno, Project, Tstart, Tend)

Denoting periods of time during which an employee has worked on a project. The closed intervals denoting these periods could overlap, and thus you need to coalesced them into maximal periods. Suggestion, sort all events in a sequence, and then use SQL-MR to do the actual coalescing, and reconstruct the original table with the intervals coalesced.

Task 1.2. Sensors have detected locations of objects at certain time. So

items( itemNo, SensorNo, Time)

Write an SQL-MR query to detect objects that are going around in a cycle, i.e., they have returned to the same location withing one day. Many objects do not move fast, so the sensor might produce consecutive readings of the same object even if this is not in a cycle.

Task 1.3. Express in SQL-TS the example query of Section 2.1 of the SQL-MR document.

Task 1.4. Write a short (one page or less) report on SQL-MR explaining the parts that, in your view, have problems or require clarification (e.g., my view is that outputs are not clear in the presence of alternative patterns).