User-Defined Aggregates for Advanced Database Applications

Carlo Zaniolo
Computer Science Department
University of California, Los Angeles, CA 90096

Contact Information

Carlo Zaniolo
Computer Science Department
University of California
Los Angeles, CA 90096
Phone: (310) 825-8137
Fax : (310) 794-5056
Email: zaniolo@cs.ucla.edu

WWW Page:  

http://www.cs.ucla.edu/~zaniolo/atlas.html

Project Award Information
Award Number: NSF-IIS 0070135
Duration: September 2000 -- August 2003.
Title: User-Defined Aggregates for Advanced Database Applications

Keywords: Database Languages, Database Extensions, Database-centric datamining, SQL aggregates.

Project Summary

An explosive growth in the scale and complexity of information services is stretching database technology beyond its limits.  The state-of-the-art data base management systems (DBMSs) provide extensibility through external functions written in procedural languages but remain ineffective in critical applications areas, such  as data mining [5]. Therefore, the project's goal is to develop a native extensibility mechanism for DBMSs, and to evaluate its effectiveness in  advanced application domains. Therefore, we  propose an extension mechanism based on  User-Defined Aggregates (UDAs) coded in SQL, rather than in external procedural languages. This approach ensures ease of use and compatibility, and overcomes many other problems besetting  today's advanced data-intensive applications.

Goals, Objectives, and Targeted Activities

Indication of Success

We have implemented a prototype , which we have named ATLaS (for Aggregate & Table Language and System), and evaluated it in critical application areas, such as database-centric data mining. This  area has already been  established as one that is beyond the capabilities of  current object-relational databases [5]. Therefore, our investigation concentrates on expressing in  ATLaS various data mining functions, such as classifiers and Apriori algorithms. The objective is to match the performance of  procedural data mining algorithms that rely on a cache-mining approach [4]. For instance, we have expressed categorical classifiers with only a 20% loss of performance, but a larger performance penalty was initially experienced for the Apriori algorithm. We have devoted significant efforts to overcome this limitation, and implemented a revised and improved system during this second project year. Our experiments show a dramatic increase of performance, whereby the Apriori algorithm and other algorithms requiring specialized inmemory data structure now perform within the 20% of their implementations in C/C++. This dramatic improvement is due to a the ability of declare and manipulate inmemory table in ATLaS: the support of tuple IDs and path expressions on these tables allows the efficient implementation of special data structures in ATLaS.

A second application area studied is spatio-temporal databases where we have used ATLaS to implement the extensible spatio-temporal data model and query language SQLST[7,8].  In addition, to providing these spatio-temporal primitives as built-ins, our SQLST system  allows the user to introduce additional extensions to the spatio-temporal query language and data model. This approach minimizes the need for new constructs that must be added to SQL to support  spatio-temporal queries, assures  the orthogonality of spatial and temporal constructs, and allows the customization of spatio-temporal extensions to meet the different needs of different applications. During this second year, we were able to achieve significant speedups on these applications thanks to the introduction of R*-tree indices in ATLaS. The addition of this new index structure also demonstrates the ease of extensibility of our basic system.

Continuous queries and streams represent an area of growing interest for database applications. We have shown that because of ATLaS hash-based implementation of aggregates, users only need to follow simple syntactic rules to produce the non-blocking aggregates required in stream-oriented computations [6]. We have also investigated the related problem of supporting time series.  The stream-oriented processing model used to define new aggregates is conducive to efficient processing of  sequences and time series. However, considerations of ease of use and efficiency  prompted us to introduce specialized constructs, and  to devise special optimization techniques for time series.  Therefore, we have defined the SQL-TS language to express time series queries, via simple extensions of SQL [1].  Moreover, we have developed new query optimization techniques for this language.  Our techniques generalize  the optimal text search algorithm of Knuth, Morris and Pratt [9], to handle complex queries  on sequences. We are  now implementing a prototype of  SQL-TS as an extension of  ATLaS.

Project Impact

GPRA Outcome Goals Project  References
  1. Reza Sadri, Carlo Zaniolo, Amir M. Zarkesh, Jafar Adibi: Optimization of Sequence Queries in Database Systems. PODS 2001: 92-103.
  2. Sergio Greco, Domenico Saccà, Carlo Zaniolo: Extending stratified datalog to capture complexity classes ranging from P to QH. Acta Informatica, 37(10): 699-725 (2001)
  3. Fosca Giannotti, Dino Pedreschi, Carlo Zaniolo: Semantics and Expressive Power of Nondeterministic Constructs in Deductive Databases. JCSS 62(1): 15-42 (2001).
  4. Reza Sadri, Carlo Zaniolo, Amir M. Zarkesh, Jafar Adibi: A Sequential Pattern Query Language for Supporting Instant Data Mining for e-Services. VLDB 2001: 653-656.
  5. Haixun Wang and Carlo Zaniolo: ATLaS: a Powerful Database Language and System Based on Simple Extensions of SQ (Extended Abstract). Proc. 18th International Conference on Data Engineering , ICDE 2002.
  6. Haixun Wang, Carlo Zaniolo: Extending SQL for Decision Support Applications. (Keynote Address). The 4th Intl. Workshop on Design and Management of Data Warehouses (DMDW 2002), Toronto, Canada, May 27, 2002.
  7. ATLaS Project documentation and download site: http://wis.cs.ucla.edu/atlas/

Area Background

Driven by an incessant stream of new applications, database systems have striven to overcome the crippling handicap of the limited  expressive power of their query languages, such as SQL. The new generation of Oject-Relational systems support major extensions, including recursive queries, new ROLAP constructs, and  procedural UDF extensions; however, they remain ineffective in many advanced application areas, such as datamining [5,8]. Results from our previous research [10,11,12,13]  suggest that a native extension mechanism for SQL, based on user-defined aggregates and table functions, can solve  these problems, can solve these problems without incurring in the  many incompatibility problems encountered with procedural  extensions

Area References

  1. Rosa Meo, Giuseppe Psaila, Stefano Ceri: A Tightly-Coupled Architecture for Data Mining. ICDE 1998: 316-323.
  2. Rakesh Agrawal, Kyuseok Shim: Developing Tightly-Coupled Data Mining Applications on a Relational Database System. KDD 1996: 287-29
  3. Sunita Sarawagi, Shiby Thomas, Rakesh Agrawal: Integrating Mining with Relational Database Systems: Alternatives and Implications. SIGMOD Conference 1998: 343-354.
  4. Cindy X. Chen, Carlo Zaniolo: SQLST: A Spatio-Temporal Data Model and Query Language, ER 2000: 96-111.
  5. C. X. Chen and C. Zaniolo,Universal Temporal Extensions for Database Languages, 15th International Conference on Data Engineering, ICDE'99, Sydney, Australia, March 23-26, 1999.
  6. Fosca Giannotti, Giuseppe Manco, Dino Pedreschi, Franco Turini: Experiences with a Logic-based knowledge discovery Support Environment. 1999 ACM SIGMOD Workshop on Research Issues in Data Mining and Knowledge Discovery 1999.
  7. D. E. Knuth, J. H. Morris, and V. R. Pratt: Fast pattern matching in strings. SIAM Journal of Computing, 6(2):323--350, June 1977.
  8. Haixun Wang, Carlo Zaniolo: Using SQL to Build New Aggregates and Extenders for Object- Relational Systems. VLDB 2000.
  9. Haixun Wang, Carlo Zaniolo: User Defined Aggregates in Object-Relational Systems. ICDE 2000: 135-144.
  10. Haixun Wang, Carlo Zaniolo: User-Defined Aggregates for Datamining. 1999 ACM SIGMOD Workshop on Research Issues in Data Mining and Knowledge Discovery 1999.