NSF-IIS 0070135

User-Defined Aggregates for Advanced Database Applications

Carlo Zaniolo
Computer Science Department
University of California, Los Angeles, CA 90096

Contact Information

Carlo Zaniolo
Computer Science Department
University of California
Los Angeles, CA 90096
Phone: (310) 825-8137
Fax : (310) 794-5056
Email: zaniolo@cs.ucla.edu

WWW Page:

http://www.cs.ucla.edu/~zaniolo/atlas.html

Project Award Information
Award Number: NSF-IIS 0070135
Duration: September 2000 -- August 2003.
Title: User-Defined Aggregates for Advanced Database Applications

Keywords: Database Languages, Database Extensions, Database-centric datamining, SQL aggregates.

Project Summary

An explosive growth in the scale and complexity of information services is stretching database technology beyond its limits. The state-of-the-art data base management systems (DBMSs) provide extensibility through external functions written in procedural languages but remain ineffective in critical applications areas, such as data mining [5]. Therefore, the project's goal is to develop a native extensibility mechanism for DBMSs, and to evaluate its effectiveness in advanced application domains. Therefore, we propose an extension mechanism based on User-Defined Aggregates (UDAs) coded in SQL, rather than in external procedural languages. This approach ensures ease of use and compatibility, and overcomes many other problems besetting today's advanced data-intensive applications.

Goals, Objectives, and Targeted Activities

Design and implementation of an SQL-based language for UDAs, and of a system for supporting their efficient execution. Compilation and optimization techniques that ensure UDAs' performance and scalability on large data sets must therefore be developed.
Development of advanced application testbeds to demonstrate the power of our UDA system, and to evaluate and tune its performance. In most applications, such as data mining functions, spatio-temporal queries, and time-series data blades, the system will be used as a DBMS extender; but in other applications, such as analysis of computational data or WEB data, the system will also be used as a stand-alone query processor.
Delivery of the software to the public, with a dedicated website providing documentation, tutorials, application testbeds, and online demos.

Indication of Success

We have implemented a prototype , which we have named ATLaS (for Aggregate & Table Language and System), and evaluated it in critical application areas, such as database-centric data mining. This area has already been established as one that is beyond the capabilities of current object-relational databases [5]. Therefore, our investigation concentrates on expressing in ATLaS various data mining functions, such as classifiers and Apriori algorithms. The objective is to match the performance of procedural data mining algorithms that rely on a cache-mining approach [4]. For instance, we have expressed categorical classifiers with only a 20% loss of performance, but a larger performance penalty was initially experienced for the Apriori algorithm. We have devoted significant efforts to overcome this limitation, and implemented a revised and improved system during this second project year. Our experiments show a dramatic increase of performance, whereby the Apriori algorithm and other algorithms requiring specialized inmemory data structure now perform within the 20% of their implementations in C/C++. This dramatic improvement is due to a the ability of declare and manipulate inmemory table in ATLaS: the support of tuple IDs and path expressions on these tables allows the efficient implementation of special data structures in ATLaS.

A second application area studied is spatio-temporal databases where we have used ATLaS to implement the extensible spatio-temporal data model and query language SQL^ST[7,8]. In addition, to providing these spatio-temporal primitives as built-ins, our SQL^STsystem allows the user to introduce additional extensions to the spatio-temporal query language and data model. This approach minimizes the need for new constructs that must be added to SQL to support spatio-temporal queries, assures the orthogonality of spatial and temporal constructs, and allows the customization of spatio-temporal extensions to meet the different needs of different applications. During this second year, we were able to achieve significant speedups on these applications thanks to the introduction of R*-tree indices in ATLaS. The addition of this new index structure also demonstrates the ease of extensibility of our basic system.

Continuous queries and streams represent an area of growing interest for database applications. We have shown that because of ATLaS hash-based implementation of aggregates, users only need to follow simple syntactic rules to produce the non-blocking aggregates required in stream-oriented computations [6]. We have also investigated the related problem of supporting time series. The stream-oriented processing model used to define new aggregates is conducive to efficient processing of sequences and time series. However, considerations of ease of use and efficiency prompted us to introduce specialized constructs, and to devise special optimization techniques for time series. Therefore, we have defined the SQL-TS language to express time series queries, via simple extensions of SQL [1]. Moreover, we have developed new query optimization techniques for this language. Our techniques generalize the optimal text search algorithm of Knuth, Morris and Pratt [9], to handle complex queries on sequences. We are now implementing a prototype of SQL-TS as an extension of ATLaS.

Project Impact

Four Ph.D. students and one Master student have been working on various aspects of the project (from implementation to user interfaces). Most of these students are minority (Chinese), and several are female. One female student (Cindy X. Chen) completed her Ph.D. in March 2001
The ATLaS prototype was used in the graduate course CS240B--Advanced Databases and Knowledge Bases Systems--that offered for the first time in the Spring Quarter 2002. Various student projects can be found in the ATLaS web site [7]
To facilitate technology transfer, a new ATLaS website [7] was started with documentation and executable downloads for Linux and MS Windows.

GPRA Outcome Goals

Discoveries at and across the frontier of science and engineering. A vast array of applications in business, government and science need to analyze large amounts of data. Examples include earth science data and medical data. This project will facilitate the elaboration of information and the extraction of knowledge from the large amount of data collected in such applications.
Connections between discoveries and their use in service to society. In fields, such as earth science or medicine, better tools for knowledge discovery and decision support can accelerate the scientific discovery process and the applications of such discoveries. This project will extend the benefits of high-level query languages, which are now only available in the database framework, to other data-intensive applications.

Project References

Reza Sadri, Carlo Zaniolo, Amir M. Zarkesh, Jafar Adibi: Optimization of Sequence Queries in Database Systems. PODS 2001: 92-103.
Sergio Greco, Domenico Saccà, Carlo Zaniolo: Extending stratified datalog to capture complexity classes ranging from P to QH. Acta Informatica, 37(10): 699-725 (2001)
Fosca Giannotti, Dino Pedreschi, Carlo Zaniolo: Semantics and Expressive Power of Nondeterministic Constructs in Deductive Databases. JCSS 62(1): 15-42 (2001).
Reza Sadri, Carlo Zaniolo, Amir M. Zarkesh, Jafar Adibi: A Sequential Pattern Query Language for Supporting Instant Data Mining for e-Services. VLDB 2001: 653-656.
Haixun Wang and Carlo Zaniolo: ATLaS: a Powerful Database Language and System Based on Simple Extensions of SQ (Extended Abstract). Proc. 18th International Conference on Data Engineering , ICDE 2002.
Haixun Wang, Carlo Zaniolo: Extending SQL for Decision Support Applications. (Keynote Address). The 4th Intl. Workshop on Design and Management of Data Warehouses (DMDW 2002), Toronto, Canada, May 27, 2002.
ATLaS Project documentation and download site: http://wis.cs.ucla.edu/atlas/

Area Background

Driven by an incessant stream of new applications, database systems have striven to overcome the crippling handicap of the limited expressive power of their query languages, such as SQL. The new generation of Oject-Relational systems support major extensions, including recursive queries, new ROLAP constructs, and procedural UDF extensions; however, they remain ineffective in many advanced application areas, such as datamining [5,8]. Results from our previous research [10,11,12,13] suggest that a native extension mechanism for SQL, based on user-defined aggregates and table functions, can solve these problems, can solve these problems without incurring in the many incompatibility problems encountered with procedural extensions

Area References

Rosa Meo, Giuseppe Psaila, Stefano Ceri: A Tightly-Coupled Architecture for Data Mining. ICDE 1998: 316-323.
Rakesh Agrawal, Kyuseok Shim: Developing Tightly-Coupled Data Mining Applications on a Relational Database System. KDD 1996: 287-29
Sunita Sarawagi, Shiby Thomas, Rakesh Agrawal: Integrating Mining with Relational Database Systems: Alternatives and Implications. SIGMOD Conference 1998: 343-354.
Cindy X. Chen, Carlo Zaniolo: SQL^ST: A Spatio-Temporal Data Model and Query Language, ER 2000: 96-111.
C. X. Chen and C. Zaniolo,Universal Temporal Extensions for Database Languages, 15th International Conference on Data Engineering, ICDE'99, Sydney, Australia, March 23-26, 1999.
Fosca Giannotti, Giuseppe Manco, Dino Pedreschi, Franco Turini: Experiences with a Logic-based knowledge discovery Support Environment. 1999 ACM SIGMOD Workshop on Research Issues in Data Mining and Knowledge Discovery 1999.
D. E. Knuth, J. H. Morris, and V. R. Pratt: Fast pattern matching in strings. SIAM Journal of Computing, 6(2):323--350, June 1977.
Haixun Wang, Carlo Zaniolo: Using SQL to Build New Aggregates and Extenders for Object- Relational Systems. VLDB 2000.
Haixun Wang, Carlo Zaniolo: User Defined Aggregates in Object-Relational Systems. ICDE 2000: 135-144.
Haixun Wang, Carlo Zaniolo: User-Defined Aggregates for Datamining. 1999 ACM SIGMOD Workshop on Research Issues in Data Mining and Knowledge Discovery 1999.