Contact Information
Carlo Zaniolo
Computer Science Department
University of California
Los Angeles, CA 90096
Phone: (310) 825-8137
Fax : (310) 794-5056
Email: zaniolo@cs.ucla.edu
WWW Page:
http://www.cs.ucla.edu/~zaniolo/atlas.html
Project Award Information
Award Number: NSF-IIS 0070135
Duration: September 2000 -- August 2003.
Title: User-Defined Aggregates for Advanced Database Applications
Keywords: Database Languages, Database Extensions, Database-centric datamining, SQL aggregates.
Project Summary
An explosive growth in the scale and complexity of information services is stretching database technology beyond its limits. The state-of-the-art data base management systems (DBMSs) provide extensibility through external functions written in procedural languages but remain ineffective in critical applications areas, such as data mining [5]. Therefore, the project's goal is to develop a native extensibility mechanism for DBMSs, and to evaluate its effectiveness in advanced application domains. Therefore, we propose an extension mechanism based on User-Defined Aggregates (UDAs) coded in SQL, rather than in external procedural languages. This approach ensures ease of use and compatibility, and overcomes many other problems besetting today's advanced data-intensive applications.
Goals, Objectives, and Targeted Activities
We have implemented a prototype , which we have named ATLaS (for Aggregate & Table Language and System), and evaluated it in critical application areas, such as database-centric data mining. This area has already been established as one that is beyond the capabilities of current object-relational databases [5]. Therefore, our investigation concentrates on expressing in ATLaS various data mining functions, such as classifiers and Apriori algorithms. The objective is to match the performance of procedural data mining algorithms that rely on a cache-mining approach [4]. For instance, we have expressed categorical classifiers with only a 20% loss of performance, but a larger performance penalty was initially experienced for the Apriori algorithm. We have devoted significant efforts to overcome this limitation, and implemented a revised and improved system during this second project year. Our experiments show a dramatic increase of performance, whereby the Apriori algorithm and other algorithms requiring specialized inmemory data structure now perform within the 20% of their implementations in C/C++. This dramatic improvement is due to a the ability of declare and manipulate inmemory table in ATLaS: the support of tuple IDs and path expressions on these tables allows the efficient implementation of special data structures in ATLaS.
A second application area studied is spatio-temporal databases where we have used ATLaS to implement the extensible spatio-temporal data model and query language SQLST[7,8]. In addition, to providing these spatio-temporal primitives as built-ins, our SQLST system allows the user to introduce additional extensions to the spatio-temporal query language and data model. This approach minimizes the need for new constructs that must be added to SQL to support spatio-temporal queries, assures the orthogonality of spatial and temporal constructs, and allows the customization of spatio-temporal extensions to meet the different needs of different applications. During this second year, we were able to achieve significant speedups on these applications thanks to the introduction of R*-tree indices in ATLaS. The addition of this new index structure also demonstrates the ease of extensibility of our basic system.
Continuous queries and streams represent an area of growing interest for database applications. We have shown that because of ATLaS hash-based implementation of aggregates, users only need to follow simple syntactic rules to produce the non-blocking aggregates required in stream-oriented computations [6]. We have also investigated the related problem of supporting time series. The stream-oriented processing model used to define new aggregates is conducive to efficient processing of sequences and time series. However, considerations of ease of use and efficiency prompted us to introduce specialized constructs, and to devise special optimization techniques for time series. Therefore, we have defined the SQL-TS language to express time series queries, via simple extensions of SQL [1]. Moreover, we have developed new query optimization techniques for this language. Our techniques generalize the optimal text search algorithm of Knuth, Morris and Pratt [9], to handle complex queries on sequences. We are now implementing a prototype of SQL-TS as an extension of ATLaS.
Project Impact
Area Background
Driven by an incessant stream of new applications, database systems have striven to overcome the crippling handicap of the limited expressive power of their query languages, such as SQL. The new generation of Oject-Relational systems support major extensions, including recursive queries, new ROLAP constructs, and procedural UDF extensions; however, they remain ineffective in many advanced application areas, such as datamining [5,8]. Results from our previous research [10,11,12,13] suggest that a native extension mechanism for SQL, based on user-defined aggregates and table functions, can solve these problems, can solve these problems without incurring in the many incompatibility problems encountered with procedural extensions
Area References