DB-UCLA Seminar

To subscribe to the db-ucla mailing list for seminar announcement, please visit this page

Time: 12:00n-1:00pm Fridays; Room: 4549 Boelter Hall

*To invite a guest speaker or to schedule a talk, contact Young Cha (cookie58z at gmail dot com)

Fall 2010
Date	Speaker	Title
10/01	Barzan Mozafari	[dbucla] From Regular Expressions to Nested Words: Unifying Languages and Query Execution for Relational and XML Sequences
10/08	Mirjana Mazuran	[dbucla] Mining and querying Tree-based Association Rules from XML documents
10/15	Amruta Joshi	[snorg] Leveraging Keyword Context to Improve Retrieval Relevance
10/22	Kelvin T. Leung	[dbucla] IRMA: An Image Registration Meta-Algorithm --- Evaluating Alternative Algorithms with Multiple Metrics
10/29	Hamid Mousavi	[dbucla] Fast and Space-Efficient Computation of Equi-Depth Histograms for Data Streams
11/05	Nikolay Laptev	[dbucla] Extending Teradata UDFs to support SQL and Map Reduce programs
11/12		No Seminar (Veterans Day)
11/19	Daniel Fabbri	[dbucla] PolicyReplay: Misconfiguration-Response Queries for Data Breach Reporting
11/26		No Seminar (Happy Thanksgiving!)
12/03		No Seminar

Spring 2010
Date	Speaker	Title
04/09	Uri Schonfeld	The NoSQL Movement: Overview and discussion
04/16
04/23
04/30
05/07
05/14	Kai Zeng	K*SQL: Searching for Complex Event Patterns in Linear Sequences and Nested Structures
05/21	Shi Gao	PRIMA Architracker
05/28
06/04	Deirdre Kerr	Using Cluster Analysis to Extract Information from Educational Video Game Log Files

Winter 2010
Date	Speaker	Title
01/08	Michael Shindler	Experimental Comparison of Document Clustering Techniques
01/15
01/22
01/29	Chu-Cheng Hsieh	Search by offering examples
02/05
02/12	Barzan Mozafari	A Unifying Engine for Sequence Patterns and XML
02/19	Young Cha	Search knowledge sharing project
02/26	Barzan Mozafari	Optimal Load Shedding with Aggregates and Mining Queries
03/05
03/12
03/19

From Regular Expressions to Nested Words: Unifying Languages and Query Execution for Relational and XML Sequences

Abstract:
There is growing interest in query language extensions for pattern matching over event streams and stored database sequences, due to the many important applications that such extensions make possible. The push for such extensions has led DBMS vendors and DSMS venture companies to propose Kleene-closure extensions of SQL standards, building on seminal research that demonstrated the effectiveness and amenability to efficient implementation of such constructs. These extensions, however powerful, suffer from limitations that severely impair their effectiveness in many real-world applications. To overcome these problems, we have designed the K*SQL language and system, based on our investigation of the nested words, which are recent models that generalize both words and trees. K*SQL extends the existing relational sequence languages, and also enables applications from other domains such as genomics, software analysis, and XML processing. At the same time, K*SQL remains extremely efficient, using our powerful optimizations for pattern search over nested words. Furthermore, we show that other sequence languages and XPath can be automatically translated into K*SQL, allowing for K*SQL to be also used as a high-performance query execution back-end for those languages. Therefore, K*SQL is a unifying SQL-based engine for sequence and XML queries, which provides novel optimization techniques for both.

Mining and querying Tree-based Association Rules from XML documents

Abstract:
Extracting information from semistructured documents is becoming more and more critical as the amount of digital information available on the internet grows. Documents are often so large that also the dataset returned as answer to a query may be too big to convey interpretable knowledge. Tree-based Association Rules (TARs) are a new way to face the challenges of such scenario. They are structure-preserving associatin rules mined from XML documents and provide approximate, intensional information on both their structure and contents. This mined knowledge is the gist of the original document and provides support for (i) query formulation on datasets whose structure is unknown and (ii) intensional query answering, that is, allowing to query TARs rather then the original document in order to provide quick, approximate answers.

IRMA: An Image Registration Meta-Algorithm ---
Evaluating Alternative Algorithms with Multiple Metrics

Abstract:
A common problem in scientific computing is the need to choose among a number of different metrics or scoring criteria. To minimize error, for example, one might choose among metrics like L1 error, L2 error, etc. These choices can be difficult to make, because each criterion is an objective that can be useful in its own right. Because they can differ significantly from one another, ad hoc choices can have significant consequences.

This problem is important in scientific imaging, as many different measures of distance, similarity, and quality are useful for images. In the image registration problem, one is given a pair of images -- an input image and a reference image -- and a set of registration algorithms for aligning them. The result of registration is a transformation of the input image that resembles the reference image, and the quality of this result is often the value of a metric measuring the distance or similarity between these images. Registration is a central problem in image analysis, and new solutions provide new ways of retrieving images from databases.

In this talk, we describe a new meta-algorithm called Image Registration Meta-Algorithm (IRMA) that evaluates results of multiple metrics. Selection is done by a decision engine that automatically stores metric computation results in a database, then mines these results to evaluate which registration algorithms gives results of highest quality. To help facilitate our goal, we have build a scientific model-base management system using Postgres on the LONI/CCB grid computing facility and LONI pipeline workflow infrastructure, which allows scientists to catalog the results such as provenance information about the parameter settings being used in the workflow and allows scientists to perform ‚Äùshot-gun‚Äù approach experiments to explore and further understand the space of alternatives in a more organized fashion.

* This work supported by NIH grant 1U54RR021813 and the Center for Computational Biology, an NIH National Center of Biomedical Computing.

Fast and Space-Efficient Computation of Equi-Depth Histograms for Data Streams

Abstract:
Equi-depth histograms represent a fundamental synopsis widely used in both database and data stream applications, as they provide the cornerstone of many techniques such as query optimization, approximate query answering, distribution fitting, and parallel database partitioning. Equi-depth histograms try to partition a sequence of data in a way that every part has the same number of data items.

In this paper, we present a new algorithm to estimate equi-depth histograms for high speed data streams over sliding windows. While many previous methods were based on quantile computations, we propose a new method called BAr Splitting Histogram (BASH) that provides an expected ϵ-approximate solution to compute the equi-depth histogram. Extensive experiments show that BASH is at least four times faster than one of the best existing approaches, while achieving the same accuracy and using less memory. The experimental results also indicate that BASH is more stable on data affected by frequent concept shifts.

Extending Teradata UDFs to support SQL and Map Reduce programs

Abstract:
Teradata is a massively parallel processing system running a shared nothing architecture. The Teradata DBMS is linearly and predictably scalable in all dimensions of a database system workload (data volume, breadth, number of users, complexity of queries). One of the many features of the Teradata Database includes the support for User Defined Functions (UDFs). A Teradata Database UDF is a program that operates on data stored in relational tables. UDFs allow users to add their own extensions to the Teradata SQL language.

Teradata UDFs must be written in Java or C. In this talk a high level overview is given about the extension of Teradata UDFs to support SQL to maximize programmer productivity and to decrease debugging time. Furthermore the conversion of Hadoop MapReduce programs and of Java UDFs into native Teradata UDFs is also discussed. The work presented was part of a summer project at Teradata.

PolicyReplay: Misconfiguration-Response Queries for Data Breach Reporting

Abstract:
Recent legislation has increased the requirements of organizations to report data breaches, or unauthorized access to data. While access control policies are used to restrict access to a database, these policies are complex and difficult to configure. As a result, misconfigurations sometimes allow users access to unauthorized data. In this paper, we consider the problem of reporting data breaches after such a misconfiguration is detected. To locate past SQL queries that may have revealed unauthorized information, we introduce the novel idea of a misconfiguration response (MR) query. The MR-query cleanly addresses the challenges of information propagation within the database by replaying the log of operations and returning all logged queries for which the result has changed due to the misconfiguration. A strawman implementation of the MR-query would go back in time and replay all the operations that occurred in the interim, with the correct policy. However, re-executing all operations is inefficient. Instead, we develop techniques to improve reporting efficiency by reducing the number of operations that must be re-executed and reducing the cost of replaying the operations. An extensive evaluation shows that our method can reduce the total runtime by up to an order of magnitude.

The NoSQL Movement: Overview and discussion

Abstract:
In this seminar we will introduce the NOSQL movement and some of the recent alternatives to the classic ACID supporting SQL database. We will discuss several of the servers available, some highly publicized recent adoptions by high traffic websites (Twitter, Digg, Reddit) and have an open discussion on where things are headed.

K*SQL: Searching for Complex Event Patterns in Linear Sequences and Nested Structures

Abstract:
A strong interest is emerging in SQL extensions for sequence patterns using Kleene-closure expressions. This burst of interest from both the research community and the commercial world is due to the many database and data stream applications made possible by these extensions, including ?nancial services, RFID-based inventory management, and electronic health systems. The K*SQL system represents a major step forward in this area. K*SQL supports a more expressive language that allows for generalized Kleene-closure queries and also achieves the expressive power of the nested word model, which greatly expands the application domain to include XML queries, software trace analysis, and genomics.

PRIMA Architracker

Abstract:
The function of archiving history information is recognized as animportant demand of new database management systems. Except for content updates, we also need to pay attention to schema changes. As of today, schema evolution still remains error-prone and time-consuming undertaking. PRIMA Architracker addresses the problem by: (i) providing a general method to archive history information of database automatically and efficiently (ii) making use of HMM to record schema evolution.

Using Cluster Analysis to Extract Information from Educational Video Game Log Files

Abstract:
I am going to give a brief overview of the Math game we designed and the traditional statistical methods of analyzing educational interventions such as this in the social sciences. Then I will explain how the use of fuzzy feature cluster analysis, utilizing the fanny algorithm in R, might help us understand the intervention at a deeper level. Finally, I will go over additional questions about such interventions that may be answerable through data mining techniques and ask for feedback.

Experimental Comparison of Document Clustering Techniques

Abstract:
We perform an experimental comparison of k-means and spectral (minimum-conductance) clustering techniques over large datasets for document retrieval applications, using f-measure as our basis for comparison. We have three primary findings: one, that minimum-conductance does not appear to be a useful objective to optimize; two, that the clustering assumed by the dataset does not optimize the k-means objective well; and three, somewhat surprisingly, a higher f-measure value can sometimes be obtained by using a larger number of final clusters than is assumed by the dataset.

Search by offering examples

Abstract:
Search engines subvert how people think about the data in WWW. Since the data is massive, one of the most important goals today is to provide a service which people could search the useful information they want .efficiently.. Nevertheless, in some scenarios, it.s unrealistic to expect that a user to devise a set of keywords for getting the information from the web. For example, it.s impossible to figure out keywords you didn.t know yet. If you want to search good basketball players in the NBA, it.s difficult to define keywords for search without knowing the name of players or knowledge of basketball games. We tend to overcome these drawbacks by providing the users with abilities to search by examples. The goal of this project is to provide an approach for web users to search similar objects from the webs, instead of using keywords to define a concept they want (traditional search). We will present an efficient algorithm to identify entities under the assumption that tag information is given. A live demo will also be demonstrated during the presentation.

A Unifying Engine for Sequence Patterns and XML

Optimal Load Shedding with Aggregates and Mining Queries

Abstract:
To cope with bursty arrivals of high-volume data, a DSMS has to shed load while minimizing the degradation of Quality of Service (QoS). In this paper, we show that this problem can be formalized as a classical optimization task from operations research, in ways that accommodate different requirements for multiple users, different query sensitivities to load shedding, and different penalty functions. Standard non-linear programming algorithms are adequate for non-critical situations, but for severe overloads, we propose a more efficient algorithm that runs in linear time, without compromising optimality. Our approach is applicable to a large class of queries including traditional SQL aggregates, statistical aggregates (e.g., quantiles), and data mining functions, such as k-means, na.ve Bayesian classifiers, decision trees, and frequent pattern discovery (where we can even specify a different error bound for each pattern). In fact, we show that these aggregate queries are special instances of a broader class of functions, that we call reciprocalerror aggregates, for which the proposed methods apply with full generality.

Finally, we propose a novel architecture for supporting load shedding in an extensible system, where users can write arbitrary User Defined Aggregates (UDA), and thus confirm our analytical findings with several experiments executed on an actual DSMS.