To subscribe to the db-ucla mailing list for seminar announcement, please visit this page
DB-UCLA Seminar
Time: 12:00n-1:00pm Fridays; Room: 4549 Boelter Hall
*To invite a guest speaker or to schedule a talk, contact Young Cha (cookie58z at gmail dot com)
From Regular Expressions to Nested Words: Unifying Languages and Query Execution for Relational and XML Sequences
Speaker:
Barzan Mozafari
Advisor:
Professor Zaniolo
Abstract:
There is growing interest in query language extensions for pattern
matching over event streams and stored database sequences, due
to the many important applications that such extensions make possible.
The push for such extensions has led DBMS vendors and
DSMS venture companies to propose Kleene-closure extensions of
SQL standards, building on seminal research that demonstrated the
effectiveness and amenability to efficient implementation of such
constructs. These extensions, however powerful, suffer from limitations
that severely impair their effectiveness in many real-world
applications. To overcome these problems, we have designed the
K*SQL language and system, based on our investigation of the
nested words, which are recent models that generalize both words
and trees.
K*SQL extends the existing relational sequence languages, and
also enables applications from other domains such as genomics,
software analysis, and XML processing. At the same time, K*SQL
remains extremely efficient, using our powerful optimizations for
pattern search over nested words. Furthermore, we show that other
sequence languages and XPath can be automatically translated into
K*SQL, allowing for K*SQL to be also used as a high-performance
query execution back-end for those languages. Therefore, K*SQL
is a unifying SQL-based engine for sequence and XML queries,
which provides novel optimization techniques for both.
Mining and querying Tree-based Association Rules from XML documents
Speaker:
Mirjana Mazuran
Affiliation:
Politecnico di Milano
Abstract:
Extracting information from semistructured documents is becoming more and more critical as the amount of digital information available on the internet grows. Documents are often so large that also the dataset returned as answer to a query may be too big to convey interpretable knowledge. Tree-based Association Rules (TARs) are a new way to face the challenges of such scenario. They are structure-preserving associatin rules mined from XML documents and provide approximate, intensional information on both their structure and contents. This mined knowledge is the gist of the original document and provides support for (i) query formulation on datasets whose structure is unknown and (ii) intensional query answering, that is, allowing to query TARs rather then the original document in order to provide quick, approximate answers.
IRMA: An Image Registration Meta-Algorithm ---
Evaluating Alternative Algorithms with Multiple Metrics
Speaker:
Kelvin T. Leung
Advisor:
Professor Parker
Abstract:
A common problem in scientific computing is the need to choose among a
number of different metrics or scoring criteria. To minimize error,
for example, one might choose among metrics like L1 error, L2 error,
etc. These choices can be difficult to make, because each criterion is
an objective that can be useful in its own right. Because they can
differ significantly from one another, ad hoc choices can have
significant consequences.
This problem is important in scientific imaging, as many different
measures of distance, similarity, and quality are useful for
images. In the image registration problem, one is given a pair of
images -- an input image and a reference image -- and a set of
registration algorithms for aligning them. The result of registration
is a transformation of the input image that resembles the reference
image, and the quality of this result is often the value of a metric
measuring the distance or similarity between these
images. Registration is a central problem in image analysis, and new
solutions provide new ways of retrieving images from databases.
In this talk, we describe a new meta-algorithm called Image
Registration Meta-Algorithm (IRMA) that evaluates results of multiple
metrics. Selection is done by a decision engine that automatically
stores metric computation results in a database, then mines these
results to evaluate which registration algorithms gives results of
highest quality. To help facilitate our goal, we have build a
scientific model-base management system using Postgres on the LONI/CCB
grid computing facility and LONI pipeline workflow infrastructure,
which allows scientists to catalog the results such as provenance
information about the parameter settings being used in the workflow
and allows scientists to perform ”shot-gun” approach experiments
to explore and further understand the space of alternatives in a more
organized fashion.
* This work supported by NIH grant 1U54RR021813 and the Center for
Computational Biology, an NIH National Center of Biomedical
Computing.
Fast and Space-Efficient Computation of Equi-Depth Histograms for Data Streams
Speaker:
Hamid Mousavi
Advisor:
Professor Zaniolo
Abstract:
Equi-depth histograms represent a fundamental synopsis
widely used in both database and data stream applications, as they
provide the cornerstone of many techniques such as query optimization,
approximate query answering, distribution fitting, and parallel database
partitioning. Equi-depth histograms try to partition a sequence of data
in a way that every part has the same number of data items.
In this paper, we present a new algorithm to estimate equi-depth
histograms for high speed data streams over sliding windows. While
many previous methods were based on quantile computations, we propose
a new method called BAr Splitting Histogram (BASH) that provides an
expected ϵ-approximate solution to compute the equi-depth histogram.
Extensive experiments show that BASH is at least four times faster than
one of the best existing approaches, while achieving the same accuracy
and using less memory. The experimental results also indicate that BASH
is more stable on data affected by frequent concept shifts.
Extending Teradata UDFs to support SQL and Map Reduce programs
Speaker:
Nikolay Laptev
Advisor:
Professor Zaniolo
Abstract:
Teradata is a massively parallel processing system running a shared nothing architecture.
The Teradata DBMS is linearly and predictably scalable in all dimensions of a database
system workload (data volume, breadth, number of users, complexity of queries). One of
the many features of the Teradata Database includes the support for User Defined Functions (UDFs).
A Teradata Database UDF is a program that operates on data stored in relational tables.
UDFs allow users to add their own extensions to the Teradata SQL language.
Teradata UDFs must be written in Java or C. In this talk a high level overview is given
about the extension of Teradata UDFs to support SQL to maximize programmer productivity and to decrease debugging
time. Furthermore the conversion of Hadoop MapReduce programs and of Java UDFs into native
Teradata UDFs is also discussed. The work presented was part of a summer project at Teradata.
PolicyReplay: Misconfiguration-Response Queries for Data Breach Reporting
Speaker:
Daniel Fabbri
Affiliation:
U of Michigan
Abstract:
Recent legislation has increased the requirements of organizations to report data breaches, or unauthorized access to data. While access control policies are used to restrict access to a database, these policies are complex and difficult to configure. As a result, misconfigurations sometimes allow users access to unauthorized data.
In this paper, we consider the problem of reporting data breaches after such a misconfiguration is detected.
To locate past SQL queries that may have revealed unauthorized information, we introduce the novel idea of a misconfiguration response (MR) query. The MR-query cleanly addresses the challenges of information propagation within the database by replaying the log of operations and returning all logged queries for which the result has changed due to the misconfiguration. A strawman implementation of the MR-query would go back in time and replay all the operations that occurred in the interim, with the correct policy. However, re-executing all operations is inefficient. Instead, we develop techniques to improve reporting efficiency by reducing the number of operations that must be re-executed and reducing the cost of replaying the operations. An extensive evaluation shows that our method can reduce the total runtime by up to an order of magnitude.
The NoSQL Movement: Overview and discussion
Speaker:
Uri Schonfeld
Abstract:
In this seminar we will introduce the NOSQL movement and some of the recent
alternatives to the classic ACID supporting SQL database. We will discuss
several of the servers available, some highly publicized recent adoptions by
high traffic websites (Twitter, Digg, Reddit) and have an open discussion on
where things are headed.
K*SQL: Searching for Complex Event Patterns in Linear Sequences and Nested Structures
Speaker:
Kai Zeng
Abstract:
A strong interest is emerging in SQL extensions for sequence patterns using
Kleene-closure expressions. This burst of interest from both the research community and
the commercial world is due to the many database and data stream applications made possible
by these extensions, including ?nancial services, RFID-based inventory management, and electronic
health systems. The K*SQL system represents a major step forward in this area. K*SQL supports a more
expressive language that allows for generalized Kleene-closure queries and also achieves
the expressive power of the nested word model, which greatly expands the application domain to
include XML queries, software trace analysis, and genomics.
PRIMA Architracker
Speaker:
Shi Gao
Abstract:
The function of archiving history information is recognized as animportant demand of new
database management systems. Except for content updates, we also need to pay attention to
schema changes. As of today, schema evolution still remains error-prone and time-consuming
undertaking. PRIMA Architracker addresses the problem by: (i) providing a general method to
archive history information of database automatically and efficiently (ii) making use of
HMM to record schema evolution.
Using Cluster Analysis to Extract Information from Educational Video Game Log Files
Speaker:
Deirdre Kerr
Abstract:
I am going to give a brief overview of the Math game we designed and the traditional
statistical methods of analyzing educational interventions such as this in the social
sciences. Then I will explain how the use of fuzzy feature cluster analysis, utilizing
the fanny algorithm in R, might help us understand the intervention at a deeper level.
Finally, I will go over additional questions
about such interventions that may be answerable through data mining techniques and ask for feedback.
Experimental Comparison of Document Clustering Techniques
Speaker:
Michael Shindler
Abstract:
We perform an experimental comparison of k-means and spectral
(minimum-conductance) clustering techniques over large datasets for document
retrieval applications, using f-measure as our basis for comparison. We have
three primary findings: one, that minimum-conductance does not appear to be a
useful objective to optimize; two, that the clustering assumed by the dataset
does not optimize the k-means objective well; and three, somewhat surprisingly,
a higher f-measure value can sometimes be obtained by using a larger number of
final clusters than is assumed by the dataset.
Search by offering examples
Speaker:
Chu-Cheng Hsieh
Abstract:
Search engines subvert how people think about the data in WWW. Since the data is massive, one of the most important goals today is to provide a service which people could search the useful information they want .efficiently..
Nevertheless, in some scenarios, it.s unrealistic to expect that a user to devise a set of keywords for getting the information from the web. For example, it.s impossible to figure out keywords you didn.t know yet. If you want to
search good basketball players in the NBA, it.s difficult to define keywords for search without knowing the name of players or knowledge of basketball games. We tend to overcome these drawbacks by providing the users with abilities
to search by examples. The goal of this project is to provide an approach for web users to search similar objects from the webs, instead of using keywords to define a concept they want (traditional search). We will present an
efficient algorithm to identify entities under the assumption that tag information is given. A live demo will also be demonstrated during the presentation.
A Unifying Engine for Sequence Patterns and XML
Speaker:
Barzan Mozafari
Abstract:
Content hidden due to confidentiality.
Optimal Load Shedding with Aggregates and Mining Queries
Speaker:
Barzan Mozafari
Abstract:
To cope with bursty arrivals of high-volume data, a DSMS has to shed load while minimizing the degradation of Quality of Service (QoS). In this paper, we show that this problem can be formalized as a classical optimization task
from operations research, in ways that accommodate different requirements for multiple users, different query sensitivities to load shedding, and different penalty functions. Standard non-linear programming algorithms are adequate
for non-critical situations, but for severe overloads, we propose a more efficient algorithm that runs in linear time, without compromising optimality. Our approach is applicable to a large class of queries including traditional
SQL aggregates, statistical aggregates (e.g., quantiles), and data mining functions, such as k-means, na.ve Bayesian classifiers, decision trees, and frequent pattern discovery (where we can even specify a different error bound for
each pattern). In fact, we show that these aggregate queries are special instances of a broader class of functions, that we call reciprocalerror aggregates, for which the proposed methods apply with full generality.
Finally, we propose a novel architecture for supporting load shedding in an extensible system, where users can write arbitrary User Defined Aggregates (UDA), and thus confirm our analytical findings with several experiments
executed on an actual DSMS.
University of California, Los Angeles, Computer Science Department. 2010.