DB-UCLA Seminar

To subscribe to the db-ucla mailing list for seminar announcement, please visit this page

Time: 12:00n-1:00pm Fridays; Room: 4549 Boelter Hall

*To invite a guest speaker or to schedule a talk, contact Young Cha (cookie58z at gmail dot com)

Fall 2011
Date	Speaker	Title
09/30	Nikolay Laptev	Scaling Google Realtime Analytics under Extreme Load
10/07	Dr. Carlo Zaniolo	Recent Advances in Logic-Based Languages
10/14	Youngchul Cha	Social Network Analysis Using Topic Model
10/21	Hamid Mousavi	Computation of Biased Histograms over Fast Data Streams
10/28	Dr. Carlo Zaniolo	Toward More Powerful Query Languages & Systems for Continuous Analytics and Data Mining
11/04	No Seminar	No Seminar
11/11	Veteran's Day	Veteran's Day
11/18	Tyson Condie	Scal(a)ing up Machine Learning and Graph-based Analytics
11/25	Thanksgiving	Thanksgiving
12/02	Mohan Yang	Optimizing Content Freshness of Relations Extracted From the Web Using Keyword Search AND Querying Uncertain Data with Aggregate Constraints

Spring 2011
Date	Speaker	Title
04/01
04/08	Dr. Carlo Zaniolo	[dbucla] SMM: a Data Stream Management System for Knowledge Discovery
04/15	Shanu Sushmita	[snorg] Aggregated Search: Result Presentation and User Behavior.
04/22
04/29
05/06	Kai Zeng	[dbucla] Uncertainty Propagation in Complex Query Networks on Data Streams: A New Paradigm for Load Shedding
05/13	Jer-Ming Chen	[dbucla] Burrows-Wheeler Transform Database
05/20	Ryan R. Rosario	[snorg] Towards a Crawling Schedule for Web Forum Content
05/27	Michael Shindler	[dbucla] Clustering Large Datasets
05/27	Shi Gao	[dbucla] Trajectory Pattern Mining
06/03	Nikhilesh Tadiparthi & Priya Meenakshi	[dbucla] Processing of streaming XML Documents

Winter 2011
Date	Speaker	Title
01/07
01/14	Dr. Junghoo Cho	[snorg] How to be a good student
01/21
01/28
02/04		Admission meeting
02/11	Dr. Carlo Zaniolo	[dbucla] Bring Order to Data Models and Query Languages for Data Streams
02/18	Vladimir Braverman	[dbucla] Space-efficient algorithms for data streams
02/25	Barzan Mozafari	[dbucla] High-performance Pattern Detection and Discovery for Databases and Data Streams
03/04		Prospective student visit day
03/11

Fall 2010
Date	Speaker	Title
10/01	Barzan Mozafari	[dbucla] From Regular Expressions to Nested Words: Unifying Languages and Query Execution for Relational and XML Sequences
10/08	Mirjana Mazuran	[dbucla] Mining and querying Tree-based Association Rules from XML documents
10/15	Amruta Joshi	[snorg] Leveraging Keyword Context to Improve Retrieval Relevance
10/22	Kelvin T. Leung	[dbucla] IRMA: An Image Registration Meta-Algorithm --- Evaluating Alternative Algorithms with Multiple Metrics
10/29	Hamid Mousavi	[dbucla] Fast and Space-Efficient Computation of Equi-Depth Histograms for Data Streams
11/05	Nikolay Laptev	[dbucla] Extending Teradata UDFs to support SQL and Map Reduce programs
11/12		Veteran's day week
11/19	Daniel Fabbri	[dbucla] PolicyReplay: Misconfiguration-Response Queries for Data Breach Reporting
11/26		Happy Thanksgiving!
12/03		Finals week

On scaling Analytics Realtime under Exterme Load Conditions: My Internship at Google

Speaker:
Nikolay Laptev

Abstract:
The need to support realtime data analytics is important to make timely decision, however due to the extreme system load requirement this is often difficult and in this talk I will describe my internship experince at Google Realtime Analytics team where we tried to address this problem. This talk will discuss the load generation tool developed, the realtime monitoring tool, the dynamic RPC adjustment and finally anomaly detection. The load generation framework was used to purposefully overload the system and find its bottleknwcks. The realtime monitoring tool was developed to see where the most packets are being dropped and helped in further analysis of the weak points of the system. The dynamic RPC adjustment (patent pending) was developed to address our scalability problem and finally the anomaly detection implementation was supplied in order to alert both us or the user of any unexpected behavior in the website traffic (due to system overload or other anomalous activity). With the help of the above, Google Analytics Realtime was launched on 09-28-2011. This talk will also briefly discuss the future plans for our lab and Stream Mill Miner system that we are developing.

Recent Advances in Logic-Based Languages

Speaker:
Dr. Carlo Zaniolo

Abstract:

Logic-based query languages represented a cornerstone of the relational database model introduced by E.F. Codd, and were then significantly extended with the development of Datalog and other deductive database lan- guages that support rules and recursive queries. This line of research produced elegant semantics and implementation techniques, and delivered the enabling technology for the effcient implementation of recursive queries with stratified negation supported by commercial DBMS and spefied by SQL standards. After a lull of several years, the spread of web-age information systems and applications has generated a vigorous resurgence of interest leading to a "Springtime for Datalog" according to Joseph M. Hellerstein [3]. It was indeed in his lab at UCB that the idea of logic-based declarative specification and design of Internet protocols and services frst emerged [4]; more recent work at UCB seeks to extend this approach and develop Datalog-based foundations for parallel and distributed programming languages [3]. On the Semantic Web front, a novelty of great interest is represented by the introduction of Linear Datalog± for expressing and supporting fficiently subsets of description logic and reasoning for ontological queries [2]. Two other simple extensions of Datalog of significant practical interest have been proposed very recently: one is intended for complex graph queries, including page-rank queries [5], while the other supports more powerful continuous queries on data streams [6]. Finally, novel challenges and opportunities are emerging at the implementation level, making it possible to support recursive queries in a Map-Reduce emvironment [1].

This DBUCLA presentation will first summarize the tutorial I recently presented at the WAIM 2011/MSRA Summer School, and then expand on [5]. My slides from the Summer School can be downloaded from http://www.cs.ucla.edu/~zaniolo/talks11/.

References
[1] F. N. Afrati, V. R. Borkar, M. J. Carey, N. Polyzotis, and J. D. Ullman. Map- Reduce Extensions and Recursive Queries. In EDBT 2011, Invited Paper, pages 1–8, 2011.
[2] G. Gottlob, G. Orsi, and A. Pieris. Ontological queries: Rewriting and opti- mization. In ICDE 2011: Keynote Paper, pages 2–13, 2011.
[3] J. M. Hellerstein. Datalog redux: experience and conjecture. In PODS 210: Keynote Paper, pages 1–2, 2010.
[4] B. T. Loo, T. Condie, J. M. Hellerstein, P. Maniatis, T. Roscoe, and I. Stoica. Implementing Declarative Overlays. In SOSP, pages 75–90, 2005.
[5] M.Mazuran, E. Serra, and C. Zaniolo: WEB Queries in Datalog with Frequency Support. Submitted for Publication.
[6] C. Zaniolo. The Logic of Query Languages for Data Streams. In Logic in DataBases 2011. EDBT Workshop, 2011.

Social Network Analysis Using Topic Model

Speaker:
Young Cha

Title: Social Network Analysis Using Topic Model
Abstract: In this paper, we discuss how we can extend probabilistic topic models to analyze the relationship graph of popular social-network data, so that we can label or group the edges and the nodes in the graph based on their topic similarity. In particular, we first apply the well-known Latent Dirichlet Allocation (LDA) model and its existing variants to the graph-labeling task and argue that the existing models do not handle popular nodes (the nodes with many incoming edges) in the graph very well. We then propose possible extensions to this model to deal with popular nodes. Our experiments show that the proposed extensions are very effective in labeling popular nodes, showing significant improvements over the existing methods. Our proposed methods can be used for providing, for instance, more relevant recommendations for people to follow within a social network.

Computation of Biased Histograms over Fast Data Streams

Speaker:
Hamid Mousavi

Title: Computation of Biased Histograms over Fast Data Streams
Abstract: In this talk, I will present our current work on designing biased histograms over fast data streams with sliding windows. In biased histograms, we need to have the size of bars/buckets exponentially decreasing (or increasing) toward some points in the distribution of data. The biased points are usually one or both ends of the data distribution. Our algorithm, which uses a new sampling technique, is based on our previous work on designing equi-depth histograms over fast data streams. Both approaches take advantage of "Exponential Histograms" and try to split and merge histogram's bars in order to keep the histograms up-to-date.

Toward More Powerful Query Languages & Systems for Continuous Analytics and Data Mining.

Speaker:
Dr. Carlo Zaniolo

Title: Toward More Powerful Query Languages & Systems for Continuous Analytics and Data Mining.
Abstract: Database oriented query languages for continuous queries offer many benefits, including scalability and parallelizability. However, they are impaired by severe limitations in their ability to express complex applications, such as data stream mining. The UCLA Stream Mill system addresses and overcomes these limitations by: 1) Extensibility cosntructs that achieve Turing completeness for SQL-based languages. Friendly environments for advanced applications, such as data stream mining, are thus achieved via these constructs and a multi-layer architecture. 2) A powerful query language for searching complex patterns recognizable by nested-word automata. Current research focuses on reclaiming blocking constructs, such as negation. Toward this goal, we use Datalog as a powerful modelling tool for the analysis of continuous queries on data streams.

Scal(a)ing up Machine Learning and Graph-based Analytics

Speaker:
Tyson Condie

Title: Scal(a)ing up Machine Learning and Graph-based Analytics
Abstract: Machine learning practitioners are increasingly interested in applying their algorithms to Big Data. Unfortunately, current high-level languages for data analytics (e.g., Hive, Pig, Sawzall, Scope) do not fully cover this domain. One key missing ingredient is the means to efficiently support iteration over the data. Zaharia et al., were the first to answer this call from a systems perspective with Spark. Spark adds the notion of a working set to data-parallel workflows and has published speed-ups of 30x over Hadoop MapReduce for many machine learning and graph algorithms. Unfortunately, Spark does cover the whole pipeline of Big Data analytics; at Yahoo!, it is common to compose Pig, MPI and direct MapReduce program modules into workflows. This fractioning of individual processing steps can be a major pain e.g., for optimization, debugging, and code readability. Our prescription to this dilemma is a new DSL for data analytics called ScalOps. Like Pig, ScalOps combines the declarative style of SQL and the low-level procedural style of MapReduce. Like Spark, ScalOps can optimize its runtime—the Hyracks parallel-database engine—for repeated access to data collections. ScalOps is part of a broader research agenda to explore new abstractions for machine learning and graph-based analytics. In this talk, I will present example workflows from the machine learning domain expressed in ScalOps and their translation to Hyracks recursive query plans.

Optimizing Content Freshness of Relations Extracted From the Web Using Keyword Search AND Querying Uncertain Data with Aggregate Constraints

Speaker:
Mohan Yang

Title: Optimizing Content Freshness of Relations Extracted From the Web Using Keyword Search
Abstract: An increasing number of applications operate on data obtained from the Web. These applications typically maintain local copies of the web data to avoid network latency in data accesses. As the data on the Web evolves, it is critical that the local copy be kept up-to-date. Data freshness is one of the most important data quality issues, and has been extensively studied for various applications including web crawling. However, web crawling is focused on obtaining as many raw web pages as possible. Our applications, on the other hand, are interested in specific content from specific data sources. Knowing the content or the semantics of the data enables us to differentiate data items based on their importance and volatility, which are key factors that impact the design of the data synchronization strategy. In this work, we formulate the concept of content freshness, and present a novel approach that maintains content freshness with least amount of web communication. Specifically, we assume data is accessible through a general keyword search interface, and we form keyword queries based on their selectivity, as well their contribution to content freshness of the local copy. Experiments show the effectiveness of our approach compared with several naive methods for keeping data fresh.

Title: Querying Uncertain Data with Aggregate Constraints
Abstract: Data uncertainty arises in many situations. A common approach to query processing uncertain data is to sample many ``possible worlds'' from the uncertain data and to run queries against the possible worlds. However, sampling is not a trivial task, as a randomly sampled possible world may not satisfy known constraints imposed on the data. In this paper, we focus on an important category of constraints, the aggregate constraints. An aggregate constraint is placed on a set of records instead of on a single record, and a real-life system usually has a large number of aggregate constraints. It is a challenging task to find qualified possible worlds in this scenario, since tuple by tuple sampling is extremely inefficient because it rarely leads to a qualified possible world. In this paper, we introduce two approaches for querying uncertain data with aggregate constraints: constraint aware sampling and MCMC sampling. Our experiments show that the new approaches lead to high quality query results with reasonable cost.

SMM: a Data Stream Management System for Knowledge Discovery

Speaker:
Professor Zaniolo

Abstract:
The problem of supporting data mining applications proved to be difficult for database management systems and it is now proving to be very challenging for data stream management systems (DSMSs), where the limitations of SQL are made even more severe by the requirements of continuous queries. The major technical advances that achieved separately on DSMSs and on data stream mining algorithms have failed to converge and produce powerful data stream mining systems. Such systems, however, are essential since the traditional pull-based approach of cache mining is no longer applicable, and the push-based computing mode of data streams and their bursty traffic complicate application development. For instance, to write mining applications with quality of service (QoS) levels approaching those of DSMSs, a mining analyst would have to contend with many arduous tasks, such as support for data buffering, complex storage and retrieval methods, scheduling, fault-tolerance, synopsis-management, load shedding, and query optimization.

Our Stream Mill Miner (SMM) system solves these problems by providing a data stream mining workbench that combines the ease of specifying high-level mining tasks, as in Weka, with the performance and QoS guarantees of a DSMS. This is accomplished in three main steps. The first is an open and extensible DSMS architecture where KDD queries can be easily expressed as user-defined aggregates (UDAs)—our system combines that with the efficiency of synoptic data structures and mining-aware load shedding and optimizations. The second key component of SMM is its integrated library of fast mining algorithms that are light enough to be effective on data streams. The third advanced feature of SMM is a Mining Model Definition Language (MMDL) that allows users to define the flow of mining tasks, integrated with a simple box&arrow GUI, to shield the mining analyst from the complexities of lower-level queries. SMM is the first DSMS capable of online mining and this paper describes its architecture, design, and performance on mining queries.

Aggregated Search: Result Presentation and User Behavior.

Speaker:
Shanu Sushmita

Affiliation:
University of Glasgow

Abstract:
In order to facilitate information access, search engines are now providing access to diverse data in a unified manner, called aggregated search. An aggregated search interface is designed to aggregate retrieval results from different information sources (image, video, maps, blogs, etc) into a single result page. In other words, aggregated search provides a richer media experience to users, and by using only a small number of keywords or phrase. It does so by without mentioning the type of media, the results provided aims to cover everything relevant. This work aims to explore the aggregated search from the users' perspective, where the focus is on understanding and describing the phenomena related to the users' search process in the context of the aggregated search. In building this understanding, the work presented here focuses on the click-behavior, information need, source relevance, dynamics of search intents. The understanding comes partly from conducting users studies and, from analyzing search engine log data.

Uncertainty Propagation in Complex Query Networks on Data Streams: A New Paradigm for Load Shedding

Speaker:
Kai Zeng

Advisor:
Professor Zaniolo

Abstract:
Data Stream Management Systems (DSMS) are subject to bursty data arrivals. When overloaded, a DSMS needs to shed some of the load while minimizing the accuracy loss. Previous work on load shedding, when focusing on optimizing the query accuracy, has great limitation in the generality and applicability to handle complex queries. We propose and study a new load shedding paradigm that is based on modeling the distribution and propagation effect of the uncertainty. This work solves the problem of optimal load shedding for the most general case of query networks and operators. Our analytical conclusions are validated through extensive experiments.

Burrows-Wheeler Transform Database

Speaker:
Jer-Ming Chen

Advisor:
Professor Zaniolo

Abstract:
Imagine you have potentially 1 million documents to index. Each documents are roughly 3000 characters long. You have roughly 3 billion characters to index. Documents come in different time, just like your email. How would you index them with high performance and updateable?
This is a SQL database implementation of BWT using SQLite on C# .Net platform. SQLite can store hundreds of TB, thus, storage size is not limited.

Towards a Crawling Schedule for Web Forum Content

Speaker:
Ryan R. Rosario

Abstract:
There is much literature about crawling web pages to minimize staleness and maximize recency. Web Forum threads are similar to web pages, but pose more challenges because the content is feverishly createed by an unpredictable set of users. Forum threads and posts are created at whim by users, and die off as the community loses interests in its topic, as moderators close/lock the thread, and as past threads are flushed from databases. The life cycle of a thread is different from the life cycle of a web page. In this talk, I will discuss a recent crawling effort on offtopic.com, and discuss my ideas on scheduling crawls for forums like offtopic.com to maximize recency and minimize staleness.

Clustering Large Datasets

Speaker:
Michael Shindler

Abstract:
Clustering is a common problem in many fields, and the k-means objective is a very popular formulation of it. We consider the case when the data to be clustered (via k-means) is too large to fit into main memory, and thus must be read and processed sequentially, using as little memory as possible. We show that, if we can read the data twice, we can achieve the same approximation bound that can be guaranteed for the normal (main memory) cases previously studied. If we are constrained by one read, we can achieve almost this -- only an additive epsilon factor worse. We demonstrate a considerable speed-up to the algorithm. Finally, we show that in addition to the theoretical guarantees, this achieves a great improvement in practice as well.

Space-efficient algorithms for data streams

Speaker:
Vladimir Braverman

Advisor:
Professor Rafail Ostrovsky

Abstract:
Streaming algorithms is an important area of theoretical computer science with many practical applications. We will define the basic model of data streams and will explain some fundamental algorithms and methods that have shaped the area of data streams. Also, we will survey some of our recent results and discuss current challenges and open problems. In particular, we will present the paper “Zero-One Frequency Laws” (STOC 2010) where we investigate frequency-based functions and answer the main open question of Alon, Matias and Szegedi (STOC 1996).

High-performance Pattern Detection and Discovery for Databases and Data Streams

Speaker:
Barzan Mozafari

Advisor:
Professor Zaniolo

Abstract:
Many important applications such as decision support, fraud detection, stock market analysis, weather forecast and online recommendation systems involve the detection and discovery of patterns, including spatio-temporal patterns, sequential patterns, frequent itemset patterns and trajectory patterns. The efficient search for such patterns and the discovery of characteristic patterns of interest pose difficult research challenges for both stored data and data streams. My dissertation has focused on these challenges seeking to achieve the following objectives: (1) Minimal extensions to current query languages for patterns, (2) Optimization techniques for pattern query languages and diverse computing environments data base, data streams, single and parallel processing, (3) Efficient algorithms for discovering interesting patterns in data streams, and (4) An integrated system that enables expressing and definition, mining, evaluation and deployment of different type of patterns from data streams.

Toward these objectives, I have sought integrated solutions of the research problems at three different levels: (i) algorithmic level, (ii) language level and (iii) system level. This talk overviews the progress that I have made on these problems, and closely related ones such as privacy-preserving data mining. Then I will describe a promising approach based on Kleene-closure expressions, that we are currently pursuing for language extension, query optimization, and support for advance spatio-temporal applications such as trajectory analysis.

From Regular Expressions to Nested Words: Unifying Languages and Query Execution for Relational and XML Sequences

Speaker:
Barzan Mozafari

Advisor:
Professor Zaniolo

Abstract:
There is growing interest in query language extensions for pattern matching over event streams and stored database sequences, due to the many important applications that such extensions make possible. The push for such extensions has led DBMS vendors and DSMS venture companies to propose Kleene-closure extensions of SQL standards, building on seminal research that demonstrated the effectiveness and amenability to efficient implementation of such constructs. These extensions, however powerful, suffer from limitations that severely impair their effectiveness in many real-world applications. To overcome these problems, we have designed the K*SQL language and system, based on our investigation of the nested words, which are recent models that generalize both words and trees. K*SQL extends the existing relational sequence languages, and also enables applications from other domains such as genomics, software analysis, and XML processing. At the same time, K*SQL remains extremely efficient, using our powerful optimizations for pattern search over nested words. Furthermore, we show that other sequence languages and XPath can be automatically translated into K*SQL, allowing for K*SQL to be also used as a high-performance query execution back-end for those languages. Therefore, K*SQL is a unifying SQL-based engine for sequence and XML queries, which provides novel optimization techniques for both.

Mining and querying Tree-based Association Rules from XML documents

Speaker:
Mirjana Mazuran

Affiliation:
Politecnico di Milano

Abstract:
Extracting information from semistructured documents is becoming more and more critical as the amount of digital information available on the internet grows. Documents are often so large that also the dataset returned as answer to a query may be too big to convey interpretable knowledge. Tree-based Association Rules (TARs) are a new way to face the challenges of such scenario. They are structure-preserving associatin rules mined from XML documents and provide approximate, intensional information on both their structure and contents. This mined knowledge is the gist of the original document and provides support for (i) query formulation on datasets whose structure is unknown and (ii) intensional query answering, that is, allowing to query TARs rather then the original document in order to provide quick, approximate answers.

IRMA: An Image Registration Meta-Algorithm ---
Evaluating Alternative Algorithms with Multiple Metrics

Speaker:
Kelvin T. Leung

Advisor:
Professor Parker

Abstract:
A common problem in scientific computing is the need to choose among a number of different metrics or scoring criteria. To minimize error, for example, one might choose among metrics like L1 error, L2 error, etc. These choices can be difficult to make, because each criterion is an objective that can be useful in its own right. Because they can differ significantly from one another, ad hoc choices can have significant consequences.

This problem is important in scientific imaging, as many different measures of distance, similarity, and quality are useful for images. In the image registration problem, one is given a pair of images -- an input image and a reference image -- and a set of registration algorithms for aligning them. The result of registration is a transformation of the input image that resembles the reference image, and the quality of this result is often the value of a metric measuring the distance or similarity between these images. Registration is a central problem in image analysis, and new solutions provide new ways of retrieving images from databases.

In this talk, we describe a new meta-algorithm called Image Registration Meta-Algorithm (IRMA) that evaluates results of multiple metrics. Selection is done by a decision engine that automatically stores metric computation results in a database, then mines these results to evaluate which registration algorithms gives results of highest quality. To help facilitate our goal, we have build a scientific model-base management system using Postgres on the LONI/CCB grid computing facility and LONI pipeline workflow infrastructure, which allows scientists to catalog the results such as provenance information about the parameter settings being used in the workflow and allows scientists to perform ‚Äùshot-gun‚Äù approach experiments to explore and further understand the space of alternatives in a more organized fashion.

* This work supported by NIH grant 1U54RR021813 and the Center for Computational Biology, an NIH National Center of Biomedical Computing.

Fast and Space-Efficient Computation of Equi-Depth Histograms for Data Streams

Speaker:
Hamid Mousavi

Advisor:
Professor Zaniolo

Abstract:
Equi-depth histograms represent a fundamental synopsis widely used in both database and data stream applications, as they provide the cornerstone of many techniques such as query optimization, approximate query answering, distribution fitting, and parallel database partitioning. Equi-depth histograms try to partition a sequence of data in a way that every part has the same number of data items.

In this paper, we present a new algorithm to estimate equi-depth histograms for high speed data streams over sliding windows. While many previous methods were based on quantile computations, we propose a new method called BAr Splitting Histogram (BASH) that provides an expected ϵ-approximate solution to compute the equi-depth histogram. Extensive experiments show that BASH is at least four times faster than one of the best existing approaches, while achieving the same accuracy and using less memory. The experimental results also indicate that BASH is more stable on data affected by frequent concept shifts.

Extending Teradata UDFs to support SQL and Map Reduce programs

Speaker:
Nikolay Laptev

Advisor:
Professor Zaniolo

Abstract:
Teradata is a massively parallel processing system running a shared nothing architecture. The Teradata DBMS is linearly and predictably scalable in all dimensions of a database system workload (data volume, breadth, number of users, complexity of queries). One of the many features of the Teradata Database includes the support for User Defined Functions (UDFs). A Teradata Database UDF is a program that operates on data stored in relational tables. UDFs allow users to add their own extensions to the Teradata SQL language.

Teradata UDFs must be written in Java or C. In this talk a high level overview is given about the extension of Teradata UDFs to support SQL to maximize programmer productivity and to decrease debugging time. Furthermore the conversion of Hadoop MapReduce programs and of Java UDFs into native Teradata UDFs is also discussed. The work presented was part of a summer project at Teradata.

PolicyReplay: Misconfiguration-Response Queries for Data Breach Reporting

Speaker:
Daniel Fabbri

Affiliation:
U of Michigan

Abstract:
Recent legislation has increased the requirements of organizations to report data breaches, or unauthorized access to data. While access control policies are used to restrict access to a database, these policies are complex and difficult to configure. As a result, misconfigurations sometimes allow users access to unauthorized data. In this paper, we consider the problem of reporting data breaches after such a misconfiguration is detected. To locate past SQL queries that may have revealed unauthorized information, we introduce the novel idea of a misconfiguration response (MR) query. The MR-query cleanly addresses the challenges of information propagation within the database by replaying the log of operations and returning all logged queries for which the result has changed due to the misconfiguration. A strawman implementation of the MR-query would go back in time and replay all the operations that occurred in the interim, with the correct policy. However, re-executing all operations is inefficient. Instead, we develop techniques to improve reporting efficiency by reducing the number of operations that must be re-executed and reducing the cost of replaying the operations. An extensive evaluation shows that our method can reduce the total runtime by up to an order of magnitude.

University of California, Los Angeles, Computer Science Department. 2010.

DB-UCLA Seminar

On scaling Analytics Realtime under Exterme Load Conditions: My Internship at Google

Recent Advances in Logic-Based Languages

Social Network Analysis Using Topic Model

Computation of Biased Histograms over Fast Data Streams

Toward More Powerful Query Languages & Systems for Continuous Analytics and Data Mining.

Scal(a)ing up Machine Learning and Graph-based Analytics

Optimizing Content Freshness of Relations Extracted From the Web Using Keyword Search AND Querying Uncertain Data with Aggregate Constraints

SMM: a Data Stream Management System for Knowledge Discovery

Aggregated Search: Result Presentation and User Behavior.

Uncertainty Propagation in Complex Query Networks on Data Streams: A New Paradigm for Load Shedding

Burrows-Wheeler Transform Database

Towards a Crawling Schedule for Web Forum Content

Clustering Large Datasets

Space-efficient algorithms for data streams

High-performance Pattern Detection and Discovery for Databases and Data Streams

From Regular Expressions to Nested Words: Unifying Languages and Query Execution for Relational and XML Sequences

Mining and querying Tree-based Association Rules from XML documents

IRMA: An Image Registration Meta-Algorithm --- Evaluating Alternative Algorithms with Multiple Metrics

Fast and Space-Efficient Computation of Equi-Depth Histograms for Data Streams

Extending Teradata UDFs to support SQL and Map Reduce programs

PolicyReplay: Misconfiguration-Response Queries for Data Breach Reporting

IRMA: An Image Registration Meta-Algorithm ---
Evaluating Alternative Algorithms with Multiple Metrics