DB-UCLA Seminar

To subscribe to the db-ucla mailing list for seminar announcement, please visit this page

Time: 12:30-1:30pm Fridays; Room: 4549 Boelter Hall

*To invite a guest speaker or to schedule a talk, contact Mike Welch (mjwelch@cs)

Fall, 2008
Date	Speaker	Title
09/26	No Seminar	--
10/03	Dr. Carlo Zaniolo	Managing Past Information in Present and Future DBMS
10/10	Richard Sia	Lesson Learnt From the UCSD Datamining Contest
10/17	No Seminar	--
10/24	Hetal Thakkar	Supporting Knowledge Discovery in Data Stream Management Systems
10/31	Qin Zhang	Scientific Approach on AI and the Application on Document Processing and Searches
11/07	Dr. Tru H. Cao	Named Entity-Centred Information Processing
11/14	No Seminar	--
11/21	No Seminar	--
11/28	No Seminar	--
12/05	All of dbUCLA	Research roundtable

Spring, 2008
Date	Speaker	Title
04/04	Hetal Thakkar	Verifying and Mining Frequent Patterns from Large Windows over Data Streams - ICDE 2008 (to appear)
04/11	Dr. D. S. Parker	The Hypothesis Web Project at UCLA
04/18	Carlo Curino	Schema Evolution in Wikipedia: toward a Web Information System Benchmark - to appear ICEIS 2008
04/25	No Seminar	--
05/02	Arthur Choi	Loopy Bayesian Inference Propagation
05/09	Guest Speaker: Pekka Kostamaa, Xin Zhou, Teradata	Talk1:Teradata Architecture - Technology and Differentiation, Talk2: Handling Data Skew in Parallel Joins in Shared-Nothing Systems SIGMOD 2008 (to appear)
05/16	No Seminar	--
05/23	Josh Hyman, Google, Santa Monica	TaskMaster: A Scalable, Reliable Queuing Infrastructure for Building Distributed Systems
05/30	Michael Welch	Automatically Identifying Localizable Queries - SIGIR 2008 (to appear)
06/06	Invited Speaker, Rosie Jones, Yahoo! Research	I know what you did last summer -- Query logs and user privacy
06/06	Richard Sia	Efficient Computation of Personal Aggregate Queries on Blogs

Winter, 2008
Date	Speaker	Title
1/11	Hyun Jin Moon	The End of an Architectural Era (It's Time for a Complete Rewrite) (VLDB '07)
	Carlo Curino	Context-ADDICT: Context-Aware Data Design Integration Customization Tailoring
1/18	Sung Jin Kim	Databases and Web Applications - Job Talk Rehearsal
1/25	Hetal Thakkar	Continous Queries in Oracle (VLDB '07)
	Myungwon Ham	Building Structured Web Community Portals: A Top-Down, Compositional, and Incremental Approach (VLDB '07)
	Richard Sia	Information Re-Retrieval: Repeat Queries in Yahoo's Logs (SIGIR '07)
2/1	Hyun Jin Moon	Querying Complex Structured Databases (VLDB '07)
2/8	Hyun Jin Moon	iTrails: Pay-as-you-go Information Integration in Dataspaces (VLDB '07)
2/15	Myungwon Ham	Data Integration with Uncertainty (VLDB '07)
	Barzan Mozafari	Recent results in query-view security
2/22	Richard Sia	Cost-effective Outbreak Detection in Networks (KDD '07)
	Hyun Jin Moon	Stop-and-Restart Style Execution for Long Running Decision Support Queries (VLDB '07)
2/29	Uri Schonfeld	Google News Personalization: Scalable Online Collaborative Filtering (WWW '07)
	Carlo Curino	Graceful Schema Evolution (Work in progress)
3/7	Mike Welch
3/14
3/21
3/28	Spring Break

Verifying and Mining Frequent Patterns from Large Windows over Data Streams
Hetal Thakkar
Mining frequent itemsets from data streams has proved to be very difficult because of computational complexity and the need for real-time response. In this paper, we introduce a novel verification algorithm which we then use to improve the performance of monitoring and mining tasks for association rules. Thus, we propose a frequent itemset mining method for sliding windows, which is faster than the state-of-the-art methods—in fact, its running time that is nearly constant with respect to the window size entails the mining of much larger windows than it was possible before. The performance of other frequent itemset mining methods (including those on static data) can be improved likewise, by replacing their counting methods (e.g., those using hash trees) by our verification algorithm.

The Hypothesis Web Project
Dr. D.S. Parker
The Hypothesis Web project is a new research project directed by Stott Parker and Wes Chu, part of the Corsortium for Neuropsychiatric Phenomics (CNP), a new NIH Interdisciplinary Research Center at UCLA initated last Fal (www.phenomics.ucla.edu). The Hypothesis Web starts from the observation that the scale and complexity of science has increased, to the point where a major obstacle to interdisciplinary research is the formulation of hypotheses that span multiple fields of expertise. This is particularly true for neuroscience research, for example, which can involve aspects of very diverse disciplines at many physical scales, and combined efforts involving multiple disciplines can be necessary in order to make any substantial progress. This project will investigate the development of the Hypothesis Web, a software platform that permits collaborative formulation of complex scientific hypotheses. It represents a novel approach to the way these hypotheses are conceived and tested, as an extension of existing software systems for web-based collaboration, data management, modeling, and access of online scientific literature. It will be used for collaborative development of hypotheses within the other CNP projects.

Schema Evolution in Wikipedia: toward a Web Information System Benchmark - to appear ICEIS 2008 Carlo Curino
Abstract:
Evolving the database that is at the core of an Information System represents a difficult maintenance problem that has only been studied in the framework of traditional information systems. However, the problem is likely to be even more severe in web information systems, where open-source software is often developed through the contributions and collaboration of many groups and individuals. Therefore, in this paper, we present an in- depth analysis of the evolution history of the Wikipedia database and its schema; Wikipedia is the best-known example of a large family of web information systems built using the open-source software MediaWiki. Our study is based on: (i) a set of Schema Modification Operators that provide a simple conceptual representation for complex schema changes, and (ii) simple software tools to automate the analysis. This framework allowed us to dissect and analyze the 4.5 years of Wikipedia history, which was short in time, but intense in terms of growth and evolution. Beyond confirming the initial hunch about the severity of the problem, our analysis suggests the need for developing better methods and tools to support graceful schema evolution. Therefore, we briefly discuss documentation and automation support systems for database evolution, and suggest that the Wikipedia case study can provide the kernel of a benchmark for testing and improving such systems.

Loopy Bayesian Inference Propagation
Arthur Choi
Abstract:
Abstract: Loopy belief propagation is an approximation algorithm for reasoning in Bayesian networks that has been critical for enabling a variety of applications, whose associated models are too difficult for exact algorithms. More recently, we have developed a perspective on loopy belief propagation that characterizes it as an algorithm that performs exact inference in a Bayesian network that has been simplified by deleting edges from the original model. This perspective allows a simple bottom-up approach to improving approximations: start with a coarse approximation (loopy BP), then reason in the base approximation to find deleted edges to recover. We find empirically that recovering only a small number of edges can have a significant impact in approximation quality with only a modest amount of computational effort. In this talk, we present our edge deletion perspective on belief propagation, and highlight some of the more recent results that this perspective has allowed.

Handling Data Skew in Parallel Joins in Shared-Nothing Systems (SIGMOD 2008 - to appear)
Guest Speaker: Pekka Kostamaa, Xin Zhou, Teradata

Talk 1:
Title: Teradata Architecture - Technology and Differentiation
Presenter: Pekka Kostamaa, Director, Advanced Development and Enterprise Architecture, Teradata R&D

Talk 2:
Title: Handling Data Skew in Parallel Joins in Shared-Nothing Systems
Presenter: Xin Zhou, Software Engineer, Teradata
Abstract:
Parallel processing continues to be important in large data warehouses. The processing requirements continue to expand in multiple dimensions. These include greater volumes, increasing number of concurrent users, more complex queries, and more applications which define complex logical, semantic, and physical data models. Shared nothing parallel database management systems can scale up "horizontally" by adding more nodes. Most parallel algorithms, however, do not take into account data skew. Data skew occurs naturally in many applications. A query processing skewed data not only slows down its response time, but generates hot nodes, which become a bottleneck throttling the overall system performance. Motivated by real business problems, we propose a new join geography called PRPD (Partial Redistribution & Partial Duplication) to improve the performance and scalability of parallel joins in the presence of data skew in a shared-nothing system. Our experimental results show that PRPD significantly speeds up query elapsed time in the presence of data skew. Our experience shows that eliminating system bottlenecks caused by data skew improves the throughput of the whole system which is important in parallel data warehouses that often run high concurrency workloads.

TaskMaster: A Scalable, Reliable Queuing Infrastructure for Building Distributed Systems
Speaker: Josh Hyman, Google, Santa Monica
TaskMaster is a system for managing priority-ordered queues that is designed to scale to 1 billion tasks across 100 thousand queues per node. A reliable queuing system, such as TaskMaster, provides a mechanism for distributing units of inherently serial work (tasks) to workers. TaskMaster provides the ability to atomically transition a task between queues, facilitating the decomposition of processing into a sequence of tasks. Individual tasks are exclusively leased to workers for the duration of processing to minimize work duplication system wide. TaskMaster helps system designers detect bottlenecks and direct system optimization by providing queue statistics such as enqueue and dequeue rates, as well as queue lengths. TaskMaster allows application designers to focus on the individual stages of data processing instead of task distribution.

Automatically Identifying Localizable Queries - SIGIR 2008
Mike Welch
Personalization of web search results as a technique for improving user satisfaction has received notable attention in the research community over the past decade. Much of this work focuses on modeling and establishing a profile for each user to aid in personalization. Our work takes a more query-centric approach. In this paper, we present a method for efficient, automatic identification of a class of queries we define as localizable from a web search engine query log. We determine a set of relevant features and use conventional machine learning techniques to classify queries. Our experiments find that our technique is able to identify localizable queries with 94% accuracy.

I know what you did last summer -- Query logs and user privacy
Invited Speaker - Dr. Rosie Jones, Yahoo! Research
Abstract:
We study the subtle cues to user identity that may be exploited in attacks on the privacy of web search query logs. We focus on two families of attacks. In the first, an attacker studies a particular sequence of queries from a single user, and attempts to map the sequence to a real-world individual (trace attack). In the second, an attacker identifies a target real-world individual, and attempts to isolate that person's queries in a large query log (person attack). In this context, we show the following. First, reasonably accurate classifiers can be built to infer the user's gender, age, and location from their query logs. Second, these classifiers can in turn be used to breach privacy under both trace attack and person attack models. Third, the accuracies of these classifiers remain somewhat reasonable even after removing names/numbers or limiting the size of the query log.

Speaker Bio:
Rosie Jones is a Senior Research Scientist in Information Retrieval at Yahoo! Research. She is an active participant in the IR community, serving as Senior PC member for SIGIR in 2007 and 2008. Her research interests include information retrieval, web mining and natural language processing.

Efficient Computation of Personal Aggregate Queries on Blogs
Richard Sia, UCLA
Abstract: There is an exploding amount of user-generated content on the Web due to the emergence of ``Web 2.0'' services, such as Blogger, MySpace, Flickr, and del.icio.us. The participation of a large number of users in sharing their opinion on the Web has inspired researchers to build an effective ``information filter'' by aggregating these independent opinions. However, given the diverse groups of users on the Web nowadays, the global aggregation of the information may not be of much interest to different groups of users. In this paper, we explore the possibility of computing personalized aggregation over the opinions expressed on the Web based on a user's indication of trust over the information sources. The hope is that by employing such ``personalized'' aggregation, we can make the recommendation more likely to be interesting to the users. We address the challenging scalability issues by proposing an efficient method, that utilizes two core techniques: Non-Negative Matrix Factorization and Threshold Algorithm, to compute personalized aggregations when there are potentially millions of users and millions of sources within a system. We show that, through experiments on real-life dataset, our personalized aggregation approach indeed makes a significant difference in the items that are recommended and it reduces the query computational cost significantly, often more than 75%, while the result of personalized aggregation is kept accurate enough.

Managing Past Information in Present and Future DBMS
Dr. Carlo Zaniolo, UCLA
Preserving information for future use constitutes a critical responsibility for our `Information Age'. Unfortunately, current information systems are failing this requirement at their very database core. Indeed, past attempts by database researchers to provide temporal DBMS extensions failed to make an impact on commercial systems. Fortunately time conquers all, and DBMS are progressing to a point where many objectives of temporal database research can now be realized by a more sophisticated exploitation of present and emerging standards. For instance, XML supports a temporally grouped state-based representation whereby the histories of XML documents and relational databases can be conveniently queried using XQuery. Even better news come from the SQL front, where OLAP functions and the recently proposed Kleene closure constructs make it possible to express both event-based and state-based temporal queries. In my seminar, I will propose a relational representation of temporal information that makes this possible; then I'll discuss various schema evolution problems that emerge in this context, along with techniques and systems we developed for supporting (i) the logical schema evolution process, (ii) historical information schemas, and (iii) historical queries in the presence of schema evolution. These techniques were initially designed for transaction-time databases, but many generalize to bitemporal databases.

Lesson Learnt From the UCSD Datamining Contest
Richard Sia, UCLA
In this talk, I will give a brief overview of the UCSD dataming contest and share the experience in participating in this event. In particular, I will address on the datamining tool used, discuss the merits of different datamining algorithm and the techniques to improve classification accuracy.

Supporting Knowledge Discovery in Data Stream Management Systems
Hetal Thakkar, UCLA
A growing number of applications, including network traffic monitoring and highway congestion analysis, continuously generate massive data streams. Management of these streams introduces many new research challenges, which include Quality of Service (QoS) guarantees, window synopses, and I/O buffering. Therefore, many research projects have focused on building Data Stream Management Systems (DSMSs) to solve these challenges. However, all of these systems are limited to simple continuous queries over data streams, i.e., they do not support advanced applications, such as data stream mining. However, such advanced applications are critical for data stream processing. In particular, data stream mining is required in many real-world applications, such as web click-stream analysis, market basket data mining, and credit card fraud detection. The importance of data stream mining is further illustrated by the research projects focusing on devising fast & light algorithms for online mining. However, we note that deployment of such methods represents yet other challenges, in addition to devising fast & light algorithms. In particular data stream mining methods must be deployed with all DSMS essentials, including QoS, load shedding and synopses, already being provided for simpler applications. Thus, in this dissertation I extend a DSMS, namely Stream Mill, into an online data mining workbench with the help of three novel techniques, as follows.
(1) Enhancing the expressive power of DSMSs to support more advanced queries, such as online mining, sequence queries, etc., by extending their query language (namely SQL),
(2) Integrating a suite of online mining algorithms in the DSMS, including advanced techniques, such as ensemble based methods, and
(3) Supporting work-flows, whereby the analyst can specify the complete mining process. This stimulates ease-of-use, since all users can now simply invoke the predefined work-flows, as opposed to redefining from scratch. Of course the system allows advanced users to integrate new mining algorithms as mining models and flows.
Finally, we show that the resulting data stream mining workbench achieves performance and extensibility, which are unmatched even by static mining workbenches.

Scientific Approach on AI and the Application on Document Processing and Searches
Qin Zhang, Founder, Booloon Inc.
The Artificial Intelligence field as we know it is not exactly a scientific field. The fundamental questions critical in this fields are not answered, and there are many confusions about what problems it should solve. In this speech, I will provide an overview analysis about these problems, and provide my solutions.

Basically, my approach starts from philosophical level. Through analyzing and dividing the problems, I believe the solutions are surprisingly simple. My current focus is in language understanding. I will introduce my method of language and knowledge structuring and its applications on document processing and search methods.

Speaker Bio:
Qin Zhang is an author, inventor, and founder of a startup company Booloon,Inc. She is working on a bookanalyzing human belief and knowledge system currently titled "Age of Dismay? - Illustration of Modern Lives". She has four pending patent applications in AI and language processing field: Thinking System and Method; System and Method for Information Processing and Motor Control; Search Method and System Using Thinking System; Content Summarizing and Search Method and System. She also has a pending patent application in Data Compression: Method and System of Compressing and Decompressing Data. She is developing search method using her inventions in AI field, and currently working on document content summarization, and meaning based keyword identifications.

She received her JD from Southwestern University School of Law, MSME from University of Miami, and BSME from Tongji University (Shanghai, China). She is a licensed Patent Attorney and Professional Engineer.

Named Entity-Centred Information Processing
Dr. Tru H. Cao, Associate Professor, Ho Chi Minh City University of Technology, Vietnam
Information on the Web is pervaded with real individuals, i.e., named entities, in the world. The content of a text is mainly defined by keywords and named entities occurring in it. Moreover, named entities have ontological features, namely, their aliases, types, and identifiers, which are hidden from their textual appearance but may be of user concern. First, this talk presents a hybrid statistical and rule-based incremental approach to recognize ambiguous named entities. Second, it introduces a multi-vector space model to combine keywords and named entity features for document searching and clustering. Third, it presents a robust named entity-based method for translating natural language queries to conceptual graphs.

Speaker Bio:
Tru Cao received his B.Eng. in Computer Science and Engineering from Ho Chi Minh City University of Technology in 1990, M.Eng. in Computer Science from Asian Institute of Technology in 1995, and Ph.D. in Computer Science from University of Queensland in 1999. He then did postdoctoral research in the Artificial Intelligence Group at University of Bristol, and the Berkeley Initiative in Soft Computing Group at UC Berkeley. He served Vice Dean (2002-2007) and Head of Department of Computer Science (2002- 2006), Faculty of Computer Science and Engineering, HCMUT. He has received HCMUT annual Excellent Teaching Award since 2002. His research interests include semantic web, soft computing, and conceptual structures. He has published about 70 papers in international journals and conference proceedings in those areas, and been the chief investigator of the national key project on Vietnamese Semantic Web. He is a member of the editorial board of the International Journal of Intelligent Information and Database Systems, the steering committee of the Pacific Rim International Conference on Artificial Intelligence, and the program committees of several international conferences.