To subscribe to the db-ucla mailing list for seminar announcement, please visit this page
DB-UCLA Seminar
Time: 12:30-1:30pm Fridays; Room: 4549 Boelter Hall
*To invite a guest speaker or to schedule a talk, contact Mike Welch (mjwelch@cs)
Verifying and Mining Frequent Patterns from Large Windows over Data Streams
Hetal Thakkar
Mining frequent itemsets from data streams has
proved to be very difficult because of computational complexity
and the need for real-time response. In this paper, we introduce
a novel verification algorithm which we then use to improve
the performance of monitoring and mining tasks for association
rules. Thus, we propose a frequent itemset mining method
for sliding windows, which is faster than the state-of-the-art
methods—in fact, its running time that is nearly constant with
respect to the window size entails the mining of much larger
windows than it was possible before. The performance of other
frequent itemset mining methods (including those on static data)
can be improved likewise, by replacing their counting methods
(e.g., those using hash trees) by our verification algorithm.
The Hypothesis Web Project
Dr. D.S. Parker
The Hypothesis Web project is a new research project
directed by Stott Parker and Wes Chu, part of the
Corsortium for Neuropsychiatric Phenomics (CNP), a new
NIH Interdisciplinary Research Center at UCLA initated last Fal
(www.phenomics.ucla.edu).
The Hypothesis Web starts from the observation that the scale and
complexity of science has increased, to the point where
a major obstacle to interdisciplinary research is the formulation of
hypotheses that span multiple fields of expertise. This is particularly
true for neuroscience research, for example, which can involve aspects of very
diverse disciplines at many physical scales, and combined efforts involving
multiple disciplines can be necessary in order to make any substantial progress.
This project will investigate the development of the Hypothesis Web,
a software platform that permits collaborative formulation of complex
scientific hypotheses. It represents a novel approach to the way
these hypotheses are conceived and tested, as an extension of existing
software systems for web-based collaboration, data management,
modeling, and access of online scientific literature. It will be used for
collaborative development of hypotheses within the other CNP projects.
Schema Evolution in Wikipedia: toward a Web Information System Benchmark - to appear ICEIS 2008
Carlo Curino
Abstract:
Evolving the database that is at the core of an Information System represents a difficult maintenance problem
that has only been studied in the framework of traditional information systems. However, the problem is likely
to be even more severe in web information systems, where open-source software is often developed through
the contributions and collaboration of many groups and individuals. Therefore, in this paper, we present an in-
depth analysis of the evolution history of the Wikipedia database and its schema; Wikipedia is the best-known
example of a large family of web information systems built using the open-source software MediaWiki. Our
study is based on: (i) a set of Schema Modification Operators that provide a simple conceptual representation
for complex schema changes, and (ii) simple software tools to automate the analysis. This framework allowed
us to dissect and analyze the 4.5 years of Wikipedia history, which was short in time, but intense in terms of
growth and evolution. Beyond confirming the initial hunch about the severity of the problem, our analysis
suggests the need for developing better methods and tools to support graceful schema evolution. Therefore,
we briefly discuss documentation and automation support systems for database evolution, and suggest that the
Wikipedia case study can provide the kernel of a benchmark for testing and improving such systems.
Loopy Bayesian Inference Propagation
Arthur Choi
Abstract:
Abstract:
Loopy belief propagation is an approximation algorithm for reasoning in Bayesian networks that has been critical for enabling a variety of applications, whose associated models are too difficult for exact algorithms. More recently, we have developed a perspective on loopy belief propagation that characterizes it as an algorithm that performs exact inference in a Bayesian network that has been simplified by deleting edges from the original model. This perspective allows a simple bottom-up approach to improving approximations: start with a coarse approximation (loopy BP), then reason in the base approximation to find deleted edges to recover. We find empirically that recovering only a small number of edges can have a significant impact in approximation quality with only a modest amount of computational effort. In this talk, we present our edge deletion perspective on belief propagation, and highlight some of the more recent results that this perspective has allowed.
Handling Data Skew in Parallel Joins in Shared-Nothing Systems (SIGMOD 2008 - to appear)
Guest Speaker: Pekka Kostamaa, Xin Zhou, Teradata
Talk 1:
Title: Teradata Architecture - Technology and Differentiation
Presenter: Pekka Kostamaa, Director, Advanced Development and Enterprise
Architecture, Teradata R&D
Talk 2:
Title: Handling Data Skew in Parallel Joins in Shared-Nothing Systems
Presenter: Xin Zhou, Software Engineer, Teradata
Abstract:
Parallel processing continues to be important in large data warehouses.
The processing requirements continue to expand in multiple dimensions.
These include greater volumes, increasing number of concurrent users,
more complex queries, and more applications which define complex
logical, semantic, and physical data models. Shared nothing parallel
database management systems can scale up "horizontally" by adding more
nodes. Most parallel algorithms, however, do not take into account data
skew. Data skew occurs naturally in many applications. A query
processing skewed data not only slows down its response time, but
generates hot nodes, which become a bottleneck throttling the overall
system performance. Motivated by real business problems, we propose a
new join geography called PRPD (Partial Redistribution & Partial
Duplication) to improve the performance and scalability of parallel
joins in the presence of data skew in a shared-nothing system. Our
experimental results show that PRPD significantly speeds up query
elapsed time in the presence of data skew. Our experience shows that
eliminating system bottlenecks caused by data skew improves the
throughput of the whole system which is important in parallel data
warehouses that often run high concurrency workloads.
TaskMaster:
A Scalable, Reliable Queuing Infrastructure for Building Distributed
Systems
Speaker: Josh Hyman, Google, Santa Monica
TaskMaster is a system for managing priority-ordered queues that is
designed to scale to 1 billion tasks across 100 thousand queues per
node. A reliable queuing system, such as TaskMaster, provides a
mechanism for distributing units of inherently serial work (tasks) to
workers. TaskMaster provides the ability to atomically transition a
task between queues, facilitating the decomposition of processing into
a sequence of tasks. Individual tasks are exclusively leased to
workers for the duration of processing to minimize work duplication
system wide. TaskMaster helps system designers detect bottlenecks and
direct system optimization by providing queue statistics such as
enqueue and dequeue rates, as well as queue lengths. TaskMaster allows
application designers to focus on the individual stages of data
processing instead of task distribution.
Automatically Identifying Localizable Queries - SIGIR 2008
Mike Welch
Personalization of web search results as a technique for improving
user satisfaction has received notable attention in
the research community over the past decade. Much of this
work focuses on modeling and establishing a profile for each
user to aid in personalization. Our work takes a more query-centric
approach. In this paper, we present a method for
efficient, automatic identification of a class of queries we define
as localizable from a web search engine query log. We
determine a set of relevant features and use conventional
machine learning techniques to classify queries. Our experiments
find that our technique is able to identify localizable
queries with 94% accuracy.
I know what you did last summer -- Query logs and user privacy
Invited Speaker - Dr. Rosie Jones, Yahoo! Research
Abstract:
We study the subtle cues to user identity that may be exploited in attacks on the privacy of web search query logs. We focus on two families of attacks. In the first, an attacker studies a particular sequence of queries from a single user, and attempts to map the sequence to a real-world individual (trace attack). In the second, an attacker identifies a target real-world individual, and attempts to isolate that person's queries in a large query log (person attack). In this context, we show the following. First, reasonably accurate classifiers can be built to infer the user's gender, age, and location from their query logs. Second, these classifiers can in turn be used to breach privacy under both trace attack and person attack models. Third, the accuracies of these classifiers remain somewhat reasonable even after removing names/numbers or limiting the size of the query log.
Speaker Bio:
Rosie Jones is a Senior Research Scientist in Information Retrieval at Yahoo! Research. She is an active participant in the IR community, serving as Senior PC member for SIGIR in 2007 and 2008. Her research interests include information retrieval, web mining and natural language processing.
Efficient Computation of Personal Aggregate Queries on Blogs
Richard Sia, UCLA
Abstract:
There is an exploding amount of user-generated content on the Web
due to the emergence of ``Web 2.0'' services, such as Blogger,
MySpace, Flickr, and del.icio.us. The participation of a large
number of users in sharing their opinion on the Web has inspired
researchers to build an effective ``information filter'' by
aggregating these independent opinions. However, given the diverse
groups of users on the Web nowadays, the global aggregation of the
information may not be of much interest to different groups of
users. In this paper, we explore the possibility of computing
personalized aggregation over the opinions expressed on the Web
based on a user's indication of trust over the information sources.
The hope is that by employing such ``personalized'' aggregation, we
can make the recommendation more likely to be interesting to the
users. We address the challenging scalability issues by proposing
an efficient method, that utilizes two core techniques: Non-Negative
Matrix Factorization and Threshold Algorithm, to compute
personalized aggregations when there are potentially millions of
users and millions of sources within a system. We show that, through
experiments on real-life dataset, our personalized aggregation
approach indeed makes a significant difference in the items that are
recommended and it reduces the query computational cost
significantly, often more than 75%, while the result of
personalized aggregation is kept accurate enough.
Managing Past Information in Present and Future DBMS
Dr. Carlo Zaniolo, UCLA
Preserving information for future use constitutes a critical
responsibility for our `Information Age'. Unfortunately, current
information systems are failing this requirement at their very
database core. Indeed, past attempts by database researchers to
provide temporal DBMS extensions failed to make an impact on
commercial systems. Fortunately time conquers all, and DBMS are
progressing to a point where many objectives of temporal database
research can now be realized by a more sophisticated exploitation
of present and emerging standards. For instance, XML supports a
temporally grouped state-based representation whereby the
histories of XML documents and relational databases can be
conveniently queried using XQuery. Even better news come from the
SQL front, where OLAP functions and the recently proposed Kleene
closure constructs make it possible to express both event-based
and state-based temporal queries. In my seminar, I will propose a
relational representation of temporal information that makes this
possible; then I'll discuss various schema evolution problems that
emerge in this context, along with techniques and systems we
developed for supporting (i) the logical schema evolution process,
(ii) historical information schemas, and (iii) historical queries
in the presence of schema evolution. These techniques were
initially designed for transaction-time databases, but many
generalize to bitemporal databases.
Lesson Learnt From the UCSD Datamining Contest
Richard Sia, UCLA
In this talk, I will give a brief overview of the UCSD dataming contest and share
the experience in participating in this event. In particular, I will address
on the datamining
tool used, discuss the merits of different datamining algorithm and the
techniques to
improve classification accuracy.
Supporting Knowledge Discovery in Data Stream Management Systems
Hetal Thakkar, UCLA
A growing number of applications, including network traffic monitoring and highway congestion analysis, continuously generate massive data streams. Management of these streams introduces many new research challenges, which include Quality of Service (QoS) guarantees, window synopses, and I/O buffering. Therefore, many research projects have focused on building Data Stream Management Systems (DSMSs) to solve these challenges. However, all of these systems are limited to simple continuous queries over data streams, i.e., they do not support advanced applications, such as data stream mining. However, such advanced applications are critical for data stream processing. In particular, data stream mining is required in many real-world applications, such as web click-stream analysis, market basket data mining, and credit card fraud detection. The importance of data stream mining is further illustrated by the research projects focusing on devising fast & light algorithms for online mining. However, we note that deployment of such methods represents yet other challenges, in addition to devising fast & light algorithms. In particular data stream mining methods must be deployed with all DSMS essentials, including QoS, load shedding and synopses, already being provided for simpler applications. Thus, in this dissertation I extend a DSMS, namely Stream Mill, into an online data mining workbench with the help of three novel techniques, as follows.
(1) Enhancing the expressive power of DSMSs to support more advanced queries, such as online mining, sequence queries, etc., by extending their query language (namely SQL),
(2) Integrating a suite of online mining algorithms in the DSMS, including advanced techniques, such as ensemble based methods, and
(3) Supporting work-flows, whereby the analyst can specify the complete mining process. This stimulates ease-of-use, since all users can now simply invoke the predefined work-flows, as opposed to redefining from scratch. Of course the system allows advanced users to integrate new mining algorithms as mining models and flows.
Finally, we show that the resulting data stream mining workbench achieves performance and extensibility, which are unmatched even by static mining workbenches.
Scientific Approach on AI and the Application on Document Processing and Searches
Qin Zhang, Founder, Booloon Inc.
The Artificial Intelligence field as we know it is not exactly a scientific field. The fundamental questions critical in this fields are not
answered, and there are many confusions about what problems it should
solve. In this speech, I will provide an overview analysis about these
problems, and provide my solutions.
Basically, my approach starts from philosophical level. Through analyzing
and dividing the problems, I believe the solutions are surprisingly simple.
My current focus is in language understanding. I will introduce my method of
language and knowledge structuring and its applications on document
processing and search methods.
Speaker Bio:
Qin Zhang is an author, inventor, and founder of a startup company Booloon,Inc. She is working on a bookanalyzing human belief and knowledge system currently titled "Age of Dismay? - Illustration of Modern Lives". She has four pending patent applications in AI and language processing field:
Thinking System and Method; System and Method for Information Processing and
Motor Control; Search Method and System Using Thinking System; Content
Summarizing and Search Method and System. She also has a pending patent
application in Data Compression: Method and System of Compressing and
Decompressing Data. She is developing search method using her inventions in
AI field, and currently working on document content summarization, and
meaning based keyword identifications.
She received her JD from Southwestern University School of Law, MSME from University of Miami, and BSME from Tongji University (Shanghai, China). She is a licensed Patent Attorney and Professional Engineer.
Named Entity-Centred Information Processing
Dr. Tru H. Cao, Associate Professor, Ho Chi Minh City University of Technology,
Vietnam
Information on the Web is pervaded with real individuals, i.e., named entities, in the world. The content of a text is mainly defined by keywords and named entities occurring in it.
Moreover, named entities have ontological features, namely, their aliases, types, and
identifiers, which are hidden from their textual appearance but may be of user concern. First,
this talk presents a hybrid statistical and rule-based incremental approach to recognize
ambiguous named entities. Second, it introduces a multi-vector space model to combine
keywords and named entity features for document searching and clustering. Third, it presents
a robust named entity-based method for translating natural language queries to conceptual
graphs.
Speaker Bio:
Tru Cao received his B.Eng. in Computer Science and Engineering from Ho Chi Minh City
University of Technology in 1990, M.Eng. in Computer Science from Asian Institute of
Technology in 1995, and Ph.D. in Computer Science from University of Queensland in 1999.
He then did postdoctoral research in the Artificial Intelligence Group at University of Bristol,
and the Berkeley Initiative in Soft Computing Group at UC Berkeley.
He served Vice Dean (2002-2007) and Head of Department of Computer Science (2002-
2006), Faculty of Computer Science and Engineering, HCMUT. He has received HCMUT
annual Excellent Teaching Award since 2002.
His research interests include semantic web, soft computing, and conceptual structures. He
has published about 70 papers in international journals and conference proceedings in those
areas, and been the chief investigator of the national key project on Vietnamese Semantic
Web. He is a member of the editorial board of the International Journal of Intelligent
Information and Database Systems, the steering committee of the Pacific Rim International
Conference on Artificial Intelligence, and the program committees of several international
conferences.
University of California, Los Angeles, Computer Science Department. 2008.