DB-UCLA Seminar

To subscribe to the db-ucla mailing list for seminar announcement, please visit this page

Time: 12:00n-1:00pm Fridays; Room: 4549 Boelter Hall

*To invite a guest speaker or to schedule a talk, contact Barzan Mozafari (barzan@cs)

Fall 2009
Date	Speaker	Title
09/25	Hamid Mousavi	Implementing Histograms over Data Streams using Stream Mill System
10/02	Chu-Cheng Hsieh	Extracting the Musician from the Wikipedia
10/09
10/16
10/23
10/30
11/06	Uri Schonfeld	Finding what You Needed, didn't Know Existed and Couldn't See
11/13	Michael Welch	Multimodal Video Indexing and Search
11/20
11/27

Spring 2009
Date	Speaker	Title
04/03	--	--
04/10	--	--
04/17	Uri Schonfeld	Sitemaps: Above and Beyond the Crawl of Duty
04/24	--	--
05/01	Kambiz Aghili and Xin Zhou	Teradata Query Optimizer
05/08	Amruta Joshi	Overview of Language Modeling Approaches
05/15	--	--
05/22	Wei Wang	New clustering approaches for mining salient patterns in high dimensional data
05/29	Antonio Magnaghi, Jennie Cochran-Chin, Jeremy Daw	Adconion
06/05	Michael Welch	Addressing Web Search Challenges Arising From Imprecise Queries and Content

Winter 2009
Date	Speaker	Title
01/09	Richard Sia	Challenges and Opportunities in Building Personalized Online Content Aggregators
01/16	Uri Schonfeld	Sitemaps and Crawling + experiences as an intern at Google
01/23	Carlo A. Curino	Panta Rhei: Database Evolution
01/30	Hamid Mousavi	A New Focused Crawler Based on Context Graphs
02/06	Chu-Cheng Hsieh	Identifying the product name and brand name through the auction website
02/13	Amruta Joshi	Toward the next generation of recommender systems: A survey of the state-of-the-art and possible extensions
02/20	Cancelled	--
02/27	Barzan Mozafari	Tired of words? How about nest words?
03/06	Nickolay Laptev and Vincenzo Russo	StreamMill overview and demo
03/13	Chu-Cheng Hsieh, Peter Peterson, Uri Schonfeld, and Michael Welch	WSDM2009 roundup

Challenges and Opportunities in Building Personalized Online Content Aggregators

Abstract:
The emergence of "Web 2.0" services has attracted a large number of users to publish content on the Web as blogs, social bookmarks, customer reviews, and wiki articles. A recent study shows that due to the explosion of this "user-generated" content, the amount of new data on the Web is growing rapidly, at a rate estimated to be four to five times higher than what was believed before. To help users better keep up with the continuous stream of new content, many Web 2.0 sites provide RSS feeds that return recently updated materials based on a user's request. Since many users often "subscribe" to a large number of RSS often in the order of thousands, there is a need for a new online service that helps users manage their growing subscription lists and the constant stream of new content. In this dissertation, we study the challenges and opportunities in building a large-scale personalized online content aggregator. In particular, we address the following three challenges in building such a system:

Panta Rhei: Database Evolution

Abstract:
Panta Rhei---everything is in a state of flux. Evolution impacts Information Systems (IS) at many different levels, we focus on the problem of evolving the data management core of an IS. We present an ongoing activity of collecting and analyzing a large number of evolution histories of real-life IS. The results of this analysis confirm the initial hunch about the severity of the problem, and provide the foundations of a benchmark for tools supporting schema evolution.

This work served as a motivation and invaluable test-bed for the design and development of the Panta Rhei Framework, which provides support for: (i) assisted and predictable evolution design, (ii) data migration scripts generation, (iii) automatic query (and update) rewriting, (iv) transaction-time archiving and querying of databases under schema evolution, and (v) automatic documentation of metadata histories.

The presentation will focus on two components of the Panta Rhei Framework: PRISM and PRIMA. PRISM achieves (i)-(iii) by harnessing recent theoretical advances in schema mapping and query rewriting in a concrete design that provides practitioners with the needed support for graceful schema evolution. PRIMA supports complex temporal queries on transaction time databases under schema evolution. By decoupling the logical and physical layers PRIMA provides the users with a rich temporal query interface (based on an XML temporal data model) while building on a reliable relational storage engine. The complexity introduced by schema evolution is completely masked to the user exploiting the rewriting technology we developed in PRISM. Temporal-specific query optimizations and a performance-oriented architecture complete the system.

Speaker Bio:
Carlo A. Curino received a Bachelor in Computer Science at Politecnico di Milano. He participated to a joint project between University of Illinois at Chicago (UIC) and Politecnico di Milano, obtaining a Master Degree in Computer Science in UIC (GPA 4/4) and the Laurea Specialistica (110/100 cum laude) in Politecnico di Milano. During the PhD at Politecnico di Milano (GPA 4/4), he spent over an year as a visiting researcher at University of California, Los Angeles (UCLA). His recent research interests includes: schema evolution, temporal databases, ontology-based data integration, and context-aware data filtering.

A New Focused Crawler Based on Context Graphs

Abstract:
Huge amount of the data across the World Wide Web has made the Web Search Engines (WSEs) one of the most important facilities on the Web. Most of the WSEs are crawler-based in which some Web pages are downloaded by crawlers with a specific policy. These pages are then indexed and used in the query answering process. A much newer topic in this area is to gather pages related to a predefined topic. This kinds of crawlers are called Focused Crawlers [1][2].

The easiest way to collect pages in a focused crawler is to start from some known URLs (called seed) and in a breadth first style download the pages and check which one is related to the focusing topic. For each downloaded page its URLs should be added to the URL repository so the approach can continue on. However, this would be an exhaustive task since most of the visiting pages are usually unrelated. [2]

A better idea is proposed in [1]. Informally, it tries to follow only links that may eventually lead us to some related pages. To this end, they construct a structure called context graph. A context graph tries to model relation of existing contexts in the Web based on their linkage structure. For example, consider that the pages of the context ‘CS department homepage’ may have links to those for the context ‘Networking Labs’ with probability 0.05. Also, consider that we are going to find related pages to the topic ‘Networking Labs’. Thus if at any stage of crawling we have found a page of the former context it will be a good choice to follow links of that page since with probability 0.05 it would lead us to a related page.

The biggest problem of their approach is that to learn the context graph, they use citation (in-links) information of the Google search engine and construct a chain of context with out considering the context of the pages. The learning stage is in a way that unrelated pages may be placed in the same category. To overcome this problem we started a work to learn a real context graph using both Google’s citation information and a changed Bayesian learning approach.

The key point in this new approach is the way it learns. Briefly, it manually initiates the first category which is the category that the related pages should match with. After that, for each newly found page, the approach tries to put it in one of the category that it best matches and refines the edges weight based on links of the page and also the links to that page. Finally, when enough number of pages is placed in categories the crawling stage will be started. In crawling stage we choose a URL from that category which first is not empty and second leads us to a related page with higher probability.

Main References:
[1] M. Diligenti, F. Coetzee, S. Lawrence, C. L. Giles, and M. Gori. “Focused Crawling using Context Graphs.” In VLDB-00, 2000.
[2] S. Chakrabarti, M. van den Berg, and B. Dom. Focused crawling: A new approach to topic-specific web resource discovery. Computer Networks, 31(11-16):1623–1640, 1999.

Identifying the product name and brand name through the auction website

Abstract:
In this presentation, I would like to introduce and summarize my research currently. My goal in this project is to extract the product identify information from the internet. Our observation indicates that the model names of certain types of product are unique, especially in electronic devices. There are two challenges here. First, we would like to identify the product model name instantly, which means that our approach should "grab" new product name once it come to the market. Secondly, the model should be easily generalized to any product with unique identifier, usually model name. We choose auction website, eBay, as our data source, and preliminary research shows a promising results

Speaker Bio:
Chu-Cheng Hsieh is currently a PhD Student working with Professor John Cho in University of California, Los Angeles.

Toward the next generation of recommender systems: A survey of the state-of-the-art and possible extensions (paper link)

Abstract:
This paper presents an overview of the field of recommender systems and describes the current generation of recommendation methods that are usually classified into the following three main categories: content-based, collaborative, and hybrid recommendation approaches. This paper also describes various limitations of current recommendation methods and discusses possible extensions that can improve recommendation capabilities and make recommender systems applicable to an even broader range of applications. These extensions include, among others, an improvement of understanding of users and items, incorporation of the contextual information into the recommendation process, support for multcriteria ratings, and a provision of more flexible and less intrusive types of recommendations. Index Terms—Recommender systems, collaborative filtering, rating estimation methods, extensions to recommender systems.

Tired of words? How about nest words?

Abstract (from the PODS'07 paper Marrying Words and Trees by Rajeev Alur):
Traditionally, data that has both linear and hierarchical structure, such as annotated linguistic data, is modeled using ordered trees and queried using tree automata. In this paper, we argue that nested words and automata over nested words offer a better way to capture and process the dual structure. Nested words generalize both words and ordered trees, and allow both word and tree operations. We study various classes of automata over nested words, and show that while they enjoy expressiveness and succinctness benefits over word and tree automata, their analysis complexity and closure properties are analogous to the corresponding word and tree special cases. In particular, we show that finite-state nested word automata can be exponentially more succinct than tree automata, and pushdown nested word automata include the two incomparable classes of context-free word languages and context-free tree languages.

StreamMill overview and demo

Abstract:
A growing number of applications, including network traffic monitoring and highway congestion analysis, continuously generate massive data streams. Management of these streams introduces many new research challenges, which include Quality of Service (QoS) guarantees, window synopses, and I/O buffering. Therefore, many research projects have focused on building Data Stream Management Systems (DSMSs) to solve these challenges. However, all of these systems are limited to simple continuous queries over data streams, i.e., they do not support advanced applications, such as data stream mining. However, such advanced applications are critical for data stream processing. In particular, data stream mining is required in many real-world applications, such as web click-stream analysis, market basket data mining, and credit card fraud detection. The importance of data stream mining is further illustrated by the research projects focusing on devising fast & light algorithms for online mining. However, we note that deployment of such methods represents yet other challenges, in addition to devising fast & light algorithms. In particular data stream mining methods must be deployed with all DSMS essentials, including QoS, load shedding and synopses, already being provided for simpler applications. Thus, in this talk we give an overview of Stream Mill DSMS and show how it was extended into an online data mining workbench with the help of three novel techniques as follows:

(1) Enhancing the expressive power of DSMSs to support more advanced queries, such as online mining, sequence queries, etc., by extending their query language (namely SQL),

(2) Integrating a suite of online mining algorithms in the DSMS, including advanced techniques, such as ensemble based methods, and

(3) Supporting work-flows, whereby the analyst can specify the complete mining process. This stimulates ease-of-use, since all users can now simply invoke the predefined work-flows, as opposed to redefining from scratch. Of course the system allows advanced users to integrate new mining algorithms as mining models and flows. We also show that the resulting data stream mining workbench achieves performance and extensibility, which are unmatched even by static mining workbenches.

Finally we preset the latest demo of this workbench which now supports "drag and drop" interface to build workflows.

Sitemaps: Above and Beyond the Crawl of Duty

Abstract:
Comprehensive coverage of the public web is crucial to web search engines. Search engines use crawlers to retrieve pages and then discover new ones by extracting the pages' outgoing links. However, the set of pages reachable from the publicly linked web is estimated to be significantly smaller than the invisible web, the set of documents that have no incoming links and can only be retrieved through web applications and web forms. The Sitemaps protocol is a fast-growing web protocol supported jointly by major search engines to help content creators and search engines unlock this hidden data by making it available to search engines. In this paper, we perform a detailed study of how "classic" discovery crawling compares with Sitemaps, in key measures such as coverage and freshness over key representative websites as well as over billions of URLs seen at Google. We observe that Sitemaps and discovery crawling complement each other very well, and offer different tradeoffs.

Teradata Query Optimizer

Speakers:
Kambiz Aghili, Engineering Manager, Teradata Query Optimizer Group
Xin Zhou, Software Research Engineer, Teradata Advanced Development Group

Abstract:
In this talk, we will first briefly review Teradata Corporation, and introduce Teradata's Advanced Development and Enterprise Architecture Group. Afterwards, Kambiz Aghili will focus on Teradata's Query Optimizer, how we analyze query, make use of statistics we collect from base tables, and do cost-based query optimization.

Speaker Bios:
Kambiz Aghili: Kambiz got his PHD degree from CS Dept, UCSD, and joined Teradata Corporation in 2005. He is now the engineering manager of Teradata Query Optimizer group.

Xin Zhou: Xin got his PHD degree from CS Dept, UCLA, and joined Teradata Corporation in 2006. He is now a Software Research Engineer in Teradata Advanced Development and Enterprise Architecture group.

New clustering approaches for mining salient patterns in high dimensional data

Abstract:
The advances of new technologies have made data collection easier and faster, resulting in large and complex datasets consisting of hundreds of thousands of objects with hundreds of dimensions. Scalable and efficient unsupervised clustering methods have been the most popular approaches in analyzing these large datasets. Traditional clustering approaches typically partition objects into disjoint groups based on distances in full dimensional space. However, more often than not, some dimensions of high dimensional data may be irrelevant to a cluster and can mask the cluster’s existence. This phenomenon prevents salient structures from being discovered by traditional clustering approaches. We developed unsupervised clustering approaches to capture pattern-preserving clusters in the subspaces of high dimensional space. The proposed subspace clustering algorithms tackle the curse of dimensionality by localizing the search of clusters in the subspaces of the original high dimensional data. They go beyond the existing distance-based clustering criteria by revealing consistent patterns that can be far apart in distance.

Adconion

Abstract:
Adconion is a global top 10 independent ad network with offices in 7 countries and traffic from every corner of the world. The company business is to aggregate supply and demand to create a value added marketplace. Adconion’s value add consists of two major components. One is business oriented, which is creating novel aggregate demand on each side of the equation, thus creating a scale that drives interest from the participants. This involves both technology and business support. The other value add is that Adconion algorithms can provide a better performing marketplaces for the participants. The latter is the focus of this presentation. At 400M+ impressions per day there is over 1TB of uncompressed data per day to sift through. What we are focused on is reducing this set of data to that which is driving performance. The constraints of a smaller company we find to be a positive factor, we can’t ‘store everything’ we have to learn to mine for what is important and store answers. The optimization team processes this data and decomposes it into various forms to produce ad performance predictions that are used for runtime ad ranking decision making. The team also produces performance predictions that are used for advanced targeting and to guide the sales efforts. Pricing prediction models to guide our marketplace activity and yield management algorithms are our next targets. Market trends include realtime bidding, advanced data sharing that drives large scale data mining creating temporal niche markets (disturbances in the Force) and optimization techniques at scale. We will give a quick business overview on these topics to give context, then dive in on one of our stream processing and data mining algorithm approach, findings and directions.

Speaker Bios:
Antonio Magnaghi has his PhD from the University of Tokyo (Japan) and comes from Yahoo via the Search, Content Match and YRL teams. Prior to Y, he held a position as researcher at Fujitsu labs.

Jennie Cochran-Chin earned her BS and Masters in Comp Sci at MIT and her PhD in MechE from UCSD. Dr Cochran had worked on a wide range of systems including real time sound and image processing, human computer interactions, drag profile trajectory tracking in fluid flows and non model based extremum seeking for real time optimization.

Jeremy Daw is the CTO of Adconion and comes with a wealth of experience from Goto/Overture/Yahoo where he was Director of Engineering (Search, Content Match, Advertising Systems) for 7 years. His teams built the first search and advertising marketplace systems on the web meeting every scaling challenge along the way.

Addressing Web Search Challenges Arising From Imprecise Queries and Content

Abstract:
The World Wide Web contains information on a scale far beyond the capacity of manual organization methods. Web search engines, which help users find information based on keywords, have an immeasurable impact on the daily lives and productivity for millions of people throughout the world. Sifting through all of the pages on the Web to find the most relevant ones is an enormous task, often exacerbated by under-specification and ambiguity in both the queries and the data. Users frequently omit relevant context or submit multifaceted queries, and authors rarely provide categorizations for their content, expecting the search engine to tackle these issues automatically. Uncertainty leads to inherent difficulty for search systems to find the right information for a particular user and query.

We investigate these problems and propose techniques to effectively address a user's needs when faced with such uncertainty. The main challenges consist of (a) learning which queries may benefit from contextualization, (b) given a multifaceted query, determining the most likely user requirements for each subtopic and then selecting a diverse set of pages to satisfy the greatest number of users, and (c) disambiguating imprecise content to build indexes capable of matching queries at a higher semantic level.

Experiments show our techniques for the identifying queries that would benefit from contextualization achieve 94\% accuracy, and preliminary experiments suggest our approach to subtopic diversification performs well under established metrics.

Implementing Histograms over Data Streams using Stream Mill System

Abstract:
Stream Mill is a Data Stream Management System (DSMS), supporting Expressive Stream Languages (ESL) as a powerful query languages for data stream processing and also data stream mining. The language extends SQL into a Turing-Complete languages using User Defined Aggregates (UDAs). UDAs are good means to implement algorithms trying to process data streams. One of such algorithms is the one computes Histogram of the values in a data streams. Histograms are being used in a variety of applications such as query optimization, approximate query answering, Curve estimation, etc.

In this talk, I will start with a simple example of data stream processing to show that things are not so easy when you have restricted amount of resources (time and space). Then I will talk about implementing Histograms using UDAs in Stream Mill system. I cover first two types of Histograms out of 3 well-known type which are 1) Regular Histograms in which the intervals associated to bars of the histogram must have the same size, 2) Even Bar Histograms in which we are interested in finding an specific number of boundaries in a way that the number of items in each two consecutive boundaries is almost the same, and 3) Step-Wise Histogram which tries to estimate a curve with an step-wise function with a maximum number of steps (bars).

Extracting the musician from the Wikipedia

Web of object concept is a recently promoted in many conference, and the goal of the web of object is to provide a better WWW for machine but not human. Person entity identification is one of the major challenges in web of object. So far, many projects are running and growing, e.g. dbpedia, freebase, etc. In this talk, I will start with introducing some web of object projects, and then shift to talk about my work: extracting the person name from the Wikipedia, and describe how to apply this dataset to solve a problem inside Yahoo!, which is "to distinguish whether an entity is a musician or a music group". The first part of my talk will explain how to use snowball algorithm to identify the wiki entry, and the second part of my talk will explain the heuristic method we found useful to determine individual or group. By meta data provided on corresponding wikipage, I use an algorithm similar to snowball to generate the high quality candidate dataset. More than 3 millions of person name are extracted in first stage. Later, I use support vector machine to filter out the noise. The highest precision I got so far is around 86% while extract the musician, and the features are generated from many seed. . And finally we successfully provide a high-quality musicians and music groups which will be used on the production of Yahoo Audio.

Finding what You Needed, didn't Know Existed and Couldn't See

Search engines employ huge clusters in an attempt to find the documents they need to best answer their users' queries. Unfortunately, crawling and evaluating the "entire web" is unfeasible. Search engines can identify queries that can likely be improved but do not know how to efficiently and reliably find the pages that would best improve the results. Web sites, on the other hand, have cheap access to their content but do now know what the search engine's needs are. The solution we present in this paper is a new channel of two-way communication channel between the search engine and the web sites. This new channel makes discovering relevant pages more efficient than classic crawling methods.

Multimodal Video Indexing and Search

Video indexing presents a unique set of challenges to search engines. No perfect techniques exist to extract features from video, often resulting in incomplete or imprecise data. State of the art speaker-independent speech-to-text technology for even controlled vocabulary domains are still hampered by high word error rates. Audio classification and image recognition systems are probabilistic and limited by the expressiveness of low level features, typically requiring supervised classifiers trained to a specific domain. The limited and imprecise nature of video feature extraction leads to inherent difficulty matching keyword queries to content. In this talk we explore some of the opportunities for enhancing indexing and search over video data by approaching the problem from the content producer's side. Our system leverages features extracted from multiple content channels of a video, such as production scripts, speech-to-text (STT) transcriptions, and image analysis systems. By automatically associating metadata with videos at a finer granularity and without the common shortcomings intrinsic to manual tagging approaches, we enable the search engine to match queries with only the relevant segments of a video.