To subscribe to the db-ucla mailing list for seminar announcement, please visit this page
Time: 12:00n-1:00pm Fridays; Room: 4549 Boelter Hall
*To invite a guest speaker or to schedule a talk, contact Barzan Mozafari (barzan@cs)
Challenges and Opportunities in Building Personalized Online Content Aggregators
The emergence of "Web 2.0" services has attracted a large number of users to publish content on the Web as blogs, social bookmarks, customer reviews, and wiki articles. A recent study shows that due to the explosion of this "user-generated" content, the amount of new data on the Web is growing rapidly, at a rate estimated to be four to five times higher than what was believed before. To help users better keep up with the continuous stream of new content, many Web 2.0 sites provide RSS feeds that return recently updated materials based on a user's request. Since many users often "subscribe" to a large number of RSS often in the order of thousands, there is a need for a new online service that helps users manage their growing subscription lists and the constant stream of new content.
In this dissertation, we study the challenges and opportunities in building a large-scale personalized online content aggregator. In particular, we address the following three challenges in building such a system:
- A significant portion of user-generated content is updated frequently, often several times a day, and is related to current world events whose significance deteriorates rapidly over time. We propose an effective RSS-feed retrieval algorithm based on the updating pattern of the feeds and the access pattern of the users. Our algorithm helps the system deliver fresh content to the users in a timely manner even in a resource-constrained setting.
- In order to help users navigate the continuously updated new content, it is important to provide a service that can prioritize and recommend what the user is most likely to be interested in based on the user's personal interest. We propose a learning framework that efficiently captures a user's interest through an interactive process. We also develop an efficient approach to computing such "personalized recommendations" that can scale well to a large number of users without putting an unreasonable strain on computing resources.
- The data collected from such a system over time, often in the form of social annotations, involves lots of human effort in matching the best descriptive keywords with the corresponding Web resources. As an example of how this rich body of data can be mined to help users, we analyze its evolution over time and propose methods that discover the behavioral properties of evolving vocabulary usage. We illustrate how these properties can be used to help users select the best keywords in the online advertising context.
Panta Rhei: Database Evolution
Carlo A. Curino
Panta Rhei---everything is in a state of flux. Evolution impacts Information Systems (IS) at many different levels, we focus on the problem of evolving the data management core of an IS. We present an ongoing activity of collecting and analyzing a large number of evolution histories of real-life IS. The results of this analysis confirm the initial hunch about the severity of the problem, and provide the foundations of a benchmark for tools supporting schema evolution.
This work served as a motivation and invaluable test-bed for the design and development of the Panta Rhei Framework, which provides support for: (i) assisted and predictable evolution design, (ii) data migration scripts generation, (iii) automatic query (and update) rewriting, (iv) transaction-time archiving and querying of databases under schema evolution, and (v) automatic documentation of metadata histories.
The presentation will focus on two components of the Panta Rhei Framework: PRISM and PRIMA. PRISM achieves (i)-(iii) by harnessing recent theoretical advances in schema mapping and query rewriting in a concrete design that provides practitioners with the needed support for graceful schema evolution. PRIMA supports complex temporal queries on transaction time databases under schema evolution. By decoupling the logical and physical layers PRIMA provides the users with a rich temporal query interface (based on an XML temporal data model) while building on a reliable relational storage engine. The complexity introduced by schema evolution is completely masked to the user exploiting the rewriting technology we developed in PRISM. Temporal-specific query optimizations and a performance-oriented architecture complete the system.
Carlo A. Curino received a Bachelor in Computer Science at Politecnico di Milano. He participated to a joint project between University of Illinois at Chicago (UIC) and Politecnico di Milano, obtaining a Master Degree in Computer Science in UIC (GPA 4/4) and the Laurea Specialistica (110/100 cum laude) in Politecnico di Milano. During the PhD at Politecnico di Milano (GPA 4/4), he spent over an year as a visiting researcher at University of California, Los Angeles (UCLA). His recent research interests includes: schema evolution, temporal databases, ontology-based data integration, and context-aware data filtering.
A New Focused Crawler Based on Context Graphs
Huge amount of the data across the World Wide Web has made the Web Search Engines (WSEs) one of the most important facilities on the Web. Most of the WSEs are crawler-based in which some Web pages are downloaded by crawlers with a specific policy. These pages are then indexed and used in the query answering process. A much newer topic in this area is to gather pages related to a predefined topic. This kinds of crawlers are called Focused Crawlers .
The easiest way to collect pages in a focused crawler is to start from some known URLs (called seed) and in a breadth first style download the pages and check which one is related to the focusing topic. For each downloaded page its URLs should be added to the URL repository so the approach can continue on. However, this would be an exhaustive task since most of the visiting pages are usually unrelated. 
A better idea is proposed in . Informally, it tries to follow only links that may eventually lead us to some related pages. To this end, they construct a structure called context graph. A context graph tries to model relation of existing contexts in the Web based on their linkage structure. For example, consider that the pages of the context ‘CS department homepage’ may have links to those for the context ‘Networking Labs’ with probability 0.05. Also, consider that we are going to find related pages to the topic ‘Networking Labs’. Thus if at any stage of crawling we have found a page of the former context it will be a good choice to follow links of that page since with probability 0.05 it would lead us to a related page.
The biggest problem of their approach is that to learn the context graph, they use citation (in-links) information of the Google search engine and construct a chain of context with out considering the context of the pages. The learning stage is in a way that unrelated pages may be placed in the same category. To overcome this problem we started a work to learn a real context graph using both Google’s citation information and a changed Bayesian learning approach.
The key point in this new approach is the way it learns. Briefly, it manually initiates the first category which is the category that the related pages should match with. After that, for each newly found page, the approach tries to put it in one of the category that it best matches and refines the edges weight based on links of the page and also the links to that page. Finally, when enough number of pages is placed in categories the crawling stage will be started. In crawling stage we choose a URL from that category which first is not empty and second leads us to a related page with higher probability.
 M. Diligenti, F. Coetzee, S. Lawrence, C. L. Giles, and M. Gori. “Focused Crawling using Context Graphs.” In VLDB-00, 2000.
 S. Chakrabarti, M. van den Berg, and B. Dom. Focused crawling: A new approach to topic-specific web resource discovery. Computer Networks, 31(11-16):1623–1640, 1999.
Identifying the product name and brand name through the auction website
In this presentation, I would like to introduce and summarize my research
currently. My goal in this project is to extract the product identify
information from the internet. Our observation indicates that the model
names of certain types of product are unique, especially in electronic
devices. There are two challenges here. First, we would like to identify the
product model name instantly, which means that our approach should "grab"
new product name once it come to the market. Secondly, the model should be
easily generalized to any product with unique identifier, usually model
name. We choose auction website, eBay, as our data source, and preliminary
research shows a promising results
Chu-Cheng Hsieh is currently a PhD Student working with Professor John Cho in University of California, Los Angeles.
Toward the next generation of recommender systems: A survey of the state-of-the-art and possible extensions (paper link)
This paper presents an overview of the field of recommender systems
and describes the current generation of recommendation methods that are
usually classified into the following three main categories: content-based,
collaborative, and hybrid recommendation approaches. This paper also
describes various limitations of current recommendation methods and
discusses possible extensions that can improve recommendation capabilities
and make recommender systems applicable to an even broader range of
applications. These extensions include, among others, an improvement of
understanding of users and items, incorporation of the contextual
information into the recommendation process, support for multcriteria
ratings, and a provision of more flexible and less intrusive types of
recommendations. Index Terms—Recommender systems, collaborative filtering,
rating estimation methods, extensions to recommender systems.
Tired of words? How about nest words?
Abstract (from the PODS'07 paper Marrying Words and Trees by Rajeev Alur):
Traditionally, data that has both linear and hierarchical
structure, such as annotated linguistic data, is modeled using
ordered trees and queried using tree automata. In this
paper, we argue that nested words and automata over nested
words offer a better way to capture and process the dual
structure. Nested words generalize both words and ordered
trees, and allow both word and tree operations. We study
various classes of automata over nested words, and show
that while they enjoy expressiveness and succinctness benefits
over word and tree automata, their analysis complexity
and closure properties are analogous to the corresponding
word and tree special cases. In particular, we show that
finite-state nested word automata can be exponentially more
succinct than tree automata, and pushdown nested word automata
include the two incomparable classes of context-free
word languages and context-free tree languages.
StreamMill overview and demo
Nickolay Laptev and Vincenzo Russo
A growing number of applications, including network traffic monitoring and highway congestion analysis, continuously generate massive data streams. Management of these streams introduces many new research challenges, which include Quality of Service (QoS) guarantees, window synopses, and I/O buffering. Therefore, many research projects have focused on building Data Stream Management Systems (DSMSs) to solve these challenges. However, all of these systems are limited to simple continuous queries over data streams, i.e., they do not support advanced applications, such as data stream mining. However, such advanced applications are critical for data stream processing. In particular, data stream mining is required in many real-world applications, such as web click-stream analysis, market basket data mining, and credit card fraud detection. The importance of data stream mining is further illustrated by the research projects focusing on devising fast & light algorithms for online mining. However, we note that deployment of such methods represents yet other challenges, in addition to devising fast & light algorithms. In particular data stream mining methods must be deployed with all DSMS essentials, including QoS, load shedding and synopses, already being provided for simpler applications. Thus, in this talk we give an overview of Stream Mill DSMS and show how it was extended into an online data mining workbench with the help of three novel techniques as follows:
(1) Enhancing the expressive power of DSMSs to support more advanced queries, such as online mining, sequence queries, etc., by extending their query language (namely SQL),
(2) Integrating a suite of online mining algorithms in the DSMS, including advanced techniques, such as ensemble based methods, and
(3) Supporting work-flows, whereby the analyst can specify the complete mining process. This stimulates ease-of-use, since all users can now simply invoke the predefined work-flows, as opposed to redefining from scratch. Of course the system allows advanced users to integrate new mining algorithms as mining models and flows.
We also show that the resulting data stream mining workbench achieves performance and extensibility, which are unmatched even by static mining workbenches.
Finally we preset the latest demo of this workbench which now supports "drag and drop" interface to build workflows.
Sitemaps: Above and Beyond the Crawl of Duty
Uri Schonfeld (co-authored with Narayanan Shivakumar)
Comprehensive coverage of the public web is crucial to web search
engines. Search engines use crawlers to retrieve pages and then discover new
ones by extracting the pages' outgoing links. However, the set of pages
reachable from the publicly linked web is estimated to be significantly
smaller than the invisible web, the set of documents that have no incoming
links and can only be retrieved through web applications and web forms. The
Sitemaps protocol is a fast-growing web protocol supported jointly by major
search engines to help content creators and search engines unlock this hidden
data by making it available to search engines. In this paper, we perform a
detailed study of how "classic" discovery crawling compares with Sitemaps, in
key measures such as coverage and freshness over key representative websites
as well as over billions of URLs seen at Google. We observe that Sitemaps and
discovery crawling complement each other very well, and offer different
Teradata Query Optimizer
Kambiz Aghili, Engineering Manager, Teradata Query Optimizer Group
Xin Zhou, Software Research Engineer, Teradata Advanced Development Group
In this talk, we will first briefly review Teradata
Corporation, and introduce Teradata's Advanced Development and
Enterprise Architecture Group. Afterwards, Kambiz Aghili will focus on
Teradata's Query Optimizer, how we analyze query, make use of statistics
we collect from base tables, and do cost-based query optimization.
Kambiz Aghili: Kambiz got his PHD degree from CS Dept, UCSD, and joined
Teradata Corporation in 2005. He is now the engineering manager of
Teradata Query Optimizer group.
Xin Zhou: Xin got his PHD degree from CS Dept, UCLA, and joined Teradata
Corporation in 2006. He is now a Software Research Engineer in Teradata
Advanced Development and Enterprise Architecture group.
New clustering approaches for mining salient patterns in high dimensional data
Wei Wang, University of North Carolina, Chapel Hill
The advances of new technologies have made data collection easier and faster, resulting in large and complex datasets consisting of hundreds of thousands of objects with hundreds of dimensions. Scalable and efficient unsupervised clustering methods have been the most popular approaches in analyzing these large datasets. Traditional clustering approaches typically partition objects into disjoint groups based on distances in full dimensional space. However, more often than not, some dimensions of high dimensional data may be irrelevant to a cluster and can mask the cluster’s existence. This phenomenon prevents salient structures from being discovered by traditional clustering approaches. We developed unsupervised clustering approaches to capture pattern-preserving clusters in the subspaces of high dimensional space. The proposed subspace clustering algorithms tackle the curse of dimensionality by localizing the search of clusters in the subspaces of the original high dimensional data. They go beyond the existing distance-based clustering criteria by revealing consistent patterns that can be far apart in distance.
Antonio Magnaghi, Jennie Cochran-Chin, Jeremy Daw
Adconion is a global top 10 independent ad network with offices in 7
countries and traffic from every corner of the world. The company business
is to aggregate supply and demand to create a value added marketplace.
Adconion’s value add consists of two major components. One is business
oriented, which is creating novel aggregate demand on each side of the
equation, thus creating a scale that drives interest from the participants.
This involves both technology and business support. The other value add is
that Adconion algorithms can provide a better performing marketplaces for
the participants. The latter is the focus of this presentation.
At 400M+ impressions per day there is over 1TB of uncompressed data per day
to sift through. What we are focused on is reducing this set of data to that
which is driving performance. The constraints of a smaller company we find
to be a positive factor, we can’t ‘store everything’ we have to learn to
mine for what is important and store answers.
The optimization team processes this data and decomposes it into various
forms to produce ad performance predictions that are used for runtime ad
ranking decision making. The team also produces performance predictions that
are used for advanced targeting and to guide the sales efforts. Pricing
prediction models to guide our marketplace activity and yield management
algorithms are our next targets.
Market trends include realtime bidding, advanced data sharing that drives
large scale data mining creating temporal niche markets (disturbances in the
Force) and optimization techniques at scale.
We will give a quick business overview on these topics to give context, then
dive in on one of our stream processing and data mining algorithm approach,
findings and directions.
Antonio Magnaghi has his PhD from the University of Tokyo (Japan) and comes
from Yahoo via the Search, Content Match and YRL teams. Prior to Y, he held
a position as researcher at Fujitsu labs.
Jennie Cochran-Chin earned her BS and Masters in Comp Sci at MIT and her PhD
in MechE from UCSD. Dr Cochran had worked on a wide range of systems
including real time sound and image processing, human computer interactions,
drag profile trajectory tracking in fluid flows and non model based extremum
seeking for real time optimization.
Jeremy Daw is the CTO of Adconion and comes with a wealth of experience from
Goto/Overture/Yahoo where he was Director of Engineering (Search, Content
Match, Advertising Systems) for 7 years. His teams built the first search
and advertising marketplace systems on the web meeting every scaling
challenge along the way.
Addressing Web Search Challenges Arising From Imprecise Queries and Content
The World Wide Web contains information on a scale far beyond the capacity of manual organization methods. Web search engines, which help users find information based on keywords, have an immeasurable impact on the daily lives and productivity for millions of people throughout the world. Sifting through all of the pages on the Web to find the most relevant ones is an enormous task, often exacerbated by under-specification and ambiguity in both the queries and the data. Users frequently omit relevant context or submit multifaceted queries, and authors rarely provide categorizations for their content, expecting the search engine to tackle these issues automatically. Uncertainty leads to inherent difficulty for search systems to find the right information for a particular user and query.
We investigate these problems and propose techniques to effectively address a user's needs when faced with such uncertainty. The main challenges consist of (a) learning which queries may benefit from contextualization, (b) given a multifaceted query, determining the most likely user requirements for each subtopic and then selecting a diverse set of pages to satisfy the greatest number of users, and (c) disambiguating imprecise content to build indexes capable of matching queries at a higher semantic level.
Experiments show our techniques for the identifying queries that would benefit from contextualization achieve 94\% accuracy, and preliminary experiments suggest our approach to subtopic diversification performs well under established metrics.
Implementing Histograms over Data Streams using Stream Mill
Stream Mill is a Data Stream Management System (DSMS), supporting
Expressive Stream Languages (ESL) as a powerful query languages for
data stream processing and also data stream mining. The language
extends SQL into a Turing-Complete languages using User Defined
Aggregates (UDAs). UDAs are good means to implement algorithms trying
to process data streams. One of such algorithms is the one computes
Histogram of the values in a data streams. Histograms are being used
in a variety of applications such as query optimization, approximate
query answering, Curve estimation, etc.
In this talk, I will start with a simple example of data stream
processing to show that things are not so easy when you have
restricted amount of resources (time and space). Then I will talk
about implementing Histograms using UDAs in Stream Mill system. I
cover first two types of Histograms out of 3 well-known type which are
1) Regular Histograms in which the intervals associated to bars of the
histogram must have the same size, 2) Even Bar Histograms in which we
are interested in finding an specific number of boundaries in a way
that the number of items in each two consecutive boundaries is almost
the same, and 3) Step-Wise Histogram which tries to estimate a curve
with an step-wise function with a maximum number of steps (bars).
Extracting the musician from the Wikipedia
Web of object concept is a recently promoted in many conference, and the goal of the web of object is to provide a better WWW for
machine but not human. Person entity identification is one of the major challenges in web of object. So far, many projects are running
and growing, e.g. dbpedia, freebase, etc. In this talk, I will start with introducing some web of object projects, and then shift to
talk about my work: extracting the person name from the Wikipedia, and describe how to apply this dataset to solve a problem inside
Yahoo!, which is "to distinguish whether an entity is a musician or a music group". The first part of my talk will explain how to use
snowball algorithm to identify the wiki entry, and the second part of my talk will explain the heuristic method we found useful to
determine individual or group. By meta data provided on corresponding wikipage, I use an algorithm similar to snowball to generate the
high quality candidate dataset. More than 3 millions of person name are extracted in first stage. Later, I use support vector machine
to filter out the noise. The highest precision I got so far is around 86% while extract the musician, and the features are generated
from many seed. . And finally we successfully provide a high-quality musicians and music groups which will be used on the production of
Finding what You Needed, didn't Know Existed and Couldn't See
Search engines employ huge clusters in an attempt to find
the documents they need to best answer their users' queries.
Unfortunately, crawling and evaluating the "entire web" is
unfeasible. Search engines can identify queries that can
likely be improved but do not know how to efficiently and
reliably find the pages that would best improve the results.
Web sites, on the other hand, have cheap access to their
content but do now know what the search engine's needs
are. The solution we present in this paper is a new channel
of two-way communication channel between the search
engine and the web sites. This new channel makes discovering
relevant pages more efficient than classic crawling methods.
Multimodal Video Indexing and Search
Video indexing presents a unique set of challenges to search engines. No perfect techniques exist to extract features from video, often resulting in incomplete or imprecise data. State of the art speaker-independent speech-to-text technology for even controlled vocabulary domains are still hampered by high word error rates. Audio classification and image recognition systems are probabilistic and limited by the expressiveness of low level features, typically requiring supervised classifiers trained to a specific domain.
The limited and imprecise nature of video feature extraction leads to inherent difficulty matching keyword queries to content. In this talk we explore some of the opportunities for enhancing indexing and search over video data by approaching the problem from the content producer's side. Our system leverages features extracted from multiple content channels of a video, such as production scripts, speech-to-text (STT) transcriptions, and image analysis systems. By automatically associating metadata with videos at a finer granularity and without the common shortcomings intrinsic to manual tagging approaches, we enable the search engine to match queries with only the relevant segments of a video.
University of California, Los Angeles, Computer Science Department. 2009.