DB-UCLA Seminar

To subscribe to the db-ucla mailing list for seminar announcement, please visit this page

Time: 12:00pm-1:00pm Fridays; Room: 4549 Boelter Hall

*To invite a guest speaker or to schedule a talk, contact Alex Shkapsky (shkapsky at cs dot ucla dot edu)

Fall 2012
Date	Speaker	Title
09/28	Chu-Cheng Hsieh	Buyer intention detection based on query transition graph analysis
10/5	Prof. Maurizio Atzori	SWIPE: Searching Wikipedia by Example
10/12	Prof. Carlo Zaniolo
10/19	Mohan Yang
10/26
11/02
11/09
11/16
11/23
11/30	Yingyi Bu (UC Irvine)	Pregelix: Think Like a Vertex, Scale like Spandex
12/07

Spring 2012
Date	Speaker	Title
04/06
04/13
04/20
04/27	Dr. Hady Lauw	Synonyms and Sessions: Mining User-Generated Data for Improving Web Search
05/04	David Jurgens	Analyzing Editor Activity on Wikipedia through Network Motif Analysis
05/11	Kai Zeng
05/18	Prof. Carlo Zaniolo	The Design of Streamlog: a Logic-based Language for Data Streams
05/25	Prof. Murali Mani	Algebraic Manipulation of Encrypted Databases
06/01	Chu-Cheng Hsieh	Experts vs. The Crowd: Examining News Prediction and User Behavior on Twitter
06/08
06/15

Winter 2012
Date	Speaker	Title
01/06
01/13
01/20
01/27
02/03
02/10	Yuchen Liu	[dbucla] Unsupervised Transactional Query Classification Based on Webpage Form Understanding
02/17	Hamid Mousavi	[dbucla] A Graph-based Framework for Text Mining
02/24	Young Cha	[dbucla] Summary of Collaborative Filtering Methods
03/02
03/09
03/16

Pregelix: Think Like a Vertex, Scale like Spandex

Speaker:
Yingyi Bu

Abstract:
Recently, there are more and more demands for analyzing Big Graph Data. For example, the scale of the world wide web keeps expanding to billions of web pages and hyper-links, the key social network sites like Facebook, LinkedIn, Twitter all have a rapidly growing gigantic social graph, and the biology science people assemble genomes from huge de Bruijn graphs. To analyze such Big Graphs requires a system which can not only scale out to hundreds or thousands of machines, but also do the computation very efficiently. In this talk, I will introduce the Pregelix system, which supports easy programming and scales to large commodity machine clusters. I will first illustrate the programming model -- application programmers need zero knowledge of the parallel/distributed system, but just "think like a vertex" and write a couple of functions that encapsulate the logic for what one graph vertex does. Then, I will walk through a few examples built on top of Pregelix, such as PageRank and connected components. After that, I will detail the shining internals of Pregelix, including the system architecture, the scalable dataflow runtime, the execution strategies, the caching, and the out-of-core support. Finally I will demonstrate our performance numbers and conclude the talk. (Truth in lending disclosure: the programming model and API were shamelessly borrowed from Google's Pregel graph analytics platform, hence the name:-))

SWIPE: Searching Wikipedia by Example

Speaker:
Prof. Maurizio Atzori

Abstract:
A novel method is demonstrated that allows semantic and well-structured knowledge bases (such as DBpedia) to be easily queried directly from Wikipedia's pages. Using Swipe, naive users with no knowledge of RDF triples and sparql can easily query DBpedia with powerful questions such as: "Who are the U.S. presidents who took office when they were 55-year old or younger, during the last 60 years", or "Find the town in California with less than 10 thousand people". This is accomplished by a novel Search by Example (SBE) approach where a user can enter the query conditions directly on the Infobox of a Wikipedia page. In fact, Swipe activates various fields of Wikipedia to allow users to enter query conditions, and then uses these conditions to generate equivalent sparql queries and execute them on DBpedia. Finally, Swipe returns the query results in a form that is conducive to query refinements and further explorations. Swipe's SBE approach makes semi-structured documents queryable in an intuitive and user-friendly way and, through Wikipedia, delivers the benefits of querying and exploring large knowledge bases to all Web users. After an introduction on the Search by Example paradigm, the Swipe system will be shown through a live demo, with a discussion on the current state of our implementation and the ongoing efforts to extend it.

Unsupervised Transactional Query Classification Based on Webpage Form Understanding

Speaker:
Yechen Liu

Abstract:
Query type classification aims to classify search queries into categories like navigational, informational and transactional, etc., according to the type of information need behind the queries. Although this problem has drawn many research attentions, previous methods usually require editors to label queries as training data or need domain knowledge to edit rules for predicting query type. Also, the existing work has been mainly focusing on the classification of informational and navigational query types. Transactional query classification has not been well addressed. In this work, we propose an unsupervised approach for transactional query classification. This method is based on the observation that, after the transactional queries are issued to a search engine, many users will click the search result pages and then have interactions with Web forms on these pages. The interactions, e.g., typing in text box, making selections from dropdown list, clicking on a button to execute actions, are used to specify detailed information of the transaction. By mining toolbar search log data, which records the associations between queries and Web forms clicked by users, we can get a set of good quality transactional queries without using manual labeling efforts. By matching these automatically acquired transactional queries and their associated Web form contents, we can generalize these queries into patterns. These patterns can be used to classify queries which are not covered by search log. Our experiments indicate that transactional queries produced by this method have good quality. The pattern based classifier achieves 83\% $F_1$ classification result. This is very effective considering the fact that we do not adopt any labeling efforts to train the classifier.

A Graph-based Framework for Text Mining

Speaker:
Hamid Mousavi

Abstract:
I will introduce an NLP-based text mining framework, which utilizes more than one parse tree for each sentence. Depending on the application, the framework pushes the most time consuming processes back to the Preprocessing Phase so that the Information Extraction Phase can be performed more quickly. The framework which is called LQL includes the following steps.

Partitioning the text into sentences
For each sentence, convert it into several parse trees using a probabilistic parser
Enriching the parsetrees using mainPart information
Generating textGraphs using set of LQL rules/patterns on mainpart annotated parsetree
Completing textGraphs using patterns with a language similar to SPARQL

Content-based search engines
Systems for automatic reviewing
Biomedical text mining
Essay grading system
Ontology generation from text
Text summarization and topic extraction
Spam detection

Summary of Collaborative Filtering Methods

Speaker:
Young Cha

Abstract:
If you have shopped at Amazon, you may have encountered the recommendation like 'People who bought this also bought...'. Netflix also suggests the movies you may be interested in based on your watch history. Collaborative filtering is a technique used by these services to suggest interesting items to their users. As the items can be anything from products to URLs to friends, its application is limitless. Many innovative collaborative filtering methods were introduced with the Netflix prize. In this talk, I'll first briefly introduce what collaborative filtering is and explain how it works. Then I'll categorize various collaborative filtering methods into three groups and discuss how each of them has evolved.

Synonyms and Sessions: Mining User-Generated Data for Improving Web Search

Speaker:
Dr. Hady Lauw

Abstract:
Search today has evolved beyond keyword queries and the ten blue links to the most relevant documents. For one thing, users now expect more direct answers, even if they do not always come from Web pages. For another thing, users now conduct long-running search tasks across multiple sessions. These new expectations have changed the face of Web search. As a result, search pages now also have the features of retrieving answers from structured data sources, as well as allowing users to track and organize their personal search histories. While the core retrieval challenge is still the same: how to associate queries to information items (e.g., pages, database records), the new approaches are very different. Instead of relying on keyword co-occurrences in documents, the primary associations are now mined from user-generated data, in the form of search logs.

In this seminar, we will explore two problems where user-generated data are used intensively to support new search features. In the first part of the talk, we look at the problem of matching informal user queries to formal representations in structured databases. To meet real-time retrieval requirements, the solution involves generating synonyms for entities found in those databases to be matched against user queries. In the second part of the talk, we investigate the problem of organizing a user's search history into a set of coherent sessions, each containing queries and clicks pertaining to a particular task, even when the queries do not share textual or temporal similarity.

Analyzing Editor Activity on Wikipedia through Network Motif Analysis

Speaker:
David Jurgens

Abstract:
Wikipedia is a collaborative environment with editors often cooperate or even battle over the creation new content. Most studies of user behavior in Wikipedia have focused on high-level trends in macro-variables, such as the editing frequency, inter-user reverting, or page growth. We propose a new method for investigating the types of editor behavior using a micro-features that represent the interactions editors have with each other. To support this analysis we construct a novel representation of Wikipedia's revision history as a temporal, bi-partite network with multiple node and edge types for users and revisions. From this representation we identify significant author interactions as network motifs and show how the motif types capture important, diverse editing behaviors. In addition, I will present the results of two additional experiments that use motif-based interactions. First, I will outline a page classification task for detecting combative behavior on pages and show that motifs provide a significant performance improvement. Second, I will show how motifs can be used as a basis for analyzing trends in the dynamics of editor behavior in order to explain Wikipedia's content growth.

The Design of Streamlog: a Logic-based Language for Data Streams

Speaker:
Professor Carlo Zaniolo

Abstract:
Data Stream Management Systems (DSMS) have attracted much interest from researchers who have proposed extensions of relational database languages to query data streams. However, while relational query languages rely on solid logical foundations, a logic-based theory of DSMS languages and their unique computational model are long overdue. In this paper, we show that continuous queries can be analyzed using the familiar concepts of closed-world assumption and local strati?cation. This approach leads to the design of a language called Streamlog, which provides the features and constructs needed for writing queries and applications on data stream, while requiring only minimal changes to Datalog. Thus, Streamlog takes the query and application languages of DSMS to new levels of expressive power and removes the unnecessary limitations that severely impair current commercial systems and research prototypes. Efficient implementations of Streamlog are based on combined techniques from Datalog and Prolog.

University of California, Los Angeles, Computer Science De partment. 2012.