Topic discovery and story segmentation for broadcast news by Swendsen-Wang Cuts

Hang Qi, Weixin Li, and Song-Chun Zhu


Abstract

Topic discovery and story segmentation provides fundamental methods for automatically organizing, analyzing, searching, and visualizing the vast amount of news videos available online. In this project, we present a topic discovery and story segmentation framework based on Swendsen-Wang Cuts, aiming at dividing news videos into stories and generating a topic hierarchy to organize these stories, which can further be used for story topic inference, topic retrieval, etc.

In topic discovery, the hierarchy we designed mainly contains two levels: event level and category level. In the event level, stories are clustered mainly based on whether they describe the same event, e.g. "compaign", "hurricane", "bus crash", etc. In the next upper level, i.e. the category level, these events are further clustered into categories, e.g. "politics", "disaster", "accident", etc. At each level, the optimal clustering is obtained using Swendsen-Wang Cuts, a reversible Marov Chain Monte Carlo algorithm for graph partition and labeling. In addition, each node in the topic hierarchy corresponds to a set of stories sharing one specific topic, which can be represented by a short summary extracted from these stories. For new stories, inference can be done by employing the obtained hierarchy. These stories will also be added to the hierarchy based on the inference result. The hierarchy will be updated if corresponding topics of new stories cannot be found in the current one. Story segmentation can be treated as a one-dimension clustering problem since one sentence can only have connections to its two neighbor sentences. Thus in the proposed framework, we also solve the segmentation problem by Swendsen-Wang Cuts. At last, it is worth mentioning that in our framework sentences are represented by triplets, with the goal of extracting the key words in these sentences. In topic discovery, Optical Character Recognition is applied to the news videos to obtain the news captions, finding news topics using news video and text jointly.

Figure 1: Proposed framework for topic discovery and story segmentation.

Results

Figure 2: One sample topic hierarchy. The node in the center of the hierarchy represents the root node. From inside to outside, nodes on the three circles around the root node construct the category level, the event level and the story level (leaf nodes) respectively. The tags of the leaf nodes are the ground-truth labels for stories. Tags of the nodes in the category and event levels represent the topics we generate. (Please click the image to see the results clearly)

Figure 3: One story segmentation result. The left upper part shows the energy curve during the sampling process. The right upper parts shows the change of temperature during this process. The lower part is the segmentation result we get. The red dots represent the true boundaries for one news video. The blue ones are the boundaries we obtain.

Demo

In this demo, we show the topic hierarchy for 77 news stories broadcasted between the time period 2013-01-02 to 2013-01-24. (Please click the image to see the demo.)

Publication

Coming soon...

Acknowledgments

This work is supported by the National Science Foundation CNS 1028381 (under the Cyberenabled Discovery Initiative)