A Restricted Visual Turing Test for Deep Scene and Event Understanding

Hang Qi1, Tianfu Wu1, Mun Wai Lee2, Song-Chun Zhu1

1Center for Vision, Cognition, Learning, and Autonomy, UCLA. 2Intelligent Automation, Inc.

Introduction

This project features a restricted visual Turing test (VTT) which evaluates computer vision systems' understanding of scenes and events in videos by story-line based queries. We collected a long-term and multi-camera captured video dataset. To perform the test, we built an integrated system consisting of a well-designed architecture, various vision modules, a knowledge base, and a query engine.

Dataset

The total video length is about 93.5 hours. The number and placement of cameras were chosen carefully to capture the events and activities from a wide range of angles and distances including side views and bird’s-eye views (cameras mounted on top of buildings). The views of cameras have moderate overlapping similar to typical surveillance settings. The videos captures some most typical scenarios in daily life incorporating rich spatial, temporal, and causal information.

Execution System

Our VTT framework will be released to facilitate broader research. We allow new videos to be uploaded into our benchmark. To perform VTT tests in future, instead of reinventing the wheels, researchers can plug their modules into our framework and test with new configurations.

Paper

Hang Qi, Tianfu Wu, Mun Wai Lee, Song-Chun Zhu. A Restricted Visual Turing Test for Deep Scene and Event Understanding.

Download PDF    View on arXiv.org

@Article{QiWuVTT2015,
    title         = {A Restricted Visual Turing Test for Deep Scene and Event Understanding},
    author        = {Hang Qi and Tianfu Wu and Mun Wai Lee and Song-Chun Zhu},
    year          = {2015},
    archivePrefix = "arXiv",
    eprint        = "1512.01715"
}

Resources

Responsive image

Dataset

A collection of videos with overlapping field-of-views.

 Coming Soon

Responsive image

Execution System

For experimenting with different configurations and modules.

 Coming Soon

Responsive image

Query Toolkit

For collecting queries and annotations on video data.

 Coming Soon

Acknowledgement

This work is supported by DARPA MSEE and DARPA SIMPLEX project.

We would like to thank Josh Walters and his colleagues at BAE Systems, the third-party collaborator in the project who administrated the test, Alexander Grushin and his colleagues at I-A-I for the effort in testing the system. We also thank members in the VCLA Lab at UCLA who contributed perception algorithms in their published work into the baseline test.