Siyuan Qi

PhD Candidate @ UCLA CS

About Me

I am a second year Ph.D. Candidate in the Computer Science Department at University of California, Los Angeles. I am currently doing computer vision research in the Center for Vision, Cognition, Learning, and Autonomy advised by Professor Song-Chun Zhu.

My research interests include Computer Vision, Machine Learning, and Cognitive Science.

We who cut mere stones must always be envisioning cathedrals.

Quarry worker's creed

News

2017
2017 Jul

ICCV 2017

Predicting Human Activities Using Stochastic Grammar.

[paper]

Conference  Computer Vision 

2017 Jun

IROS 2017

[Oral] Feeling the Force: Integrating Force and Pose for Fluent Discovery through Imitation Learning to Open Medicine Bottles.

[paper]

Oral  Conference  Robotics 

2017 Apr

VRLA Expo 2017

[Invited talk] I presented our work on "Examining Human Physical Judgments Across Virtual Gravity Fields" in VRLA 2017.

Invited Talk  Virtual Reality  Cognitive Science 

2016
2016 Nov

IEEE Virtual Reality 2017

[Oral] The Martian: Examining Human Physical Judgments Across Virtual Gravity Fields.

Accepted to TVCG

[paper] [project] [demo]

Oral  Journal  Virtual Reality  Cognitive Science 

Projects


  • Human Activity Prediction

    This project aims to predict future human activities from partially observed RGB-D videos. Human activity prediction is generally difficult due to its non-Markovian property and the rich context between human and environments. We use a stochastic grammar model to capture the compositional/hierarchical structure of events, integrating human actions, objects, and their affordances.

    Computer Vision  Robotics 
  • Indoor Scene Synthesis by Stochastic Grammar

    This project studies how to realistically synthesis indoor scene layouts using stochastic grammar. We present a novel human-centric method to sample 3D room layouts and synthesis photo-realistic images using physics-based rendering. We use object affordance and human activity planning to model indoor scenes, which contains functional grouping relations and supporting relations between furniture and objects. An attributed spatial And-Or graph (S-AOG) is proposed to model indoor scenes. The S-AOG is a stochastic context sensitive grammar, in which the terminal nodes are object entities including room, furniture and supported objects.

    Computer Vision  Computer Graphics 

Publications


  • Predicting Human Activities Using Stochastic Grammar

    Siyuan Qi, Siyuan Huang, Ping Wei, Song-Chun Zhu.

    ICCV 2017, Venice, Italy

    Conference  Computer Vision 

  • Abstract
    This paper presents a novel method to predict future human activities from partially observed RGB-D videos. Human activity prediction is generally difficult due to its non-Markovian property and the rich context between human and environments. We use a stochastic grammar model to capture the compositional structure of events, integrating human actions, objects, and their affordances. We represent the event by a spatial-temporal And-Or graph (ST-AOG). The ST-AOG is composed of a temporal stochastic grammar defined on sub-activities, and spatial graphs representing sub-activities that consist of human actions, objects, and their affordances. Future sub-activities are predicted using the temporal grammar and Earley parsing algorithm. The corresponding action, object, and affordance labels are then inferred accordingly. Extensive experiments are conducted to show the effectiveness of our model on both semantic event parsing and future activity prediction.
  •        
@inproceedings{qi2017predicting,
    title={Predicting Human Activities Using Stochastic Grammar},
    author={Qi, Siyuan and Huang, Siyuan and Wei, Ping and Zhu, Song-Chun},
    booktitle={International Conference on Computer Vision (ICCV)},
    year={2017}
}

  • Feeling the Force: Integrating Force and Pose for Fluent Discoverythrough Imitation Learning to Open Medicine Bottles

    Mark Edmonds*, Feng Gao*, Xu Xie, Hangxin Liu, Siyuan Qi, Yixin Zhu, Brandon Rothrock, Song-Chun Zhu.
    * equal contributors

    IROS 2017, Vancouver, Canada

    Conference  Robotics 

  • Abstract
    Learning complex robot manipulation policies for real-world objects is challenging, often requiring significant tuning within controlled environments. In this paper, we learn a manipulation model to execute tasks with multiple stages and variable structure, which typically are not suitable for most robot manipulation approaches. The model is learned from human demonstration using a tactile glove that measures both hand pose and contact forces. The tactile glove enables observation of visually latent changes in the scene, specifically the forces imposed to unlock the child-safety mechanisms of medicine bottles. From these observations, we learn an action planner through both a top-down stochastic grammar model (And-Or graph) to represent the compositional nature of the task sequence and a bottom-up discriminative model from the observed poses and forces. These two terms are combined during planning to select the next optimal action. We present a method for transferring this human-specific knowledge onto a robot platform and demonstrate that the robot can perform successful manipulations of unseen objects with similar task structure.
  •        
@inproceedings{edmonds2017feeling,
    title={Feeling the Force: Integrating Force and Pose for Fluent Discovery through Imitation Learning to Open Medicine Bottles },
    author={Edmonds, Mark and Gao, Feng and Xie, Xu and Liu, Hangxin and Qi, Siyuan and Zhu, Yixin and Rothrock, Brandon and Zhu, Song-Chun},
    booktitle={International Conference on Intelligent Robots and Systems (IROS)},
    year={2017}
}
  • Configurable, Photorealistic Image Rendering and Ground Truth Synthesis by Sampling Stochastic Grammars Representing Indoor Scenes

    Chenfanfu Jiang*, Yixin Zhu*, Siyuan Qi*, Siyuan Huang*, Jenny Lin, Xingwen Guo, Lap-Fai Yu, Demetri Terzopoulos, Song-Chun Zhu.
    * equal contributors

    Arxiv

    Computer Vision  Computer Graphics 

  • Abstract
    We propose the configurable rendering of massive quantities of photorealistic images with ground truth for the purposes of training, benchmarking, and diagnosing computer vision models. In contrast to the conventional (crowd-sourced) manual labeling of ground truth for a relatively modest number of RGB-D images captured by Kinect-like sensors, we devise a non-trivial configurable pipeline of algorithms capable of generating a potentially infinite variety of indoor scenes using a stochastic grammar, specifically, one represented by an attributed spatial And-Or graph. We employ physics-based rendering to synthesize photorealistic RGB images while automatically synthesizing detailed, per-pixel ground truth data, including visible surface depth and normal, object identity and material information, as well as illumination. Our pipeline is configurable inasmuch as it enables the precise customization and control of important attributes of the generated scenes. We demonstrate that our generated scenes achieve a performance similar to the NYU v2 Dataset on pre-trained deep learning models. By modifying pipeline components in a controllable manner, we furthermore provide diagnostics on common scene understanding tasks; eg., depth and surface normal prediction, semantic segmentation, etc.
  • [Oral] The Martian: Examining Human Physical Judgments Across Virtual Gravity Fields.

    Tian Ye*, Siyuan Qi*, James Kubricht, Yixin Zhu, Hongjing Lu, Song-Chun Zhu.
    * equal contributors

    IEEE VR 2017, Los Angeles, California
    Accepted to TVCG

    Oral  Journal  Virtual Reality  Cognitive Science 

  • Abstract
    This paper examines how humans adapt to novel physical situations with unknown gravitational acceleration in immersive virtual environments. We designed four virtual reality experiments with different tasks for participants to complete: strike a ball to hit a target, trigger a ball to hit a target, predict the landing location of a projectile, and estimate the flight duration of a projectile. The first two experiments compared human behavior in the virtual environment with real-world performance reported in the literature. The last two experiments aimed to test the human ability to adapt to novel gravity fields by measuring their performance in trajectory prediction and time estimation tasks. The experiment results show that: 1) based on brief observation of a projectile's initial trajectory, humans are accurate at predicting the landing location even under novel gravity fields, and 2) humans' time estimation in a familiar earth environment fluctuates around the ground truth flight duration, although the time estimation in unknown gravity fields indicates a bias toward earth's gravity.
@article{ye2017martian,
    title={The Martian: Examining Human Physical Judgments across Virtual Gravity Fields},
    author={Ye, Tian and Qi, Siyuan and Kubricht, James and Zhu, Yixin and Lu, Hongjing and Zhu, Song-Chun},
    journal={IEEE Transactions on Visualization and Computer Graphics},
    volume={23},
    number={4},
    pages={1399--1408},
    year={2017},
    publisher={IEEE}
}

More


  • First Class Honors, Faculty of Engineering, University of Hong Kong
    2013
  • Undergraduate Research Fellowship, University of Hong Kong
    2012
  • Kingboard Scholarship, University of Hong Kong
    2010 & 2011 & 2012
  • Dean's Honors List, University of Hong Kong
    2010 & 2011
  • AI Challenge (Sponsored by Google), 2nd place in Chinese contestants, 74th worldwide
    2011
  • Student Ambassador, University of Hong Kong
    2010
  • University Entrance Scholarship, University of Hong Kong
    2010