Siyuan Qi

PhD Candidate @ UCLA CS

About Me

I am a third year Ph.D. Candidate in the Computer Science Department at University of California, Los Angeles. I am currently doing computer vision research in the Center for Vision, Cognition, Learning, and Autonomy advised by Professor Song-Chun Zhu.

My research interests include Computer Vision, Machine Learning, and Cognitive Science.

We who cut mere stones must always be envisioning cathedrals.

Quarry worker's creed

News

2018
2018 Feb

One paper accepted at CVPR 2018

Human-centric Indoor Scene Synthesis Using Stochastic Grammar.

[paper] [supplementary] [code] [project]

Conference  Computer Vision  Computer Graphics 

2018 Jan

Two papers accepted at ICRA 2018

Intent-aware Multi-agent Reinforcement Learning.

[paper] [demo] [code]

Unsupervised Learning of Hierarchical Models for Hand-Object Interactions using Tactile Glove.

[paper] [demo] [project]

Conference  Robotics 

2017
2017 Jul

One paper accepted at ICCV 2017

Predicting Human Activities Using Stochastic Grammar.

[paper] [demo] [code]

Conference  Computer Vision 

2017 Jun

One paper accepted at IROS 2017

[Oral] Feeling the Force: Integrating Force and Pose for Fluent Discovery through Imitation Learning to Open Medicine Bottles.

[paper] [demo] [code] [project]

Oral  Conference  Robotics 

2017 Apr

Invited talk at VRLA Expo 2017

[Invited talk] I presented our work on "Examining Human Physical Judgments Across Virtual Gravity Fields" in VRLA 2017.

Invited Talk  Virtual Reality  Cognitive Science 

2016
2016 Nov

One paper accepted at IEEE Virtual Reality 2017

[Oral] The Martian: Examining Human Physical Judgments Across Virtual Gravity Fields.

Accepted to TVCG

[paper] [demo] [project]

Oral  Journal  Virtual Reality  Cognitive Science 

Projects


  • Human Activity Prediction

    This project aims to predict future human activities from partially observed RGB-D videos. Human activity prediction is generally difficult due to its non-Markovian property and the rich context between human and environments. We use a stochastic grammar model to capture the compositional/hierarchical structure of events, integrating human actions, objects, and their affordances.

    Computer Vision  Robotics 
  • Indoor Scene Synthesis by Stochastic Grammar

    This project studies how to realistically synthesis indoor scene layouts using stochastic grammar. We present a novel human-centric method to sample 3D room layouts and synthesis photo-realistic images using physics-based rendering. We use object affordance and human activity planning to model indoor scenes, which contains functional grouping relations and supporting relations between furniture and objects. An attributed spatial And-Or graph (S-AOG) is proposed to model indoor scenes. The S-AOG is a stochastic context sensitive grammar, in which the terminal nodes are object entities including room, furniture and supported objects.

    Computer Vision  Computer Graphics 

Publications


  • Human-centric Indoor Scene Synthesis Using Stochastic Grammar

    Siyuan Qi, Yixin Zhu, Siyuan Huang, Chenfanfu Jiang, Song-Chun Zhu.

    CVPR 2018, Salt Lake City, USA

    Conference  Computer Vision  Computer Graphics 

  • Abstract
    We present a human-centric method to sample and synthesize 3D room layouts and 2D images thereof, for the purpose of obtaining large-scale 2D/3D image data with the perfect per-pixel ground truth. An attributed spatial And-Or graph (S-AOG) is proposed to represent indoor scenes. The S-AOG is a probabilistic grammar model, in which the terminal nodes are object entities including room, furniture, and supported objects. Human contexts as contextual relations are encoded by Markov Random Fields (MRF) on the terminal nodes. We learn the distributions from an indoor scene dataset and sample new layouts using Monte Carlo Markov Chain. Experiments demonstrate that the proposed method can robustly sample a large variety of realistic room layouts based on three criteria: (i) visual realism comparing to a state-of-the-art room arrangement method, (ii) accuracy of the affordance maps with respect to ground-truth, and (ii) the functionality and naturalness of synthesized rooms evaluated by human subjects..
@inproceedings{qi2018human,
    title={Human-centric Indoor Scene Synthesis Using Stochastic Grammar},
    author={Qi, Siyuan and Zhu, Yixin and Huang, Siyuan and Jiang, Chenfanfu and Zhu, Song-Chun},
    booktitle={Conference on Computer Vision and Pattern Recognition (CVPR)},
    year={2018}
}
  • Intent-aware Multi-agent Reinforcement Learning

    Siyuan Qi, Song-Chun Zhu.

    ICRA 2018, Brisbane, Australia

    Conference  Reinforcement Learning  Robotics 

  • Abstract
    This paper proposes an intent-aware multi-agent planning framework as well as a learning algorithm. Under this framework, an agent plans in the goal space to maximize the expected utility. The planning process takes the belief of other agents' intents into consideration. Instead of formulating the learning problem as a partially observable Markov decision process (POMDP), we propose a simple but effective linear function approximation of the utility function. It is based on the observation that for humans, other people's intents will pose an influence on our utility for a goal. The proposed framework has several major advantages: i) it is computationally feasible and guaranteed to converge. ii) It can easily integrate existing intent prediction and low-level planning algorithms. iii) It does not suffer from sparse feedbacks in the action space. We experiment our algorithm in a real-world problem that is non-episodic, and the number of agents and goals can vary over time. Our algorithm is trained in a scene in which aerial robots and humans interact, and tested in a novel scene with a different environment. Experimental results show that our algorithm achieves the best performance and human-like behaviors emerge during the dynamic process.
  •              
@inproceedings{qi2018intent,
    title={Intent-aware Multi-agent Reinforcement Learning},
    author={Qi, Siyuan and Zhu, Song-Chun},
    booktitle={International Conference on Robotics and Automation (ICRA)},
    year={2018}
}
  • Unsupervised Learning of Hierarchical Models for Hand-Object Interactions

    Xu Xie, Hangxin Liu, Mark Edmonds, Feng Gao, Siyuan Qi, Yixin Zhu, Brandon Rothrock, Song-Chun Zhu

    ICRA 2018, Brisbane, Australia

    Conference  Robotics 

  • Abstract
    Contact forces of the hand are visually unobservable, but play a crucial role in understanding hand-object interactions. In this paper, we propose an unsupervised learning approach for manipulation event segmentation and manipulation event parsing. The proposed framework incorporates hand pose kinematics and contact forces using a low-cost easy-to-replicate tactile glove. We use a temporal grammar model to capture the hierarchical structure of events, integrating extracted force vectors from the raw sensory input of poses and forces. The temporal grammar is represented as a temporal And-Or graph (T-AOG), which can be induced in an unsupervised manner. We obtain the event labeling sequences by measuring the similarity between segments using the Dynamic Time Alignment Kernel (DTAK). Experimental results show that our method achieves high accuracy in manipulation event segmentation, recognition and parsing by utilizing both pose and force data.
@inproceedings{xu2018unsupervised,
    title={Unsupervised Learning of Hierarchical Models for Hand-Object Interactions},
    author={Xie, Xu and Liu, Hangxin and Edmonds, Mark and Gao, Feng and Qi, Siyuan and Zhu, Yixin and Rothrock, Brandon and Zhu, Song-Chun},
    booktitle={International Conference on Robotics and Automation (ICRA)},
    year={2018}
}
  • Predicting Human Activities Using Stochastic Grammar

    Siyuan Qi, Siyuan Huang, Ping Wei, Song-Chun Zhu.

    ICCV 2017, Venice, Italy

    Conference  Computer Vision 

  • Abstract
    This paper presents a novel method to predict future human activities from partially observed RGB-D videos. Human activity prediction is generally difficult due to its non-Markovian property and the rich context between human and environments. We use a stochastic grammar model to capture the compositional structure of events, integrating human actions, objects, and their affordances. We represent the event by a spatial-temporal And-Or graph (ST-AOG). The ST-AOG is composed of a temporal stochastic grammar defined on sub-activities, and spatial graphs representing sub-activities that consist of human actions, objects, and their affordances. Future sub-activities are predicted using the temporal grammar and Earley parsing algorithm. The corresponding action, object, and affordance labels are then inferred accordingly. Extensive experiments are conducted to show the effectiveness of our model on both semantic event parsing and future activity prediction.
  •              
@inproceedings{qi2017predicting,
    title={Predicting Human Activities Using Stochastic Grammar},
    author={Qi, Siyuan and Huang, Siyuan and Wei, Ping and Zhu, Song-Chun},
    booktitle={International Conference on Computer Vision (ICCV)},
    year={2017}
}

  • Feeling the Force: Integrating Force and Pose for Fluent Discovery through Imitation Learning to Open Medicine Bottles

    Mark Edmonds*, Feng Gao*, Xu Xie, Hangxin Liu, Siyuan Qi, Yixin Zhu, Brandon Rothrock, Song-Chun Zhu.
    * equal contributors

    IROS 2017, Vancouver, Canada

    Oral  Conference  Robotics 

  • Abstract
    Learning complex robot manipulation policies for real-world objects is challenging, often requiring significant tuning within controlled environments. In this paper, we learn a manipulation model to execute tasks with multiple stages and variable structure, which typically are not suitable for most robot manipulation approaches. The model is learned from human demonstration using a tactile glove that measures both hand pose and contact forces. The tactile glove enables observation of visually latent changes in the scene, specifically the forces imposed to unlock the child-safety mechanisms of medicine bottles. From these observations, we learn an action planner through both a top-down stochastic grammar model (And-Or graph) to represent the compositional nature of the task sequence and a bottom-up discriminative model from the observed poses and forces. These two terms are combined during planning to select the next optimal action. We present a method for transferring this human-specific knowledge onto a robot platform and demonstrate that the robot can perform successful manipulations of unseen objects with similar task structure.
@inproceedings{edmonds2017feeling,
    title={Feeling the Force: Integrating Force and Pose for Fluent Discovery through Imitation Learning to Open Medicine Bottles },
    author={Edmonds, Mark and Gao, Feng and Xie, Xu and Liu, Hangxin and Qi, Siyuan and Zhu, Yixin and Rothrock, Brandon and Zhu, Song-Chun},
    booktitle={International Conference on Intelligent Robots and Systems (IROS)},
    year={2017}
}
  • Configurable, Photorealistic Image Rendering and Ground Truth Synthesis by Sampling Stochastic Grammars Representing Indoor Scenes

    Chenfanfu Jiang*, Siyuan Qi*, Yixin Zhu*, Siyuan Huang*, Jenny Lin, Lap-Fai Yu, Demetri Terzopoulos, Song-Chun Zhu.
    * equal contributors

    Under review for IJCV

    Journal  Computer Vision  Computer Graphics 

  • Abstract
    We propose a systematic learning-based approach to the generation of massive quantities of synthetic 3D scenes and numerous photorealistic 2D images thereof, with associated ground truth information, for the purposes of training, benchmarking, and diagnosing learning-based computer vision and robotics algorithms. In particular, we devise a learning-based pipeline of algorithms capable of automatically generating and rendering a potentially infinite variety of indoor scenes by using a stochastic grammar, represented as an attributed Spatial And-Or Graph, in conjunction with state-of-the-art physics-based rendering. Our pipeline is capable of synthesizing scene layouts with high diversity, and it is configurable in that it enables the precise customization and control of important attributes of the generated scenes. It renders photorealistic RGB images of the generated scenes while automatically synthesizing detailed, per-pixel ground truth data, including visible surface depth and normal, object identity, and material information (detailed to object parts), as well as environments (e.g., illumination and camera viewpoints). We demonstrate the value of our dataset, by improving performance in certain machine-learning-based scene understanding tasks--e.g., depth and surface normal prediction, semantic segmentation, reconstruction, etc.---and by providing benchmarks for and diagnostics of trained models by modifying object attributes and scene properties in a controllable manner.
  • [Oral] The Martian: Examining Human Physical Judgments Across Virtual Gravity Fields.

    Tian Ye*, Siyuan Qi*, James Kubricht, Yixin Zhu, Hongjing Lu, Song-Chun Zhu.
    * equal contributors

    IEEE VR 2017, Los Angeles, California, USA
    Accepted to TVCG

    Oral  Journal  Virtual Reality  Cognitive Science 

  • Abstract
    This paper examines how humans adapt to novel physical situations with unknown gravitational acceleration in immersive virtual environments. We designed four virtual reality experiments with different tasks for participants to complete: strike a ball to hit a target, trigger a ball to hit a target, predict the landing location of a projectile, and estimate the flight duration of a projectile. The first two experiments compared human behavior in the virtual environment with real-world performance reported in the literature. The last two experiments aimed to test the human ability to adapt to novel gravity fields by measuring their performance in trajectory prediction and time estimation tasks. The experiment results show that: 1) based on brief observation of a projectile's initial trajectory, humans are accurate at predicting the landing location even under novel gravity fields, and 2) humans' time estimation in a familiar earth environment fluctuates around the ground truth flight duration, although the time estimation in unknown gravity fields indicates a bias toward earth's gravity.
@article{ye2017martian,
    title={The Martian: Examining Human Physical Judgments across Virtual Gravity Fields},
    author={Ye, Tian and Qi, Siyuan and Kubricht, James and Zhu, Yixin and Lu, Hongjing and Zhu, Song-Chun},
    journal={IEEE Transactions on Visualization and Computer Graphics},
    volume={23},
    number={4},
    pages={1399--1408},
    year={2017},
    publisher={IEEE}
}

More


  • First Class Honors, Faculty of Engineering, University of Hong Kong
    2013
  • Undergraduate Research Fellowship, University of Hong Kong
    2012
  • Kingboard Scholarship, University of Hong Kong
    2010 & 2011 & 2012
  • Dean's Honors List, University of Hong Kong
    2010 & 2011
  • AI Challenge (Sponsored by Google), 2nd place in Chinese contestants, 74th worldwide
    2011
  • Student Ambassador, University of Hong Kong
    2010
  • University Entrance Scholarship, University of Hong Kong
    2010