Zilong Zheng

Ph.D. Candidate in Computer Science at UCLA

Engineering VI 491
404 Westwood Plaza
University of California, Los Angeles
Los Angeles, CA, 90095
Email: z.zheng [at] ucla [dot] edu

I am a fourth year Ph.D. candidate in the Department of Computer Science, UCLA. I am currently doing research on machine learning at Center for Vision, Cognition, Learning, and Autonomy (VCLA), under the supervision of Prof. Song-chun Zhu. Before that, I obtained bachelor degree of Computer Science at University of Minnesota. I also received B.E. degree in Micro-electronic Technology from University of Electronic Science and Technology of China (UESTC). My research interests lie in Machine Learning and Cognitive Science.


Publications

Journal Articals

  • Cooperative Training of Fast Thinking Initializer and Slow Thinking Solver for Multi-Modal Conditional Learning TPAMI

    Jianwen Xie*, Zilong Zheng*, Xiaolin Fang, Song-Chun Zhu, Ying Nian Wu (* equal contributions)
    IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI) (major revision)

    PDF
    This paper studies the supervised learning of the conditional distribution of a high-dimensional output given an input, where the output and input may belong to two different modalities, e.g., the output is an photo image and the input is a sketch image. We solve this problem by cooperative training of a fast thinking initializer and slow thinking solver. The initializer generates the output directly by a non-linear transformation of the input as well as a noise vector that accounts for latent variability in the output. The slow thinking solver learns an objective function in the form of a conditional energy function, so that the output can be generated by optimizing the objective function, or more rigorously by sampling from the conditional energy-based model. We propose to learn the two models jointly, where the fast thinking initializer serves to initialize the sampling of the slow thinking solver, and the solver refines the initial output by an iterative algorithm. The solver learns from the difference between the refined output and the observed output, while the initializer learns from how the solver refines its initial output. We demonstrate the effectiveness of the proposed method on various multi-modal conditional learning tasks, e.g., class-to-image generation, image-to-image translation, and image recovery.
    @inproceedings{xie20183DDesNet,
    title={Learning Descriptor Networks for 3D Shape Synthesis and Analysis},
    author={Xie, Jianwen and Zheng, Zilong and Gao, Ruiqi and Wang, Wenguan and Zhu Song-Chun and Wu, Ying Nian},
    booktitle={The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
    year={2018}
    }                                        
  • Learning Energy-Based 3D Descriptor Networks for Volumetric Shape Synthesis and Analysis TPAMI

    Jianwen Xie*, Zilong Zheng*, Ruiqi Gao, Wenguan Wang, Song-Chun Zhu, Ying Nian Wu (* equal contributions)
    IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI) (major revision)

    PDF
    3D data that contains rich geometry information of objects and scenes is a valuable asset for understanding 3D physical world. With the recent emergence of large-scale 3D datasets, it becomes increasingly crucial to have a powerful 3D generative model for 3D shape synthesis and analysis. This paper proposes a 3D shape descriptor network, which is a deep 3D convolutional energy-based model, for representing volumetric shape patterns. The maximum likelihood training of the model follows an “analysis by synthesis” scheme. The benefits of the proposed model are five-fold: first, unlike GANs and VAEs, the training of the model does not rely on any auxiliary models; second, the model can synthesize realistic 3D shapes by sampling from the probability distribution via MCMC, such as Langevin dynamics; third, the conditional version of the model can be applied to 3D object recovery and super-resolution; fourth, the model can be used to train a 3D generator network via MCMC teaching; fifth, the unsupervisedly trained model provides a powerful feature extractor for 3D data, which can be useful for 3D object classification. Experiments demonstrate that the proposed model can generate high-quality 3D shape patterns and can be useful for a wide variety of 3D shape analysis.
    @inproceedings{xie20183DDesNet,
    title={Learning Descriptor Networks for 3D Shape Synthesis and Analysis},
    author={Xie, Jianwen and Zheng, Zilong and Gao, Ruiqi and Wang, Wenguan and Zhu Song-Chun and Wu, Ying Nian},
    booktitle={The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
    year={2018}
    }                                        

Conference Papers

  • Unsupervised Cross-Domain Translation via Alternating MCMC Teaching

    Jianwen Xie*, Zilong Zheng*, Xiaolin Fang, Song-Chun Zhu, Ying Nian Wu (* equal contributions)
    Under Review

  • Generative PointNet: Deep Energy-Based Learning on Unordered Point Sets for 3D Generation, Reconstruction and Classification

    Jianwen Xie, Yifei Xu, Zilong Zheng, Song-Chun Zhu, Ying Nian Wu
    Under Review

  • Joint Inference of States, Robot Knowledge, and Human (False-)Beliefs ICRA'20

    Tao Yuan, Hangxin Liu, Lifeng Fan, Zilong Zheng, Tao Gao, Yixin Zhu, Song-Chun Zhu
    In Proceedings of the IEEE International Conference on Robotics and Automation (ICRA) 2020

    PDF Video
    Aiming to understand how human (false-)belief—a core socio-cognitive ability—would affect human interactions with robots, this paper proposes to adopt a graphical model to unify the representation of object states, robot knowledge, and human (false-)beliefs. Specifically, a parse graph (PG) is learned from a single-view spatiotemporal parsing by aggregating various object states along the time; such a learned representation is accumulated as the robot’s knowledge. An inference algorithm is derived to fuse individual PG from all robots across multi-views into a joint PG, which affords more effective reasoning and inference capability to overcome the errors originated from a single view. In the experiments, through the joint inference over PGs, the system correctly recognizes human (false-)belief in various settings and achieves better cross-view accuracy on a challenging small object tracking dataset.
    @inproceedings{yuan2020joint,
        title={Joint Inference of States, Robot Knowledge, and Human (False-)Beliefs},
        author={Yuan, Tao and Liu, Hangxin and Fan, Lifeng and Zheng, Zilong and Gao, Tao and Zhu, Yixin and Zhu, Song-Chun},
        booktitle={ICRA},
        year={2020}
    }
    
  • Motion-Based Generator Model: Unsupervised Disentanglement of Appearance, Trackable and Intrackable Motions in Dynamic Patterns AAAI'20

    Jianwen Xie*, Ruiqi Gao*, Zilong Zheng, Song-Chun Zhu, Ying Nian Wu (* equal contributions)
    The Thirty-Third AAAI Conference on Artificial Intelligence (AAAI) 2020 (Oral)

    PDF Website
    Dynamic patterns are characterized by complex spatial and motion patterns. Understanding dynamic patterns requires a disentangled representational model that separates the factorial components. A commonly used model for dynamic patterns is the state space model, where the state evolves over time according to a transition model and the state generates the observed image frames according to an emission model. To model the motions explicitly, it is natural for the model to be based on the motions or the displacement fields of the pixels. Thus in the emission model, we let the hidden state generate the displacement field, which warps the trackable component in the previous image frame to generate the next frame while adding a simultaneously emitted residual image to account for the change that cannot be explained by the deformation. The warping of the previous image is about the trackable part of the change of image frame, while the residual image is about the intrackable part of the image. We use a maximum likelihood algorithm to learn the model parameters that iterates between inferring latent noise vectors that drive the transition model and updating the parameters given the inferred latent vectors. Meanwhile we adopt a regularization term to penalize the norms of the residual images to encourage the model to explain the change of image frames by trackable motion. Unlike existing methods on dynamic patterns, we learn our model in unsupervised setting without ground truth displacement fields or optical flows. In addition, our model defines a notion of intrackability by the separation of warped component and residual component in each image frame. We show that our method can synthesize realistic dynamic pattern, and disentangling appearance, trackable and intrackable motions. The learned models can be useful for motion transfer, and it is natural to adopt it to define and measure intrackability of a dynamic pattern.
    @inproceedings{xie2020motion,
        title={Motion-Based Generator Model: Unsupervised Disentanglement of Appearance, Trackable and Intrackable Motions in Dynamic Patterns},
        author={Xie, Jianwen and Gao, Ruiqi and Zheng, Zilong and Zhu, Song-Chun and Wu, Ying Nian},
        booktitle={The Thirty-Third AAAI Conference on Artificial Intelligence (AAAI)},
        year={2020}
    } 
    
  • Reasoning Visual Dialogs with Structural and Partial Observations CVPR'19

    Zilong Zheng*, Wenguan Wang*, Siyuan Qi*, Song-Chun Zhu (* equal contributions)
    IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 2019 (Oral)

    PDF Code
    We propose a novel model to address the task of Visual Dialog which exhibits complex dialog structures. To obtain a reasonable answer based on the current question and the dialog history, the underlying semantic dependencies between dialog entities are essential. In this paper, we explicitly formalize this task as inference in a graphical model with partially observed nodes and unknown graph structures (relations in dialog). The given dialog entities are viewed as the observed nodes. The answer to a given question is represented by a node with missing value. We first introduce an Expectation Maximization algorithm to infer both the underlying dialog structures and the missing node values (desired answers). Based on this, we proceed to propose a differentiable graph neural network (GNN) solution that approximates this process. Experiment results on the VisDial and VisDial-Q datasets show that our model outperforms comparative methods. It is also observed that our method can infer the underlying dialog structure for better dialog reasoning.
    @inproceedings{zheng2019reasoning,
        title={Reasoning Visual Dialogs with Structural and Partial Observations},
        author={Zheng, Zilong and Wang, Wenguan and Qi, Siyuan and Zhu, Song-Chun},
        booktitle={Computer Vision and Pattern Recognition (CVPR), 2019 IEEE Conference on},
        year={2019}
    } 
    
  • Learning Dynamic Generator Model by Alternating Back-Propagation Through Time AAAI'19

    Jianwen Xie*, Ruiqi Gao*, Zilong Zheng, Song-Chun Zhu, Ying Nian Wu (* equal contributions)
    The Thirty-Third AAAI Conference on Artificial Intelligence (AAAI) 2019 (Spotlight)

    PDF Code Website
    This paper studies the dynamic generator model for spatial-temporal processes such as dynamic textures and action sequences in video data. In this model, each time frame of the video sequence is generated by a generator model, which is a non-linear transformation of a latent state vector, where the non-linear transformation is parametrized by a top-down neural network. The sequence of latent state vectors follows a non-linear auto-regressive model, where the state vector of the next frame is a non-linear transformation of the state vector of the current frame as well as an independent noise vector that provides randomness in the transition. The non-linear transformation of this transition model can be parametrized by a feedforward neural network. We show that this model can be learned by an alternating back-propagation through time algorithm that iteratively samples the noise vectors and updates the parameters in the transition model and the generator model. We show that our training method can learn realistic models for dynamic textures and action patterns.
    @article{xie2019DG,
        title = {Learning Dynamic Generator Model by Alternating Back-Propagation Through Time},
        author = {Xie, Jianwen and Gao, Ruiqi and Zheng, Zilong and Zhu, Song-Chun and Wu, Ying Nian},
        journal={The Thirty-Third AAAI Conference on Artificial Intelligence (AAAI)},
        year = {2019}
    }                                        
  • Learning Descriptor Networks for 3D Shape Synthesis and Analysis CVPR'18

    Jianwen Xie*, Zilong Zheng*, Ruiqi Gao, Wenguan Wang, Song-Chun Zhu, Ying Nian Wu (* equal contributions)
    IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 2018 (Oral)

    PDF Code Website
    This paper proposes a 3D shape descriptor network, which is a deep convolutional energy-based model, for modeling volumetric shape patterns. The maximum likelihood training of the model follows an “analysis by synthesis” scheme and can be interpreted as a mode seeking and mode shifting process. The model can synthesize 3D shape patterns by sampling from the probability distribution via MCMC such as Langevin dynamics. The model can be used to train a 3D generator network via MCMC teaching. The conditional version of the 3D shape descriptor net can be used for 3D object recovery and 3D object super-resolution. Experiments demonstrate that the proposed model can generate realistic 3D shape patterns and can be useful for 3D shape analysis.
    @inproceedings{xie20183DDesNet,
        title={Learning Descriptor Networks for 3D Shape Synthesis and Analysis},
        author={Xie, Jianwen and Zheng, Zilong and Gao, Ruiqi and Wang, Wenguan and Zhu Song-Chun and Wu, Ying Nian},
        booktitle={The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
        year={2018}
    }                                        

Patents

  • Noninvasive brain blood oxygen parameter measuring method

    Patent Publication No.: CN104382604A, Priority-date: 2014-12-02. PDF

  • Near infrared noninvasive detection probe for tissue blood oxygen saturation

    Patent Publication No.: CN204394526U, Priority-date: 2014-12-02. PDF

  • Brain blood oxygen saturation degree noninvasive monitor

    Patent Publication No.: CN204394527U, Priority-date: 2014-12-02. PDF


Professional Services

  • Conference reviewer for ECCV 2020; CVPR 2019, 2020; AAAI 2020; ICCV 2019
  • Journal reviewer for Pattern Recognition, Neurocomputing