Guangyu Zhou
Ph.D. in computer science

Boelter Hall 3551
420 Westwood Plaza
University of California, Los Angeles
Los Angeles, CA 90095

Email: Email: zgy_ucla_cs [at] cs.ucla.edu


I am a Ph.D. candidate in Computer Science at the University of California, Los Angeles, advised by Prof. Wei Wang. My research interests include: data mining and machine learning in various field. Prior to UCLA, I received my B.Eng of Computer Science from University of Illinois at Urbana-Champaign, under the supervision of Prof. Jiawei Han.

You can find my CV here. My Google Scholar profile.


Conference Papers

C6

Multifaceted Protein-Protein Interaction Prediction Based on Siamese Residual RCNN ISMB '19

Chen, Muhao and Ju, Chelsea and Zhou, Guangyu and Chen, Xuelu and Zhang, Tianran and Chang, Kai-Wei and Zaniolo, Carlo and Wang, Wei Intelligent Systems for Molecular Biology (ISMB '19), acceptance rate: NA

Motivation: Sequence-based protein-protein interaction (PPI) prediction represents a fundamental computational biology problem. To address this problem, extensive research efforts have been made to extract predefined features from the sequences. Based on these features, statistical algorithms are learned to classify the PPIs. However, such explicit features are usually costly to extract, and typically have limited coverage on the PPI information. Hence, we present an end-to-end framework, PIPR, for PPI predictions using only the primary sequences. PIPR incorporates a deep residual recurrent convolutional neural network in the Siamese architecture, which leverages both robust local features and contextualized information, which are significant for capturing the mutual influence of proteins sequences. Results: Our framework relieves the data pre-processing efforts that are required by other systems, and generalizes well to different application scenarios. Experimental evaluations show that PIPR outperforms various state-of-the-art systems on the binary PPI prediction problem. Moreover, it shows a promising performance on more challenging problems of interaction type prediction and binding affinity estimation, where existing approaches fall short. Availability: The implementation is available at https://github.com/muhaochen/seq_ppi.git Contact: muhaochen@ucla.edu
@article{chen2019pipr, title={Multifaceted Protein-Protein Interaction Prediction Based on Siamese Residual RCNN}, author={Chen, Muhao and Ju, Chelsea and Zhou, Guangyu and Chen, Xuelu and Zhang, Tianran and Chang, Kai-Wei and Zaniolo, Carlo and Wang, Wei}, journal={Bioinformatics (Accepted)}, year={2019}, publisher={Oxford University Press} }
C5

Inferring Microbial Communities for City Scale Metagenomics Using Neural Networks BIBM '18

Guangyu Zhou, Jyun-Yu Jiang, Chelsea J.-T. Ju, and Wei Wang
IEEE International Conference on Bioinformatics and Biomedicine (BIBM '18), acceptance rate: 105/534 = 19.6%

Microbes play a critical role in human health and disease, especially in cities with high population densities. Understanding the microbial ecosystem in an urban environment is essential for monitoring the transmission of infectious diseases and identifying potentially urgent threats. To achieve this goal, researchers have started to collect and analyze metagenomic samples from subway stations in major cities. However, it is too costly and time-consuming to achieve city-wide sampling with fine-grained geo-spatial resolution. In this paper, we present MetaMLAnn, a neural network based approach to infer microbial communities at unmeasured locations, based upon information from various data sources in an urban environment, including subway line information, sampling material, and microbial compositions. MetaMLAnn exploits these heterogeneous features to capture the latent dependencies between microbial compositions and the urban environment, thereby precisely inferring microbial communities at unsampled locations. Moreover, we propose a regularization framework to incorporate the species relatedness as prior knowledge. We evaluate our approach using the public metagenomics dataset collected from multiple subway stations in New York and Boston. The experimental results show that MetaMLAnn consistently outperforms five conventional classifiers across several evaluation metrics. The code, features and labels are available at https://github.com/zgy921028/MetaMLAnn
@inproceedings{8621409, author={G. {Zhou} and J. {Jiang} and C. J. -. {Ju} and W. {Wang}}, booktitle={2018 IEEE International Conference on Bioinformatics and Biomedicine (BIBM)}, title={Inferring Microbial Communities for City Scale Metagenomics Using Neural Networks}, year={2018}, volume={}, number={}, pages={603-608}, keywords={Urban areas;Neural networks;Public transportation;Atmospheric modeling;Computational modeling;Predictive models;Feature extraction;Urban metagenomics;Multi-label classification;Neural network}, doi={10.1109/BIBM.2018.8621409}, ISSN={}, month={Dec},}
C4

GeoBurst: Real-Time Local Event Detection in Geo-Tagged Tweet Streams SIGIR '16

Chao Zhang, Guangyu Zhou, Quan Yuan, Honglei Zhuang, Yu Zhang, Lance Kaplan, Shaowen Wang and Jiawei Han.
ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR '16), acceptance rate: 62/341 = 18.18%

The real-time discovery of local events (e.g., protests, crimes, disasters) is of great importance to various applications, such as crime monitoring, disaster alarming, and activity recommendation. While this task was nearly impossible years ago due to the lack of timely and reliable data sources, the recent explosive growth in geo-tagged tweet data brings new opportunities to it. That said, how to extract quality local events from geo-tagged tweet streams in real time remains largely unsolved so far. We propose GeoBurst, a method that enables effective and real-time local event detection from geo-tagged tweet streams. With a novel authority measure that captures the geo-topic correlations among tweets, GeoBurst first identifies several pivots in the query window. Such pivots serve as representative tweets for potential local events and naturally attract similar tweets to form candidate events. To select truly interesting local events from the candidate list, GeoBurst further summarizes continuous tweet streams and compares the candidates against historical activities to obtain spatiotemporally bursty ones. Finally, GeoBurst also features an updating module that finds new pivots with little time cost when the query window shifts. As such, GeoBurst is capable of monitoring continuous streams in real time. We used crowdsourcing to evaluate GeoBurst on two real-life data sets that contain millions of geo-tagged tweets. The results demonstrate that GeoBurst significantly outperforms state-of-the-art methods in precision, and is orders of magnitude faster.
@inproceedings{zhang2016geoburst, title={Geoburst: Real-time local event detection in geo-tagged tweet streams}, author={Zhang, Chao and Zhou, Guangyu and Yuan, Quan and Zhuang, Honglei and Zheng, Yu and Kaplan, Lance and Wang, Shaowen and Han, Jiawei}, booktitle={Proceedings of the 39th International ACM SIGIR conference on Research and Development in Information Retrieval}, pages={513--522}, year={2016}, organization={ACM} }
C3

Linguistic Understanding of Complaints and Praises in User Reviews WASSA '16

Ganesan, Kavita and Zhou, Guangyu
Proceedings of the 7th Workshop on Computational Approaches to Subjectivity, Sentiment and Social Media Analysis. 2016. (WASSA '16)

Traditional sentiment analysis has been focused on predicting the polarity of texts as positive or negative at different granularity. This broad categorization does not account for informativeness of the underlying text. For many real-world applications such as social listening, brand monitoring and e-commerce platforms, the opinions that really matter are the informative opinions describing why something is good or bad. In this paper, we try to understand the properties of complaints and praises which is an informative subset of the negative and positive categories. Our analysis in the context of user reviews shows that complaints and praises have distinct properties that differentiate it from positive only or negative only sentences.
@inproceedings {ComplaintsPraises, title = {Linguistic Understanding of Complaints and Praises in User Reviews}, booktitle = {7th Workshop on Computational Approaches to Subjectivity, Sentiment \& Social Media Analysis (NAACL-WASSA)}, year = {2016}, publisher = {NAACL}, organization = {NAACL}, address = {San Diego}, author = {Kavita Ganesan and Guangyu Zhou}} }
C1

Complexes Detection in Biological Networks via Diversified Dense Subgraphs Mining RECOMB '16

Ganesan, Kavita and Zhou, Guangyu
Research in Computational Molecular Biology. 2016. (recomb '16)

Protein-protein interaction (PPI) networks, providing a comprehensive landscape of protein interacting patterns, enable us to explore biological processes and cellular components at multiple resolutions. For a biological process, a number of proteins need to work together to perform the job. Proteins densely interact with each other, forming large molecular machines or cellular building blocks. Identification of such densely interconnected clusters or protein complexes from PPI networks enables us to obtain a better understanding of the hierarchy and organization of biological processes and cellular components. Most existing methods apply efficient graph clustering algorithms on PPI networks, often failing to detect possible densely connected subgraphs and overlapped subgraphs. Besides clustering-based methods, dense subgraph enumeration methods have also been used, which aim to find all densely connected protein sets. However, such methods are not practically tractable even on a small yeast PPI network, due to high computational complexity. In this paper, we introduce a novel approximate algorithm to efficiently enumerate putative protein complexes from biological networks. The key insight of our algorithm is that we do not need to enumerate all dense subgraphs. Instead we only need to find a small subset of subgraphs that cover as many proteins as possible. The problem is formulated as finding a diverse set of dense subgraphs, where we develop highly effective pruning techniques to guarantee efficiency. To handle large networks, we take a divide-and-conquer approach to speed up the algorithm in a distributed manner. By comparing with existing clustering and dense subgraph-based algorithms on several human and yeast PPI networks, we demonstrate that our method can detect more putative protein complexes and achieves better prediction accuracy.
@inproceedings{DBLP:conf/recomb/MaZW0016, author = {Xiuli Ma and Guangyu Zhou and Jingjing Wang and Jian Peng and Jiawei Han}, title = {Complexes Detection in Biological Networks via Diversified Dense Subgraphs Mining}, booktitle = {Research in Computational Molecular Biology - 20th Annual Conference, {RECOMB} 2016, Santa Monica, CA, USA, April 17-21, 2016, Proceedings}, pages = {270--272}, year = {2016}, crossref = {DBLP:conf/recomb/2016}, url = {https://link.springer.com/content/pdf/bbm\%3A978-3-319-31957-5\%2F1.pdf}, timestamp = {Tue, 22 Jan 2019 19:17:14 +0100}, biburl = {https://dblp.org/rec/bib/conf/recomb/MaZW0016}, bibsource = {dblp computer science bibliography, https://dblp.org} }

Journal Articles

J3

Prediction of microbial communities for urban metagenomics using neural network approach BMC

Guangyu Zhou, Jyun-Yu Jiang, Chelsea J.-T. Ju, and Wei Wang
Hum Genomics (Oct 2019)

Microbes are greatly associated with human health and disease, especially in densely populated cities. It is essential to understand the microbial ecosystem in an urban environment for cities to monitor the transmission of infectious diseases and detect potentially urgent threats. To achieve this goal, the DNA sample collection and analysis have been conducted at subway stations in major cities. However, city-scale sampling with the fine-grained geo-spatial resolution is expensive and laborious. In this paper, we introduce MetaMLAnn, a neural network based approach to infer microbial communities at unsampled locations given information reflecting different factors, including subway line networks, sampling material types, and microbial composition patterns.
@article{zhou2019prediction, title={Prediction of microbial communities for urban metagenomics using neural network approach}, author={Zhou, Guangyu and Jiang, Jyun-Yu and Ju, Chelsea J-T and Wang, Wei}, journal={Human genomics}, volume={13}, number={1}, pages={47}, year={2019}, publisher={Springer} } } }
J2

MetaPheno: A critical evaluation of deep learning and machine learning in metagenome-based disease prediction Method

Nathan LaPierre, Chelsea J.-T. Ju, Guangyu Zhou, and Wei Wang
Methods (March 2019)

The human microbiome plays a number of critical roles, impacting almost every aspect of human health and well-being. Conditions in the microbiome have been linked to a number of significant diseases. Additionally, revolutions in sequencing technology have led to a rapid increase in publicly-available sequencing data. Consequently, there have been growing efforts to predict disease status from metagenomic sequencing data, with a proliferation of new approaches in the last few years. Some of these efforts have explored utilizing a powerful form of machine learning called deep learning, which has been applied successfully in several biological domains. Here, we review some of these methods and the algorithms that they are based on, with a particular focus on deep learning methods. We also perform a deeper analysis of Type 2 Diabetes and obesity datasets that have eluded improved results, using a variety of machine learning and feature extraction methods. We conclude by offering perspectives on study design considerations that may impact results and future directions the field can take to improve results and offer more valuable conclusions. The scripts and extracted features for the analyses conducted in this paper are available via GitHub:https://github.com/nlapier2/metapheno.
@article{lapierre2019metapheno, title={MetaPheno: A Critical Evaluation of Deep Learning and Machine Learning in Metagenome-Based Disease Prediction}, author={LaPierre, Nathan and Ju, Chelsea J-T and Zhou, Guangyu and Wang, Wei}, journal={Methods}, year={2019}, publisher={Elsevier} } }
J1

Detection of Complexes in Biological Networks Through Diversified Dense Subgraph Mining JCB

Ma Xiuli, Zhou Guangyu, Shang Jingbo, Wang Jingjing, Peng Jian, and Han Jiawei.
Journal of Computational Biology (JCB), Volume 24 Issue 9 (September 2017): 923--941.

Protein-protein interaction (PPI) networks, providing a comprehensive landscape of protein interaction patterns, enable us to explore biological processes and cellular components at multiple resolutions. For a biological process, a number of proteins need to work together to perform a job. Proteins densely interact with each other, forming large molecular machines or cellular building blocks. Identification of such densely interconnected clusters or protein complexes from PPI networks enables us to obtain a better understanding of the hierarchy and organization of biological processes and cellular components. However, most existing graph clustering algorithms on PPI networks often cannot effectively detect densely connected subgraphs and overlapped subgraphs. In this article, we formulate the problem of complex detection as diversified dense subgraph mining and introduce a novel approximation algorithm to efficiently enumerate putative protein complexes from biological networks. The key insight of our algorithm is that instead of enumerating all dense subgraphs, we only need to find a small diverse subset of subgraphs that cover as many proteins as possible. The problem is modeled as finding a diverse set of maximal dense subgraphs where we develop highly effective pruning techniques to guarantee efficiency. To scale up to large networks, we devise a divide-and-conquer approach to speed up the algorithm in a distributed manner. By comparing with existing clustering and dense subgraph-based algorithms on several yeast and human PPI networks, we demonstrate that our method can detect more putative protein complexes and achieves better prediction accuracy.
@article{ma2017detection, title={Detection of complexes in biological networks through diversified dense subgraph mining}, author={Ma, Xiuli and Zhou, Guangyu and Shang, Jingbo and Wang, Jingjing and Peng, Jian and Han, Jiawei}, journal={Journal of Computational Biology}, volume={24}, number={9}, pages={923--941}, year={2017}, publisher={Mary Ann Liebert, Inc. 140 Huguenot Street, 3rd Floor New Rochelle, NY 10801 USA} } }

Theses

T2

Inferring Microbial Community for City-Scale Metagenomics

Guangyu Zhou
Written Qualifying Exam/M.S. thesis, UCLA Computer Science, 2017.

Microbial communities in our environment influence human health and disease, especially in cities with high population densities. Profiling the communities with metagenomics at city-scales is beneficial for long-term disease surveillance and health management. To achieve this,recent works have made great efforts to collect DNA samples from subway stations in large cities. However, to obtain city-scale DNA samples with fine-grained geospatial resolutions is costly and time consuming. In this paper, we present MetaMLAnn, a neural network based approach to infer microbial communities of unsampled locations based upon sparsely collected samples in certain locations. Our model can capture the latent dependency between species by using shared representation among heterogeneous features. We also propose a regularization framework to incorporate the species similarity as the prior knowledge. We evaluate our approach based on the metagenomics dataset across the New York subway system. The experimental results show that MetaMLAnn consistently outperforms five conventional classifiers across several evaluation metrics. In addition, the promising results also demonstrate that the integration of domain knowledge as well as our proposed features are really beneficial to infer microbial community.
T1

Extract and Match in Citation Extraction: A Comparative Study on Constraint Usage

Guangyu Zhou
Bachelor's thesis, University of Illinois at Urbana-Champaign, 2016.

Recent year has witnessed unprecedented proliferation of data, with the largest portion encoded in natural language and therefore unstructured. One sub-field of information extraction is Named Entity Recognition (NER). Different from traditional classification, NER encoded a hidden state sequential structure where the label sequence of word is latent and both word and tags have dependency upon others.Conditional Random Fields (CRF) has been applied mostly to model such NER problem. CRF allows both discriminative training and the bidirectional flow of probabilistic information across the sequence, which makes it a state-of-the-art tool. However, research indicates that even state-of-the-art NER systems (CRF) are brittle, meaning that NER systems developed for one domain do not typically perform well on other domains. Even in one specific domain of NER, the performance cannot be guaranteed among different domains with inconsistency of the format or rules of writing. Lots of relevant work has been done to extend the CRF model by injecting the domain knowledge as the form of constraint enforcement. Most non-local constraints are hard to be directly captured by the CRF model due to the Markov property. To facilitate users to add general (local and non-local) constraints over the output space in a natural and systematic fashion, they provide different approximate inference procedures such as Gibbs sampling or beam search. Their approaches, though can be extremely fast by reducing the search space, may result in local optima and sacrifice the effectiveness. We specifically focus on the Citation Extraction problem: Given unstructured publication data from personal webpages, we would like to use NER tools to structure them into predefined fields (e.g. Author, Title, Venue, Year, etc.) so that the structured data can be better indexed. The opportunities of our problem lay in the consistent patterns (i.e. Match) of the publications sources. To leverage such “Matching” knowledge, we use constraint enforcement. However, the challenge of such approach is how to select useful constraints from instantiated constraint candidates with affordable cost. To address the challenge, in this work, we approach the constraints enforcement from a different per- spective: We adapt the extended version of (segment-based) Viterbi that can incorporate most types of constraints users need. More specifically, we conduct a comparative analysis between local and non-local constraints. Our objective is to achieve similar performance in citation extraction problem without incur- ring the cost of enforcing expensive non-local constraints. We formalize our problem into two sub-directions: 1) use cheaper local constraints to express expensive non-local constraints; 2) enforce expensive non-local constraints based on a cheaper model (Viterbi). While the first direction is lack of systematic argument, the experiment results give us guidance on the second direction.

Teaching

CS 35L
Software Construction Laboratory Lab 3, Winter 2019; Lab 3, Fall 2018, Lab 4, Spring 2018, Lab 3, Winter 2018. Professor: Paul Eggert
CS 249
Big Data Analysis, Fall 2017, Professor: Wei Wang

Misc