A Transformer-based Approach for Source Code Summarization

Wasi Ahmad, Saikat Chakraborty, Baishakhi Ray, and Kai-Wei Chang, in ACL (short), 2020.

Slides Code

Download the full text

Abstract

Generating a readable summary that describes the functionality of a program is known as source code summarization. In this task, learning code representation by modeling the pairwise relationship between code tokens to capture their long-range dependencies is crucial. To learn code representation for summarization, we explore the Transformer model that uses a self-attention mechanism and has shown to be effective in capturing long-range dependencies. In this work, we show that despite the approach is simple, it outperforms the state-of-the-art techniques by a significant margin. We perform extensive analysis and ablation studies that reveal several important findings, e.g., the absolute encoding of source code tokens’ position hinders, while relative encoding significantly improves the summarization performance. We have made our code publicly available to facilitate future research.

Bib Entry

@inproceedings{ahmad2020transformer,
  author = {Ahmad, Wasi and Chakraborty, Saikat and Ray, Baishakhi and Chang, Kai-Wei},
  title = {A Transformer-based Approach for Source Code Summarization},
  booktitle = {ACL (short)},
  year = {2020},
  presentation_id = {https://virtual.acl2020.org/paper_main.449.html}
}

Related Publications

A Transformer-based Approach for Source Code Summarization

Wasi Ahmad, Saikat Chakraborty, Baishakhi Ray, and Kai-Wei Chang, in ACL (short), 2020.
Full Text Slides Video Code Abstract BibTeX Details

Generating a readable summary that describes the functionality of a program is known as source code summarization. In this task, learning code representation by modeling the pairwise relationship between code tokens to capture their long-range dependencies is crucial. To learn code representation for summarization, we explore the Transformer model that uses a self-attention mechanism and has shown to be effective in capturing long-range dependencies. In this work, we show that despite the approach is simple, it outperforms the state-of-the-art techniques by a significant margin. We perform extensive analysis and ablation studies that reveal several important findings, e.g., the absolute encoding of source code tokens’ position hinders, while relative encoding significantly improves the summarization performance. We have made our code publicly available to facilitate future research.

@inproceedings{ahmad2020transformer,
  author = {Ahmad, Wasi and Chakraborty, Saikat and Ray, Baishakhi and Chang, Kai-Wei},
  title = {A Transformer-based Approach for Source Code Summarization},
  booktitle = {ACL (short)},
  year = {2020},
  presentation_id = {https://virtual.acl2020.org/paper_main.449.html}
}

Details

Context Attentive Document Ranking and Query Suggestion

Wasi Ahmad, Kai-Wei Chang, and Hongning Wang, in SIGIR, 2019.
Full Text Slides Code Abstract BibTeX Details

We present a context-aware neural ranking model to exploit users’ on-task search activities and enhance retrieval performance. Inparticular, a two-level hierarchical recurrent neural network isintroduced to learn search context representation of individualqueries, search tasks, and corresponding dependency structure byjointly optimizing two companion retrieval tasks: document rank-ing and query suggestion. To identify variable dependency structurebetween search context and users’ ongoing search activities, at-tention at both levels of recurrent states are introduced. Extensiveexperiment comparisons against a rich set of baseline methods andan in-depth ablation analysis confirm the value of our proposedapproach for modeling search context buried in search tasks.

@inproceedings{ahmad2019context,
  author = {Ahmad, Wasi and Chang, Kai-Wei and Wang, Hongning},
  title = {Context Attentive Document Ranking and Query Suggestion},
  booktitle = {SIGIR},
  year = {2019}
}

Details

Multifaceted Protein-Protein Interaction Prediction Based on Siamese Residual RCNN

Muhao Chen, Chelsea J.-T. Ju, Guangyu Zhou, Xuelu Chen, Tianran Zhang, Kai-Wei Chang, Carlo Zaniolo, and Wei Wang, in ISMB, 2019.
Full Text Code Abstract BibTeX Details

Sequence-based protein-protein interaction (PPI) prediction represents a fundamental computational biology problem. To address this problem, extensive research efforts have been made to extract predefined features from the sequences. Based on these features, statistical algorithms are learned to classify the PPIs. However, such explicit features are usually costly to extract, and typically have limited coverage on the PPI information. Hence, we present an end-to-end framework, Lasagna, for PPI predictions using only the primary sequences of a protein pair. Lasagna incorporates a deep residual recurrent convolutional neural network in the Siamese learning architecture, which leverages both robust local features and contextualized information that are significant for capturing the mutual influence of protein sequences. Our framework relieves the data pre-processing efforts that are required by other systems, and generalizes well to different application scenarios. Experimental evaluations show that Lasagna outperforms various state-of-the-art systems on the binary PPI prediction problem. Moreover, it shows a promising performance on more challenging problems of interaction type prediction and binding affinity estimation, where existing approaches fall short.

@inproceedings{chen2019multifaceted,
  author = {Chen, Muhao and Ju, Chelsea J.-T. and Zhou, Guangyu and Chen, Xuelu and Zhang, Tianran and Chang, Kai-Wei and Zaniolo, Carlo and Wang, Wei},
  title = {Multifaceted Protein-Protein Interaction Prediction Based on Siamese Residual RCNN},
  booktitle = {ISMB},
  year = {2019}
}

Details

Multi-Task Learning for Document Ranking and Query Suggestion

Wasi Ahmad, Kai-Wei Chang, and Hongning Wang, in ICLR, 2018.
Full Text Code Abstract BibTeX Details

We propose a multi-task learning framework to jointly learn document ranking and query suggestion for web search. It consists of two major components, a document ranker and a query recommender. Document ranker combines current query and session information and compares the combined representation with document representation to rank the documents. Query recommen tracks users’ query reformulation sequence considering all previous in-session queries using a sequence to sequence approach. As both tasks are driven by the users’ underlying search intent, we perform joint learning of these two components through session recurrence, which encodes search context and intent. Extensive comparisons against state-of-the-art document ranking and query suggestion algorithms are performed on the public AOL search log, and the promising results endorse the effectiveness of the joint learning framework.

@inproceedings{ahmad2018multitask,
  author = {Ahmad, Wasi and Chang, Kai-Wei and Wang, Hongning},
  title = {Multi-Task Learning for Document Ranking and Query Suggestion},
  booktitle = {ICLR},
  year = {2018}
}

Details

Intent-aware Query Obfuscation for Privacy Protection in Personalized Web Search

Wasi Ahmad, Kai-Wei Chang, and Hongning Wang, in SIGIR, 2018.
Full Text Code Abstract BibTeX Details

Modern web search engines exploit users’ search history to personalize search results, with a goal of improving their service utility on a per-user basis. But it is this very dimension that leads to the risk of privacy infringement and raises serious public concerns. In this work, we propose a client-centered intent-aware query obfuscation solution for protecting user privacy in a personalized web search scenario. In our solution, each user query is submitted with l additional cover queries and corresponding clicks, which act as decoys to mask users’ genuine search intent from a search engine. The cover queries are sequentially sampled from a set of hierarchically organized language models to ensure the coherency of fake search intents in a cover search task. Our approach emphasizes the plausibility of generated cover queries, not only to the current genuine query but also to previous queries in the same task, to increase the complexity for a search engine to identify a user’s true intent. We also develop two new metrics from an information theoretic perspective to evaluate the effectiveness of provided privacy protection. Comprehensive experiment comparisons with state-of-the-art query obfuscation techniques are performed on the public AOL search log, and the propitious results substantiate the effectiveness of our solution.

@inproceedings{ahmad2018intent,
  author = {Ahmad, Wasi and Chang, Kai-Wei and Wang, Hongning},
  title = {Intent-aware Query Obfuscation for Privacy Protection in Personalized Web Search},
  booktitle = {SIGIR},
  year = {2018}
}

Details

Counterexamples for Robotic Planning Explained in Structured Language

Lu Feng, Mahsa Ghasemi, Kai-Wei Chang, and Ufuk Topcu, in ICRA, 2018.
Full Text Abstract BibTeX Details

Automated techniques such as model checking have been used to verify models of robotic mission plans based on Markov decision processes (MDPs) and generate counterexamples that may help diagnose requirement violations. However, such artifacts may be too complex for humans to understand, because existing representations of counterexamples typically include a large number of paths or a complex automaton. To help improve the interpretability of counterexamples, we define a notion of explainable counterexample, which includes a set of structured natural language sentences to describe the robotic behavior that lead to a requirement violation in an MDP model of robotic mission plan. We propose an approach based on mixed-integer linear programming for generating explainable counterexamples that are minimal, sound and complete. We demonstrate the usefulness of the proposed approach via a case study of warehouse robots planning.

@inproceedings{feng2018conterexamples,
  author = {Feng, Lu and Ghasemi, Mahsa and Chang, Kai-Wei and Topcu, Ufuk},
  title = {Counterexamples for Robotic Planning Explained in Structured Language},
  booktitle = {ICRA},
  year = {2018}
}

Details

Word and sentence embedding tools to measure semantic similarity of Gene Ontology terms by their definitions

Dat Duong, Wasi Uddin Ahmad, Eleazar Eskin, Kai-Wei Chang, and Jingyi Jessica Li, in Journal of Computational Biology, 2018.
Full Text Code Abstract BibTeX Details

The Gene Ontology (GO) database contains GO terms that describe biological functions of genes.
Previous methods for comparing GO terms have relied on the fact that GO terms are organized
into a tree structure. Under this paradigm, the locations of two GO terms in the tree dictate their
similarity score. In this paper, we introduce two new solutions for this problem, by focusing
instead on the definitions of the GO terms. We apply neural network based techniques from
the natural language processing (NLP) domain. The first method does not rely on the GO tree,
whereas the second indirectly depends on the GO tree. In our first approach, we compare two GO
definitions by treating them as two unordered sets of words. The word similarity is estimated by a
word embedding model that maps words into an N-dimensional space. In our second approach,
we account for the word-ordering within a sentence. We use a sentence encoder to embed GO
definitions into vectors and estimate how likely one definition entails another. We validate our
methods in two ways. In the first experiment, we test the model’s ability to differentiate a true
protein-protein network from a randomly generated network. In the second experiment, we test
the model in identifying orthologs from randomly-matched genes in human, mouse, and fly. In
both experiments, a hybrid of NLP and GO-tree based method achieves the best classification
accuracy.

@inproceedings{DAECL18,
  author = {Duong, Dat and Ahmad, Wasi Uddin and Eskin, Eleazar and Chang, Kai-Wei and Li, Jingyi Jessica},
  title = {Word and sentence embedding tools to measure semantic similarity of Gene Ontology terms by their definitions},
  booktitle = {Journal of Computational Biology},
  year = {2018}
}

Details