Share this page:

Unified Pre-training for Program Understanding and Generation

Wasi Ahmad, Saikat Chakraborty, Baishakhi Ray, and Kai-Wei Chang, in NAACL, 2021.

Top-10 cited paper at NAACL 21

Code

Download the full text


Abstract

Code summarization nd generation empower conversion between programming language (PL) and natural language (NL), while code translation avails the migration of legacy code from one PL to another. This paper introduces PLBART, a sequence-to-sequence model capable of performing a broad spectrum of program and language understanding and generation tasks. PLBART is pre-trained on an extensive collection of Java and Python functions and associated NL text via denoising autoencoding. Experiments on code summarization in the English language, code generation, and code translation in seven programming languages show that PLBART outperforms or rivals state-of-the-art models. Moreover, experiments on discriminative tasks, e.g., program repair, clone detection, and vulnerable code detection, demonstrate PLBART’s effectiveness in program understanding. Furthermore, analysis reveals that PLBART learns program syntax, style (e.g., identifier naming convention), logical flow (e.g., if block inside an else block is equivalent to else if block) that are crucial to program semantics and thus excels even with limited annotations.



Bib Entry

@inproceedings{ahmad2021unified,
  title = {Unified Pre-training for Program Understanding and Generation},
  author = {Ahmad, Wasi and Chakraborty, Saikat and Ray, Baishakhi and Chang, Kai-Wei},
  booktitle = {NAACL},
  presentation_id = {https://underline.io/events/122/sessions/4197/lecture/20024-unified-pre-training-for-program-understanding-and-generation},
  year = {2021}
}

Related Publications

  1. METAL: A Multi-Agent Framework for Chart Generation with Test-Time Scaling, ACL, 2025
  2. MQT-LLaVA: Matryoshka Query Transformer for Large Vision-Language Models, NeurIPS, 2024
  3. DACO: Towards Application-Driven and Comprehensive Data Analysis via Code Generation, NeurIPS (Datasets and Benchmarks Track), 2024
  4. VDebugger: Harnessing Execution Feedback for Debugging Visual Programs, EMNLP-Finding, 2024
  5. AVATAR: A Parallel Corpus for Java-Python Program Translation, ACL-Finding (short), 2023
  6. Retrieval Augmented Code Generation and Summarization, EMNLP-Finding, 2021