Unified Pre-training for Program Understanding and Generation

Wasi Ahmad, Saikat Chakraborty, Baishakhi Ray, and Kai-Wei Chang, in NAACL, 2021.

Top-10 cited paper at NAACL 21

Code

Download the full text

Abstract

Code summarization nd generation empower conversion between programming language (PL) and natural language (NL), while code translation avails the migration of legacy code from one PL to another. This paper introduces PLBART, a sequence-to-sequence model capable of performing a broad spectrum of program and language understanding and generation tasks. PLBART is pre-trained on an extensive collection of Java and Python functions and associated NL text via denoising autoencoding. Experiments on code summarization in the English language, code generation, and code translation in seven programming languages show that PLBART outperforms or rivals state-of-the-art models. Moreover, experiments on discriminative tasks, e.g., program repair, clone detection, and vulnerable code detection, demonstrate PLBART’s effectiveness in program understanding. Furthermore, analysis reveals that PLBART learns program syntax, style (e.g., identifier naming convention), logical flow (e.g., if block inside an else block is equivalent to else if block) that are crucial to program semantics and thus excels even with limited annotations.

De-noising pretraining excels for dual modeling of programming language (e.g., source code) + natural language (e.g., code comment). See our new @NAACLHLT paper https://t.co/YrLFIJE1RH. Thanks to awesome collaborations by Wasi Ahmed, Saikat Chakraborty, @kaiwei_chang
.
— Baishakhi Ray (@baishakhir) March 13, 2021

Bib Entry

@inproceedings{ahmad2021unified,
  title = {Unified Pre-training for Program Understanding and Generation},
  author = {Ahmad, Wasi and Chakraborty, Saikat and Ray, Baishakhi and Chang, Kai-Wei},
  booktitle = {NAACL},
  presentation_id = {https://underline.io/events/122/sessions/4197/lecture/20024-unified-pre-training-for-program-understanding-and-generation},
  year = {2021}
}

Related Publications

AutoSUIT Bench - Automated Security UnIt Test Benchmark for LLM Coding, ACL-Findings, 2026
METAL: A Multi-Agent Framework for Chart Generation with Test-Time Scaling, ACL, 2025
MQT-LLaVA: Matryoshka Query Transformer for Large Vision-Language Models, NeurIPS, 2024
DACO: Towards Application-Driven and Comprehensive Data Analysis via Code Generation, NeurIPS (Datasets and Benchmarks Track), 2024
VDebugger: Harnessing Execution Feedback for Debugging Visual Programs, EMNLP-Finding, 2024
AVATAR: A Parallel Corpus for Java-Python Program Translation, ACL-Finding (short), 2023
Retrieval Augmented Code Generation and Summarization, EMNLP-Finding, 2021