AVATAR: A Parallel Corpus for Java-Python Program Translation
Wasi Ahmad, Md Golam Rahman Tushar, Saikat Chakraborty, and Kai-Wei Chang, in ACL-Finding (short), 2023.
CodeDownload the full text
Abstract
Program translation refers to migrating source code from one programming language to another. It has a tremendous practical value in software development as porting software across different languages is time-consuming and costly. Automating program translation is of paramount importance in software migration, and recently researchers explored unsupervised approaches due to the unavailability of parallel corpora. However, the availability of pre-trained language models for programming languages enable supervised fine-tuning with a small amount of labeled examples. In this work, we present a corpus of 8,475 programming problems and their solutions written in two popular languages, Java and Python. We collect the dataset from competitive programming sites, online platforms, and open source repositories. We present several baselines, including models trained from scratch or pre-trained on large-scale source code collection and fine-tuned on our proposed dataset. Experiment results show that while the models perform relatively well in terms of the lexical match, they lack in generating code that is accurate in terms of syntax and data-flow match.
Bib Entry
@inproceedings{ahmad2021avatar,
title = {AVATAR: A Parallel Corpus for Java-Python Program Translation},
author = {Ahmad, Wasi and Tushar, Md Golam Rahman and Chakraborty, Saikat and Chang, Kai-Wei},
booktitle = {ACL-Finding (short)},
year = {2023}
}
Related Publications
- METAL: A Multi-Agent Framework for Chart Generation with Test-Time Scaling, ACL, 2025
- MQT-LLaVA: Matryoshka Query Transformer for Large Vision-Language Models, NeurIPS, 2024
- DACO: Towards Application-Driven and Comprehensive Data Analysis via Code Generation, NeurIPS (Datasets and Benchmarks Track), 2024
- VDebugger: Harnessing Execution Feedback for Debugging Visual Programs, EMNLP-Finding, 2024
- Retrieval Augmented Code Generation and Summarization, EMNLP-Finding, 2021
- Unified Pre-training for Program Understanding and Generation, NAACL, 2021