layout: page title: Awards permalink: /awards/ weight: 1.5 show: 1

Award Papers

  1. OpenThoughts: Data Recipes for Reasoning Models

    Etash Kumar Guha, Ryan Marten, Sedrick Keh, Negin Raoof, Georgios Smyrnis, Hritik Bansal, Marianna Nezhurina, Jean Mercat, Trung Vu, Zayne Rea Sprague, Ashima Suvarna, Benjamin Feuer, Leon Liangyu Chen, Zaid Khan, Eric Frankel, and others, in ICLR, 2026.
    Reasoning models have made rapid progress on many benchmarks involving math,
    code, and science. Yet, there are still many open questions about the best training recipes for reasoning since state-of-the-art models often rely on proprietary
    datasets with little to no public information available. To address this, the goal
    of the OpenThoughts project is to create open-source datasets for training reasoning models. After initial explorations, our OpenThoughts2-1M dataset led
    to OpenThinker2-32B, the first model trained on public reasoning data to match
    DeepSeek-R1-Distill-32B on standard reasoning benchmarks such as AIME and
    LiveCodeBench. We then improve our dataset further by systematically investigating each step of our data generation pipeline with 1,000+ controlled experiments,
    which led to OpenThoughts3. Scaling the pipeline to 1.2M examples and using
    QwQ-32B as teacher yields our OpenThinker3-7B model, which achieves state-ofthe-art results: 53% on AIME 2025, 51% on LiveCodeBench 06/24-01/25, and 54% on GPQA Diamond ¡V improvements of 15.3, 17.2, and 20.5 percentage points
    compared to the DeepSeek-R1-Distill-Qwen-7B. 
    @inproceedings{guha2026openthoughts,
      title = {OpenThoughts: Data Recipes for Reasoning Models},
      author = {Guha, Etash Kumar and Marten, Ryan and Keh, Sedrick and Raoof, Negin and Smyrnis, Georgios and Bansal, Hritik and Nezhurina, Marianna and Mercat, Jean and Vu, Trung and Sprague, Zayne Rea and Suvarna, Ashima and Feuer, Benjamin and Chen, Leon Liangyu and Khan, Zaid and Frankel, Eric and others},
      booktitle = {ICLR},
      year = {2026}
    }
    
    Details
  2. MathVista: Evaluating Mathematical Reasoning of Foundation Models in Visual Contexts

    Pan Lu, Hritik Bansal, Tony Xia, Jiacheng Liu, Chunyuan Li, Hannaneh Hajishirzi, Hao Cheng, Kai-Wei Chang, Michel Galley, and Jianfeng Gao, in ICLR, 2024.
    Although Large Language Models (LLMs) and Large Multimodal Models (LMMs) exhibit impressive skills in various domains, their ability for mathematical reasoning within visual contexts has not been formally examined. Equipping LLMs and LMMs with this capability is vital for general-purpose AI assistants and showcases promising potential in education, data analysis, and scientific discovery. To bridge this gap, we present MathVista, a benchmark designed to amalgamate challenges from diverse mathematical and visual tasks. We first taxonomize the key task types, reasoning skills, and visual contexts from the literature to guide our selection from 28 existing math-focused and visual question answering datasets. Then, we construct three new datasets, IQTest, FunctionQA, and PaperQA, to accommodate for missing types of visual contexts. The problems featured often require deep visual understanding beyond OCR or image captioning, and compositional reasoning with rich domain-specific tools, thus posing a notable challenge to existing models. We conduct a comprehensive evaluation of 11 prominent open-source and proprietary foundation models (LLMs, LLMs augmented with tools, and LMMs). The best-performing model, Multimodal Bard, achieves only 58% of human performance (34.8% vs 60.3%), indicating ample room for further improvement. Given this significant gap, MathVista fuels future research in the development of general-purpose AI agents capable of tackling mathematically intensive and visually rich real-world tasks.
    @inproceedings{lu2024mathvista,
      title = {MathVista: Evaluating Mathematical Reasoning of Foundation Models in Visual Contexts},
      author = {Lu, Pan and Bansal, Hritik and Xia, Tony and Liu, Jiacheng and Li, Chunyuan and Hajishirzi, Hannaneh and Cheng, Hao and Chang, Kai-Wei and Galley, Michel and Gao, Jianfeng},
      booktitle = {ICLR},
      year = {2024}
    }
    
    Details
  3. ParaAMR: A Large-Scale Syntactically Diverse Paraphrase Dataset by AMR Back-Translation

    Kuan-Hao Huang, Varun Iyer, I.-Hung Hsu, Anoop Kumar, Kai-Wei Chang, and Aram Galstyan, in ACL, 2023.
    Paraphrase generation is a long-standing task in natural language processing (NLP). Supervised paraphrase generation models, which rely on human-annotated paraphrase pairs, are cost-inefficient and hard to scale up. On the other hand, automatically annotated paraphrase pairs (e.g., by machine back-translation), usually suffer from the lack of syntactic diversity – the generated paraphrase sentences are very similar to the source sentences in terms of syntax. In this work, we present ParaAMR, a large-scale syntactically diverse paraphrase dataset created by abstract meaning representation back-translation. Our quantitative analysis, qualitative examples, and human evaluation demonstrate that the paraphrases of ParaAMR are syntactically more diverse compared to existing large-scale paraphrase datasets while preserving good semantic similarity. In addition, we show that ParaAMR can be used to improve on three NLP tasks: learning sentence embeddings, syntactically controlled paraphrase generation, and data augmentation for few-shot learning. Our results thus showcase the potential of ParaAMR for improving various NLP applications.
    @inproceedings{huang2023paraarm,
      author = {Huang, Kuan-Hao and Iyer, Varun and Hsu, I-Hung and Kumar, Anoop and Chang, Kai-Wei and Galstyan, Aram},
      title = {ParaAMR: A Large-Scale Syntactically Diverse Paraphrase Dataset by AMR Back-Translation},
      booktitle = {ACL},
      presentation_id = {https://underline.io/events/395/posters/15227/poster/76600-paraamr-a-large-scale-syntactically-diverse-paraphrase-dataset-by-amr-back-translation},
      year = {2023}
    }
    
    Details
  4. The Tail Wagging the Dog: Dataset Construction Biases of Social Bias Benchmarks

    Nikil Roashan Selvam, Sunipa Dev, Daniel Khashabi, Tushar Khot, and Kai-Wei Chang, in ACL (short), 2023.
    How reliably can we trust the scores obtained from social bias benchmarks as faithful indicators of problematic social biases in a given language model? In this work, we study this question by contrasting social biases with non-social biases stemming from choices made during dataset construction that might not even be discernible to the human eye. To do so, we empirically simulate various alternative constructions for a given benchmark based on innocuous modifications (such as paraphrasing or random-sampling) that maintain the essence of their social bias. On two well-known social bias benchmarks (Winogender and BiasNLI) we observe that these shallow modifications have a surprising effect on the resulting degree of bias across various models. We hope these troubling observations motivate more robust measures of social biases.
    @inproceedings{roashan2023tail,
      author = {Selvam, Nikil Roashan and Dev, Sunipa and Khashabi, Daniel and Khot, Tushar and Chang, Kai-Wei},
      title = {The Tail Wagging the Dog: Dataset Construction Biases of Social Bias Benchmarks},
      presentation_id = {https://underline.io/events/395/posters/15337/poster/76963-the-tail-wagging-the-dog-dataset-construction-biases-of-social-bias-benchmarks},
      booktitle = {ACL (short)},
      year = {2023}
    }
    
    Details
  5. Controllable Text Generation with Neurally-Decomposed Oracle

    Tao Meng, Sidi Lu, Nanyun Peng, and Kai-Wei Chang, in NeurIPS, 2022.
    We propose a general and efficient framework to control auto-regressive generation models with NeurAlly-Decomposed Oracle (NADO). Given a pre-trained base language model and a sequence-level boolean oracle function, we propose to decompose the oracle function into token-level guidance to steer the base model in text generation. Specifically, the token-level guidance is approximated by a neural model trained with examples sampled from the base model, demanding no additional auxiliary labeled data. We present the closed-form optimal solution to incorporate the token-level guidance into the base model for controllable generation. We further provide a theoretical analysis of how the approximation quality of NADO affects the controllable generation results. Experiments conducted on two applications: (1) text generation with lexical constraints and (2) machine translation with formality control demonstrate that our framework efficiently guides the base model towards the given oracle while maintaining high generation quality.
    @inproceedings{meng2022controllable,
      title = {Controllable Text Generation with Neurally-Decomposed Oracle},
      author = {Meng, Tao and Lu, Sidi and Peng, Nanyun and Chang, Kai-Wei},
      booktitle = {NeurIPS},
      year = {2022}
    }
    
    Details
  6. Grounded Language-Image Pre-training

    Liunian Harold Li, Pengchuan Zhang, Haotian Zhang, Jianwei Yang, Chunyuan Li, Yiwu Zhong, Lijuan Wang, Lu Yuan, Lei Zhang, Jenq-Neng Hwang, Kai-Wei Chang, and Jianfeng Gao, in CVPR, 2022.
    This paper presents a grounded language-image pre-training (GLIP) model for learning object-level, language-aware, and semantic-rich visual representations. GLIP unifies object detection and phrase grounding for pre-training. The unification brings two benefits: 1) it allows GLIP to learn from both detection and grounding data to improve both tasks and bootstrap a good grounding model; 2) GLIP can leverage massive image-text pairs by generating grounding boxes in a self-training fashion, making the learned representation semantic-rich. In our experiments, we pre-train GLIP on 27M grounding data, including 3M human-annotated and 24M web-crawled image-text pairs. The learned representations demonstrate strong zero-shot and few-shot transferability to various object-level recognition tasks. 1) When directly evaluated on COCO and LVIS (without seeing any images in COCO during pre-training), GLIP achieves 49.8 AP and 26.9 AP, respectively, surpassing many supervised baselines. 2) After fine-tuned on COCO, GLIP achieves 60.8 AP on val and 61.5 AP on test-dev, surpassing prior SoTA. 3) When transferred to 13 downstream object detection tasks, a 1-shot GLIP rivals with a fully-supervised Dynamic Head.
    @inproceedings{li2022grounded,
      title = {Grounded Language-Image Pre-training},
      author = {Li, Liunian Harold and Zhang, Pengchuan and Zhang, Haotian and Yang, Jianwei and Li, Chunyuan and Zhong, Yiwu and Wang, Lijuan and Yuan, Lu and Zhang, Lei and Hwang, Jenq-Neng and Chang, Kai-Wei and Gao, Jianfeng},
      booktitle = {CVPR},
      year = {2022}
    }
    
    Details
  7. Men Also Like Shopping: Reducing Gender Bias Amplification using Corpus-level Constraints

    Jieyu Zhao, Tianlu Wang, Mark Yatskar, Vicente Ordonez, and Kai-Wei Chang, in EMNLP, 2017.
    Language is increasingly being used to define rich visual recognition problems with supporting image collections sourced from the web. Structured prediction models are used in these tasks to take advantage of correlations between co-occuring labels and visual input but risk inadvertently encoding social biases found in web corpora.
    In this work, we study data and models associated with multilabel object classification and visual semantic role labeling. We find that (a) datasets for these tasks contain significant gender bias and (b) models trained on these datasets further amplify existing bias. For example, the activity cooking is over 33% more likely to involve females than males in a training set, but a trained model amplifies the disparity to 68% at test time. We propose to inject corpus-level constraints for calibrating existing structured prediction models and design an algorithm based on Lagrangian relaxation for the resulting inference problems. Our method results in no performance loss for the underlying recognition task but decreases the magnitude of bias amplification by 33.3% and 44.9% for multilabel classification and visual semantic role labeling, respectively.
    @inproceedings{zhao2017men,
      author = {Zhao, Jieyu and Wang, Tianlu and Yatskar, Mark and Ordonez, Vicente and Chang, Kai-Wei},
      title = {Men Also Like Shopping: Reducing Gender Bias Amplification using Corpus-level Constraints},
      booktitle = {EMNLP},
      year = {2017}
    }
    
    Details
  8. Large Linear Classification When Data Cannot Fit In Memory

    Hsiang-Fu Yu, Cho-Jui Hsieh, Kai-Wei Chang, and Chih-Jen Lin, in TKDD, 2012.
    Recent advances in linear classification have shown that for applications such as document classification, the training can be extremely efficient. However, most of the existing training methods are designed by assuming that data can be stored in the computer memory. These methods cannot be easily applied to data larger than the memory capacity due to the random access to the disk. We propose and analyze a block minimization framework for data larger than the memory size. At each step a block of data is loaded from the disk and handled by certain learning methods. We investigate two implementations of the proposed framework for primal and dual SVMs, respectively. As data cannot fit in memory, many design considerations are very different from those for traditional algorithms. Experiments using data sets 20 times larger than the memory demonstrate the effectiveness of the proposed method.
    @inproceedings{yu2010large,
      author = {Yu, Hsiang-Fu and Hsieh, Cho-Jui and Chang, Kai-Wei and Lin, Chih-Jen},
      title = {Large Linear Classification When Data Cannot Fit In Memory},
      booktitle = {TKDD},
      year = {2012}
    }
    
    Details

Other Award and Top-Cited Papers

Top-cited papers are according to Paper Digest.

Browse all award and highlighted papers