HoneyBee: Data Recipes for Vision-Language Reasoners

Hritik Bansal, Devendra Singh Sachan, Kai-Wei Chang, Aditya Grover, Gargi Ghosh, Wen-tau Yih, and Ramakanth Pasunuru, in CVPR, 2026.

Code LinkedIn Post

Download the full text

Abstract

Recent advances in vision-language models (VLMs) have made them highly effective at reasoning tasks. However, the principles underlying the construction of performant VL reasoning training datasets remain poorly understood. In this work, we introduce several data curation approaches and study their impacts on VL reasoning capabilities by carefully controlling training and evaluation setups. We analyze the effects of context (image and question pair) sources, implement targeted data interventions, and explore scaling up images, questions, and chain-of-thought (CoT) solutions. Our findings reveal that (a) context source strategies significantly affect VLM performance, (b) interventions such as auxiliary signals from image captions and the inclusion of text-only reasoning yield substantial gains, and (c) scaling all data dimensions (e.g., unique questions per image and unique CoTs per image-question pair) consistently improves reasoning capability. Motivated by these insights, we introduce HoneyBee, a large-scale, high-quality CoT reasoning dataset with 2.5M examples consisting 350K image-question pairs. VLMs trained with HoneyBee outperform state-of-the-art models across model sizes. For instance, a HoneyBee-trained VLM with 3B parameters outperforms the SOTA model and the base model by 7.8% and 24.8%, respectively, on MathVerse. Furthermore, we propose a test-time scaling strategy that reduces decoding cost by 73% without sacrificing accuracy. Overall, this work presents improved strategies for VL reasoning dataset curation research. Data is available at https://huggingface.co/datasets/facebook/HoneyBee.

Bib Entry

@inproceedings{bansal2026honeybee,
  title = {HoneyBee: Data Recipes for Vision-Language Reasoners},
  author = {Bansal, Hritik and Sachan, Devendra Singh and Chang, Kai-Wei and Grover, Aditya and Ghosh, Gargi and Yih, Wen-tau and Pasunuru, Ramakanth},
  booktitle = {CVPR},
  year = {2026}
}

Related Publications

VisRet: Visualization Improves Knowledge-Intensive Text-to-Image Retrieval, ACL, 2026
MotionEdit: Benchmarking and Learning Motion-Centric Image Editing, CVPR, 2026
LaViDa: A Large Diffusion Language Model for Multimodal Understanding, NeurIPS, 2025
PARTONOMY: Large Multimodal Models with Part-Level Visual Understanding, NeurIPS, 2025
SlowFast-VGen: Slow-Fast Learning for Action-Driven Long Video Generation, ICLR, 2025
Verbalized Representation Learning for Interpretable Few-Shot Generalization, ICCV, 2025
STIV: Scalable Text and Image Conditioned Video Generation, ICCV, 2025
Contrastive Visual Data Augmentation, ICML, 2025
SYNTHIA: Novel Concept Design with Affordance Composition, ACL, 2025
Towards a holistic framework for multimodal LLM in 3D brain CT radiology report generation, Nature Communications, 2025
Enhancing Large Vision Language Models with Self-Training on Image Comprehension, NeurIPS, 2024
CoBIT: A Contrastive Bi-directional Image-Text Generation Model, ICLR, 2024
DesCo: Learning Object Recognition with Rich Language Descriptions, NeurIPS, 2023
Text Encoders are Performance Bottlenecks in Contrastive Vision-Language Models, EMNLP, 2023
"What's 'up' with vision-language models? Investigating their struggle to understand spatial relations.", EMNLP, 2023
MetaVL: Transferring In-Context Learning Ability From Language Models to Vision-Language Models, ACL (short), 2023
REVEAL: Retrieval-Augmented Visual-Language Pre-Training with Multi-Source Multimodal Knowledge, CVPR, 2023
Grounded Language-Image Pre-training, CVPR, 2022
How Much Can CLIP Benefit Vision-and-Language Tasks?, ICLR, 2022