Share this page:

HoneyBee: Data Recipes for Vision-Language Reasoners

Hritik Bansal, Devendra Singh Sachan, Kai-Wei Chang, Aditya Grover, Gargi Ghosh, Wen-tau Yih, and Ramakanth Pasunuru, in CVPR, 2026.

Code LinkedIn Post

Download the full text


Abstract

Recent advances in vision-language models (VLMs) have made them highly effective at reasoning tasks. However, the principles underlying the construction of performant VL reasoning training datasets remain poorly understood. In this work, we introduce several data curation approaches and study their impacts on VL reasoning capabilities by carefully controlling training and evaluation setups. We analyze the effects of context (image and question pair) sources, implement targeted data interventions, and explore scaling up images, questions, and chain-of-thought (CoT) solutions. Our findings reveal that (a) context source strategies significantly affect VLM performance, (b) interventions such as auxiliary signals from image captions and the inclusion of text-only reasoning yield substantial gains, and (c) scaling all data dimensions (e.g., unique questions per image and unique CoTs per image-question pair) consistently improves reasoning capability. Motivated by these insights, we introduce HoneyBee, a large-scale, high-quality CoT reasoning dataset with 2.5M examples consisting 350K image-question pairs. VLMs trained with HoneyBee outperform state-of-the-art models across model sizes. For instance, a HoneyBee-trained VLM with 3B parameters outperforms the SOTA model and the base model by 7.8% and 24.8%, respectively, on MathVerse. Furthermore, we propose a test-time scaling strategy that reduces decoding cost by 73% without sacrificing accuracy. Overall, this work presents improved strategies for VL reasoning dataset curation research. Data is available at https://huggingface.co/datasets/facebook/HoneyBee.



Bib Entry

@inproceedings{bansal2026honeybee,
  title = {HoneyBee: Data Recipes for Vision-Language Reasoners},
  author = {Bansal, Hritik and Sachan, Devendra Singh and Chang, Kai-Wei and Grover, Aditya and Ghosh, Gargi and Yih, Wen-tau and Pasunuru, Ramakanth},
  booktitle = {CVPR},
  year = {2026}
}

Related Publications

  1. VisRet: Visualization Improves Knowledge-Intensive Text-to-Image Retrieval, ACL, 2026
  2. MotionEdit: Benchmarking and Learning Motion-Centric Image Editing, CVPR, 2026
  3. LaViDa: A Large Diffusion Language Model for Multimodal Understanding, NeurIPS, 2025
  4. PARTONOMY: Large Multimodal Models with Part-Level Visual Understanding, NeurIPS, 2025
  5. SlowFast-VGen: Slow-Fast Learning for Action-Driven Long Video Generation, ICLR, 2025
  6. Verbalized Representation Learning for Interpretable Few-Shot Generalization, ICCV, 2025
  7. STIV: Scalable Text and Image Conditioned Video Generation, ICCV, 2025
  8. Contrastive Visual Data Augmentation, ICML, 2025
  9. SYNTHIA: Novel Concept Design with Affordance Composition, ACL, 2025
  10. Towards a holistic framework for multimodal LLM in 3D brain CT radiology report generation, Nature Communications, 2025
  11. Enhancing Large Vision Language Models with Self-Training on Image Comprehension, NeurIPS, 2024
  12. CoBIT: A Contrastive Bi-directional Image-Text Generation Model, ICLR, 2024
  13. DesCo: Learning Object Recognition with Rich Language Descriptions, NeurIPS, 2023
  14. Text Encoders are Performance Bottlenecks in Contrastive Vision-Language Models, EMNLP, 2023
  15. "What's 'up' with vision-language models? Investigating their struggle to understand spatial relations.", EMNLP, 2023
  16. MetaVL: Transferring In-Context Learning Ability From Language Models to Vision-Language Models, ACL (short), 2023
  17. REVEAL: Retrieval-Augmented Visual-Language Pre-Training with Multi-Source Multimodal Knowledge, CVPR, 2023
  18. Grounded Language-Image Pre-training, CVPR, 2022
  19. How Much Can CLIP Benefit Vision-and-Language Tasks?, ICLR, 2022