Contrastive Visual Data Augmentation

Yu Zhou, Bingxuan Li, Tang Mohan, Xiaomeng Jin, Te-Lin Wu, Kuan-Hao Huang, Heng Ji, Kai-Wei Chang, and Nanyun Peng, in ICML, 2025.

Download the full text

Abstract

Large multimodal models (LMMs) often struggle to recognize novel concepts, as they rely on pre-trained knowledge and have limited ability to capture subtle visual details. Domain-specific knowledge gaps in training also make them prone to confusing visually similar, commonly misrepresented, or low-resource concepts. To help LMMs better align nuanced visual features with language, improving their ability to recognize and reason about novel or rare concepts, we propose a Contrastive visual Data Augmentation (CoDA) strategy. CoDA extracts key contrastive textual and visual features of target concepts against the known concepts they are misrecognized as, and then uses multimodal generative models to produce targeted synthetic data. Automatic filtering of extracted features and augmented images is implemented to guarantee their quality, as verified by human annotators. We show the effectiveness and efficiency of CoDA on low-resource concept and diverse scene recognition datasets including INaturalist and SUN. We additionally collect NovelSpecies, a benchmark dataset consisting of newly discovered animal species that are guaranteed to be unseen by LMMs. LLaVA-1.6 1-shot updating results on these three datasets show CoDA significantly improves SOTA visual data augmentation strategies by 12.3% (NovelSpecies), 5.1% (SUN), and 6.0% (iNat) absolute gains in accuracy.

Bib Entry

@inproceedings{zhou2025contrastive,
  title = {Contrastive Visual Data Augmentation},
  author = {Zhou, Yu and Li, Bingxuan and Mohan, Tang and Jin, Xiaomeng and Wu, Te-Lin and Huang, Kuan-Hao and Ji, Heng and Chang, Kai-Wei and Peng, Nanyun},
  booktitle = {ICML},
  year = {2025}
}

Related Publications

VisRet: Visualization Improves Knowledge-Intensive Text-to-Image Retrieval, ACL, 2026
HoneyBee: Data Recipes for Vision-Language Reasoners, CVPR, 2026
MotionEdit: Benchmarking and Learning Motion-Centric Image Editing, CVPR, 2026
LaViDa: A Large Diffusion Language Model for Multimodal Understanding, NeurIPS, 2025
PARTONOMY: Large Multimodal Models with Part-Level Visual Understanding, NeurIPS, 2025
STIV: Scalable Text and Image Conditioned Video Generation, ICCV, 2025
Verbalized Representation Learning for Interpretable Few-Shot Generalization, ICCV, 2025
SYNTHIA: Novel Concept Design with Affordance Composition, ACL, 2025
SlowFast-VGen: Slow-Fast Learning for Action-Driven Long Video Generation, ICLR, 2025
Towards a holistic framework for multimodal LLM in 3D brain CT radiology report generation, Nature Communications, 2025
Enhancing Large Vision Language Models with Self-Training on Image Comprehension, NeurIPS, 2024
CoBIT: A Contrastive Bi-directional Image-Text Generation Model, ICLR, 2024
DesCo: Learning Object Recognition with Rich Language Descriptions, NeurIPS, 2023
"What's 'up' with vision-language models? Investigating their struggle to understand spatial relations.", EMNLP, 2023
Text Encoders are Performance Bottlenecks in Contrastive Vision-Language Models, EMNLP, 2023
MetaVL: Transferring In-Context Learning Ability From Language Models to Vision-Language Models, ACL (short), 2023
REVEAL: Retrieval-Augmented Visual-Language Pre-Training with Multi-Source Multimodal Knowledge, CVPR, 2023
Grounded Language-Image Pre-training, CVPR, 2022
How Much Can CLIP Benefit Vision-and-Language Tasks?, ICLR, 2022