Share this page:

Verbalized Representation Learning for Interpretable Few-Shot Generalization

Cheng-Fu Yang, Da Yin, Wenbo Hu, Heng Ji, Nanyun Peng, Bolei Zhou, and Kai-Wei Chang, in ICCV, 2025.

Code

Download the full text


Abstract

Humans recognize objects after observing only a few examples, a remarkable capability enabled by their inherent language understanding of the real-world environment. Developing verbalized and interpretable representation can significantly improve model generalization in low-data settings. In this work, we propose Verbalized Representation Learning (VRL), a novel approach for automatically extracting human-interpretable features for object recognition using few-shot data. Our method uniquely captures inter-class differences and intra-class commonalities in the form of natural language by employing a Vision-Language Model (VLM) to identify key discriminative features between different classes and shared characteristics within the same class. These verbalized features are then mapped to numeric vectors through the VLM. The resulting feature vectors can be further utilized to train and infer with downstream classifiers. Experimental results show that, at the same model scale, VRL achieves a 24% absolute improvement over prior state-of-the-art methods while using 95% less data and a smaller model. Furthermore, compared to human-labeled attributes, the features learned by VRL exhibit a 20% absolute gain when used for downstream classification tasks.


Bib Entry

@inproceedings{yang2025verbalized,
  title = {Verbalized Representation Learning for Interpretable Few-Shot Generalization},
  author = {Yang, Cheng-Fu and Yin, Da and Hu, Wenbo and Ji, Heng and Peng, Nanyun and Zhou, Bolei and Chang, Kai-Wei},
  booktitle = {ICCV},
  year = {2025}
}

Related Publications

  1. HoneyBee: Data Recipes for Vision-Language Reasoners, CVPR, 2026
  2. MotionEdit: Benchmarking and Learning Motion-Centric Image Editing, CVPR, 2026
  3. LaViDa: A Large Diffusion Language Model for Multimodal Understanding, NeurIPS, 2025
  4. PARTONOMY: Large Multimodal Models with Part-Level Visual Understanding, NeurIPS, 2025
  5. STIV: Scalable Text and Image Conditioned Video Generation, ICCV, 2025
  6. Contrastive Visual Data Augmentation, ICML, 2025
  7. SYNTHIA: Novel Concept Design with Affordance Composition, ACL, 2025
  8. SlowFast-VGen: Slow-Fast Learning for Action-Driven Long Video Generation, ICLR, 2025
  9. Towards a holistic framework for multimodal LLM in 3D brain CT radiology report generation, Nature Communications, 2025
  10. Enhancing Large Vision Language Models with Self-Training on Image Comprehension, NeurIPS, 2024
  11. CoBIT: A Contrastive Bi-directional Image-Text Generation Model, ICLR, 2024
  12. DesCo: Learning Object Recognition with Rich Language Descriptions, NeurIPS, 2023
  13. "What's 'up' with vision-language models? Investigating their struggle to understand spatial relations.", EMNLP, 2023
  14. Text Encoders are Performance Bottlenecks in Contrastive Vision-Language Models, EMNLP, 2023
  15. MetaVL: Transferring In-Context Learning Ability From Language Models to Vision-Language Models, ACL (short), 2023
  16. REVEAL: Retrieval-Augmented Visual-Language Pre-Training with Multi-Source Multimodal Knowledge, CVPR, 2023
  17. Grounded Language-Image Pre-training, CVPR, 2022
  18. How Much Can CLIP Benefit Vision-and-Language Tasks?, ICLR, 2022