"What’s ’up’ with vision-language models? Investigating their struggle to understand spatial relations."

Amita Kamath, Jack Hessel, and Kai-Wei Chang, in EMNLP, 2023.

Download the full text

Abstract

Recent vision-language (VL) models have reached human parity on VQAv2 — but does that mean they can distinguish "left" from "right"? We curate three new corpora to precisely quantify model ability to comprehend basic spatial relations: COCO-prep from COCO, GQA-prep from GQA, and RealCLEVR from images we capture ourselves with even tighter controls. Compared to prior evaluations which conflate several types of reasoning, our three tests offer precise evaluations of spatial relations, e.g., our RealCLEVR benchmark is controlled, with only the preposition changing between images within a set, e.g. mug on/under/left of/right of a table. This enables us to evaluate model performance on pairs or sets of prepositions. We evaluate 18 VL models, finding that all fall far behind human performance (despite surpassing human performance on VQAv2, as in the case of BLIP2); most only achieve a few points above random chance across all benchmarks. We then study the LAION-2B dataset, which was used to train OpenCLIP models, to investigate if pre-training data can provide clues as to why spatial relation understanding doesn’t emerge. We find that prepositions are infrequent and often ambiguous in LAION 2B. Based on this corpus analysis, we investigate a few training strategies to address this shortcoming. While up-weighting preposition-containing instances and fine-tuning on IID data improve accuracy slightly, our three spatial relation benchmarks remain challenging for all VL models we test. We will release code and data.

Bib Entry

@inproceedings{kamath2023whatsup,
  title = {"What's 'up' with vision-language models? Investigating their struggle to understand spatial relations."},
  author = {Kamath, Amita and Hessel, Jack and Chang, Kai-Wei},
  booktitle = {EMNLP},
  year = {2023}
}

Related Publications

VisRet: Visualization Improves Knowledge-Intensive Text-to-Image Retrieval, ACL, 2026
HoneyBee: Data Recipes for Vision-Language Reasoners, CVPR, 2026
MotionEdit: Benchmarking and Learning Motion-Centric Image Editing, CVPR, 2026
LaViDa: A Large Diffusion Language Model for Multimodal Understanding, NeurIPS, 2025
PARTONOMY: Large Multimodal Models with Part-Level Visual Understanding, NeurIPS, 2025
STIV: Scalable Text and Image Conditioned Video Generation, ICCV, 2025
Verbalized Representation Learning for Interpretable Few-Shot Generalization, ICCV, 2025
Contrastive Visual Data Augmentation, ICML, 2025
SYNTHIA: Novel Concept Design with Affordance Composition, ACL, 2025
SlowFast-VGen: Slow-Fast Learning for Action-Driven Long Video Generation, ICLR, 2025
Towards a holistic framework for multimodal LLM in 3D brain CT radiology report generation, Nature Communications, 2025
Enhancing Large Vision Language Models with Self-Training on Image Comprehension, NeurIPS, 2024
CoBIT: A Contrastive Bi-directional Image-Text Generation Model, ICLR, 2024
DesCo: Learning Object Recognition with Rich Language Descriptions, NeurIPS, 2023
Text Encoders are Performance Bottlenecks in Contrastive Vision-Language Models, EMNLP, 2023
MetaVL: Transferring In-Context Learning Ability From Language Models to Vision-Language Models, ACL (short), 2023
REVEAL: Retrieval-Augmented Visual-Language Pre-Training with Multi-Source Multimodal Knowledge, CVPR, 2023
Grounded Language-Image Pre-training, CVPR, 2022
How Much Can CLIP Benefit Vision-and-Language Tasks?, ICLR, 2022