Share this page:

"What’s ’up’ with vision-language models? Investigating their struggle to understand spatial relations."

Amita Kamath, Jack Hessel, and Kai-Wei Chang, in EMNLP, 2023.


Recent vision-language (VL) models have reached human parity on VQAv2 — but does that mean they can distinguish "left" from "right"? We curate three new corpora to precisely quantify model ability to comprehend basic spatial relations: COCO-prep from COCO, GQA-prep from GQA, and RealCLEVR from images we capture ourselves with even tighter controls. Compared to prior evaluations which conflate several types of reasoning, our three tests offer precise evaluations of spatial relations, e.g., our RealCLEVR benchmark is controlled, with only the preposition changing between images within a set, e.g. mug on/under/left of/right of a table. This enables us to evaluate model performance on pairs or sets of prepositions. We evaluate 18 VL models, finding that all fall far behind human performance (despite surpassing human performance on VQAv2, as in the case of BLIP2); most only achieve a few points above random chance across all benchmarks. We then study the LAION-2B dataset, which was used to train OpenCLIP models, to investigate if pre-training data can provide clues as to why spatial relation understanding doesn’t emerge. We find that prepositions are infrequent and often ambiguous in LAION 2B. Based on this corpus analysis, we investigate a few training strategies to address this shortcoming. While up-weighting preposition-containing instances and fine-tuning on IID data improve accuracy slightly, our three spatial relation benchmarks remain challenging for all VL models we test. We will release code and data.

Bib Entry

  title = {"What's 'up' with vision-language models? Investigating their struggle to understand spatial relations."},
  author = {Kamath, Amita and Hessel, Jack and Chang, Kai-Wei},
  booktitle = {EMNLP},
  year = {2023}

Related Publications