The Hard Positive Truth about Vision-Language Compositionality
Amita Kamath, Cheng-Yu Hsieh, Kai-Wei Chang, and Ranjay Krishna, in ECCV, 2024.
Download the full text
Abstract
Several benchmarks have concluded that our best vision-language models (e.g., CLIP) are lacking in compositionality. Given an image, these benchmarks probe a model’s ability to identify its associated caption amongst a set of compositional distractors. In response, a surge of recent proposals show improvements by finetuning CLIP with distractors as hard negatives. Our investigations reveal that these improvements have been overstated — because existing benchmarks do not probe whether finetuned models remain invariant to hard positives. By curating an evaluation dataset with 112,382 both hard negatives and hard positives, we uncover that including hard positives decreases CLIP’s performance by 12.9%, while humans perform effortlessly at 99%. CLIP finetuned with hard negatives results in an even larger decrease, up to 38.7%. With this finding, we then produce a 1,775,259 training set with both hard negatives and hard positives captions. By training with both, we see improvements on existing benchmarks while simultaneously improving performance on hard positives, indicating an improvement in compositionality. Our work suggests the need for future research to rigorously test and improve CLIP’s understanding of semantic relationships between related “positive” concepts.
Bib Entry
@inproceedings{kamath2024hard, title = {The Hard Positive Truth about Vision-Language Compositionality}, author = {Kamath, Amita and Hsieh, Cheng-Yu and Chang, Kai-Wei and Krishna, Ranjay}, booktitle = {ECCV}, year = {2024} }