Share this page:

How Much Can CLIP Benefit Vision-and-Language Tasks?

Sheng Shen, Liunian Harold Li, Hao Tan, Mohit Bansal, Anna Rohrbach, Kai-Wei Chang, Zhewei Yao, and Kurt Keutz, in ICLR, 2022.

Top-10 cited paper at ICLR 22

Code

Download the full text


Abstract

Most existing Vision-and-Language (V&L) models rely on pre-trained visual encoders, using a relatively small set of manually-annotated data (as compared to web-crawled data), to perceive the visual world. However, it has been observed that large-scale pretraining usually can result in better generalization performance, e.g., CLIP (Contrastive Language-Image Pre-training), trained on a massive amount of image-caption pairs, has shown a strong zero-shot capability on various vision tasks. To further study the advantage brought by CLIP, we propose to use CLIP as the visual encoder in various V&L models in two typical scenarios: 1) plugging CLIP into task-specific fine-tuning; 2) combining CLIP with V&L pre-training and transferring to downstream tasks. We show that CLIP significantly outperforms widely-used visual encoders trained with in-domain annotated data, such as BottomUp-TopDown. We achieve competitive or better results on diverse V&L tasks, while establishing new state-of-the-art results on Visual Question Answering, Visual Entailment, and V&L Navigation tasks.


Bib Entry

@inproceedings{shen2022how,
  title = { How Much Can CLIP Benefit Vision-and-Language Tasks? },
  author = {Shen, Sheng and Li, Liunian Harold and Tan, Hao and Bansal, Mohit and Rohrbach, Anna and Chang, Kai-Wei and Yao, Zhewei and Keutz, Kurt},
  booktitle = {ICLR},
  year = {2022}
}

Related Publications

  1. DesCo: Learning Object Recognition with Rich Language Descriptions

    Liunian Harold Li, Zi-Yi Dou, Nanyun Peng, and Kai-Wei Chang, in NeurIPS, 2023.
    Full Text Demo Abstract BibTeX Details Ranks 1st at the #OmniLabel Challenge of CVPR2023
    Recent development in vision-language approaches has instigated a paradigm shift in learning visual recognition models from language supervision. These approaches align objects with language queries (e.g. "a photo of a cat") and improve the models’ adaptability to identify novel objects and domains. Recently, several studies have attempted to query these models with complex language expressions that include specifications of fine-grained semantic details, such as attributes, shapes, textures, and relations. However, simply incorporating language descriptions as queries does not guarantee accurate interpretation by the models. In fact, our experiments show that GLIP, the state-of-the-art vision-language model for object detection, often disregards contextual information in the language descriptions and instead relies heavily on detecting objects solely by their names. To tackle the challenges, we propose a new description-conditioned (DesCo) paradigm of learning object recognition models with rich language descriptions consisting of two major innovations: 1) we employ a large language model as a commonsense knowledge engine to generate rich language descriptions of objects based on object names and the raw image-text caption; 2) we design context-sensitive queries to improve the model’s ability in deciphering intricate nuances embedded within descriptions and enforce the model to focus on context rather than object names alone. On two novel object detection benchmarks, LVIS and OminiLabel, under the zero-shot detection setting, our approach achieves 34.8 APr minival (+9.1) and 29.3 AP (+3.6), respectively, surpassing the prior state-of-the-art models, GLIP and FIBER, by a large margin.
    @inproceedings{li2023desco,
      author = {Li, Liunian Harold and Dou, Zi-Yi and Peng, Nanyun and Chang, Kai-Wei},
      title = {DesCo: Learning Object Recognition with Rich Language Descriptions},
      booktitle = {NeurIPS},
      year = {2023}
    }
    
    Details
  2. Text Encoders are Performance Bottlenecks in Contrastive Vision-Language Models

    Amita Kamath, Jack Hessel, and Kai-Wei Chang, in EMNLP, 2023.
    Full Text Abstract BibTeX Details
    Performant vision-language (VL) models like CLIP represent captions using a single vector. How much information about language is lost in this bottleneck? We first curate CompPrompts, a set of increasingly compositional image captions that VL models should be able to capture (e.g., single object, to object+property, to multiple interacting objects). Then, we train text-only recovery probes that aim to reconstruct captions from single-vector text representations produced by several VL models. This approach doesn’t require images, allowing us to test on a broader range of scenes compared to prior work. We find that: 1) CLIP’s text encoder falls short on object relationships, attribute-object association, counting, and negations; 2) some text encoders work significantly better than others; and 3) text-only recovery performance predicts multi-modal matching performance on ControlledImCaps: a new evaluation benchmark we collect+release consisting of fine-grained compositional images+captions. Specifically – our results suggest text-only recoverability is a necessary (but not sufficient) condition for modeling compositional factors in contrastive vision+language models. 
    @inproceedings{kamath2023text,
      author = {Kamath, Amita and Hessel, Jack and Chang, Kai-Wei},
      title = {Text Encoders are Performance Bottlenecks in Contrastive Vision-Language Models},
      booktitle = {EMNLP},
      year = {2023}
    }
    
    Details
  3. "What’s ’up’ with vision-language models? Investigating their struggle to understand spatial relations."

    Amita Kamath, Jack Hessel, and Kai-Wei Chang, in EMNLP, 2023.
    Full Text Abstract BibTeX Details
    Recent vision-language (VL) models have reached human parity on VQAv2 — but does that mean they can distinguish "left" from "right"? We curate three new corpora to precisely quantify model ability to comprehend basic spatial relations: COCO-prep from COCO, GQA-prep from GQA, and RealCLEVR from images we capture ourselves with even tighter controls. Compared to prior evaluations which conflate several types of reasoning, our three tests offer precise evaluations of spatial relations, e.g., our RealCLEVR benchmark is controlled, with only the preposition changing between images within a set, e.g. mug on/under/left of/right of a table. This enables us to evaluate model performance on pairs or sets of prepositions. We evaluate 18 VL models, finding that all fall far behind human performance (despite surpassing human performance on VQAv2, as in the case of BLIP2); most only achieve a few points above random chance across all benchmarks. We then study the LAION-2B dataset, which was used to train OpenCLIP models, to investigate if pre-training data can provide clues as to why spatial relation understanding doesn’t emerge. We find that prepositions are infrequent and often ambiguous in LAION 2B. Based on this corpus analysis, we investigate a few training strategies to address this shortcoming. While up-weighting preposition-containing instances and fine-tuning on IID data improve accuracy slightly, our three spatial relation benchmarks remain challenging for all VL models we test. We will release code and data.
    @inproceedings{kamath2023whatsup,
      title = {"What's 'up' with vision-language models? Investigating their struggle to understand spatial relations."},
      author = {Kamath, Amita and Hessel, Jack and Chang, Kai-Wei},
      booktitle = {EMNLP},
      year = {2023}
    }
    
    Details
  4. REVEAL: Retrieval-Augmented Visual-Language Pre-Training with Multi-Source Multimodal Knowledge

    Ziniu Hu, Ahmet Iscen, Chen Sun, Zirui Wang, Kai-Wei Chang, Yizhou Sun, Cordelia Schmid, David A. Ross, and Alireza Fathi, in CVPR, 2023.
    Full Text Abstract BibTeX Details
    In this paper, we propose an end-to-end Retrieval-Augmented Visual Language Model (REVEAL) that learns to encode world knowledge into a large-scale memory, and to retrieve from it to answer knowledge-intensive queries. REVEAL consists of four key components: the memory, the encoder, the retriever and the generator. The large-scale memory encodes various sources of multimodal world knowledge (e.g. image-text pairs, question answering pairs, knowledge graph triplets, etc) via a unified encoder. The retriever finds the most relevant knowledge entries in the memory, and the generator fuses the retrieved knowledge with the input query to produce the output. A key novelty in our approach is that the memory, encoder, retriever and generator are all pre-trained end-to-end on a massive amount of data. Furthermore, our approach can use a diverse set of multimodal knowledge sources, which is shown to result in significant gains. We show that REVEAL achieves state-of-the-art results on visual question answering and image captioning.
    @inproceedings{hu2023reveal,
      author = {Hu, Ziniu and Iscen, Ahmet and Sun, Chen and Wang, Zirui and Chang, Kai-Wei and Sun, Yizhou and Schmid, Cordelia and Ross, David A. and Fathi, Alireza},
      title = {REVEAL: Retrieval-Augmented Visual-Language Pre-Training with Multi-Source Multimodal Knowledge},
      booktitle = {CVPR},
      year = {2023}
    }
    
    Details
  5. Grounded Language-Image Pre-training

    Liunian Harold Li, Pengchuan Zhang, Haotian Zhang, Jianwei Yang, Chunyuan Li, Yiwu Zhong, Lijuan Wang, Lu Yuan, Lei Zhang, Jenq-Neng Hwang, Kai-Wei Chang, and Jianfeng Gao, in CVPR, 2022.
    Full Text Code Abstract BibTeX Details Best Paper Finallist, 33 out of 8161 submissions, top 0.4%
    This paper presents a grounded language-image pre-training (GLIP) model for learning object-level, language-aware, and semantic-rich visual representations. GLIP unifies object detection and phrase grounding for pre-training. The unification brings two benefits: 1) it allows GLIP to learn from both detection and grounding data to improve both tasks and bootstrap a good grounding model; 2) GLIP can leverage massive image-text pairs by generating grounding boxes in a self-training fashion, making the learned representation semantic-rich. In our experiments, we pre-train GLIP on 27M grounding data, including 3M human-annotated and 24M web-crawled image-text pairs. The learned representations demonstrate strong zero-shot and few-shot transferability to various object-level recognition tasks. 1) When directly evaluated on COCO and LVIS (without seeing any images in COCO during pre-training), GLIP achieves 49.8 AP and 26.9 AP, respectively, surpassing many supervised baselines. 2) After fine-tuned on COCO, GLIP achieves 60.8 AP on val and 61.5 AP on test-dev, surpassing prior SoTA. 3) When transferred to 13 downstream object detection tasks, a 1-shot GLIP rivals with a fully-supervised Dynamic Head.
    @inproceedings{li2022grounded,
      title = {Grounded Language-Image Pre-training},
      author = {Li, Liunian Harold and Zhang, Pengchuan and Zhang, Haotian and Yang, Jianwei and Li, Chunyuan and Zhong, Yiwu and Wang, Lijuan and Yuan, Lu and Zhang, Lei and Hwang, Jenq-Neng and Chang, Kai-Wei and Gao, Jianfeng},
      booktitle = {CVPR},
      year = {2022}
    }
    
    Details
  6. How Much Can CLIP Benefit Vision-and-Language Tasks?

    Sheng Shen, Liunian Harold Li, Hao Tan, Mohit Bansal, Anna Rohrbach, Kai-Wei Chang, Zhewei Yao, and Kurt Keutz, in ICLR, 2022.
    Full Text Code Abstract BibTeX Details Top-10 cited paper at ICLR 22
    Most existing Vision-and-Language (V&L) models rely on pre-trained visual encoders, using a relatively small set of manually-annotated data (as compared to web-crawled data), to perceive the visual world. However, it has been observed that large-scale pretraining usually can result in better generalization performance, e.g., CLIP (Contrastive Language-Image Pre-training), trained on a massive amount of image-caption pairs, has shown a strong zero-shot capability on various vision tasks. To further study the advantage brought by CLIP, we propose to use CLIP as the visual encoder in various V&L models in two typical scenarios: 1) plugging CLIP into task-specific fine-tuning; 2) combining CLIP with V&L pre-training and transferring to downstream tasks. We show that CLIP significantly outperforms widely-used visual encoders trained with in-domain annotated data, such as BottomUp-TopDown. We achieve competitive or better results on diverse V&L tasks, while establishing new state-of-the-art results on Visual Question Answering, Visual Entailment, and V&L Navigation tasks.
    @inproceedings{shen2022how,
      title = { How Much Can CLIP Benefit Vision-and-Language Tasks? },
      author = {Shen, Sheng and Li, Liunian Harold and Tan, Hao and Bansal, Mohit and Rohrbach, Anna and Chang, Kai-Wei and Yao, Zhewei and Keutz, Kurt},
      booktitle = {ICLR},
      year = {2022}
    }
    
    Details