CoBIT: A Contrastive Bi-directional Image-Text Generation Model

Haoxuan You, Mandy Guo, Zhecan Wang, Kai-Wei Chang, Jason Michael Baldridge, and Jiahui Yu, in ICLR, 2024.

Download the full text

Abstract

The field of Vision-and-Language (VL) has witnessed a proliferation of pretrained foundation models. Current techniques typically employ only one type of training objective, whether it’s (1) contrastive objectives (like CLIP), (2) image-to-text generative objectives (like PaLI), or (3) text-to-image generative objectives (like Parti). However, all these three objectives are mutually relevant and are all based on image-text pairs. Intuitively, the first two objectives can be considered as complementary projections between two modalities, and contrastive learning can preserve global alignment and generations facilitate fine-grained understanding. Inspired by this, we present a Contrastive Bi-directional Image-Text generation model (CoBIT) to first time unify the three pre-training objectives in one framework. Specifically, CoBIT employs a novel unicoder-decoder structure consisting of an image unicoder, a text unicoder, and a cross-modal decoder. The image/text unicoders can switch between encoding and decoding in different tasks, enabling flexibility and shared knowledge that benefits

Bib Entry

@inproceedings{you2024cobit,
  title = {CoBIT: A Contrastive Bi-directional Image-Text Generation Model},
  author = {You, Haoxuan and Guo, Mandy and Wang, Zhecan and Chang, Kai-Wei and Baldridge, Jason Michael and Yu, Jiahui},
  booktitle = {ICLR},
  year = {2024},
  month = jan,
  day = {16}
}