CoBIT: A Contrastive Bi-directional Image-Text Generation Model
Haoxuan You, Mandy Guo, Zhecan Wang, Kai-Wei Chang, Jason Michael Baldridge, and Jiahui Yu, in ICLR, 2024.
Download the full text
Abstract
The field of Vision-and-Language (VL) has witnessed a proliferation of pretrained foundation models. Current techniques typically employ only one type of training objective, whether it’s (1) contrastive objectives (like CLIP), (2) image-to-text generative objectives (like PaLI), or (3) text-to-image generative objectives (like Parti). However, all these three objectives are mutually relevant and are all based on image-text pairs. Intuitively, the first two objectives can be considered as complementary projections between two modalities, and contrastive learning can preserve global alignment and generations facilitate fine-grained understanding. Inspired by this, we present a Contrastive Bi-directional Image-Text generation model (CoBIT) to first time unify the three pre-training objectives in one framework. Specifically, CoBIT employs a novel unicoder-decoder structure consisting of an image unicoder, a text unicoder, and a cross-modal decoder. The image/text unicoders can switch between encoding and decoding in different tasks, enabling flexibility and shared knowledge that benefits
Bib Entry
@inproceedings{you2024cobit, title = {CoBIT: A Contrastive Bi-directional Image-Text Generation Model}, author = {You, Haoxuan and Guo, Mandy and Wang, Zhecan and Chang, Kai-Wei and Baldridge, Jason Michael and Yu, Jiahui}, booktitle = {ICLR}, year = {2024}, month = jan, day = {16} }