May 5, 2022, 1:10 a.m. | Jiahui Yu, Zirui Wang, Vijay Vasudevan, Legg Yeung, Mojtaba Seyedhosseini, Yonghui Wu

Exploring large-scale pretrained foundation models is of significant interest
in computer vision because these models can be quickly transferred to many
downstream tasks. This paper presents Contrastive Captioner (CoCa), a
minimalist design to pretrain an image-text encoder-decoder foundation model
jointly with contrastive loss and captioning loss, thereby subsuming model
capabilities from contrastive approaches like CLIP and generative methods like
SimVLM. In contrast to standard encoder-decoder transformers where all decoder
layers attend to encoder outputs, CoCa omits cross-attention in the first …

