Web: http://arxiv.org/abs/2205.01917

May 5, 2022, 1:12 a.m. | Jiahui Yu, Zirui Wang, Vijay Vasudevan, Legg Yeung, Mojtaba Seyedhosseini, Yonghui Wu

cs.LG updates on arXiv.org arxiv.org

Exploring large-scale pretrained foundation models is of significant interest
in computer vision because these models can be quickly transferred to many
downstream tasks. This paper presents Contrastive Captioner (CoCa), a
minimalist design to pretrain an image-text encoder-decoder foundation model
jointly with contrastive loss and captioning loss, thereby subsuming model
capabilities from contrastive approaches like CLIP and generative methods like
SimVLM. In contrast to standard encoder-decoder transformers where all decoder
layers attend to encoder outputs, CoCa omits cross-attention in the first …

arxiv cv image models text

More from arxiv.org / cs.LG updates on arXiv.org

Director, Applied Mathematics & Computational Research Division

@ Lawrence Berkeley National Lab | Berkeley, Ca

Business Data Analyst

@ MainStreet Family Care | Birmingham, AL

Assistant/Associate Professor of the Practice in Business Analytics

@ Georgetown University McDonough School of Business | Washington DC

Senior Data Science Writer

@ NannyML | Remote

Director of AI/ML Engineering

@ Armis Industries | Remote (US only), St. Louis, California

Digital Analytics Manager

@ Patagonia | Ventura, California