all AI news
Localization vs. Semantics: Visual Representations in Unimodal and Multimodal Models. (arXiv:2212.00281v2 [cs.CV] UPDATED)
cs.CL updates on arXiv.org arxiv.org
Despite the impressive advancements achieved through vision-and-language
pretraining, it remains unclear whether this joint learning paradigm can help
understand each individual modality. In this work, we conduct a comparative
analysis of the visual representations in existing vision-and-language models
and vision-only models by probing a broad range of tasks, aiming to assess the
quality of the learned representations in a nuanced manner. Interestingly, our
empirical observations suggest that vision-and-language models are better at
label prediction tasks like object and attribute prediction, …
analysis arxiv comparative analysis cs.cv language language models localization multimodal multimodal models paradigm pretraining semantics through vision vision-and-language visual work