all AI news
MaMMUT: A simple vision-encoder text-decoder architecture for multimodal tasks
Google AI Blog ai.googleblog.com
Vision-language foundational models are built on the premise of a single pre-training followed by subsequent adaptation to multiple downstream tasks. Two main and disjoint training scenarios are popular: a CLIP-style contrastive learning and next-token prediction. Contrastive learning trains the model to predict if image-text pairs correctly match, effectively building visual and text representations for the corresponding image and text inputs, whereas next-token prediction predicts the most likely next …
architecture clip computer vision decoder encoder google google research language multimodal multimodal learning multiple next popular prediction pre-training research scientists text training trains video analysis vision