Nov. 23, 2022, 2:17 a.m. | Yuanze Lin, Chen Wei, Huiyu Wang, Alan Yuille, Cihang Xie

cs.CL updates on arXiv.org arxiv.org

Video-language pre-training is crucial for learning powerful multi-modal
representation. However, it typically requires a massive amount of computation.
In this paper, we develop SMAUG, an efficient pre-training framework for
video-language models. The foundation component in SMAUG is masked
autoencoders. Different from prior works which only mask textual inputs, our
masking strategy considers both visual and textual modalities, providing a
better cross-modal alignment and saving more pre-training costs. On top of
that, we introduce a space-time token sparsification module, which leverages …

arxiv autoencoder language masked autoencoder pre-training training video

Software Engineer for AI Training Data (School Specific)

@ G2i Inc | Remote

Software Engineer for AI Training Data (Python)

@ G2i Inc | Remote

Software Engineer for AI Training Data (Tier 2)

@ G2i Inc | Remote

Data Engineer

@ Lemon.io | Remote: Europe, LATAM, Canada, UK, Asia, Oceania

Artificial Intelligence – Bioinformatic Expert

@ University of Texas Medical Branch | Galveston, TX

Lead Developer (AI)

@ Cere Network | San Francisco, US