all AI news
Multimodal Transformer Distillation for Audio-Visual Synchronization
March 19, 2024, 4:50 a.m. | Xuanjun Chen, Haibin Wu, Chung-Che Wang, Hung-yi Lee, Jyh-Shing Roger Jang
cs.CV updates on arXiv.org arxiv.org
Abstract: Audio-visual synchronization aims to determine whether the mouth movements and speech in the video are synchronized. VocaLiST reaches state-of-the-art performance by incorporating multimodal Transformers to model audio-visual interact information. However, it requires high computing resources, making it impractical for real-world applications. This paper proposed an MTDVocaLiST model, which is trained by our proposed multimodal Transformer distillation (MTD) loss. MTD loss enables MTDVocaLiST model to deeply mimic the cross-attention distribution and value-relation in the Transformer of …
abstract applications art arxiv audio computing computing resources cs.cv cs.ir cs.sd distillation eess.as however information making movements multimodal paper performance resources speech state synchronization transformer transformers type video visual world
More from arxiv.org / cs.CV updates on arXiv.org
Jobs in AI, ML, Big Data
Lead Developer (AI)
@ Cere Network | San Francisco, US
Research Engineer
@ Allora Labs | Remote
Ecosystem Manager
@ Allora Labs | Remote
Founding AI Engineer, Agents
@ Occam AI | New York
AI Engineer Intern, Agents
@ Occam AI | US
AI Research Scientist
@ Vara | Berlin, Germany and Remote