Multimodal Transformer Distillation for Audio-Visual Synchronization | allainews.com

March 19, 2024, 4:50 a.m. | Xuanjun Chen, Haibin Wu, Chung-Che Wang, Hung-yi Lee, Jyh-Shing Roger Jang

cs.CV updates on arXiv.org arxiv.org

arXiv:2210.15563v3 Announce Type: replace
Abstract: Audio-visual synchronization aims to determine whether the mouth movements and speech in the video are synchronized. VocaLiST reaches state-of-the-art performance by incorporating multimodal Transformers to model audio-visual interact information. However, it requires high computing resources, making it impractical for real-world applications. This paper proposed an MTDVocaLiST model, which is trained by our proposed multimodal Transformer distillation (MTD) loss. MTD loss enables MTDVocaLiST model to deeply mimic the cross-attention distribution and value-relation in the Transformer of …

abstract applications art arxiv audio computing computing resources cs.cv cs.ir cs.sd distillation eess.as however information making movements multimodal paper performance resources speech state synchronization transformer transformers type video visual world

More from arxiv.org / cs.CV updates on arXiv.org

GPT-4V(ision) for Robotics: Multimodal Task Planning from Human Demonstration 15 hours ago | arxiv.org

abstract arxiv cs.cl cs.cv +25

Dynamic Open Vocabulary Enhanced Safe-landing with Intelligence (DOVESEI) 15 hours ago | arxiv.org

abstract arxiv attention cs.ai +16

CoVid-19 Detection leveraging Vision Transformers and Explainable AI 15 hours ago | arxiv.org

abstract arxiv covid covid-19 +19

SAR image matching algorithm based on multi-class features 15 hours ago | arxiv.org

abstract algorithm application arxiv +13

Enhancing Sign Language Teaching: A Mixed Reality Approach for Immersive Learning and Multi-Dimensional Feedback 15 hours ago | arxiv.org

abstract arxiv challenges classroom +13

A Linear Time and Space Local Point Cloud Geometry Encoder via Vectorized Kernel Mixture (VecKM) 15 hours ago | arxiv.org

abstract arxiv cloud compute +11

UP-CrackNet: Unsupervised Pixel-Wise Road Crack Detection via Adversarial Image Restoration 15 hours ago | arxiv.org

abstract adversarial algorithms arxiv +21

AttributionScanner: A Visual Analytics System for Model Validation with Metadata-Free Slice Finding 15 hours ago | arxiv.org

abstract analytics arxiv context +19

FurniScene: A Large-scale 3D Room Dataset with Intricate Furnishing Scenes 15 hours ago | arxiv.org

abstract applications arxiv attention +15

Lead Developer (AI)

@ Cere Network | San Francisco, US

View on ai-jobs.net

Research Engineer

@ Allora Labs | Remote

View on ai-jobs.net

Ecosystem Manager

@ Allora Labs | Remote

View on ai-jobs.net

Founding AI Engineer, Agents

@ Occam AI | New York

View on ai-jobs.net

AI Engineer Intern, Agents

@ Occam AI | US

View on ai-jobs.net

AI Research Scientist

@ Vara | Berlin, Germany and Remote

View on ai-jobs.net