Feb. 9, 2024, 5:46 a.m. | Zhiyuan Ma Xiangyu Zhu Guojun Qi Chen Qian Zhaoxiang Zhang Zhen Lei

cs.CV updates on arXiv.org arxiv.org

Speech-driven 3D facial animation is important for many multimedia applications. Recent work has shown promise in using either Diffusion models or Transformer architectures for this task. However, their mere aggregation does not lead to improved performance. We suspect this is due to a shortage of paired audio-4D data, which is crucial for the Transformer to effectively perform as a denoiser within the Diffusion framework. To tackle this issue, we present DiffSpeaker, a Transformer-based network equipped with novel biased conditional attention …

aggregation animation applications architectures audio cs.ai cs.cv data diffusion diffusion models multimedia performance shortage speech transformer work

