March 26, 2024, 4:44 a.m. | Puyuan Peng, Po-Yao Huang, Daniel Li, Abdelrahman Mohamed, David Harwath

cs.LG updates on arXiv.org arxiv.org

arXiv:2403.16973v1 Announce Type: cross
Abstract: We introduce VoiceCraft, a token infilling neural codec language model, that achieves state-of-the-art performance on both speech editing and zero-shot text-to-speech (TTS) on audiobooks, internet videos, and podcasts. VoiceCraft employs a Transformer decoder architecture and introduces a token rearrangement procedure that combines causal masking and delayed stacking to enable generation within an existing sequence. On speech editing tasks, VoiceCraft produces edited speech that is nearly indistinguishable from unedited recordings in terms of naturalness, as evaluated …

arxiv cs.ai cs.cl cs.lg cs.sd editing eess.as speech text text-to-speech type zero-shot

Founding AI Engineer, Agents

@ Occam AI | New York

AI Engineer Intern, Agents

@ Occam AI | US

AI Research Scientist

@ Vara | Berlin, Germany and Remote

Data Architect

@ University of Texas at Austin | Austin, TX

Data ETL Engineer

@ University of Texas at Austin | Austin, TX

Lead GNSS Data Scientist

@ Lurra Systems | Melbourne