Feb. 6, 2024, 5:54 a.m. | Dan Lyth Simon King

cs.CL updates on arXiv.org arxiv.org

Text-to-speech models trained on large-scale datasets have demonstrated impressive in-context learning capabilities and naturalness. However, control of speaker identity and style in these models typically requires conditioning on reference speech recordings, limiting creative applications. Alternatively, natural language prompting of speaker identity and style has demonstrated promising results and provides an intuitive method of control. However, reliance on human-labeled descriptions prevents scaling to large datasets.
Our work bridges the gap between these two approaches. We propose a scalable method for labeling …

annotations applications capabilities context control creative cs.cl cs.sd datasets fidelity guidance identity in-context learning language natural natural language prompting reference scale speaker speech style synthetic text text-to-speech

Founding AI Engineer, Agents

@ Occam AI | New York

AI Engineer Intern, Agents

@ Occam AI | US

AI Research Scientist

@ Vara | Berlin, Germany and Remote

Data Architect

@ University of Texas at Austin | Austin, TX

Data ETL Engineer

@ University of Texas at Austin | Austin, TX

Lead GNSS Data Scientist

@ Lurra Systems | Melbourne