June 30, 2022, 1:12 a.m. | Peter Makarov, Ammar Abbas, Mateusz Łajszczak, Arnaud Joly, Sri Karlapati, Alexis Moinet, Thomas Drugman, Penny Karanasou

cs.CL updates on arXiv.org arxiv.org

Generating expressive and contextually appropriate prosody remains a
challenge for modern text-to-speech (TTS) systems. This is particularly evident
for long, multi-sentence inputs. In this paper, we examine simple extensions to
a Transformer-based FastSpeech-like system, with the goal of improving prosody
for multi-sentence TTS. We find that long context, powerful text features, and
training on multi-speaker data all improve prosody. More interestingly, they
result in synergies. Long context disambiguates prosody, improves coherence,
and plays to the strengths of Transformers. Fine-tuning word-level …

arxiv tts

Data Scientist (m/f/x/d)

@ Symanto Research GmbH & Co. KG | Spain, Germany

AI Scientist/Engineer

@ OKX | Singapore

Research Engineering/ Scientist Associate I

@ The University of Texas at Austin | AUSTIN, TX

Senior Data Engineer

@ Algolia | London, England

Fundamental Equities - Vice President, Equity Quant Research Analyst (Income & Value Investment Team)

@ BlackRock | NY7 - 50 Hudson Yards, New York

Snowflake Data Analytics

@ Devoteam | Madrid, Spain