Achieving realistic, vivid, and human-like synthesized conversational
gestures conditioned on multi-modal data is still an unsolved problem due to
the lack of available datasets, models and standard evaluation metrics. To
address this, we build Body-Expression-Audio-Text dataset, BEAT, which has i)
76 hours, high-quality, multi-modal data captured from 30 speakers talking with
eight different emotions and in four different languages, ii) 32 millions
frame-level emotion and semantic relevance annotations. Our statistical
analysis on BEAT demonstrates the correlation of conversational gestures with …

