all AI news
On the Impact of Noises in Crowd-Sourced Data for Speech Translation. (arXiv:2206.13756v1 [cs.CL])
June 29, 2022, 1:12 a.m. | Siqi Ouyang, Rong Ye, Lei Li
cs.CL updates on arXiv.org arxiv.org
Training speech translation (ST) models requires large and high-quality
datasets. MuST-C is one of the most widely used ST benchmark datasets. It
contains around 400 hours of speech-transcript-translation data for each of the
eight translation directions. This dataset passes several quality-control
filters during creation. However, we find that MuST-C still suffers from three
major quality issues: audio-text misalignment, inaccurate translation, and
unnecessary speaker's name. What are the impacts of these data quality issues
for model development and evaluation? In this …
More from arxiv.org / cs.CL updates on arXiv.org
Jobs in AI, ML, Big Data
Senior ML Researcher - 3D Geometry Processing | 3D Shape Generation | 3D Mesh Data
@ Promaton | Europe
Principal Data Engineer
@ RS21 | Remote
SQL/Power BI Developer
@ ICF | Virginia Remote Office (VA99)
Senior Machine Learning Engineer (Canada Remote)
@ Fullscript | Ottawa, ON
Software Engineer - MLOps.
@ Renesas Electronics | Toyosu, Japan
Junior Data Scientist / Artificial Intelligence consultant
@ Deloitte | Luxembourg, LU