Sept. 2, 2022, 1:15 a.m. | Yan Xia, Zhou Zhao, Shangwei Ye, Yang Zhao, Haoyuan Li, Yi Ren

cs.CL updates on arXiv.org arxiv.org

In this paper, we introduce a new task, spoken video grounding (SVG), which
aims to localize the desired video fragments from spoken language descriptions.
Compared with using text, employing audio requires the model to directly
exploit the useful phonemes and syllables related to the video from raw speech.
Moreover, we randomly add environmental noises to this speech audio, further
increasing the difficulty of this task and better simulating real applications.
To rectify the discriminative phonemes and extract video-related information
from …

arxiv curriculum curriculum learning learning video

Founding AI Engineer, Agents

@ Occam AI | New York

AI Engineer Intern, Agents

@ Occam AI | US

AI Research Scientist

@ Vara | Berlin, Germany and Remote

Data Architect

@ University of Texas at Austin | Austin, TX

Data ETL Engineer

@ University of Texas at Austin | Austin, TX

Business Intelligence Architect - Specialist

@ Eastman | Hyderabad, IN, 500 008