March 27, 2024, 4:43 a.m. | Anil Batra, Davide Moltisanti, Laura Sevilla-Lara, Marcus Rohrbach, Frank Keller

cs.LG updates on arXiv.org arxiv.org

arXiv:2311.15964v2 Announce Type: replace-cross
Abstract: Procedural videos show step-by-step demonstrations of tasks like recipe preparation. Understanding such videos is challenging, involving the precise localization of steps and the generation of textual instructions. Manually annotating steps and writing instructions is costly, which limits the size of current datasets and hinders effective learning. Leveraging large but noisy video-transcript datasets for pre-training can boost performance, but demands significant computational resources. Furthermore, transcripts contain irrelevant content and exhibit style variation compared to instructions written …

abstract arxiv cs.ai cs.cl cs.cv cs.lg current datasets localization pre-training recipe show step-by-step tasks textual training type understanding videos writing

Data Architect

@ University of Texas at Austin | Austin, TX

Data ETL Engineer

@ University of Texas at Austin | Austin, TX

Lead GNSS Data Scientist

@ Lurra Systems | Melbourne

Senior Machine Learning Engineer (MLOps)

@ Promaton | Remote, Europe

Research Scientist, Demography and Survey Science, University Grad

@ Meta | Menlo Park, CA | New York City

Computer Vision Engineer, XR

@ Meta | Burlingame, CA