April 23, 2024, 4:47 a.m. | Tengda Han, Max Bain, Arsha Nagrani, G\"ul Varol, Weidi Xie, Andrew Zisserman

cs.CV updates on arXiv.org arxiv.org

arXiv:2404.14412v1 Announce Type: new
Abstract: Generating Audio Description (AD) for movies is a challenging task that requires fine-grained visual understanding and an awareness of the characters and their names. Currently, visual language models for AD generation are limited by a lack of suitable training data, and also their evaluation is hampered by using performance measures not specialized to the AD domain. In this paper, we make three contributions: (i) We propose two approaches for constructing AD datasets with aligned video …

abstract arxiv audio characters cs.cv data evaluation fine-grained iii language language models movies pixels training training data type understanding visual visual language models

Founding AI Engineer, Agents

@ Occam AI | New York

AI Engineer Intern, Agents

@ Occam AI | US

AI Research Scientist

@ Vara | Berlin, Germany and Remote

Data Architect

@ University of Texas at Austin | Austin, TX

Data ETL Engineer

@ University of Texas at Austin | Austin, TX

Lead GNSS Data Scientist

@ Lurra Systems | Melbourne