all AI news
AutoAD III: The Prequel -- Back to the Pixels
April 23, 2024, 4:47 a.m. | Tengda Han, Max Bain, Arsha Nagrani, G\"ul Varol, Weidi Xie, Andrew Zisserman
cs.CV updates on arXiv.org arxiv.org
Abstract: Generating Audio Description (AD) for movies is a challenging task that requires fine-grained visual understanding and an awareness of the characters and their names. Currently, visual language models for AD generation are limited by a lack of suitable training data, and also their evaluation is hampered by using performance measures not specialized to the AD domain. In this paper, we make three contributions: (i) We propose two approaches for constructing AD datasets with aligned video …
abstract arxiv audio characters cs.cv data evaluation fine-grained iii language language models movies pixels training training data type understanding visual visual language models
More from arxiv.org / cs.CV updates on arXiv.org
Compact 3D Scene Representation via Self-Organizing Gaussian Grids
2 days, 2 hours ago |
arxiv.org
Fingerprint Matching with Localized Deep Representation
2 days, 2 hours ago |
arxiv.org
Jobs in AI, ML, Big Data
Founding AI Engineer, Agents
@ Occam AI | New York
AI Engineer Intern, Agents
@ Occam AI | US
AI Research Scientist
@ Vara | Berlin, Germany and Remote
Data Architect
@ University of Texas at Austin | Austin, TX
Data ETL Engineer
@ University of Texas at Austin | Austin, TX
Lead GNSS Data Scientist
@ Lurra Systems | Melbourne