all AI news
VideoAgent: Long-form Video Understanding with Large Language Model as Agent
March 18, 2024, 4:45 a.m. | Xiaohan Wang, Yuhui Zhang, Orr Zohar, Serena Yeung-Levy
cs.CV updates on arXiv.org arxiv.org
Abstract: Long-form video understanding represents a significant challenge within computer vision, demanding a model capable of reasoning over long multi-modal sequences. Motivated by the human cognitive process for long-form video understanding, we emphasize interactive reasoning and planning over the ability to process lengthy visual inputs. We introduce a novel agent-based system, VideoAgent, that employs a large language model as a central agent to iteratively identify and compile crucial information to answer a question, with vision-language foundation …
abstract agent arxiv challenge cognitive computer computer vision cs.ai cs.cl cs.cv cs.ir form human inputs interactive language language model large language large language model modal multi-modal planning process reasoning type understanding video video understanding vision visual
More from arxiv.org / cs.CV updates on arXiv.org
Jobs in AI, ML, Big Data
Data Architect
@ University of Texas at Austin | Austin, TX
Data ETL Engineer
@ University of Texas at Austin | Austin, TX
Lead GNSS Data Scientist
@ Lurra Systems | Melbourne
Senior Machine Learning Engineer (MLOps)
@ Promaton | Remote, Europe
Senior ML Engineer
@ Carousell Group | Ho Chi Minh City, Vietnam
Data and Insight Analyst
@ Cotiviti | Remote, United States