all AI news
VideoAgent: A Memory-augmented Multimodal Agent for Video Understanding
March 19, 2024, 4:49 a.m. | Yue Fan, Xiaojian Ma, Rujie Wu, Yuntao Du, Jiaqi Li, Zhi Gao, Qing Li
cs.CV updates on arXiv.org arxiv.org
Abstract: We explore how reconciling several foundation models (large language models and vision-language models) with a novel unified memory mechanism could tackle the challenging video understanding problem, especially capturing the long-term temporal relations in lengthy videos. In particular, the proposed multimodal agent VideoAgent: 1) constructs a structured memory to store both the generic temporal event descriptions and object-centric tracking states of the video; 2) given an input task query, it employs tools including video segment localization …
abstract agent arxiv cs.cv explore foundation language language models large language large language models long-term memory multimodal novel relations temporal type understanding video videos video understanding vision vision-language models
More from arxiv.org / cs.CV updates on arXiv.org
Jobs in AI, ML, Big Data
Software Engineer for AI Training Data (School Specific)
@ G2i Inc | Remote
Software Engineer for AI Training Data (Python)
@ G2i Inc | Remote
Software Engineer for AI Training Data (Tier 2)
@ G2i Inc | Remote
Data Engineer
@ Lemon.io | Remote: Europe, LATAM, Canada, UK, Asia, Oceania
Artificial Intelligence – Bioinformatic Expert
@ University of Texas Medical Branch | Galveston, TX
Lead Developer (AI)
@ Cere Network | San Francisco, US