all AI news
OLViT: Multi-Modal State Tracking via Attention-Based Embeddings for Video-Grounded Dialog
Feb. 21, 2024, 5:46 a.m. | Adnen Abdessaied, Manuel von Hochmeister, Andreas Bulling
cs.CV updates on arXiv.org arxiv.org
Abstract: We present the Object Language Video Transformer (OLViT) - a novel model for video dialog operating over a multi-modal attention-based dialog state tracker. Existing video dialog models struggle with questions requiring both spatial and temporal localization within videos, long-term temporal reasoning, and accurate object tracking across multiple dialog turns. OLViT addresses these challenges by maintaining a global dialog state based on the output of an Object State Tracker (OST) and a Language State Tracker (LST): …
abstract arxiv attention cs.cv dialog embeddings language localization long-term modal multi-modal novel questions reasoning spatial state struggle temporal tracking transformer type via video videos
More from arxiv.org / cs.CV updates on arXiv.org
Jobs in AI, ML, Big Data
Artificial Intelligence – Bioinformatic Expert
@ University of Texas Medical Branch | Galveston, TX
Lead Developer (AI)
@ Cere Network | San Francisco, US
Research Engineer
@ Allora Labs | Remote
Ecosystem Manager
@ Allora Labs | Remote
Founding AI Engineer, Agents
@ Occam AI | New York
AI Engineer Intern, Agents
@ Occam AI | US