Feb. 21, 2024, 5:46 a.m. | Adnen Abdessaied, Manuel von Hochmeister, Andreas Bulling

cs.CV updates on arXiv.org arxiv.org

arXiv:2402.13146v1 Announce Type: new
Abstract: We present the Object Language Video Transformer (OLViT) - a novel model for video dialog operating over a multi-modal attention-based dialog state tracker. Existing video dialog models struggle with questions requiring both spatial and temporal localization within videos, long-term temporal reasoning, and accurate object tracking across multiple dialog turns. OLViT addresses these challenges by maintaining a global dialog state based on the output of an Object State Tracker (OST) and a Language State Tracker (LST): …

abstract arxiv attention cs.cv dialog embeddings language localization long-term modal multi-modal novel questions reasoning spatial state struggle temporal tracking transformer type via video videos

Artificial Intelligence – Bioinformatic Expert

@ University of Texas Medical Branch | Galveston, TX

Lead Developer (AI)

@ Cere Network | San Francisco, US

Research Engineer

@ Allora Labs | Remote

Ecosystem Manager

@ Allora Labs | Remote

Founding AI Engineer, Agents

@ Occam AI | New York

AI Engineer Intern, Agents

@ Occam AI | US