all AI news
OLViT: Multi-Modal State Tracking via Attention-Based Embeddings for Video-Grounded Dialog
Feb. 21, 2024, 5:46 a.m. | Adnen Abdessaied, Manuel von Hochmeister, Andreas Bulling
cs.CV updates on arXiv.org arxiv.org
Abstract: We present the Object Language Video Transformer (OLViT) - a novel model for video dialog operating over a multi-modal attention-based dialog state tracker. Existing video dialog models struggle with questions requiring both spatial and temporal localization within videos, long-term temporal reasoning, and accurate object tracking across multiple dialog turns. OLViT addresses these challenges by maintaining a global dialog state based on the output of an Object State Tracker (OST) and a Language State Tracker (LST): …
abstract arxiv attention cs.cv dialog embeddings language localization long-term modal multi-modal novel questions reasoning spatial state struggle temporal tracking transformer type via video videos
More from arxiv.org / cs.CV updates on arXiv.org
Jobs in AI, ML, Big Data
ML/AI Engineer / NLP Expert - Custom LLM Development (x/f/m)
@ HelloBetter | Remote
Doctoral Researcher (m/f/div) in Automated Processing of Bioimages
@ Leibniz Institute for Natural Product Research and Infection Biology (Leibniz-HKI) | Jena
Seeking Developers and Engineers for AI T-Shirt Generator Project
@ Chevon Hicks | Remote
Security Data Engineer
@ ASML | Veldhoven, Building 08, Netherlands
Data Engineer
@ Parsons Corporation | Pune - Business Bay
Data Engineer
@ Parsons Corporation | Bengaluru, Velankani Tech Park