OLViT: Multi-Modal State Tracking via Attention-Based Embeddings for Video-Grounded Dialog | allainews.com

Feb. 21, 2024, 5:46 a.m. | Adnen Abdessaied, Manuel von Hochmeister, Andreas Bulling

cs.CV updates on arXiv.org arxiv.org

arXiv:2402.13146v1 Announce Type: new
Abstract: We present the Object Language Video Transformer (OLViT) - a novel model for video dialog operating over a multi-modal attention-based dialog state tracker. Existing video dialog models struggle with questions requiring both spatial and temporal localization within videos, long-term temporal reasoning, and accurate object tracking across multiple dialog turns. OLViT addresses these challenges by maintaining a global dialog state based on the output of an Object State Tracker (OST) and a Language State Tracker (LST): …

abstract arxiv attention cs.cv dialog embeddings language localization long-term modal multi-modal novel questions reasoning spatial state struggle temporal tracking transformer type via video videos

More from arxiv.org / cs.CV updates on arXiv.org

Active Visual Localization for Multi-Agent Collaboration: A Data-Driven Approach 7 hours ago | arxiv.org

abstract agent agents arxiv +17

DRSI-Net: Dual-Residual Spatial Interaction Network for Multi-Person Pose Estimation 7 hours ago | arxiv.org

abstract arxiv computer computer vision +12

EmMixformer: Mix transformer for eye movement recognition 7 hours ago | arxiv.org

abstract arxiv attention biometric +19

Towards Robust Semantic Segmentation against Patch-based Attack via Attention Refinement 7 hours ago | arxiv.org

abstract arxiv attention case +17

Customization Assistant for Text-to-image Generation 7 hours ago | arxiv.org

abstract applications arxiv assistant +16

FALCON: Fairness Learning via Contrastive Attention Approach to Continual Semantic Scene Understanding 7 hours ago | arxiv.org

abstract arxiv attention catastrophic forgetting +22

V2X Cooperative Perception for Autonomous Driving: Recent Advances and Challenges 7 hours ago | arxiv.org

abstract advances arxiv autonomous +21

Deep Diversity-Enhanced Feature Representation of Hyperspectral Images 7 hours ago | arxiv.org

arxiv cs.cv diversity eess.iv +4

Hierarchical Point Attention for Indoor 3D Object Detection 7 hours ago | arxiv.org

3d object 3d object detection abstract architectures +20

Artificial Intelligence – Bioinformatic Expert

@ University of Texas Medical Branch | Galveston, TX

View on ai-jobs.net

Lead Developer (AI)

@ Cere Network | San Francisco, US

View on ai-jobs.net

Research Engineer

@ Allora Labs | Remote

View on ai-jobs.net

Ecosystem Manager

@ Allora Labs | Remote

View on ai-jobs.net

Founding AI Engineer, Agents

@ Occam AI | New York

View on ai-jobs.net

AI Engineer Intern, Agents

@ Occam AI | US

View on ai-jobs.net