OLViT: Multi-Modal State Tracking via Attention-Based Embeddings for Video-Grounded Dialog | allainews.com

Feb. 21, 2024, 5:46 a.m. | Adnen Abdessaied, Manuel von Hochmeister, Andreas Bulling

cs.CV updates on arXiv.org arxiv.org

arXiv:2402.13146v1 Announce Type: new
Abstract: We present the Object Language Video Transformer (OLViT) - a novel model for video dialog operating over a multi-modal attention-based dialog state tracker. Existing video dialog models struggle with questions requiring both spatial and temporal localization within videos, long-term temporal reasoning, and accurate object tracking across multiple dialog turns. OLViT addresses these challenges by maintaining a global dialog state based on the output of an Object State Tracker (OST) and a Language State Tracker (LST): …

abstract arxiv attention cs.cv dialog embeddings language localization long-term modal multi-modal novel questions reasoning spatial state struggle temporal tracking transformer type via video videos

More from arxiv.org / cs.CV updates on arXiv.org

Multi-View Spectrogram Transformer for Respiratory Sound Classification 1 day, 1 hour ago | arxiv.org

abstract arxiv audio classification +17

CL-MRI: Self-Supervised Contrastive Learning to Improve the Accuracy of Undersampled MRI Reconstruction 1 day, 1 hour ago | arxiv.org

abstract accuracy acquisitions arxiv +15

LoopDraw: a Loop-Based Autoregressive Model for Shape Synthesis and Editing 1 day, 1 hour ago | arxiv.org

abstract alternative arxiv autoregressive +16

CLIP-Guided Source-Free Object Detection in Aerial Images 1 day, 1 hour ago | arxiv.org

aerial arxiv clip cs.cv +6

MonoNPHM: Dynamic Head Reconstruction from Monocular Videos 1 day, 1 hour ago | arxiv.org

abstract arxiv color cs.cv +9

GaussianHead: High-fidelity Head Avatars with Learnable Gaussian Derivation 1 day, 1 hour ago | arxiv.org

arxiv avatars cs.cv derivation +4

OTMatch: Improving Semi-Supervised Learning with Optimal Transport 1 day, 1 hour ago | arxiv.org

abstract algorithms arxiv cs.cv +20

SpATr: MoCap 3D Human Action Recognition based on Spiral Auto-encoder and Transformer Network 1 day, 1 hour ago | arxiv.org

action action recognition arxiv auto +9

FairCLIP: Social Bias Elimination based on Attribute Prototype Learning and Representation Neutralization 1 day, 1 hour ago | arxiv.org

abstract arxiv bias biases +22

ML/AI Engineer / NLP Expert - Custom LLM Development (x/f/m)

@ HelloBetter | Remote

View on ai-jobs.net

Doctoral Researcher (m/f/div) in Automated Processing of Bioimages

@ Leibniz Institute for Natural Product Research and Infection Biology (Leibniz-HKI) | Jena

View on ai-jobs.net

Seeking Developers and Engineers for AI T-Shirt Generator Project

@ Chevon Hicks | Remote

View on ai-jobs.net

Security Data Engineer

@ ASML | Veldhoven, Building 08, Netherlands

View on ai-jobs.net

Data Engineer

@ Parsons Corporation | Pune - Business Bay

View on ai-jobs.net

Data Engineer

@ Parsons Corporation | Bengaluru, Velankani Tech Park

View on ai-jobs.net