March 6, 2024, 5:45 a.m. | Chun-Peng Chang, Shaoxiang Wang, Alain Pagani, Didier Stricker

cs.CV updates on arXiv.org arxiv.org

arXiv:2403.03077v1 Announce Type: new
Abstract: 3D visual grounding involves matching natural language descriptions with their corresponding objects in 3D spaces. Existing methods often face challenges with accuracy in object recognition and struggle in interpreting complex linguistic queries, particularly with descriptions that involve multiple anchors or are view-dependent. In response, we present the MiKASA (Multi-Key-Anchor Scene-Aware) Transformer. Our novel end-to-end trained model integrates a self-attention-based scene-aware object encoder and an original multi-key-anchor technique, enhancing object recognition accuracy and the understanding of …

abstract accuracy anchor anchors arxiv challenges cs.cv face key language multiple natural natural language object objects queries recognition spaces struggle transformer type view visual

Lead Developer (AI)

@ Cere Network | San Francisco, US

Research Engineer

@ Allora Labs | Remote

Ecosystem Manager

@ Allora Labs | Remote

Founding AI Engineer, Agents

@ Occam AI | New York

AI Engineer Intern, Agents

@ Occam AI | US

AI Research Scientist

@ Vara | Berlin, Germany and Remote