Feb. 13, 2024, 5:44 a.m. | Soroush Nasiriany Fei Xia Wenhao Yu Ted Xiao Jacky Liang Ishita Dasgupta Annie Xie Danny Dries

cs.LG updates on arXiv.org arxiv.org

Vision language models (VLMs) have shown impressive capabilities across a variety of tasks, from logical reasoning to visual understanding. This opens the door to richer interaction with the world, for example robotic control. However, VLMs produce only textual outputs, while robotic control and other spatial tasks require outputting continuous coordinates, actions, or trajectories. How can we enable VLMs to handle such settings without fine-tuning on task-specific data?
In this paper, we propose a novel visual prompting approach for VLMs that …

capabilities continuous control cs.cl cs.cv cs.lg cs.ro example iterative knowledge language language models pivot prompting reasoning robotic spatial tasks textual understanding vision visual visual prompting vlms world

Data Engineer

@ Lemon.io | Remote: Europe, LATAM, Canada, UK, Asia, Oceania

Artificial Intelligence – Bioinformatic Expert

@ University of Texas Medical Branch | Galveston, TX

Lead Developer (AI)

@ Cere Network | San Francisco, US

Research Engineer

@ Allora Labs | Remote

Ecosystem Manager

@ Allora Labs | Remote

Founding AI Engineer, Agents

@ Occam AI | New York