Feb. 13, 2024, 5:44 a.m. | Soroush Nasiriany Fei Xia Wenhao Yu Ted Xiao Jacky Liang Ishita Dasgupta Annie Xie Danny Dries

cs.LG updates on arXiv.org arxiv.org

Vision language models (VLMs) have shown impressive capabilities across a variety of tasks, from logical reasoning to visual understanding. This opens the door to richer interaction with the world, for example robotic control. However, VLMs produce only textual outputs, while robotic control and other spatial tasks require outputting continuous coordinates, actions, or trajectories. How can we enable VLMs to handle such settings without fine-tuning on task-specific data?
In this paper, we propose a novel visual prompting approach for VLMs that …

capabilities continuous control cs.cl cs.cv cs.lg cs.ro example iterative knowledge language language models pivot prompting reasoning robotic spatial tasks textual understanding vision visual visual prompting vlms world

Research Scholar (Technical Research)

@ Centre for the Governance of AI | Hybrid; Oxford, UK

HPC Engineer (x/f/m) - DACH

@ Meshcapade GmbH | Remote, Germany

Senior Analytics Engineer (Retail)

@ Lightspeed Commerce | Toronto, Ontario, Canada

Data Scientist II, BIA GPS India Operations

@ Bristol Myers Squibb | Hyderabad

Analytics Engineer

@ Bestpass | Remote

Senior Analyst - Data Management

@ Marsh McLennan | Mumbai - Hiranandani