Feb. 13, 2024, 5:44 a.m. | Siddharth Karamcheti Suraj Nair Ashwin Balakrishna Percy Liang Thomas Kollar Dorsa Sadigh

cs.LG updates on arXiv.org arxiv.org

Visually-conditioned language models (VLMs) have seen growing adoption in applications such as visual dialogue, scene understanding, and robotic task planning; adoption that has fueled a wealth of new models such as LLaVa, InstructBLIP, and PaLI-3. Despite the volume of new releases, key design decisions around image preprocessing, architecture, and optimization are under-explored, making it challenging to understand what factors account for model performance $-$ a challenge further complicated by the lack of objective, consistent evaluations. To address these gaps, we …

adoption applications architecture cs.ai cs.cl cs.cv cs.lg decisions design dialogue image key language language models llava optimization planning releases robotic space understanding visual vlms wealth

Artificial Intelligence – Bioinformatic Expert

@ University of Texas Medical Branch | Galveston, TX

Lead Developer (AI)

@ Cere Network | San Francisco, US

Research Engineer

@ Allora Labs | Remote

Ecosystem Manager

@ Allora Labs | Remote

Founding AI Engineer, Agents

@ Occam AI | New York

AI Engineer Intern, Agents

@ Occam AI | US