March 14, 2024, 4:47 a.m. | Boyuan Zheng, Boyu Gou, Jihyung Kil, Huan Sun, Yu Su

cs.CV updates on arXiv.org arxiv.org

arXiv:2401.01614v2 Announce Type: replace-cross
Abstract: The recent development on large multimodal models (LMMs), especially GPT-4V(ision) and Gemini, has been quickly expanding the capability boundaries of multimodal models beyond traditional tasks like image captioning and visual question answering. In this work, we explore the potential of LMMs like GPT-4V as a generalist web agent that can follow natural language instructions to complete tasks on any given website. We propose SEEACT, a generalist web agent that harnesses the power of LMMs for …

agent arxiv cs.ai cs.cl cs.cv cs.ir gpt gpt-4v type web

Senior Machine Learning Engineer

@ GPTZero | Toronto, Canada

ML/AI Engineer / NLP Expert - Custom LLM Development (x/f/m)

@ HelloBetter | Remote

Senior Research Engineer/Specialist - Motor Mechanical Design

@ GKN Aerospace | Bristol, GB

Research Engineer (Motor Mechanical Design)

@ GKN Aerospace | Bristol, GB

Senior Research Engineer (Electromagnetic Design)

@ GKN Aerospace | Bristol, GB

Associate Research Engineer Clubs | Titleist

@ Acushnet Company | Carlsbad, CA, United States