all AI news
Autonomous visual information seeking with large language models
Google AI Blog ai.googleblog.com
There has been great progress towards adapting large language models (LLMs) to accommodate multimodal inputs for tasks including image captioning, visual question answering (VQA), and open vocabulary recognition. Despite such achievements, current state-of-the-art visual language models (VLMs) perform inadequately on visual information seeking datasets, such as Infoseek and OK-VQA, where external knowledge is required to answer the questions.
Examples of visual … |
art autonomous captioning computer vision current google google research image information language language models large language large language models llms machine learning multimodal multimodal learning perception progress question answering recognition research researcher research scientist state tasks team