Feb. 7, 2024, 5:47 a.m. | Trilok Padhi Ugur Kursuncu Yaman Kumar Valerie L. Shalin Lane Peterson Fronczek

cs.CV updates on arXiv.org arxiv.org

The prevalence of smart devices with the ability to capture moments in multiple modalities has enabled users to experience multimodal information online. However, large Language (LLMs) and Vision models (LVMs) are still limited in capturing holistic meaning with cross-modal semantic relationships. Without explicit, common sense knowledge (e.g., as a knowledge graph), Visual Language Models (VLMs) only learn implicit representations by capturing high-level patterns in vast corpora, missing essential contextual cross-modal cues. In this work, we design a framework to couple …

common sense cs.ai cs.cl cs.cv cs.cy cs.hc devices experience information knowledge language large language llms marketing meaning modal moments multimodal multiple relationships semantic sense smart vision vision models

Lead Developer (AI)

@ Cere Network | San Francisco, US

Research Engineer

@ Allora Labs | Remote

Ecosystem Manager

@ Allora Labs | Remote

Founding AI Engineer, Agents

@ Occam AI | New York

AI Engineer Intern, Agents

@ Occam AI | US

AI Research Scientist

@ Vara | Berlin, Germany and Remote