all AI news
ViT-Lens: Initiating Omni-Modal Exploration through 3D Insights
March 27, 2024, 4:46 a.m. | Weixian Lei, Yixiao Ge, Jianfeng Zhang, Dylan Sun, Kun Yi, Ying Shan, Mike Zheng Shou
cs.CV updates on arXiv.org arxiv.org
Abstract: Though the success of CLIP-based training recipes in vision-language models, their scalability to more modalities (e.g., 3D, audio, etc.) is limited to large-scale data, which is expensive or even inapplicable for rare modalities. In this paper, we present ViT-Lens that facilitates efficient omni-modal representation learning by perceiving novel modalities with a pretrained ViT and aligning to a pre-defined space. Specifically, the modality-specific lens is tuned to project multimodal signals to the shared embedding space, which …
abstract arxiv audio clip cs.cv data etc exploration insights language language models modal novel paper recipes representation representation learning scalability scale success through training type vision vision-language models vit
More from arxiv.org / cs.CV updates on arXiv.org
Eyes Wide Shut? Exploring the Visual Shortcomings of Multimodal LLMs
2 days, 11 hours ago |
arxiv.org
Jobs in AI, ML, Big Data
Data Architect
@ University of Texas at Austin | Austin, TX
Data ETL Engineer
@ University of Texas at Austin | Austin, TX
Lead GNSS Data Scientist
@ Lurra Systems | Melbourne
Senior Machine Learning Engineer (MLOps)
@ Promaton | Remote, Europe
Research Scientist, Demography and Survey Science, University Grad
@ Meta | Menlo Park, CA | New York City
Computer Vision Engineer, XR
@ Meta | Burlingame, CA