March 27, 2024, 4:46 a.m. | Weixian Lei, Yixiao Ge, Jianfeng Zhang, Dylan Sun, Kun Yi, Ying Shan, Mike Zheng Shou

cs.CV updates on arXiv.org arxiv.org

arXiv:2308.10185v2 Announce Type: replace
Abstract: Though the success of CLIP-based training recipes in vision-language models, their scalability to more modalities (e.g., 3D, audio, etc.) is limited to large-scale data, which is expensive or even inapplicable for rare modalities. In this paper, we present ViT-Lens that facilitates efficient omni-modal representation learning by perceiving novel modalities with a pretrained ViT and aligning to a pre-defined space. Specifically, the modality-specific lens is tuned to project multimodal signals to the shared embedding space, which …

abstract arxiv audio clip cs.cv data etc exploration insights language language models modal novel paper recipes representation representation learning scalability scale success through training type vision vision-language models vit

Data Architect

@ University of Texas at Austin | Austin, TX

Data ETL Engineer

@ University of Texas at Austin | Austin, TX

Lead GNSS Data Scientist

@ Lurra Systems | Melbourne

Senior Machine Learning Engineer (MLOps)

@ Promaton | Remote, Europe

Research Scientist, Demography and Survey Science, University Grad

@ Meta | Menlo Park, CA | New York City

Computer Vision Engineer, XR

@ Meta | Burlingame, CA