April 26, 2024, 4:45 a.m. | Niclas Popp, Jan Hendrik Metzen, Matthias Hein

cs.CV updates on arXiv.org arxiv.org

arXiv:2404.16637v1 Announce Type: new
Abstract: Multi-modal foundation models such as CLIP have showcased impressive zero-shot capabilities. However, their applicability in resource-constrained environments is limited due to their large number of parameters and high inference time. While existing approaches have scaled down the entire CLIP architecture, we focus on training smaller variants of the image encoder, which suffices for efficient zero-shot classification. The use of synthetic data has shown promise in distilling representations from larger teachers, resulting in strong few-shot and …

abstract architecture arxiv capabilities clip cs.cv data distillation environments focus foundation however image inference modal multi-modal parameters synthetic synthetic data type while zero-shot

Lead Developer (AI)

@ Cere Network | San Francisco, US

Research Engineer

@ Allora Labs | Remote

Ecosystem Manager

@ Allora Labs | Remote

Founding AI Engineer, Agents

@ Occam AI | New York

AI Engineer Intern, Agents

@ Occam AI | US

AI Research Scientist

@ Vara | Berlin, Germany and Remote