March 13, 2024, 4:42 a.m. | Chenyu You, Yifei Min, Weicheng Dai, Jasjeet S. Sekhon, Lawrence Staib, James S. Duncan

cs.LG updates on arXiv.org arxiv.org

arXiv:2403.07241v1 Announce Type: cross
Abstract: Fine-tuning pre-trained vision-language models, like CLIP, has yielded success on diverse downstream tasks. However, several pain points persist for this paradigm: (i) directly tuning entire pre-trained models becomes both time-intensive and computationally costly. Additionally, these tuned models tend to become highly specialized, limiting their practicality for real-world deployment; (ii) recent studies indicate that pre-trained vision-language classifiers may overly depend on spurious features -- patterns that correlate with the target in training data, but are not …

abstract annotations arxiv become clip cs.cv cs.lg diverse fine-tuning however language language models modal multi-modal pain paradigm pre-trained models robustness success tasks type vision vision-language models

Data Architect

@ University of Texas at Austin | Austin, TX

Data ETL Engineer

@ University of Texas at Austin | Austin, TX

Lead GNSS Data Scientist

@ Lurra Systems | Melbourne

Senior Machine Learning Engineer (MLOps)

@ Promaton | Remote, Europe

Research Scientist

@ Meta | Menlo Park, CA

Principal Data Scientist

@ Mastercard | O'Fallon, Missouri (Main Campus)