Feb. 7, 2024, 5:47 a.m. | Zijun Long Zaiqiao Meng Gerardo Aragon Camarasa Richard McCreadie

cs.CV updates on arXiv.org arxiv.org

Vision Transformers (ViTs) have emerged as popular models in computer vision, demonstrating state-of-the-art performance across various tasks. This success typically follows a two-stage strategy involving pre-training on large-scale datasets using self-supervised signals, such as masked random patches, followed by fine-tuning on task-specific labeled datasets with cross-entropy loss. However, this reliance on cross-entropy loss has been identified as a limiting factor in ViTs, affecting their generalization and transferability to downstream tasks. Addressing this critical challenge, we introduce a novel Label-aware Contrastive …

art computer computer vision cross-entropy cs.ai cs.cv datasets entropy fine-tuning framework loss performance popular pre-training random reliance scale stage state strategy success tasks training transformers vision vision transformers

Founding AI Engineer, Agents

@ Occam AI | New York

AI Engineer Intern, Agents

@ Occam AI | US

AI Research Scientist

@ Vara | Berlin, Germany and Remote

Data Architect

@ University of Texas at Austin | Austin, TX

Data ETL Engineer

@ University of Texas at Austin | Austin, TX

Lead GNSS Data Scientist

@ Lurra Systems | Melbourne