all AI news
DeiT-LT Distillation Strikes Back for Vision Transformer Training on Long-Tailed Datasets
April 4, 2024, 4:42 a.m. | Harsh Rangwani, Pradipto Mondal, Mayank Mishra, Ashish Ramayee Asokan, R. Venkatesh Babu
cs.LG updates on arXiv.org arxiv.org
Abstract: Vision Transformer (ViT) has emerged as a prominent architecture for various computer vision tasks. In ViT, we divide the input image into patch tokens and process them through a stack of self attention blocks. However, unlike Convolutional Neural Networks (CNN), ViTs simple architecture has no informative inductive bias (e.g., locality,etc. ). Due to this, ViT requires a large amount of data for pre-training. Various data efficient approaches (DeiT) have been proposed to train ViT on …
abstract architecture arxiv attention cnn computer computer vision convolutional neural networks cs.ai cs.cv cs.lg datasets distillation however image networks neural networks process simple stack strikes tasks them through tokens training transformer type vision vit
More from arxiv.org / cs.LG updates on arXiv.org
Jobs in AI, ML, Big Data
Founding AI Engineer, Agents
@ Occam AI | New York
AI Engineer Intern, Agents
@ Occam AI | US
AI Research Scientist
@ Vara | Berlin, Germany and Remote
Data Architect
@ University of Texas at Austin | Austin, TX
Data ETL Engineer
@ University of Texas at Austin | Austin, TX
Lead GNSS Data Scientist
@ Lurra Systems | Melbourne