Feb. 29, 2024, 9:10 p.m. | /u/PM_ME_JOB_OFFER

Machine Learning www.reddit.com

I'm still reading around but most every Computer Vision paper I read uses ViT as their backbone instead of SWIN or other similar architectures but why?



The ViT paper had to pre-train their model on the 303M image JFT dataset to beat earlier convolutional models on ImageNet whereas SWIN achieves better performance without any pre-training. I imagine SWIN would achieve comparable, if not higher performance on ImageNet if it was pre-trained the same way though admittedly I haven't seen …

architectures computer computer vision dataset every image imagenet machinelearning paper reading swin train vision vit

Data Architect

@ University of Texas at Austin | Austin, TX

Data ETL Engineer

@ University of Texas at Austin | Austin, TX

Lead GNSS Data Scientist

@ Lurra Systems | Melbourne

Senior Machine Learning Engineer (MLOps)

@ Promaton | Remote, Europe

AIML - Sr Machine Learning Engineer, Data and ML Innovation

@ Apple | Seattle, WA, United States

Senior Data Engineer

@ Palta | Palta Cyprus, Palta Warsaw, Palta remote