Web: http://arxiv.org/abs/2205.06230

May 13, 2022, 1:10 a.m. | Matthias Minderer, Alexey Gritsenko, Austin Stone, Maxim Neumann, Dirk Weissenborn, Alexey Dosovitskiy, Aravindh Mahendran, Anurag Arnab, Mostafa Dehg

cs.CV updates on arXiv.org arxiv.org

Combining simple architectures with large-scale pre-training has led to
massive improvements in image classification. For object detection,
pre-training and scaling approaches are less well established, especially in
the long-tailed and open-vocabulary setting, where training data is relatively
scarce. In this paper, we propose a strong recipe for transferring image-text
models to open-vocabulary object detection. We use a standard Vision
Transformer architecture with minimal modifications, contrastive image-text
pre-training, and end-to-end detection fine-tuning. Our analysis of the scaling
properties of this setup …

arxiv cv detection open transformers vision

More from arxiv.org / cs.CV updates on arXiv.org

Predictive Ecology Postdoctoral Fellow

@ Lawrence Berkeley National Lab | Berkeley, CA

Data Analyst, Patagonia Action Works

@ Patagonia | Remote

Data & Insights Strategy & Innovation General Manager

@ Chevron Services Company, a division of Chevron U.S.A Inc. | Houston, TX

Faculty members in Research areas such as Bayesian and Spatial Statistics; Data Privacy and Security; AI/ML; NLP; Image and Video Data Analysis

@ Ahmedabad University | Ahmedabad, India

Director, Applied Mathematics & Computational Research Division

@ Lawrence Berkeley National Lab | Berkeley, Ca

Business Data Analyst

@ MainStreet Family Care | Birmingham, AL