April 15, 2024, 4:45 a.m. | Timoth\'ee Darcet, Maxime Oquab, Julien Mairal, Piotr Bojanowski

cs.CV updates on arXiv.org arxiv.org

arXiv:2309.16588v2 Announce Type: replace
Abstract: Transformers have recently emerged as a powerful tool for learning visual representations. In this paper, we identify and characterize artifacts in feature maps of both supervised and self-supervised ViT networks. The artifacts correspond to high-norm tokens appearing during inference primarily in low-informative background areas of images, that are repurposed for internal computations. We propose a simple yet effective solution based on providing additional tokens to the input sequence of the Vision Transformer to fill that …

abstract arxiv cs.cv feature identify images inference low maps networks norm paper tokens tool transformers type vision vision transformers visual vit

AI Research Scientist

@ Vara | Berlin, Germany and Remote

Data Architect

@ University of Texas at Austin | Austin, TX

Data ETL Engineer

@ University of Texas at Austin | Austin, TX

Lead GNSS Data Scientist

@ Lurra Systems | Melbourne

Senior Machine Learning Engineer (MLOps)

@ Promaton | Remote, Europe

AI Engineering Manager

@ M47 Labs | Barcelona, Catalunya [Cataluña], Spain