May 8, 2024, 4:46 a.m. | Dong Lao, Yangchao Wu, Tian Yu Liu, Alex Wong, Stefano Soatto

cs.CV updates on arXiv.org arxiv.org

arXiv:2310.03967v2 Announce Type: replace
Abstract: Vision Transformer (ViT) architectures represent images as collections of high-dimensional vectorized tokens, each corresponding to a rectangular non-overlapping patch. This representation trades spatial granularity for embedding dimensionality, and results in semantically rich but spatially coarsely quantized feature maps. In order to retrieve spatial details beneficial to fine-grained inference tasks we propose a training-free method inspired by "stochastic resonance". Specifically, we perform sub-token spatial transformations to the input data, and aggregate the resulting ViT features after …

abstract architectures arxiv cs.ai cs.cv dimensionality embedding feature fine-grained images inference maps representation results spatial stochastic token tokens transformer transformers type via vision vit

Software Engineer for AI Training Data (School Specific)

@ G2i Inc | Remote

Software Engineer for AI Training Data (Python)

@ G2i Inc | Remote

Software Engineer for AI Training Data (Tier 2)

@ G2i Inc | Remote

Data Engineer

@ Lemon.io | Remote: Europe, LATAM, Canada, UK, Asia, Oceania

Artificial Intelligence – Bioinformatic Expert

@ University of Texas Medical Branch | Galveston, TX

Lead Developer (AI)

@ Cere Network | San Francisco, US