all AI news
Uniform Masking: Enabling MAE Pre-training for Pyramid-based Vision Transformers with Locality. (arXiv:2205.10063v1 [cs.CV])
cs.CV updates on arXiv.org arxiv.org
Masked AutoEncoder (MAE) has recently led the trends of visual
self-supervision area by an elegant asymmetric encoder-decoder design, which
significantly optimizes both the pre-training efficiency and fine-tuning
accuracy. Notably, the success of the asymmetric structure relies on the
"global" property of Vanilla Vision Transformer (ViT), whose self-attention
mechanism reasons over arbitrary subset of discrete image patches. However, it
is still unclear how the advanced Pyramid-based ViTs (e.g., PVT, Swin) can be
adopted in MAE pre-training as they commonly introduce operators …
arxiv cv enabling pre-training training transformers uniform vision