Web: http://arxiv.org/abs/2206.11895

June 24, 2022, 1:12 a.m. | Jinghuan Shang, Srijan Das, Michael S. Ryoo

cs.CV updates on arXiv.org arxiv.org

Humans are remarkably flexible in understanding viewpoint changes due to
visual cortex supporting the perception of 3D structure. In contrast, most of
the computer vision models that learn visual representation from a pool of 2D
images often fail to generalize over novel camera viewpoints. Recently, the
vision architectures have shifted towards convolution-free architectures,
visual Transformers, which operate on tokens derived from image patches.
However, neither these Transformers nor 2D convolutional networks perform
explicit operations to learn viewpoint-agnostic representation for visual …

3d arxiv cv learning space tokens

