Jan. 24, 2022, 2:10 a.m. | Kishaan Jeeveswaran, Senthilkumar Kathiresan, Arnav Varma, Omar Magdy, Bahram Zonooz, Elahe Arani

Convolutional Neural Networks (CNNs), architectures consisting of
convolutional layers, have been the standard choice in vision tasks. Recent
studies have shown that Vision Transformers (VTs), architectures based on
self-attention modules, achieve comparable performance in challenging tasks
such as object detection and semantic segmentation. However, the image
processing mechanism of VTs is different from that of conventional CNNs. This
poses several questions about their generalizability, robustness, reliability,
and texture bias when used to extract features for complex tasks. To address
these …

