The Vision Transformer architecture has shown to be competitive in the
computer vision (CV) space where it has dethroned convolution-based networks in
several benchmarks. Nevertheless, Convolutional Neural Networks (CNN) remain
the preferential architecture for the representation module in Reinforcement
Learning. In this work, we study pretraining a Vision Transformer using several
state-of-the-art self-supervised methods and assess data-efficiency gains from
this training framework. We propose a new self-supervised learning method
called TOV-VICReg that extends VICReg to better capture temporal relations
between …

