Web: http://arxiv.org/abs/2206.06346

June 16, 2022, 1:13 a.m. | Elad Ben-Avraham, Roei Herzig, Karttikeya Mangalam, Amir Bar, Anna Rohrbach, Leonid Karlinsky, Trevor Darrell, Amir Globerson

cs.CV updates on arXiv.org arxiv.org

Recent action recognition models have achieved impressive results by
integrating objects, their locations and interactions. However, obtaining dense
structured annotations for each frame is tedious and time-consuming, making
these methods expensive to train and less scalable. At the same time, if a
small set of annotated images is available, either within or outside the domain
of interest, how could we leverage these for a video downstream task? We
propose a learning framework StructureViT (SViT for short), which demonstrates
how utilizing …

arxiv clip cv image tokens video

