Feb. 9, 2024, 5:46 a.m. | Ivana Bala\v{z}evi\'c Yuge Shi Pinelopi Papalampidi Rahma Chaabouni Skanda Koppula Olivier J. H\'enaff

cs.CV updates on arXiv.org arxiv.org

Most transformer-based video encoders are limited to short temporal contexts due to their quadratic complexity. While various attempts have been made to extend this context, this has often come at the cost of both conceptual and computational complexity. We propose to instead re-purpose existing pre-trained video transformers by simply fine-tuning them to attend to memories derived non-parametrically from past activations. By leveraging redundancy reduction, our memory-consolidated vision transformer (MC-ViT) effortlessly extends its context far into the past and exhibits excellent …

complexity computational consolidation context cost cs.cv fine-tuning memory temporal them transformer transformers understanding video video understanding

Lead Developer (AI)

@ Cere Network | San Francisco, US

Research Engineer

@ Allora Labs | Remote

Ecosystem Manager

@ Allora Labs | Remote

Founding AI Engineer, Agents

@ Occam AI | New York

AI Engineer Intern, Agents

@ Occam AI | US

AI Research Scientist

@ Vara | Berlin, Germany and Remote