June 27, 2024, 4:47 a.m. | Ke Fan, Zechen Bai, Tianjun Xiao, Dominik Zietlow, Max Horn, Zixu Zhao, Carl-Johann Simon-Gabriel, Mike Zheng Shou, Francesco Locatello, Bernt Schiele

cs.CV updates on arXiv.org arxiv.org

arXiv:2309.09858v2 Announce Type: replace
Abstract: In this paper, we show that recent advances in video representation learning and pre-trained vision-language models allow for substantial improvements in self-supervised video object localization. We propose a method that first localizes objects in videos via an object-centric approach with slot attention and then assigns text to the obtained slots. The latter is achieved by an unsupervised way to read localized semantic information from the pre-trained CLIP model. The resulting video object localization is entirely …

abstract advances arxiv attention cs.cv improvements language language models localization object objects paper replace representation representation learning show text type unsupervised via video videos vision vision-language vision-language models

VP, Enterprise Applications

@ Blue Yonder | Scottsdale

Data Scientist - Moloco Commerce Media

@ Moloco | Redwood City, California, United States

Senior Backend Engineer (New York)

@ Kalepa | New York City. Hybrid

Senior Backend Engineer (USA)

@ Kalepa | New York City. Remote US.

Senior Full Stack Engineer (USA)

@ Kalepa | New York City. Remote US.

Senior Full Stack Engineer (New York)

@ Kalepa | New York City., Hybrid