June 10, 2024, 4:48 a.m. | Feiyu Pan, Hao Fang, Xiankai Lu

cs.CV updates on arXiv.org arxiv.org

arXiv:2406.04842v1 Announce Type: new
Abstract: Referring video object segmentation (RVOS) relies on natural language expressions to segment target objects in video, emphasizing modeling dense text-video relations. The current RVOS methods typically use independently pre-trained vision and language models as backbones, resulting in a significant domain gap between video and text. In cross-modal feature interaction, text features are only used as query initialization and do not fully utilize important information in the text. In this work, we propose using frozen pre-trained …

abstract arxiv cs.cv current cvpr language language models modeling natural natural language object objects relations segment segmentation solution text type video vision workshop

