all AI news
RaP: Redundancy-aware Video-language Pre-training for Text-Video Retrieval. (arXiv:2210.06881v1 [cs.CV])
cs.CV updates on arXiv.org arxiv.org
Video language pre-training methods have mainly adopted sparse sampling
techniques to alleviate the temporal redundancy of videos. Though effective,
sparse sampling still suffers inter-modal redundancy: visual redundancy and
textual redundancy. Compared with highly generalized text, sparsely sampled
frames usually contain text-independent portions, called visual redundancy.
Sparse sampling is also likely to miss important frames corresponding to some
text portions, resulting in textual redundancy. Inter-modal redundancy leads to
a mismatch of video and text information, hindering the model from better
learning …
arxiv language pre-training rap redundancy retrieval text training video