Oct. 14, 2022, 1:16 a.m. | Xing Wu, Chaochen Gao, Zijia Lin, Zhongyuan Wang, Jizhong Han, Songlin Hu

cs.CV updates on arXiv.org arxiv.org

Video language pre-training methods have mainly adopted sparse sampling
techniques to alleviate the temporal redundancy of videos. Though effective,
sparse sampling still suffers inter-modal redundancy: visual redundancy and
textual redundancy. Compared with highly generalized text, sparsely sampled
frames usually contain text-independent portions, called visual redundancy.
Sparse sampling is also likely to miss important frames corresponding to some
text portions, resulting in textual redundancy. Inter-modal redundancy leads to
a mismatch of video and text information, hindering the model from better
learning …

arxiv language pre-training rap redundancy retrieval text training video

Artificial Intelligence – Bioinformatic Expert

@ University of Texas Medical Branch | Galveston, TX

Lead Developer (AI)

@ Cere Network | San Francisco, US

Research Engineer

@ Allora Labs | Remote

Ecosystem Manager

@ Allora Labs | Remote

Founding AI Engineer, Agents

@ Occam AI | New York

AI Engineer Intern, Agents

@ Occam AI | US