March 4, 2024, 5:45 a.m. | Jiaqi Xu, Bo Liu, Yunkuo Chen, Mengli Cheng, Xing Shi

cs.CV updates on arXiv.org arxiv.org

arXiv:2303.05707v2 Announce Type: replace
Abstract: Video-and-language understanding has a variety of applications in the industry, such as video question answering, text-video retrieval, and multi-label classification. Existing video-and-language understanding methods generally adopt heavy multi-modal encoders and feature fusion modules, which consume high computational costs. Specially, they have difficulty dealing with dense video frames or long text prevalent in industrial applications. This paper proposes MuLTI, a highly accurate and efficient video-and-language understanding model that achieves efficient and effective feature fusion and rapid …

abstract applications arxiv classification computational costs cs.cl cs.cv cs.mm feature fusion industry language language understanding modal modeling modules multi-modal multiple question question answering retrieval text type understanding video

Artificial Intelligence – Bioinformatic Expert

@ University of Texas Medical Branch | Galveston, TX

Lead Developer (AI)

@ Cere Network | San Francisco, US

Research Engineer

@ Allora Labs | Remote

Ecosystem Manager

@ Allora Labs | Remote

Founding AI Engineer, Agents

@ Occam AI | New York

AI Engineer Intern, Agents

@ Occam AI | US