all AI news
MuLTI: Efficient Video-and-Language Understanding with Text-Guided MultiWay-Sampler and Multiple Choice Modeling
March 4, 2024, 5:45 a.m. | Jiaqi Xu, Bo Liu, Yunkuo Chen, Mengli Cheng, Xing Shi
cs.CV updates on arXiv.org arxiv.org
Abstract: Video-and-language understanding has a variety of applications in the industry, such as video question answering, text-video retrieval, and multi-label classification. Existing video-and-language understanding methods generally adopt heavy multi-modal encoders and feature fusion modules, which consume high computational costs. Specially, they have difficulty dealing with dense video frames or long text prevalent in industrial applications. This paper proposes MuLTI, a highly accurate and efficient video-and-language understanding model that achieves efficient and effective feature fusion and rapid …
abstract applications arxiv classification computational costs cs.cl cs.cv cs.mm feature fusion industry language language understanding modal modeling modules multi-modal multiple question question answering retrieval text type understanding video
More from arxiv.org / cs.CV updates on arXiv.org
Jobs in AI, ML, Big Data
Artificial Intelligence – Bioinformatic Expert
@ University of Texas Medical Branch | Galveston, TX
Lead Developer (AI)
@ Cere Network | San Francisco, US
Research Engineer
@ Allora Labs | Remote
Ecosystem Manager
@ Allora Labs | Remote
Founding AI Engineer, Agents
@ Occam AI | New York
AI Engineer Intern, Agents
@ Occam AI | US