June 17, 2024, 4:41 a.m. | Dachuan Shi, Chaofan Tao, Anyi Rao, Zhendong Yang, Chun Yuan, Jiaqi Wang

cs.CL updates on arXiv.org arxiv.org

arXiv:2305.17455v4 Announce Type: replace-cross
Abstract: Recent vision-language models have achieved tremendous advances. However, their computational costs are also escalating dramatically, making model acceleration exceedingly critical. To pursue more efficient vision-language Transformers, this paper introduces Cross-Guided Ensemble of Tokens (CrossGET), a general acceleration framework for vision-language Transformers. This framework adaptively combines tokens in real-time during inference, significantly reducing computational costs while maintaining high performance. CrossGET features two primary innovations: 1) Cross-Guided Matching and Ensemble. CrossGET leverages cross-modal guided token matching and …

arxiv cs.cl cs.cv ensemble language replace tokens transformers type vision vision-language

AI Focused Biochemistry Postdoctoral Fellow

@ Lawrence Berkeley National Lab | Berkeley, CA

Senior Data Engineer

@ Displate | Warsaw

Data Architect

@ Unison Consulting Pte Ltd | Kuala Lumpur, Federal Territory of Kuala Lumpur, Malaysia

Data Architect

@ Games Global | Isle of Man, Isle of Man

Enterprise Data Architect

@ Ent Credit Union | Colorado Springs, CO, United States

Lead Data Architect (AWS, Azure, GCP)

@ CapTech Consulting | Chicago, IL, United States