Feb. 26, 2024, 5:41 a.m. | Sungjin Park, Edward Choi

cs.LG updates on arXiv.org arxiv.org

arXiv:2402.15096v1 Announce Type: new
Abstract: Transformer-based models have significantly improved performance across a range of multimodal understanding tasks, such as visual question answering and action recognition. However, multimodal Transformers significantly suffer from a quadratic complexity of the multi-head attention with the input sequence length, especially as the number of modalities increases. To address this, we introduce Low-Cost Multimodal Transformer (LoCoMT), a novel multimodal attention mechanism that aims to reduce computational cost during training and inference with minimal performance loss. Specifically, …

abstract action recognition arxiv attention complexity computational cost cs.cv cs.lg cs.mm head low multi-head multi-head attention multimodal performance question question answering recognition tasks transformer transformers type understanding visual

AI Research Scientist

@ Vara | Berlin, Germany and Remote

Data Architect

@ University of Texas at Austin | Austin, TX

Data ETL Engineer

@ University of Texas at Austin | Austin, TX

Lead GNSS Data Scientist

@ Lurra Systems | Melbourne

Senior Machine Learning Engineer (MLOps)

@ Promaton | Remote, Europe

Data Analyst (Digital Business Analyst)

@ Activate Interactive Pte Ltd | Singapore, Central Singapore, Singapore