Feb. 1, 2024, 12:45 p.m. | Yingqian Cui Jie Ren Pengfei He Jiliang Tang Yue Xing

cs.LG updates on arXiv.org arxiv.org

We present a theoretical analysis of the performance of transformer with softmax attention in in-context learning with linear regression tasks. While the existing literature predominantly focuses on the convergence of transformers with single-/multi-head attention, our research centers on comparing their performance. We conduct an exact theoretical analysis to demonstrate that multi-head attention with a substantial embedding dimension performs better than single-head attention. When the number of in-context examples D increases, the prediction loss using single-/multi-head attention is in O(1/D), and …

analysis attention context convergence cs.ai cs.lg head in-context learning linear linear regression literature multi-head multi-head attention performance regression research softmax stat.ml tasks transformer transformers

AI Research Scientist

@ Vara | Berlin, Germany and Remote

Data Architect

@ University of Texas at Austin | Austin, TX

Data ETL Engineer

@ University of Texas at Austin | Austin, TX

Lead GNSS Data Scientist

@ Lurra Systems | Melbourne

Senior Machine Learning Engineer (MLOps)

@ Promaton | Remote, Europe

AI Engineering Manager

@ M47 Labs | Barcelona, Catalunya [Cataluña], Spain