all AI news
Training Dynamics of Multi-Head Softmax Attention for In-Context Learning: Emergence, Convergence, and Optimality
March 1, 2024, 5:43 a.m. | Siyu Chen, Heejune Sheen, Tianhao Wang, Zhuoran Yang
cs.LG updates on arXiv.org arxiv.org
Abstract: We study the dynamics of gradient flow for training a multi-head softmax attention model for in-context learning of multi-task linear regression. We establish the global convergence of gradient flow under suitable choices of initialization. In addition, we prove that an interesting "task allocation" phenomenon emerges during the gradient flow dynamics, where each attention head focuses on solving a single task of the multi-task model. Specifically, we prove that the gradient flow dynamics can be split …
abstract arxiv attention context convergence cs.ai cs.lg dynamics emergence flow global gradient head in-context learning linear linear regression math.oc math.st multi-head prove regression softmax stat.ml stat.th study training type
More from arxiv.org / cs.LG updates on arXiv.org
Jobs in AI, ML, Big Data
Founding AI Engineer, Agents
@ Occam AI | New York
AI Engineer Intern, Agents
@ Occam AI | US
AI Research Scientist
@ Vara | Berlin, Germany and Remote
Data Architect
@ University of Texas at Austin | Austin, TX
Data ETL Engineer
@ University of Texas at Austin | Austin, TX
Machine Learning Research Scientist
@ d-Matrix | San Diego, Ca