March 1, 2024, 5:43 a.m. | Siyu Chen, Heejune Sheen, Tianhao Wang, Zhuoran Yang

cs.LG updates on arXiv.org arxiv.org

arXiv:2402.19442v1 Announce Type: new
Abstract: We study the dynamics of gradient flow for training a multi-head softmax attention model for in-context learning of multi-task linear regression. We establish the global convergence of gradient flow under suitable choices of initialization. In addition, we prove that an interesting "task allocation" phenomenon emerges during the gradient flow dynamics, where each attention head focuses on solving a single task of the multi-task model. Specifically, we prove that the gradient flow dynamics can be split …

abstract arxiv attention context convergence cs.ai cs.lg dynamics emergence flow global gradient head in-context learning linear linear regression math.oc math.st multi-head prove regression softmax stat.ml stat.th study training type

Founding AI Engineer, Agents

@ Occam AI | New York

AI Engineer Intern, Agents

@ Occam AI | US

AI Research Scientist

@ Vara | Berlin, Germany and Remote

Data Architect

@ University of Texas at Austin | Austin, TX

Data ETL Engineer

@ University of Texas at Austin | Austin, TX

Machine Learning Research Scientist

@ d-Matrix | San Diego, Ca