June 11, 2024, 4:46 a.m. | Simon Schug, Seijin Kobayashi, Yassir Akram, Jo\~ao Sacramento, Razvan Pascanu

cs.LG updates on arXiv.org arxiv.org

arXiv:2406.05816v1 Announce Type: new
Abstract: Transformers can under some circumstances generalize to novel problem instances whose constituent parts might have been encountered during training but whose compositions have not. What mechanisms underlie this ability for compositional generalization? By reformulating multi-head attention as a hypernetwork, we reveal that a low-dimensional latent code specifies key-query specific operations. We find empirically that this latent code is highly structured, capturing information about the subtasks performed by the network. Using the framework of attention as …

abstract arxiv attention code cs.lg head instances key low multi-head multi-head attention novel operations problem query training transformers type

