June 26, 2024, 4:45 a.m. | Connor Kissane, Robert Krzyzanowski, Joseph Isaac Bloom, Arthur Conmy, Neel Nanda

cs.LG updates on arXiv.org arxiv.org

arXiv:2406.17759v1 Announce Type: new
Abstract: Decomposing model activations into interpretable components is a key open problem in mechanistic interpretability. Sparse autoencoders (SAEs) are a popular method for decomposing the internal activations of trained transformers into sparse, interpretable features, and have been applied to MLP layers and the residual stream. In this work we train SAEs on attention layer outputs and show that also here SAEs find a sparse, interpretable decomposition. We demonstrate this on transformers from several model families and …

abstract arxiv attention autoencoders components cs.lg features interpretability key layer mlp popular problem residual stream transformers type work

Performance Marketing Manager

@ Jerry | New York City

Senior Growth Marketing Manager (FULLY REMOTE)

@ Jerry | Seattle, WA

Growth Marketing Channel Manager

@ Jerry | New York City

Azure Integration Developer - Consultant - Bangalore

@ KPMG India | Bengaluru, Karnataka, India

Director - Technical Program Manager

@ Capital One | Bengaluru, In

Lead Developer-Process Automation -Python Developer

@ Diageo | Bengaluru Karle Town SEZ