all AI news
Interpreting Attention Layer Outputs with Sparse Autoencoders
June 26, 2024, 4:45 a.m. | Connor Kissane, Robert Krzyzanowski, Joseph Isaac Bloom, Arthur Conmy, Neel Nanda
cs.LG updates on arXiv.org arxiv.org
Abstract: Decomposing model activations into interpretable components is a key open problem in mechanistic interpretability. Sparse autoencoders (SAEs) are a popular method for decomposing the internal activations of trained transformers into sparse, interpretable features, and have been applied to MLP layers and the residual stream. In this work we train SAEs on attention layer outputs and show that also here SAEs find a sparse, interpretable decomposition. We demonstrate this on transformers from several model families and …
abstract arxiv attention autoencoders components cs.lg features interpretability key layer mlp popular problem residual stream transformers type work
More from arxiv.org / cs.LG updates on arXiv.org
MixerFlow: MLP-Mixer meets Normalising Flows
1 day, 2 hours ago |
arxiv.org
Kernelised Normalising Flows
1 day, 2 hours ago |
arxiv.org
Jobs in AI, ML, Big Data
Performance Marketing Manager
@ Jerry | New York City
Senior Growth Marketing Manager (FULLY REMOTE)
@ Jerry | Seattle, WA
Growth Marketing Channel Manager
@ Jerry | New York City
Azure Integration Developer - Consultant - Bangalore
@ KPMG India | Bengaluru, Karnataka, India
Director - Technical Program Manager
@ Capital One | Bengaluru, In
Lead Developer-Process Automation -Python Developer
@ Diageo | Bengaluru Karle Town SEZ