Interpreting Attention Layer Outputs with Sparse Autoencoders | allainews.com

June 26, 2024, 4:45 a.m. | Connor Kissane, Robert Krzyzanowski, Joseph Isaac Bloom, Arthur Conmy, Neel Nanda

cs.LG updates on arXiv.org arxiv.org

arXiv:2406.17759v1 Announce Type: new
Abstract: Decomposing model activations into interpretable components is a key open problem in mechanistic interpretability. Sparse autoencoders (SAEs) are a popular method for decomposing the internal activations of trained transformers into sparse, interpretable features, and have been applied to MLP layers and the residual stream. In this work we train SAEs on attention layer outputs and show that also here SAEs find a sparse, interpretable decomposition. We demonstrate this on transformers from several model families and …

abstract arxiv attention autoencoders components cs.lg features interpretability key layer mlp popular problem residual stream transformers type work

More from arxiv.org / cs.LG updates on arXiv.org

Bayesian identification of nonseparable Hamiltonians with multiplicative noise using deep learning and reduced-order modeling 1 day, 2 hours ago | arxiv.org

abstract arxiv bayesian cs.lg +17

MMGPL: Multimodal Medical Data Analysis with Graph Prompt Learning 1 day, 2 hours ago | arxiv.org

abstract analysis arxiv cs.cv +16

Self-Supervised Detection of Perfect and Partial Input-Dependent Symmetries 1 day, 2 hours ago | arxiv.org

arxiv cs.cv cs.lg detection +3

MixerFlow: MLP-Mixer meets Normalising Flows 1 day, 2 hours ago | arxiv.org

abstract architectures arxiv context +15

Machine Learning-Enabled Software and System Architecture Frameworks 1 day, 2 hours ago | arxiv.org

abstract architecture arxiv concerns +22

Efficient Interaction-Aware Interval Analysis of Neural Network Feedback Loops 1 day, 2 hours ago | arxiv.org

abstract analysis arxiv cs.lg +19

Kernelised Normalising Flows 1 day, 2 hours ago | arxiv.org

abstract architecture arxiv capabilities +14

GSplit: Scaling Graph Neural Network Training on Large Graphs via Split-Parallelism 1 day, 2 hours ago | arxiv.org

abstract arxiv class cs.dc +25

Reinforcement Learning in Credit Scoring and Underwriting 1 day, 2 hours ago | arxiv.org

abstract action adapt arxiv +17

Performance Marketing Manager

@ Jerry | New York City

View on ai-jobs.net

Senior Growth Marketing Manager (FULLY REMOTE)

@ Jerry | Seattle, WA

View on ai-jobs.net

Growth Marketing Channel Manager

@ Jerry | New York City

View on ai-jobs.net

Azure Integration Developer - Consultant - Bangalore

@ KPMG India | Bengaluru, Karnataka, India

View on ai-jobs.net

Director - Technical Program Manager

@ Capital One | Bengaluru, In

View on ai-jobs.net

Lead Developer-Process Automation -Python Developer

@ Diageo | Bengaluru Karle Town SEZ

View on ai-jobs.net