Feb. 8, 2024, 5:46 a.m. | Zhengxuan Wu Atticus Geiger Thomas Icard Christopher Potts Noah D. Goodman

cs.CL updates on arXiv.org arxiv.org

Obtaining human-interpretable explanations of large, general-purpose language models is an urgent goal for AI safety. However, it is just as important that our interpretability methods are faithful to the causal dynamics underlying model behavior and able to robustly generalize to unseen inputs. Distributed Alignment Search (DAS) is a powerful gradient descent method grounded in a theory of causal abstraction that has uncovered perfect alignments between interpretable symbolic algorithms and small deep learning models fine-tuned for specific tasks. In the present …


