March 15, 2024, 4:42 a.m. | Shauli Ravfogel, Yoav Goldberg, Ryan Cotterell

cs.LG updates on arXiv.org arxiv.org

arXiv:2210.10012v4 Announce Type: replace
Abstract: Methods for erasing human-interpretable concepts from neural representations that assume linearity have been found to be tractable and useful. However, the impact of this removal on the behavior of downstream classifiers trained on the modified representations is not fully understood. In this work, we formally define the notion of log-linear guardedness as the inability of an adversary to predict the concept directly from the representation, and study its implications. We show that, in the binary …

abstract arxiv behavior classifiers concepts cs.cl cs.lg found however human impact linear notion tractable type work

Artificial Intelligence – Bioinformatic Expert

@ University of Texas Medical Branch | Galveston, TX

Lead Developer (AI)

@ Cere Network | San Francisco, US

Research Engineer

@ Allora Labs | Remote

Ecosystem Manager

@ Allora Labs | Remote

Founding AI Engineer, Agents

@ Occam AI | New York

AI Engineer Intern, Agents

@ Occam AI | US