Feb. 16, 2024, 5:41 a.m. | Shashwat Singh, Shauli Ravfogel, Jonathan Herzig, Roee Aharoni, Ryan Cotterell, Ponnurangam Kumaraguru

cs.LG updates on arXiv.org arxiv.org

arXiv:2402.09631v1 Announce Type: new
Abstract: Language models often exhibit undesirable behaviors, such as gender bias or toxic language. Interventions in the representation space were shown effective in mitigating such issues by altering the LM behavior. We first show that two prominent intervention techniques, Linear Erasure and Steering Vectors, do not enable a high degree of control and are limited in expressivity.
We then propose a novel intervention methodology for generating expressive counterfactuals in the representation space, aiming to make representations …

abstract arxiv behavior bias cs.cl cs.lg gender gender bias language language models linear representation show space toxic language type vectors

Data Architect

@ University of Texas at Austin | Austin, TX

Data ETL Engineer

@ University of Texas at Austin | Austin, TX

Lead GNSS Data Scientist

@ Lurra Systems | Melbourne

Senior Machine Learning Engineer (MLOps)

@ Promaton | Remote, Europe

Field Sample Specialist (Air Sampling) - Eurofins Environment Testing – Pueblo, CO

@ Eurofins | Pueblo, CO, United States

Camera Perception Engineer

@ Meta | Sunnyvale, CA