May 14, 2024, 4:49 a.m. | Rohan Ajwani, Shashidhar Reddy Javaji, Frank Rudzicz, Zining Zhu

cs.CL updates on

arXiv:2405.06800v1 Announce Type: new
Abstract: Large Language Models (LLMs) are becoming vital tools that help us solve and understand complex problems by acting as digital assistants. LLMs can generate convincing explanations, even when only given the inputs and outputs of these problems, i.e., in a ``black-box'' approach. However, our research uncovers a hidden risk tied to this approach, which we call *adversarial helpfulness*. This happens when an LLM's explanations make a wrong answer look right, potentially leading people to trust …

