Opening the Black Box of Large Language Models: Two Views on Holistic Interpretability | allainews.com

Feb. 19, 2024, 5:47 a.m. | Haiyan Zhao, Fan Yang, Himabindu Lakkaraju, Mengnan Du

cs.CL updates on arXiv.org arxiv.org

arXiv:2402.10688v1 Announce Type: new
Abstract: As large language models (LLMs) grow more powerful, concerns around potential harms like toxicity, unfairness, and hallucination threaten user trust. Ensuring beneficial alignment of LLMs with human values through model alignment is thus critical yet challenging, requiring a deeper understanding of LLM behaviors and mechanisms. We propose opening the black box of LLMs through a framework of holistic interpretability encompassing complementary bottom-up and top-down perspectives. The bottom-up view, enabled by mechanistic interpretability, focuses on component …

abstract alignment arxiv black box box concerns cs.cl hallucination human interpretability language language models large language large language models llm llms through toxicity trust type understanding user trust values

More from arxiv.org / cs.CL updates on arXiv.org

Knowledge Graphs and Pre-trained Language Models enhanced Representation Learning for Conversational Recommender Systems 6 minutes ago | arxiv.org

abstract arxiv context conversation +20

ProCoT: Stimulating Critical Thinking and Writing of Students through Engagement with Large Language Models (LLMs) 6 minutes ago | arxiv.org

abstract active learning arxiv chatgpt +22

UNcommonsense Reasoning: Abductive Reasoning about Uncommon Situations 6 minutes ago | arxiv.org

abstract arxiv commonsense cs.cl +10

Response: Emergent analogical reasoning in large language models 6 minutes ago | arxiv.org

abstract acquired analogy arxiv +16

Retroformer: Retrospective Large Language Agents with Policy Gradient Optimization 6 minutes ago | arxiv.org

abstract agents arxiv autonomous +18

NumLLM: Numeric-Sensitive Large Language Model for Chinese Finance 6 minutes ago | arxiv.org

abstract arxiv chinese cs.ce +25

CookingSense: A Culinary Knowledgebase with Multidisciplinary Assertions 6 minutes ago | arxiv.org

abstract acquired arxiv collection +17

GOLD: Geometry Problem Solver with Natural Language Description 6 minutes ago | arxiv.org

abstract artificial artificial intelligence arxiv +22

Enhancing Surgical Robots with Embodied Intelligence for Autonomous Ultrasound Scanning 6 minutes ago | arxiv.org

abstract arxiv autonomous cs.ai +17

Data Architect

@ University of Texas at Austin | Austin, TX

View on ai-jobs.net

Data ETL Engineer

@ University of Texas at Austin | Austin, TX

View on ai-jobs.net

Lead GNSS Data Scientist

@ Lurra Systems | Melbourne

View on ai-jobs.net

Senior Machine Learning Engineer (MLOps)

@ Promaton | Remote, Europe

View on ai-jobs.net

Sr. VBI Developer II

@ Atos | Texas, US, 75093

View on ai-jobs.net

Wealth Management - Data Analytics Intern/Co-op Fall 2024

@ Scotiabank | Toronto, ON, CA

View on ai-jobs.net