Feb. 29, 2024, 5:42 a.m. | Julian Coda-Forno, Marcel Binz, Jane X. Wang, Eric Schulz

cs.LG updates on arXiv.org arxiv.org

arXiv:2402.18225v1 Announce Type: cross
Abstract: Large language models (LLMs) have significantly advanced the field of artificial intelligence. Yet, evaluating them comprehensively remains challenging. We argue that this is partly due to the predominant focus on performance metrics in most benchmarks. This paper introduces CogBench, a benchmark that includes ten behavioral metrics derived from seven cognitive psychology experiments. This novel approach offers a toolkit for phenotyping LLMs' behavior. We apply CogBench to 35 LLMs, yielding a rich and diverse dataset. We …

abstract advanced artificial artificial intelligence arxiv benchmark benchmarks cs.ai cs.cl cs.lg focus intelligence lab language language model language models large language large language model large language models llms metrics paper performance psychology them type

Data Architect

@ University of Texas at Austin | Austin, TX

Data ETL Engineer

@ University of Texas at Austin | Austin, TX

Lead GNSS Data Scientist

@ Lurra Systems | Melbourne

Senior Machine Learning Engineer (MLOps)

@ Promaton | Remote, Europe

Senior Principal, Product Strategy Operations, Cloud Data Analytics

@ Google | Sunnyvale, CA, USA; Austin, TX, USA

Data Scientist - HR BU

@ ServiceNow | Hyderabad, India