Feb. 20, 2024, 5:52 a.m. | Rajan Vivek, Kawin Ethayarajh, Diyi Yang, Douwe Kiela

cs.CL updates on arXiv.org arxiv.org

arXiv:2309.08638v2 Announce Type: replace
Abstract: Modern language models often exhibit powerful but brittle behavior, leading to the development of larger and more diverse benchmarks to reliably assess their behavior. Here, we suggest that model performance can be benchmarked and elucidated with much smaller evaluation sets. We first show that in six popular language classification benchmarks, model confidence in the correct class on many pairs of points is strongly correlated across models. We build upon this phenomenon to propose Anchor Point …

abstract anchor arxiv behavior benchmarking benchmarks cs.cl development diverse evaluation examples language language models modern performance popular show six type

Lead Developer (AI)

@ Cere Network | San Francisco, US

Research Engineer

@ Allora Labs | Remote

Ecosystem Manager

@ Allora Labs | Remote

Founding AI Engineer, Agents

@ Occam AI | New York

AI Engineer Intern, Agents

@ Occam AI | US

AI Research Scientist

@ Vara | Berlin, Germany and Remote