Feb. 20, 2024, 5:52 a.m. | Rajan Vivek, Kawin Ethayarajh, Diyi Yang, Douwe Kiela

cs.CL updates on arXiv.org arxiv.org

arXiv:2309.08638v2 Announce Type: replace
Abstract: Modern language models often exhibit powerful but brittle behavior, leading to the development of larger and more diverse benchmarks to reliably assess their behavior. Here, we suggest that model performance can be benchmarked and elucidated with much smaller evaluation sets. We first show that in six popular language classification benchmarks, model confidence in the correct class on many pairs of points is strongly correlated across models. We build upon this phenomenon to propose Anchor Point …

abstract anchor arxiv behavior benchmarking benchmarks cs.cl development diverse evaluation examples language language models modern performance popular show six type

Software Engineer for AI Training Data (School Specific)

@ G2i Inc | Remote

Software Engineer for AI Training Data (Python)

@ G2i Inc | Remote

Software Engineer for AI Training Data (Tier 2)

@ G2i Inc | Remote

Data Engineer

@ Lemon.io | Remote: Europe, LATAM, Canada, UK, Asia, Oceania

Artificial Intelligence – Bioinformatic Expert

@ University of Texas Medical Branch | Galveston, TX

Lead Developer (AI)

@ Cere Network | San Francisco, US