The Generative AI Paradox on Evaluation: What It Can Solve, It May Not Evaluate | allainews.com

Feb. 12, 2024, 5:46 a.m. | Juhyun Oh Eunsu Kim Inha Cha Alice Oh

cs.CL updates on arXiv.org arxiv.org

This paper explores the assumption that Large Language Models (LLMs) skilled in generation tasks are equally adept as evaluators. We assess the performance of three LLMs and one open-source LM in Question-Answering (QA) and evaluation tasks using the TriviaQA (Joshi et al., 2017) dataset. Results indicate a significant disparity, with LLMs exhibiting lower performance in evaluation tasks compared to generation tasks. Intriguingly, we discover instances of unfaithful evaluation where models accurately evaluate answers in areas where they lack competence, underscoring …

adept cs.ai cs.cl dataset evaluation generative language language models large language large language models llms paper paradox performance question skilled solve tasks

More from arxiv.org / cs.CL updates on arXiv.org

Sparse is Enough in Fine-tuning Pre-trained Large Language Models 2 days, 1 hour ago | arxiv.org

arxiv cs.ai cs.cl cs.lg +6

On the Learnability of Watermarks for Language Models 2 days, 1 hour ago | arxiv.org

abstract arxiv cs.cl cs.cr +17

StableSSM: Alleviating the Curse of Memory in State-space Models through Stable Reparameterization 2 days, 1 hour ago | arxiv.org

abstract arxiv capabilities cs.ai +14

Evaluating Generative Ad Hoc Information Retrieval 2 days, 1 hour ago | arxiv.org

abstract advances arxiv cs.cl +19

Language Models As Semantic Indexers 2 days, 1 hour ago | arxiv.org

arxiv cs.cl cs.ir cs.lg +4

Large language models can accurately predict searcher preferences 2 days, 1 hour ago | arxiv.org

abstract arxiv cs.ai cs.cl +16

On the Reliability of Watermarks for Large Language Models 2 days, 1 hour ago | arxiv.org

abstract arxiv become bots +28

A Watermark for Large Language Models 2 days, 1 hour ago | arxiv.org

abstract arxiv cs.cl cs.cr +16

CreoleVal: Multilingual Multitask Benchmarks for Creoles 2 days, 1 hour ago | arxiv.org

abstract annotated data arxiv benchmarks +14

Founding AI Engineer, Agents

@ Occam AI | New York

View on ai-jobs.net

AI Engineer Intern, Agents

@ Occam AI | US

View on ai-jobs.net

AI Research Scientist

@ Vara | Berlin, Germany and Remote

View on ai-jobs.net

Data Architect

@ University of Texas at Austin | Austin, TX

View on ai-jobs.net

Data ETL Engineer

@ University of Texas at Austin | Austin, TX

View on ai-jobs.net

Lead GNSS Data Scientist

@ Lurra Systems | Melbourne

View on ai-jobs.net