Are Large Language Model-based Evaluators the Solution to Scaling Up Multilingual Evaluation? | allainews.com

Feb. 14, 2024, 5:46 a.m. | Rishav Hada Varun Gumma Adrian de Wynter Harshita Diddee Mohamed Ahmed Monojit Choudhury Kalika Bali

cs.CL updates on arXiv.org arxiv.org

Large Language Models (LLMs) excel in various Natural Language Processing (NLP) tasks, yet their evaluation, particularly in languages beyond the top $20$, remains inadequate due to existing benchmarks and metrics limitations. Employing LLMs as evaluators to rank or score other models' outputs emerges as a viable solution, addressing the constraints tied to human annotators and established benchmarks. In this study, we explore the potential of LLM-based evaluators, specifically GPT-4 in enhancing multilingual evaluation by calibrating them against $20$K human judgments …

benchmarks beyond cs.cl evaluation excel language language model language models language processing languages large language large language model large language models limitations llms metrics multilingual natural natural language natural language processing nlp processing scaling scaling up solution tasks

More from arxiv.org / cs.CL updates on arXiv.org

Sparse is Enough in Fine-tuning Pre-trained Large Language Models 2 days ago | arxiv.org

arxiv cs.ai cs.cl cs.lg +6

On the Learnability of Watermarks for Language Models 2 days ago | arxiv.org

abstract arxiv cs.cl cs.cr +17

StableSSM: Alleviating the Curse of Memory in State-space Models through Stable Reparameterization 2 days ago | arxiv.org

abstract arxiv capabilities cs.ai +14

Evaluating Generative Ad Hoc Information Retrieval 2 days ago | arxiv.org

abstract advances arxiv cs.cl +19

Language Models As Semantic Indexers 2 days ago | arxiv.org

arxiv cs.cl cs.ir cs.lg +4

Large language models can accurately predict searcher preferences 2 days ago | arxiv.org

abstract arxiv cs.ai cs.cl +16

On the Reliability of Watermarks for Large Language Models 2 days ago | arxiv.org

abstract arxiv become bots +28

A Watermark for Large Language Models 2 days ago | arxiv.org

abstract arxiv cs.cl cs.cr +16

CreoleVal: Multilingual Multitask Benchmarks for Creoles 2 days ago | arxiv.org

abstract annotated data arxiv benchmarks +14

Founding AI Engineer, Agents

@ Occam AI | New York

View on ai-jobs.net

AI Engineer Intern, Agents

@ Occam AI | US

View on ai-jobs.net

AI Research Scientist

@ Vara | Berlin, Germany and Remote

View on ai-jobs.net

Data Architect

@ University of Texas at Austin | Austin, TX

View on ai-jobs.net

Data ETL Engineer

@ University of Texas at Austin | Austin, TX

View on ai-jobs.net

Lead GNSS Data Scientist

@ Lurra Systems | Melbourne

View on ai-jobs.net