Feb. 14, 2024, 5:46 a.m. | Rishav Hada Varun Gumma Adrian de Wynter Harshita Diddee Mohamed Ahmed Monojit Choudhury Kalika Bali

cs.CL updates on arXiv.org arxiv.org

Large Language Models (LLMs) excel in various Natural Language Processing (NLP) tasks, yet their evaluation, particularly in languages beyond the top $20$, remains inadequate due to existing benchmarks and metrics limitations. Employing LLMs as evaluators to rank or score other models' outputs emerges as a viable solution, addressing the constraints tied to human annotators and established benchmarks. In this study, we explore the potential of LLM-based evaluators, specifically GPT-4 in enhancing multilingual evaluation by calibrating them against $20$K human judgments …

benchmarks beyond cs.cl evaluation excel language language model language models language processing languages large language large language model large language models limitations llms metrics multilingual natural natural language natural language processing nlp processing scaling scaling up solution tasks

Founding AI Engineer, Agents

@ Occam AI | New York

AI Engineer Intern, Agents

@ Occam AI | US

AI Research Scientist

@ Vara | Berlin, Germany and Remote

Data Architect

@ University of Texas at Austin | Austin, TX

Data ETL Engineer

@ University of Texas at Austin | Austin, TX

Lead GNSS Data Scientist

@ Lurra Systems | Melbourne