Feb. 6, 2024, 5:45 a.m. | Kun-Peng Ning Shuo Yang Yu-Yang Liu Jia-Yu Yao Zhen-Hui Liu Yu Wang Ming Pang Li Yuan

cs.LG updates on arXiv.org arxiv.org

Existing large language models (LLMs) evaluation methods typically focus on testing the performance on some closed-environment and domain-specific benchmarks with human annotations. In this paper, we explore a novel unsupervised evaluation direction, utilizing peer-review mechanisms to measure LLMs automatically. In this setting, both open-source and closed-source LLMs lie in the same environment, capable of answering unlabeled questions and evaluating each other, where each LLM's response score is jointly determined by other anonymous ones. To obtain the ability hierarchy among these …

annotations benchmarks cs.ai cs.cl cs.lg domain environment evaluation explore focus human language language models large language large language models llms novel paper peer peer-review performance review testing unsupervised

Founding AI Engineer, Agents

@ Occam AI | New York

AI Engineer Intern, Agents

@ Occam AI | US

AI Research Scientist

@ Vara | Berlin, Germany and Remote

Data Architect

@ University of Texas at Austin | Austin, TX

Data ETL Engineer

@ University of Texas at Austin | Austin, TX

Lead GNSS Data Scientist

@ Lurra Systems | Melbourne