Feb. 6, 2024, 5:45 a.m. | Kun-Peng Ning Shuo Yang Yu-Yang Liu Jia-Yu Yao Zhen-Hui Liu Yu Wang Ming Pang Li Yuan

cs.LG updates on arXiv.org arxiv.org

Existing large language models (LLMs) evaluation methods typically focus on testing the performance on some closed-environment and domain-specific benchmarks with human annotations. In this paper, we explore a novel unsupervised evaluation direction, utilizing peer-review mechanisms to measure LLMs automatically. In this setting, both open-source and closed-source LLMs lie in the same environment, capable of answering unlabeled questions and evaluating each other, where each LLM's response score is jointly determined by other anonymous ones. To obtain the ability hierarchy among these …

annotations benchmarks cs.ai cs.cl cs.lg domain environment evaluation explore focus human language language models large language large language models llms novel paper peer peer-review performance review testing unsupervised

Software Engineer for AI Training Data (School Specific)

@ G2i Inc | Remote

Software Engineer for AI Training Data (Python)

@ G2i Inc | Remote

Software Engineer for AI Training Data (Tier 2)

@ G2i Inc | Remote

Data Engineer

@ Lemon.io | Remote: Europe, LATAM, Canada, UK, Asia, Oceania

Artificial Intelligence – Bioinformatic Expert

@ University of Texas Medical Branch | Galveston, TX

Lead Developer (AI)

@ Cere Network | San Francisco, US