June 4, 2024, 4:54 a.m. | Zhumin Chu, Qingyao Ai, Yiteng Tu, Haitao Li, Yiqun Liu

cs.CL updates on arXiv.org arxiv.org

arXiv:2401.15641v2 Announce Type: replace-cross
Abstract: The impressive performance of large language models (LLMs) has attracted considerable attention from the academic and industrial communities. Besides how to construct and train LLMs, how to effectively evaluate and compare the capacity of LLMs has also been well recognized as an important yet difficult problem. Existing paradigms rely on either human annotators or model-based evaluators to evaluate the performance of LLMs on different tasks. However, these paradigms often suffer from high cost, low generalizability, …

