March 6, 2024, 5:48 a.m. | Hui Huang, Yingqi Qu, Jing Liu, Muyun Yang, Tiejun Zhao

cs.CL updates on arXiv.org arxiv.org

arXiv:2403.02839v1 Announce Type: new
Abstract: Recently, there has been a growing trend of utilizing Large Language Model (LLM) to evaluate the quality of other LLMs. Many studies have employed proprietary close-source models, especially GPT4, as the evaluator. Alternatively, other works have fine-tuned judge models based on open-source LLMs as the evaluator. In this study, we conduct an empirical study of different judge models on their evaluation capability. Our findings indicate that although the fine-tuned judge models achieve high accuracy on …

abstract arxiv classifiers cs.cl evaluation gpt4 judge language language model large language large language model llm llms proprietary quality studies study trend type

AI Research Scientist

@ Vara | Berlin, Germany and Remote

Data Architect

@ University of Texas at Austin | Austin, TX

Data ETL Engineer

@ University of Texas at Austin | Austin, TX

Lead GNSS Data Scientist

@ Lurra Systems | Melbourne

Senior Machine Learning Engineer (MLOps)

@ Promaton | Remote, Europe

Senior Data Scientist

@ ITE Management | New York City, United States