all AI news
SemScore: Automated Evaluation of Instruction-Tuned LLMs based on Semantic Textual Similarity. (arXiv:2401.17072v1 [cs.CL])
cs.CL updates on arXiv.org arxiv.org
Instruction-tuned Large Language Models (LLMs) have recently showcased
remarkable advancements in their ability to generate fitting responses to
natural language instructions. However, many current works rely on manual
evaluation to judge the quality of generated responses. Since such manual
evaluation is time-consuming, it does not easily scale to the evaluation of
multiple models and model variants. In this short paper, we propose a
straightforward but remarkably effective evaluation metric called SemScore, in
which we directly compare model outputs to gold …
arxiv automated cs.cl current evaluation generate generated instruction-tuned judge language language models large language large language models llms natural natural language quality responses scale semantic textual