Jan. 31, 2024, 4:41 p.m. | Ansar Aynetdinov, Alan Akbik

cs.CL updates on arXiv.org arxiv.org

Instruction-tuned Large Language Models (LLMs) have recently showcased
remarkable advancements in their ability to generate fitting responses to
natural language instructions. However, many current works rely on manual
evaluation to judge the quality of generated responses. Since such manual
evaluation is time-consuming, it does not easily scale to the evaluation of
multiple models and model variants. In this short paper, we propose a
straightforward but remarkably effective evaluation metric called SemScore, in
which we directly compare model outputs to gold …

arxiv automated cs.cl current evaluation generate generated instruction-tuned judge language language models large language large language models llms natural natural language quality responses scale semantic textual

Lead Developer (AI)

@ Cere Network | San Francisco, US

Research Engineer

@ Allora Labs | Remote

Ecosystem Manager

@ Allora Labs | Remote

Founding AI Engineer, Agents

@ Occam AI | New York

AI Engineer Intern, Agents

@ Occam AI | US

AI Research Scientist

@ Vara | Berlin, Germany and Remote