May 16, 2024, 4:46 a.m. | Milan Gritta, Gerasimos Lampouras, Ignacio Iacobacci

arXiv:2405.09186v1 Announce Type: new
Abstract: Language models (LMs) as conversational assistants recently became popular tools that help people accomplish a variety of tasks. These typically result from adapting LMs pretrained on general domain text sequences through further instruction-tuning and possibly preference optimisation methods. The evaluation of such LMs would ideally be performed using human judgement, however, this is not scalable. On the other hand, automatic evaluation featuring auxiliary LMs as judges and/or knowledge-based tasks is scalable but struggles with assessing …

