April 29, 2022, 1:11 a.m. | Thórhildur Thorleiksdóttir, Cedric Renggli, Nora Hollenstein, Ce Zhang

cs.CL updates on arXiv.org arxiv.org

Collecting human judgements is currently the most reliable evaluation method
for natural language generation systems. Automatic metrics have reported flaws
when applied to measure quality aspects of generated text and have been shown
to correlate poorly with human judgements. However, human evaluation is time
and cost-intensive, and we lack consensus on designing and conducting human
evaluation experiments. Thus there is a need for streamlined approaches for
efficient collection of human judgements when evaluating natural language
generation systems. Therefore, we present …

arxiv evaluation human

Data Scientist (m/f/x/d)

@ Symanto Research GmbH & Co. KG | Spain, Germany

Robotics Technician - Weekend Day Shift

@ GXO Logistics | Hillsboro, OR, US, 97124

Gen AI Developer

@ NTT DATA | Irving, TX, US

Applied AI/ML - Vice President

@ JPMorgan Chase & Co. | LONDON, United Kingdom

Research Fellow (Computer Science/Engineering/AI)

@ Nanyang Technological University | NTU Main Campus, Singapore

Senior Machine Learning Engineer

@ Rasa | Remote - Germany