all AI news
On the Blind Spots of Model-Based Evaluation Metrics for Text Generation. (arXiv:2212.10020v2 [cs.CL] UPDATED)
cs.CL updates on arXiv.org arxiv.org
In this work, we explore a useful but often neglected methodology for
robustness analysis of text generation evaluation metrics: stress tests with
synthetic data. Basically, we design and synthesize a wide range of potential
errors and check whether they result in a commensurate drop in the metric
scores. We examine a range of recently proposed evaluation metrics based on
pretrained language models, for the tasks of open-ended generation,
translation, and summarization. Our experiments reveal interesting
insensitivities, biases, or even loopholes …
analysis arxiv blind check data design errors evaluation evaluation metrics methodology metrics robustness stress synthetic synthetic data tests text text generation work