On the Blind Spots of Model-Based Evaluation Metrics for Text Generation. (arXiv:2212.10020v2 [cs.CL] UPDATED) | allainews.com

May 8, 2023, 12:45 a.m. | Tianxing He, Jingyu Zhang, Tianle Wang, Sachin Kumar, Kyunghyun Cho, James Glass, Yulia Tsvetkov

cs.CL updates on arXiv.org arxiv.org

In this work, we explore a useful but often neglected methodology for
robustness analysis of text generation evaluation metrics: stress tests with
synthetic data. Basically, we design and synthesize a wide range of potential
errors and check whether they result in a commensurate drop in the metric
scores. We examine a range of recently proposed evaluation metrics based on
pretrained language models, for the tasks of open-ended generation,
translation, and summarization. Our experiments reveal interesting
insensitivities, biases, or even loopholes …

analysis arxiv blind check data design errors evaluation evaluation metrics methodology metrics robustness stress synthetic synthetic data tests text text generation work

More from arxiv.org / cs.CL updates on arXiv.org

Vesper: A Compact and Effective Pretrained Model for Speech Emotion Recognition 1 day, 6 hours ago | arxiv.org

abstract artificial artificial general intelligence arxiv +19

Visually grounded few-shot word learning in low-resource settings 1 day, 6 hours ago | arxiv.org

abstract arxiv cs.cl eess.as +16

KTRL+F: Knowledge-Augmented In-Document Search 1 day, 6 hours ago | arxiv.org

abstract arxiv challenges cs.cl +12

Knowledgeable Preference Alignment for LLMs in Domain-specific Question Answering 1 day, 6 hours ago | arxiv.org

abstract alignment applications arxiv +19

Hint-enhanced In-Context Learning wakes Large Language Models up for knowledge-intensive tasks 1 day, 6 hours ago | arxiv.org

abstract arxiv context cs.cl +17

LibriSQA: A Novel Dataset and Framework for Spoken Question Answering with Large Language Models 1 day, 6 hours ago | arxiv.org

arxiv cs.cl dataset framework +9

Efficient Sentiment Analysis: A Resource-Aware Evaluation of Feature Extraction Techniques, Ensembling, and Deep Learning Models 1 day, 6 hours ago | arxiv.org

abstract accuracy analysis arxiv +18

Self-Polish: Enhance Reasoning in Large Language Models via Problem Refinement 1 day, 6 hours ago | arxiv.org

arxiv cs.ai cs.cl language +6

MFE-NER: Multi-feature Fusion Embedding for Chinese Named Entity Recognition 1 day, 6 hours ago | arxiv.org

abstract arxiv characters chinese +10

Senior Machine Learning Engineer (MLOps)

@ Promaton | Remote, Europe

View on ai-jobs.net

Data Analyst - Associate

@ JPMorgan Chase & Co. | Mumbai, Maharashtra, India

View on ai-jobs.net

Staff Data Engineer (Data Platform)

@ Coupang | Seoul, South Korea

View on ai-jobs.net

AI/ML Engineering Research Internship

@ Keysight Technologies | Santa Rosa, CA, United States

View on ai-jobs.net

Sr. Director, Head of Data Management and Reporting Execution

@ Biogen | Cambridge, MA, United States

View on ai-jobs.net

Manager, Marketing - Audience Intelligence (Senior Data Analyst)

@ Delivery Hero | Singapore, Singapore

View on ai-jobs.net