CFMatch: Aligning Automated Answer Equivalence Evaluation with Expert Judgments For Open-Domain Question Answering | allainews.com

Feb. 22, 2024, 5:48 a.m. | Zongxia Li, Ishani Mondal, Yijun Liang, Huy Nghiem, Jordan Boyd-Graber

cs.CL updates on arXiv.org arxiv.org

arXiv:2401.13170v2 Announce Type: replace
Abstract: Question answering (QA) can only make progress if we know if an answer is correct, but for many of the most challenging and interesting QA examples, current evaluation metrics to determine answer equivalence (AE) often do not align with human judgments, particularly more verbose, free-form answers from large language models (LLM). There are two challenges: a lack of data and that models are too big: LLM-based scorers can correlate better with human judges, but this …

abstract arxiv automated cs.cl current domain evaluation evaluation metrics examples expert human metrics progress question question answering type

More from arxiv.org / cs.CL updates on arXiv.org

Statler: State-Maintaining Language Models for Embodied Reasoning 3 hours ago | arxiv.org

abstract arxiv cs.cl cs.ro +16

MoSECroT: Model Stitching with Static Word Embeddings for Crosslingual Zero-shot Transfer 3 hours ago | arxiv.org

abstract arxiv cs.ai cs.cl +26

Deception Detection from Linguistic and Physiological Data Streams Using Bimodal Convolutional Neural Networks 3 hours ago | arxiv.org

abstract application arxiv concerns +19

Using Natural Language Explanations to Improve Robustness of In-context Learning 3 hours ago | arxiv.org

abstract adversarial arxiv context +22

Direct Neural Machine Translation with Task-level Mixture of Experts models 3 hours ago | arxiv.org

abstract arxiv cs.cl data +16

Jury: A Comprehensive Evaluation Toolkit 3 hours ago | arxiv.org

arxiv cs.ai cs.cl evaluation +3

You Only Look at Screens: Multimodal Chain-of-Action Agents 3 hours ago | arxiv.org

action agents arxiv cs.ai +6

Draft & Verify: Lossless Large Language Model Acceleration via Self-Speculative Decoding 3 hours ago | arxiv.org

abstract arxiv cs.cl decoding +19

NaijaRC: A Multi-choice Reading Comprehension Dataset for Nigerian Languages 3 hours ago | arxiv.org

abstract arxiv create cross-lingual +16

Software Engineer for AI Training Data (School Specific)

@ G2i Inc | Remote

View on ai-jobs.net

Software Engineer for AI Training Data (Python)

@ G2i Inc | Remote

View on ai-jobs.net

Software Engineer for AI Training Data (Tier 2)

@ G2i Inc | Remote

View on ai-jobs.net

Data Engineer

@ Lemon.io | Remote: Europe, LATAM, Canada, UK, Asia, Oceania

View on ai-jobs.net

Artificial Intelligence – Bioinformatic Expert

@ University of Texas Medical Branch | Galveston, TX

View on ai-jobs.net

Lead Developer (AI)

@ Cere Network | San Francisco, US

View on ai-jobs.net