Have Seen Me Before? Automating Dataset Updates Towards Reliable and Timely Evaluation | allainews.com

Feb. 20, 2024, 5:51 a.m. | Jiahao Ying, Yixin Cao, Bo Wang, Wei Tang, Yizhe Yang, Shuicheng Yan

cs.CL updates on arXiv.org arxiv.org

arXiv:2402.11894v1 Announce Type: new
Abstract: Due to the expanding capabilities and pre-training data, Large Language Models (LLMs) are facing increasingly serious evaluation challenges. On one hand, the data leakage issue cause over-estimation on existing benchmarks. On the other hand, periodically curating datasets manually is costly. In this paper, we propose to automate dataset updates for reliable and timely evaluation. The basic idea is to generate unseen and high-quality testing samples based on existing ones to mitigate leakage issues. In specific, …

abstract arxiv benchmarks capabilities challenges cs.cl data data leakage dataset datasets evaluation issue language language models large language large language models llms paper pre-training training training data type updates

More from arxiv.org / cs.CL updates on arXiv.org

Hijacking Context in Large Multi-modal Models 17 hours ago | arxiv.org

abstract arxiv contents context +16

The Janus Interface: How Fine-Tuning in Large Language Models Amplifies the Privacy Risks 17 hours ago | arxiv.org

abstract arxiv concerns cs.cl +21

Enhancing Diagnostic Accuracy through Multi-Agent Conversations: Using Large Language Models to Mitigate Cognitive Bias 17 hours ago | arxiv.org

abstract accuracy agent arxiv +28

Small Language Model Can Self-correct 17 hours ago | arxiv.org

abstract arxiv capability chatgpt +18

Prompt-based mental health screening from social media text 17 hours ago | arxiv.org

abstract article arxiv bag +17

Scaling Political Texts with Large Language Models: Asking a Chatbot Might Be All You Need 17 hours ago | arxiv.org

abstract arxiv author chatbot +20

Exploring the Jungle of Bias: Political Bias Attribution in Language Models via Dependency Analysis 17 hours ago | arxiv.org

analysis arxiv attribution bias +10

Natural Language Interfaces for Tabular Data Querying and Visualization: A Survey 17 hours ago | arxiv.org

abstract arxiv chatgpt cs.ai +27

Hidden Citations Obscure True Impact in Science 17 hours ago | arxiv.org

abstract arxiv citations clear +19

Data Engineer

@ Lemon.io | Remote: Europe, LATAM, Canada, UK, Asia, Oceania

View on ai-jobs.net

Artificial Intelligence – Bioinformatic Expert

@ University of Texas Medical Branch | Galveston, TX

View on ai-jobs.net

Lead Developer (AI)

@ Cere Network | San Francisco, US

View on ai-jobs.net

Research Engineer

@ Allora Labs | Remote

View on ai-jobs.net

Ecosystem Manager

@ Allora Labs | Remote

View on ai-jobs.net

Founding AI Engineer, Agents

@ Occam AI | New York

View on ai-jobs.net