VarBench: Robust Language Model Benchmarking Through Dynamic Variable Perturbation | allainews.com

June 26, 2024, 4:42 a.m. | Kun Qian, Shunji Wan, Claudia Tang, Youzhi Wang, Xuanming Zhang, Maximillian Chen, Zhou Yu

cs.CL updates on arXiv.org arxiv.org

arXiv:2406.17681v1 Announce Type: new
Abstract: As large language models achieve impressive scores on traditional benchmarks, an increasing number of researchers are becoming concerned about benchmark data leakage during pre-training, commonly known as the data contamination problem. To ensure fair evaluation, recent benchmarks release only the training and validation sets, keeping the test set labels closed-source. They require anyone wishing to evaluate his language model to submit the model's predictions for centralized processing and then publish the model's result on their …

abstract arxiv benchmark benchmarking benchmarks cs.cl data data leakage dynamic evaluation fair language language model language models large language large language models pre-training problem release researchers robust through training type validation

More from arxiv.org / cs.CL updates on arXiv.org

MuTox: Universal MUltilingual Audio-based TOXicity Dataset and Zero-shot Detector 1 day, 2 hours ago | arxiv.org

abstract arxiv audio cs.cl +22

Can Large Language Model Summarizers Adapt to Diverse Scientific Communication Goals? 1 day, 2 hours ago | arxiv.org

abstract adapt arxiv communication +23

ReFT: Reasoning with Reinforced Fine-Tuning 1 day, 2 hours ago | arxiv.org

abstract annotations arxiv capability +22

Deductive Closure Training of Language Models for Coherence, Accuracy, and Updatability 1 day, 2 hours ago | arxiv.org

abstract accuracy arxiv cs.cl +13

Exploring Defeasibility in Causal Reasoning 1 day, 2 hours ago | arxiv.org

abstract arxiv causal causal reasoning +7

Can Large Language Models Follow Concept Annotation Guidelines? A Case Study on Scientific and Financial … 1 day, 2 hours ago | arxiv.org

abstract annotation arxiv capacity +26

Theory of Mind for Multi-Agent Collaboration via Large Language Models 1 day, 2 hours ago | arxiv.org

abstract agent agents arxiv +28

Enhancing Text-based Knowledge Graph Completion with Zero-Shot Large Language Models: A Focus on Semantic Enhancement 1 day, 2 hours ago | arxiv.org

arxiv cs.ai cs.cl focus +12

A Large Language Model Approach to Educational Survey Feedback Analysis 1 day, 2 hours ago | arxiv.org

abstract analysis arxiv capabilities +27

Performance Marketing Manager

@ Jerry | New York City

View on ai-jobs.net

Senior Growth Marketing Manager (FULLY REMOTE)

@ Jerry | Seattle, WA

View on ai-jobs.net

Growth Marketing Channel Manager

@ Jerry | New York City

View on ai-jobs.net

Azure Integration Developer - Consultant - Bangalore

@ KPMG India | Bengaluru, Karnataka, India

View on ai-jobs.net

Director - Technical Program Manager

@ Capital One | Bengaluru, In

View on ai-jobs.net

Lead Developer-Process Automation -Python Developer

@ Diageo | Bengaluru Karle Town SEZ

View on ai-jobs.net