all AI news
Have Seen Me Before? Automating Dataset Updates Towards Reliable and Timely Evaluation
Feb. 20, 2024, 5:51 a.m. | Jiahao Ying, Yixin Cao, Bo Wang, Wei Tang, Yizhe Yang, Shuicheng Yan
cs.CL updates on arXiv.org arxiv.org
Abstract: Due to the expanding capabilities and pre-training data, Large Language Models (LLMs) are facing increasingly serious evaluation challenges. On one hand, the data leakage issue cause over-estimation on existing benchmarks. On the other hand, periodically curating datasets manually is costly. In this paper, we propose to automate dataset updates for reliable and timely evaluation. The basic idea is to generate unseen and high-quality testing samples based on existing ones to mitigate leakage issues. In specific, …
abstract arxiv benchmarks capabilities challenges cs.cl data data leakage dataset datasets evaluation issue language language models large language large language models llms paper pre-training training training data type updates
More from arxiv.org / cs.CL updates on arXiv.org
Jobs in AI, ML, Big Data
Data Engineer
@ Lemon.io | Remote: Europe, LATAM, Canada, UK, Asia, Oceania
Artificial Intelligence – Bioinformatic Expert
@ University of Texas Medical Branch | Galveston, TX
Lead Developer (AI)
@ Cere Network | San Francisco, US
Research Engineer
@ Allora Labs | Remote
Ecosystem Manager
@ Allora Labs | Remote
Founding AI Engineer, Agents
@ Occam AI | New York