June 7, 2024, 4:51 a.m. | Cheng Xu, Shuhao Guan, Derek Greene, M-Tahar Kechadi

arXiv:2406.04244v1 Announce Type: new
Abstract: The rapid development of Large Language Models (LLMs) like GPT-4, Claude-3, and Gemini has transformed the field of natural language processing. However, it has also resulted in a significant issue known as Benchmark Data Contamination (BDC). This occurs when language models inadvertently incorporate evaluation benchmark information from their training data, leading to inaccurate or unreliable performance during the evaluation phase of the process. This paper reviews the complex challenge of BDC in LLM evaluation and …

