June 27, 2024, 4:42 a.m. | Lei Zhang, Yunshui Li, Ziqiang Liu, Jiaxi yang, Junhao Liu, Longze Chen, Run Luo, Min Yang

cs.CL updates on arXiv.org arxiv.org

arXiv:2312.09542v2 Announce Type: replace
Abstract: With the advancement of large language models (LLMs) and the expansion of their context windows, existing long-context benchmarks fall short in effectively evaluating the models' comprehension and reasoning abilities in extended texts. Moreover, conventional benchmarks relying on F1 metrics often inaccurately score responses: they may undervalue correct answers that differ from the reference responses and overvalue incorrect ones that resemble the reference texts. In response to these limitations, we introduce Marathon, a novel evaluation benchmark …

arxiv context cs.cl language language models large language large language models marathon race realm replace through type

Quantitative Researcher – Algorithmic Research

@ Man Group | GB London Riverbank House

Software Engineering Expert

@ Sanofi | Budapest

Senior Bioinformatics Scientist

@ Illumina | US - Bay Area - Foster City

Senior Engineer - Generative AI Product Engineering (Remote-Eligible)

@ Capital One | McLean, VA

Graduate Assistant - Bioinformatics

@ University of Arkansas System | University of Arkansas at Little Rock

Senior AI-HPC Cluster Engineer

@ NVIDIA | US, CA, Santa Clara