Feb. 15, 2024, 5:42 a.m. | Miltiadis Allamanis, Sheena Panthaplackel, Pengcheng Yin

cs.LG updates on arXiv.org arxiv.org

arXiv:2402.08699v1 Announce Type: cross
Abstract: To evaluate code large language models (LLMs), research has relied on a few small manually curated benchmarks, such as HumanEval and MBPP, which represent a narrow part of the real-world software domains. In this work, we introduce round-trip correctness (RTC) as an alternative evaluation method. RTC allows Code LLM evaluation on a broader spectrum of real-world software domains without the need for costly human curation. RTC rests on the idea that we can ask a …

