May 2, 2024, 4:42 a.m. | Hugh Zhang (Mike), Jeff Da (Mike), Dean Lee (Mike), Vaughn Robinson (Mike), Catherine Wu (Mike), Will Song (Mike), Tiffany Zhao (Mike), Pranav Raja (M

cs.LG updates on arXiv.org arxiv.org

arXiv:2405.00332v1 Announce Type: cross
Abstract: Large language models (LLMs) have achieved impressive success on many benchmarks for mathematical reasoning. However, there is growing concern that some of this performance actually reflects dataset contamination, where data closely resembling benchmark questions leaks into the training data, instead of true reasoning ability. To investigate this claim rigorously, we commission Grade School Math 1000 (GSM1k). GSM1k is designed to mirror the style and complexity of the established GSM8k benchmark, the gold standard for measuring …

abstract arxiv benchmark benchmarks cs.ai cs.cl cs.lg data dataset however language language model language models large language large language model large language models leaks llms mathematical reasoning performance questions reasoning school success training training data true type

Software Engineer for AI Training Data (School Specific)

@ G2i Inc | Remote

Software Engineer for AI Training Data (Python)

@ G2i Inc | Remote

Software Engineer for AI Training Data (Tier 2)

@ G2i Inc | Remote

Data Engineer

@ Lemon.io | Remote: Europe, LATAM, Canada, UK, Asia, Oceania

Artificial Intelligence – Bioinformatic Expert

@ University of Texas Medical Branch | Galveston, TX

Lead Developer (AI)

@ Cere Network | San Francisco, US