March 13, 2024, 4:44 a.m. | Xingyao Wang, Zihan Wang, Jiateng Liu, Yangyi Chen, Lifan Yuan, Hao Peng, Heng Ji

cs.LG updates on arXiv.org arxiv.org

arXiv:2309.10691v3 Announce Type: replace-cross
Abstract: To solve complex tasks, large language models (LLMs) often require multiple rounds of interactions with the user, sometimes assisted by external tools. However, current evaluation protocols often emphasize benchmark performance with single-turn exchanges, neglecting the nuanced interactions among the user, LLMs, and external tools, while also underestimating the importance of natural language feedback from users. These oversights contribute to discrepancies between research benchmark evaluations and real-world use cases. We introduce MINT, a benchmark that evaluates …

abstract arxiv benchmark cs.ai cs.cl cs.lg current evaluation feedback however interactions language language models large language large language models llms mint multiple performance solve tasks tools type

Senior Data Engineer

@ Displate | Warsaw

Decision Scientist

@ Tesco Bengaluru | Bengaluru, India

Senior Technical Marketing Engineer (AI/ML-powered Cloud Security)

@ Palo Alto Networks | Santa Clara, CA, United States

Associate Director, Technology & Data Lead - Remote

@ Novartis | East Hanover

Product Manager, Generative AI

@ Adobe | San Jose

Associate Director – Data Architect Corporate Functions

@ Novartis | Prague