Feb. 15, 2024, 5:43 a.m. | Siwei Yang, Bingchen Zhao, Cihang Xie

cs.LG updates on arXiv.org arxiv.org

arXiv:2402.09404v1 Announce Type: cross
Abstract: This paper introduces AQA-Bench, a novel benchmark to assess the sequential reasoning capabilities of large language models (LLMs) in algorithmic contexts, such as depth-first search (DFS). The key feature of our evaluation benchmark lies in its interactive evaluation protocol -- for example, in DFS, the availability of each node's connected edge is contingent upon the model's traversal to that node, thereby necessitating the LLM's ability to effectively remember visited nodes and strategize subsequent moves. We …

arxiv benchmark cs.ai cs.cl cs.lg interactive llms reasoning type

