Cleared for Takeoff? Compositional & Conditional Reasoning may be the Achilles Heel to (Flight-Booking) Language Agents | allainews.com

April 8, 2024, 4:46 a.m. | Harsh Kohli, Huan Sun

cs.CL updates on arXiv.org arxiv.org

arXiv:2404.04237v1 Announce Type: new
Abstract: The rapid progress of large language models (LLMs) has seen them excel and frequently surpass human performance on standard benchmarks. This has enabled many downstream applications, such as LLM agents, to rely on their sophisticated reasoning to navigate complex task requirements. However, LLMs are known to unexpectedly falter in simple tasks and under seemingly straightforward circumstances - underscoring the need for better and more diverse evaluation setups to measure their true capabilities. To this end, …

abstract agents applications arxiv benchmarks cs.cl excel human human performance language language models large language large language models llm llms performance progress reasoning standard them type

More from arxiv.org / cs.CL updates on arXiv.org

Sparse is Enough in Fine-tuning Pre-trained Large Language Models 1 day, 3 hours ago | arxiv.org

arxiv cs.ai cs.cl cs.lg +6

On the Learnability of Watermarks for Language Models 1 day, 3 hours ago | arxiv.org

abstract arxiv cs.cl cs.cr +17

StableSSM: Alleviating the Curse of Memory in State-space Models through Stable Reparameterization 1 day, 3 hours ago | arxiv.org

abstract arxiv capabilities cs.ai +14

Evaluating Generative Ad Hoc Information Retrieval 1 day, 3 hours ago | arxiv.org

abstract advances arxiv cs.cl +19

Language Models As Semantic Indexers 1 day, 3 hours ago | arxiv.org

arxiv cs.cl cs.ir cs.lg +4

Large language models can accurately predict searcher preferences 1 day, 3 hours ago | arxiv.org

abstract arxiv cs.ai cs.cl +16

On the Reliability of Watermarks for Large Language Models 1 day, 3 hours ago | arxiv.org

abstract arxiv become bots +28

A Watermark for Large Language Models 1 day, 3 hours ago | arxiv.org

abstract arxiv cs.cl cs.cr +16

CreoleVal: Multilingual Multitask Benchmarks for Creoles 1 day, 3 hours ago | arxiv.org

abstract annotated data arxiv benchmarks +14

Founding AI Engineer, Agents

@ Occam AI | New York

View on ai-jobs.net

AI Engineer Intern, Agents

@ Occam AI | US

View on ai-jobs.net

AI Research Scientist

@ Vara | Berlin, Germany and Remote

View on ai-jobs.net

Data Architect

@ University of Texas at Austin | Austin, TX

View on ai-jobs.net

Data ETL Engineer

@ University of Texas at Austin | Austin, TX

View on ai-jobs.net

Lead GNSS Data Scientist

@ Lurra Systems | Melbourne

View on ai-jobs.net