June 7, 2024, 4:45 a.m. | Navreet Kaur, Monojit Choudhury, Danish Pruthi

arXiv:2312.08800v2 Announce Type: replace-cross
Abstract: As corporations rush to integrate large language models (LLMs) to their search offerings, it is critical that they provide factually accurate information that is robust to any presuppositions that a user may express. In this work, we introduce UPHILL, a dataset consisting of health-related queries with varying degrees of presuppositions. Using UPHILL, we evaluate the factual accuracy and consistency of InstructGPT, ChatGPT, and BingChat models. We find that while model responses rarely disagree with true …

