Too Big to Fool: Resisting Deception in Language Models

Mohammad Reza Samsami; Mats Leon Richter; Juan A. Rodriguez; Megh Thakkar; Sarath Chandar; Maxime Gasse

Too Big to Fool: Resisting Deception in Language Models

Mohammad Reza Samsami, Mats Leon Richter, Juan A. Rodriguez, Megh Thakkar, Sarath Chandar, Maxime Gasse

26 Sept 2024 (modified: 05 Feb 2025)Submitted to ICLR 2025EveryoneRevisionsBibTeXCC BY 4.0

Keywords: Large Language Models, Evaluation, Misinformation, In-Context Learning, World Models, Reasoning

TL;DR: This paper introduces a powerful evaluation method showing that larger language models are more resilient to misleading prompts and better at using truthful in-context hints.

Abstract: Large language models must balance their weight-encoded knowledge with in-context information from prompts to generate accurate responses. This paper investigates this interplay by analyzing how models of varying capacities within the same family handle intentionally misleading in-context information. Our experiments demonstrate that larger models exhibit higher resilience to deceptive prompts, showcasing an advanced ability to interpret and integrate prompt information with their internal knowledge. Furthermore, we find that larger models outperform smaller ones in following legitimate instructions, indicating that their resilience is not due to disregarding in-context information. We also show that this phenomenon is likely not a result of memorization but stems from the models' ability to better leverage implicit task-relevant information from the prompt alongside their internally stored knowledge.

Primary Area: foundation or frontier models, including LLMs

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 7458

Loading