Mind Games Machines Play: Contrastive Cognitive Bias Detection in LLMs and Distilled Models

Published: 23 Sept 2025, Last Modified: 17 Feb 2026CogInterp @ NeurIPS 2025 RejectEveryoneRevisionsBibTeXCC BY 4.0
Keywords: LLMs, distilled models, cognitive bias, framing bias, anchoring bias, bias detection, interpretability, model distillation, contrastive evaluation, Qwen2, DeepSeek, bias mitigation
TL;DR: Both LLMs and their distilled counterparts show strong framing and anchoring biases; this paper introduces a systematic test framework to detect, compare, and propose ways to reduce such biases for fairer AI.
Abstract: Large language models (LLMs) and their distilled derivatives have revolutionized natural language processing but remain vulnerable to cognitive biases that parallel systematic judgment errors in humans. This study examines the prevalence of framing and anchoring biases in two state-of-the-art models: Qwen2-7B-Instruct and DeepSeek-R1-Distill-Qwen-1.5B. Using a novel, contrastive test set grounded in cognitive psychology, we demonstrate significant bias presence in both models. We further analyze how model design, training regimes, and feedback mechanisms shape bias expression and propose mitigation strategies. The framework introduced provides a systematic approach for auditing and comparing cognitive biases, supporting the development of fairer, more interpretable language models.
Submission Number: 80
Loading