Keywords: calibration/uncertainty, reasoning
Abstract: How do large language models (LLMs) perceive task difficulty, and how does this perception shape their problem-solving capabilities? While existing work on the epistemology of LLMs has mainly focused on confidence, difficulty perception offers a novel perspective on models’ knowledge and reasoning process. This question becomes especially relevant in the context of large reasoning models (LRMs), where test-time compute can be allocated in the form of special thinking tokens depending on problem difficulty. Our experiments with six LRMs on mathematical, competitive programming, science, and social reasoning tasks reveal that difficulty perception is not random and its ranking instead highly correlates between different models; we present evidence that models perform better on problems they deem easier, a correspondence stronger than that of verbalized confidence. We also show that cues overstating problem difficulty in prompts can cause reasoning inefficiency. Our findings establish difficulty perception as a concept separate from verbalized confidence in model epistemology while highlight risks from simple prompt injections with hints of difficulty.
Paper Type: Long
Research Area: Interpretability and Analysis of Models for NLP
Research Area Keywords: calibration/uncertainty
Contribution Types: Model analysis & interpretability, NLP engineering experiment
Languages Studied: English
Submission Number: 10782
Loading