Are Reflective Words in Large Reasoning Models a Sign of Genuine Capability or Memorized Patterns?

ACL ARR 2026 January Submission4185 Authors

05 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: LLM Interpretability, Large Reasoning Models, Reflection Features
Abstract: Recent Large Reasoning Models exhibit strong reasoning abilities in tasks like mathematics and logical inference, notably through human-like self-verification and reflection in their chain of thought. However, it remains unclear whether these reflective statements stem from genuine internal mechanisms or are merely memorized patterns. From a model interpretability perspective, this work investigates LRMs’ representation space to determine whether specific features causally govern reflective capabilities. Using a difference-in-means approach, we extract "Self-Reflection Features" by contrasting model activations during self-reflection versus affirmative answering. Further causal analysis reveals that these features strongly influence knowledge parameters associated with reflection words, suggesting that such outputs are genuine manifestations of internal mechanisms rather than memorization. Finally, causal interventions demonstrate that modulating these features flexibly adjusts the model's self-reflective intensity.
Paper Type: Short
Research Area: Interpretability and Analysis of Models for NLP
Research Area Keywords: LLM Interpretability, Large Reasoning Models
Contribution Types: Model analysis & interpretability
Languages Studied: English
Submission Number: 4185
Loading