Are Reflection Words in Large Reasoning Models a Sign of Genuine Capability or Memorized Patterns?

ACL ARR 2025 May Submission7337 Authors

20 May 2025 (modified: 03 Jul 2025)ACL ARR 2025 May SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Recent Large Reasoning Models exhibit strong reasoning abilities in tasks like mathematics and logical inference, notably through human-like self-verification and reflection in their chain of thought. However, it remains unclear whether these reflective statements stem from genuine internal mechanisms or are merely memorized patterns. From a model interpretability perspective, this work investigates LRMs’ representation space to determine whether specific features causally govern reflective capabilities. Using a difference-in-means approach, we extract "Self-Reflection Features" by contrasting model activations during self-reflection versus affirmative answering. Further causal analysis reveals that these features strongly influence knowledge parameters associated with reflection words, suggesting that such outputs are genuine manifestations of internal mechanisms rather than memorization. Finally, causal interventions demonstrate that modulating these features flexibly adjusts the model's self-reflective intensity.
Paper Type: Long
Research Area: Interpretability and Analysis of Models for NLP
Research Area Keywords: LLM Interpretability, Large Reasoning Models
Contribution Types: Model analysis & interpretability
Languages Studied: English
Submission Number: 7337
Loading