Are Reflection Words in Large Reasoning Models a Sign of Genuine Capability or Memorized Patterns?

Are Reflection Words in Large Reasoning Models a Sign of Genuine Capability or Memorized Patterns?

ACL ARR 2025 May Submission7337 Authors

20 May 2025 (modified: 03 Jul 2025)ACL ARR 2025 May SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Recent Large Reasoning Models exhibit strong reasoning abilities in tasks like mathematics and logical inference, notably through human-like self-verification and reflection in their chain of thought. However, it remains unclear whether these reflective statements stem from genuine internal mechanisms or are merely memorized patterns. From a model interpretability perspective, this work investigates LRMs’ representation space to determine whether specific features causally govern reflective capabilities. Using a difference-in-means approach, we extract "Self-Reflection Features" by contrasting model activations during self-reflection versus affirmative answering. Further causal analysis reveals that these features strongly influence knowledge parameters associated with reflection words, suggesting that such outputs are genuine manifestations of internal mechanisms rather than memorization. Finally, causal interventions demonstrate that modulating these features flexibly adjusts the model's self-reflective intensity.

Paper Type: Long

Research Area: Interpretability and Analysis of Models for NLP

Research Area Keywords: LLM Interpretability, Large Reasoning Models

Contribution Types: Model analysis & interpretability

Languages Studied: English

Submission Number: 7337

Loading