Are Reflective Words in Large Reasoning Models a Sign of Genuine Capability or Memorized Patterns?

Are Reflective Words in Large Reasoning Models a Sign of Genuine Capability or Memorized Patterns?

ACL ARR 2026 January Submission4185 Authors

05 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: LLM Interpretability, Large Reasoning Models, Reflection Features

Abstract: Recent Large Reasoning Models exhibit strong reasoning abilities in tasks like mathematics and logical inference, notably through human-like self-verification and reflection in their chain of thought. However, it remains unclear whether these reflective statements stem from genuine internal mechanisms or are merely memorized patterns. From a model interpretability perspective, this work investigates LRMs’ representation space to determine whether specific features causally govern reflective capabilities. Using a difference-in-means approach, we extract "Self-Reflection Features" by contrasting model activations during self-reflection versus affirmative answering. Further causal analysis reveals that these features strongly influence knowledge parameters associated with reflection words, suggesting that such outputs are genuine manifestations of internal mechanisms rather than memorization. Finally, causal interventions demonstrate that modulating these features flexibly adjusts the model's self-reflective intensity.

Paper Type: Short

Research Area: Interpretability and Analysis of Models for NLP

Research Area Keywords: LLM Interpretability, Large Reasoning Models

Contribution Types: Model analysis & interpretability

Languages Studied: English

Submission Number: 4185

Loading