Abstract: While Chain-of-Thought (CoT) prompting has become a cornerstone for complex reasoning in Large Language Models (LLMs), the robustness of the generated reasoning remains an open question. We investigate the Decoupling Hypothesis: that the robustness of a model's reasoning path and the robustness of its final answer are largely independent; correct answers can coexist with arbitrarily fragile reasoning under small input perturbations. To systematically verify this, we introduce MATCHA, a novel Answer-Conditioned Probing framework. Unlike standard evaluations that focus on final output accuracy, MATCHA isolates the reasoning phase by conditioning generation on the model's predicted answer, allowing us to stress-test the stability of the rationale itself. Our experiments reveal a critical vulnerability: under imperceptible input perturbations, LLMs frequently maintain the correct answer while generating inconsistent or nonsensical reasoning - effectively being ``Right for the Wrong Reasons''. Using LLM judges to quantify this robustness gap, we find that multi-step and commonsense tasks are significantly more susceptible to this decoupling than logical tasks. Furthermore, we demonstrate that adversarial examples generated by MATCHA transfer non-trivially to black-box models. Crucially, we show that this fragility is not solely an artifact of our answer-conditioned protocol: while standard CoT-then-Answer generation does not permit strict answer-fixed isolation, it nevertheless exhibits similar patterns of reasoning degradation under analogous attacks. Our findings expose the illusion of CoT robustness and underscore the need for future architectures that enforce genuine answer-reasoning consistency rather than mere surface-level accuracy.
Submission Type: Regular submission (no more than 12 pages of main content)
Changes Since Last Submission: N/A
Code: https://github.com/uiuc-focal-lab/MATCHA/tree/main
Assigned Action Editor: ~Shuai_Li3
Submission Number: 7342
Loading