Keywords: chain-of-thought, reasoning, evaluation
TL;DR: LLMs can solve physics problems by patching gaps in heavily deleted CoT reasoning traces, but without true faithfulness.
Abstract: Reasoning-focused language models are increasingly applied to AI for science, but evaluation has not kept pace: benchmarks largely measure end-task accuracy while ignoring whether models genuinely depend on their own reasoning traces. This gap is critical in domains like physics problem solving, where equations, units, and structured terminology make reasoning reliability both essential and testable. We introduce a systematic deletion framework that intercepts chain-of-thought (CoT) mid-generation, removes tokens, and measures downstream effects. Applied to three open-source models—Magistral, Phi-4, and Qwen-A3B—across multiple physics benchmarks, our method shows that models remain accurate under heavy deletions (40–60\%) by “cramming” reconstructed steps into final answers. Overlap analyses reveal that deleted equations and facts often reappear, but inconsistently across strategies, exposing shallow and opportunistic reliance on CoT. These findings underscore that current accuracy-based evaluations are insufficient for scientific domains, and point toward the need for methods that assess reasoning faithfulness as a core requirement for advancing AI for science.
Primary Area: applications to physical sciences (physics, chemistry, biology, etc.)
Submission Number: 4526
Loading