Keywords: reasoning, benchmark, mathematics, evaluation, interpretability
TL;DR: MathBode is a dynamic, frequency-domain benchmark for math reasoning.
Abstract: We present $\textbf{MathBode}$, a dynamic diagnostic for mathematical reasoning in large language models (LLMs). Instead of one-shot accuracy, MathBode treats each parametric problem as a system: we drive a single parameter sinusoidally and fit first-harmonic responses of model outputs and exact solutions. This yields interpretable, frequency-resolved metrics—gain (amplitude tracking) and phase (lag)—that form Bode-style fingerprints. Across five closed-form families (linear solve, ratio/saturation, compound interest, $2\times2$ linear systems, similar triangles), the diagnostic surfaces systematic low-pass behavior and growing phase lag that accuracy alone obscures. We compare several models against a symbolic baseline that calibrates the instrument (G ≈ 1, φ ≈ 0). Results separate frontier from mid-tier models on dynamics, providing a compact, reproducible protocol that complements standard benchmarks with actionable measurements of reasoning fidelity and consistency. We open-source the dataset and code to enable further research and adoption.
Submission Number: 146
Loading