Keywords: Reasoning System, LLM
Abstract: A fundamental tension plagues complex reasoning in LLMs: models are biased towards probabilistic shortcuts and flawed decompositions, yet tasks demand logical rigor.
Existing methods, from heuristic prompting to large-scale training, fail to resolve this conflict and thus cannot guarantee reliability at test time. This dependence limits scalability, invites reward hacking, and produces brittle, hard-to-interpret behaviors that constrain the discovery of non-human—but potentially superior—reasoning strategies.
We introduce Atomos, a training-free framework that achieves reliable reasoning by composing absolutely controllable atomic steps verified by the same base model.
The core insight is that while generating complex solutions is hard, strong models can already solve and, more importantly, verify atomic subproblems with high accuracy.
Crucially, verification is typically far cheaper than generation.
Atomos leverages this asymmetry by wrapping each step in a low-overhead self-checking loop, where the same base model acts as its own verifier.
This transforms the challenge of global reliability to test-time compute scheduling.
We show that this reliability is governed by how compute is split between two fundamental axes: world sampling (exploring diverse reasoning paths) and path sampling (deepening the verification and retries within a single path).
This trade-off yields predictable isoperformance curves and a simple rule for optimally allocating a compute budget.
Our theory further reveals that the cost to achieve a target level of correctness grows only linearly with problem complexity but polylogarithmically with the reliability requirement itself, making extreme reliability surprisingly affordable.
Empirically, using the Gemini-2.5-Pro model, Atomos can provide the correct answer and proof for IMO2025 P6 within 2 hour.
Primary Area: foundation or frontier models, including LLMs
Submission Number: 498
Loading