Offloaded Reasoning: Efficient Inference for Large Language Models via Modular Reasoning and Refinement

Published: 10 Jun 2025, Last Modified: 01 Jul 2025TTODLer-FM @ ICML 2025 OralEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Speculative and Parallel execution/decoding, Approximate inference methods
Abstract: Large language models (LLMs) demonstrate strong reasoning capabilities but are expensive to run at inference time, limiting their practical deployment. We propose **Offloaded Reasoning** (OR), a modular strategy where a lightweight model generates intermediate reasoning traces that are then used by a larger model to produce the final answer. We further introduce **Offloaded Reasoning with Refinement** (ORR), where the large model first edits or improves the reasoning trace before answering. Unlike token-level acceleration methods, OR and ORR operate at the reasoning level and require no retraining of the large model. Experiments on GSM8K and Math500 show that OR achieves up to 8x faster inference than full large-model reasoning with minimal accuracy loss, while ORR recovers or exceeds full accuracy at substantially lower cost. Our results highlight the potential of modular, delegation-based reasoning for building more efficient and adaptable LLM systems.
Submission Number: 28
Loading