Big Reasoning with Small Models: Instruction Retrieval at Inference Time

Big Reasoning with Small Models: Instruction Retrieval at Inference Time

ACL ARR 2026 January Submission6973 Authors

06 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: small language models, inference-time intervention, instruction retrieval, retrieval-augmented reasoning, model efficiency, structured reasoning

Abstract: Small language models (SLMs) enable low-cost, private, on-device inference, but they often fail on problems that require specialized domain knowledge or multi-step reasoning. Existing approaches for improving reasoning either rely on scale (e.g., chain-of-thought prompting), require task-specific training that limits reuse and generality (e.g., distillation), or retrieve unstructured information that still leaves the SLM to determine an appropriate reasoning strategy. We propose instruction retrieval, an inference-time intervention that augments an SLM with structured, reusable reasoning procedures rather than raw passages. We construct an Instruction Corpus by clustering similar training questions and using a teacher model to generate generalizable guides that pair domain background with explicit step-by-step procedures. At inference, the SLM retrieves the instructions most relevant to a given query and executes the associated procedures without any additional fine-tuning. Across three challenging domains—medicine, law, and mathematics, instruction retrieval yields consistent gains for models with at least 3B parameters, improving accuracy by 9.4\%, 7.9\%, and 5.1\%, respectively, with the strongest 14B model surpassing GPT-4o’s zero-shot performance on knowledge-intensive tasks.

Paper Type: Long

Research Area: LLM Efficiency

Research Area Keywords: LLM Efficiency, NLP in resource-constrained settings, efficient models, inference methods, prompting, knowledge-augmented methods, retrieval, security and privacy

Contribution Types: Approaches low compute settings-efficiency

Languages Studied: English

Submission Number: 6973

Loading