Reasoning Under 1 Billion: Memory-Augmented Reinforcement Learning for Large Language Models

TMLR Paper4826 Authors

10 May 2025 (modified: 14 May 2025)Under review for TMLREveryoneRevisionsBibTeXCC BY 4.0
Abstract: Recent advances in fine-tuning large language models (LLMs) with reinforcement learning (RL) have shown promising improvements in complex reasoning tasks, particularly when paired with chain-of-thought (CoT) prompting. However, these successes have been largely demonstrated on large-scale models with billions of parameters, where a strong pretraining foundation ensures effective initial exploration. In contrast, RL remains challenging for tiny LLMs with 1 billion parameters or fewer because they lack the necessary pretraining strength to explore effectively, often leading to suboptimal reasoning patterns. This work introduces a novel intrinsic motivation approach that leverages episodic memory to address this challenge, improving tiny LLMs in CoT reasoning tasks. Inspired by human memory-driven learning, our method leverages successful reasoning patterns stored in memory while allowing controlled exploration to generate novel responses. Intrinsic rewards are computed efficiently using a kNN-based episodic memory, allowing the model to discover new reasoning strategies while quickly adapting to effective past solutions. Experiments on three reasoning datasets demonstrate that our approach significantly enhances smaller LLMs' reasoning performance and generalization capability, making RL-based reasoning improvements more accessible in low-resource settings.
Submission Length: Regular submission (no more than 12 pages of main content)
Previous TMLR Submission Url: https://openreview.net/forum?id=uz7Oq6W4iK
Changes Since Last Submission: Fixed a template issue from the last submission by adding the missing header.
Assigned Action Editor: ~Kamil_Ciosek1
Submission Number: 4826
Loading