Keywords: learning from experience;reinforcement learning;Group Relative Preference Optimization
Abstract: LLMs excel across many tasks but typically lack the ability to accumulate and reuse prior experiences. As a result, they often reason from scratch, retracing known solution paths and repeating past mistakes. Existing work commonly relies on Retrieval-Augmented Generation (RAG) to retrieve experiential memory summarized by LLMs. However, this paradigm suffers from high latency and computational cost, utilizes memory based on relevance rather than utility, resulting in suboptimal outcomes.
To address these issues, we propose \textbf{L-PEM} (A \textbf{L}ightweight model for \textbf{P}arametric \textbf{E}xperiential \textbf{M}emory), a novel approach that embeds experience into the parameters of a compact generative model. This architecture unifies memory generation and application in a single forward pass, effectively replacing the conventional store-and-retrieve paradigm.
We train L-PEM with Group Relative Preference Optimization (GRPO) using rollouts from a frozen executor as feedback and evaluate it on multiple mathematical reasoning benchmarks. L-PEM delivers significant performance gains while maintaining low latency and computational cost. Extensive ablation and analysis further elucidate the mechanisms underlying L-PEM’s effectiveness. \footnote{We release out code at https://anonymous.4open.science/r/L-PEM}
Primary Area: reinforcement learning
Submission Number: 24169
Loading