Keywords: Memorization; Scientific Benchmarks; Medical Datasets
TL;DR: Reasoning models outperform base models on recall (lookup) tasks; we hypothesize RL helps them traverse hierarchical structure (e.g., medical codes), so the RL model didn’t memorize more—it just searches better.
Abstract: Reinforcement learning (RL) is often credited with improving language-model reasoning and generalization, possibly at the expense of memorized knowledge degradation. We observe, however, that on certain tasks designed to test pure knowledge recall; e.g., "Which disease corresponds to ICD-9 code 57.95?", RL-enhanced reasoning models (DeepSeek-R1, QwQ, Magistral) still consistently surpass their non-reasoning counterparts (DeepSeek-V3, Qwen-Instruct, Mistral-Small) by a large margin of 21 percentage points. Our analysis indicates that these gains stem not from acquisition of new knowledge during RL, but from improved access to knowledge already encoded during pretraining: RL appears to teach models to efficiently traverse hierarchical structure in the data to recall relevant information at inference time. To test this hypothesis we demonstrate that structured prompting designed to explicitly instruct for similar step-by-step hierarchy traversal recovers most of the RL gains, reducing the 21pp gap to 6.1pp on MedConceptsQA, without any RL training. Taken together, these results suggest that many benefits attributed to ``reasoning training'' may, in fact, arise from enhanced knowledge navigation rather than improved logical capability.
Supplementary Material: zip
Primary Area: interpretability and explainable AI
Submission Number: 22175
Loading