LoRA as an Implicit KL Regularizer in GRPO Fine-Tuning: From Theory to Practice

TMLR Paper7872 Authors

10 Mar 2026 (modified: 30 Apr 2026)Withdrawn by AuthorsEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Low-Rank Adaptation (LoRA) is widely used for parameter-efficient reinforcement learning fine-tuning of large language models (LLMs), often together with an explicit KL penalty toward a reference policy. In this work, we analyze how the low-rank constraint itself can restrict parameter trajectories during gradient descent and limit the resulting policy shift. We study the learning dynamics of LoRA updates and derive an explicit upper bound on the KL divergence between the reference and updated policies that depends on the adapter rank. Empirically, in Group Relative Policy Optimization (GRPO) fine-tuning of several LLM families on reasoning benchmarks, we observe that removing the explicit KL penalty yields similar evaluation accuracy while reducing training time due to avoiding referencepolicy evaluations. Our results provide theoretical grounding for KL-free fine-tuning with LoRA, maintaining reasoning performance while allowing for training speedups in practice.
Submission Type: Regular submission (no more than 12 pages of main content)
Changes Since Last Submission: =
Assigned Action Editor: ~Vimal_Thilak2
Submission Number: 7872
Loading