GRIT: Geometry-Aware PEFT with K-FAC Preconditioning, Fisher-Guided Reprojection, and Dynamic Rank Adaptation

GRIT: Geometry-Aware PEFT with K-FAC Preconditioning, Fisher-Guided Reprojection, and Dynamic Rank Adaptation

ICLR 2026 Conference Submission18146 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: fine-tuning, LoRA

TL;DR: GRIT is geometry‑aware PEFT (K‑FAC + Fisher reprojection + dynamic rank) that matches/surpasses LoRA/QLoRA with far fewer parameters and less forgetting.

Abstract: $\\textbf{Parameter-efficient fine-tuning (PEFT)}$ is now the standard approach for adapting LLMs to $\\textit{specific domains and use cases}$, yet prominent approaches such as $\\textit{LoRA}$ and $\\textit{QLoRA}$ are largely $\\textit{geometry-agnostic}$: they optimize within $\\textit{fixed, randomly oriented low-rank subspaces}$ using plain first-order descent, $\\textit{ignoring}$ local $\\textbf{loss curvature}$. This $\\textit{inflates the parameter–update budget}$ and $\\textit{increases}$ $\\textbf{drift}$ along weakly constrained directions. We introduce $\\textbf{GRIT}$, which turns standard LoRA updates into a $\\textit{dynamic, curvature-aware}$ procedure. Concretely, GRIT retains the LoRA parameterization but: (1) $\\textbf{preconditions gradients}$ in the adapter’s rank space using $\\textit{K-FAC}$ (Kronecker-Factored Approximate Curvature) as a natural-gradient proxy; (2) periodically $\\textbf{reprojects}$ the low-rank basis onto dominant $\\textit{Fisher eigendirections}$ to suppress drift; and (3) $\\textbf{adapts the effective rank}$ by reading the spectrum so capacity concentrates where the signal is. The overall effect is to steer updates into $\\textit{high-signal, low-interference}$ directions while using $\\textit{fewer effective parameters}$. Across $\\textit{instruction-following}$, $\\textit{comprehension}$, and $\\textit{reasoning}$ benchmarks on LLaMA backbones, $\\textbf{GRIT}$ $\\textit{matches or surpasses}$ $\\textit{LoRA}$/$\\textit{QLoRA}$ while $\\textbf{cutting trainable parameters by}$ $\\sim\\!46\\%$ $\\textit{on average}$ (25--80\\% across tasks) $\\textit{without degrading quality}$. $\\textbf{Fine-tuning large language models}$ typically induces $\\textit{\\textbf{catastrophic forgetting}}$—drift from the pretraining distribution that erodes general knowledge. We model GRIT’s forgetting with a curvature‑modulated power law: $$L_{pt}^{\\mathrm{GRIT}} = L_{pt}^{0} + A\\,\\frac{D_{ft}^{\\beta}}{(\\Xi_{\\mathrm{GRIT}}N)^{\\alpha}} + E,$$ where, in compact form, $$\\Xi_{\\mathrm{GRIT}} = (1+\\gamma_{r}r_{\\mathrm{eff}})(1+\\gamma_{a}\\rho_{\\mathrm{align}})(1+\\gamma_{p}\\pi_{\\mathrm{proj}}),$$ capturing the role of $\\textit{effective rank}$, $\\textit{alignment}$ to Fisher eigendirections, and $\\textit{projection fidelity}$, respectively—yielding consistently lower drift than $\\textit{LoRA}$. GRIT further $\\textit{matches or surpasses}$ $\\textit{Orthogonal-LoRA}$, $IA^{3}$, $\\textit{DoRA}$/$\\textit{Eff-FT}$, and $\\textit{Shampoo}$ on the parameter-updates–versus–performance–retention frontier. Code repository: https://anonymous.4open.science/r/iclr2026-submission-18146

Primary Area: optimization

Submission Number: 18146

Loading