Scalable LLM Math Reasoning Acceleration with Low-rank Distillation

Published: 10 Jun 2025, Last Modified: 24 Jun 2025LCFM 2025EveryoneRevisionsBibTeXCC BY 4.0
Keywords: large language model, efficiency, distillation, reasoning, scaling, low-rank
TL;DR: We propose an efficient inference algorithm that preserves language/math reasoning capabilities and is compatible with many existing efficient algorithms.
Abstract: Due to long generations, LLM math reasoning demands significant computational resources and time. While many existing efficient inference methods have been developed with excellent performance preservation on language tasks, they often severely degrade math performance. We propose Caprese, a resource-efficient distillation method to recover lost capabilities from deploying efficient inference methods, focused primarily in feedforward blocks. With original weights unperturbed, \~1% of additional parameters, and only 20K synthetic training samples, we can recover much if not all of the math capabilities lost from efficient inference for thinking LLMs and without harm to language tasks for instruct LLMs. Moreover, Caprese slashes the number of active parameters (\~2B cut for Gemma 2 9B and Llama 3.1 8B) and integrates cleanly into existing model layers to reduce latency (>11% reduction to generate 2048 tokens with Qwen 2.5 14B) while encouraging response brevity.
Submission Number: 10
Loading