Abstract: Inference optimizations such as quantization, pruning, format and datatype conversion,
model export, and serialization can lead to
functional degradations in language model task
performance. While most efforts on performance recovery for deployment focus on robust quantization techniques, we focus on recovering model accuracies from any sources
that degrade model weights, such as improper
model serialization. In this work, we propose
Recover-LoRA, a lightweight and dataset agnostic method to recover accuracy in degraded
models. Recover-LoRA uses synthetic data
and logit distillation to learn LoRA adapters
on selective layers that facilitate aligning the
degraded model to its full precision model. We
investigate the utility of Recover-LoRA across
a diverse set of small language models (SLMs),
including models with varying attention architectures, multi-head attention (MHA) and
group-query attention (GQA), as well as several evaluation datasets. Our results show that
Recover-LoRA recovers model accuracies by
5-17% on MHA and GQA SLMs
Loading