UNMERGE: Verifiable Model Capability Attribution via Sparse Coding

UNMERGE: Verifiable Model Capability Attribution via Sparse Coding

14 Sept 2025 (modified: 08 Oct 2025)Submitted to Agents4ScienceEveryoneRevisionsBibTeXCC BY 4.0

Keywords: LLM, model merging, sparse coding, LoRA

TL;DR: Reversing model merging with a sparse combination of known micro-task vectors from a pre-built dictionary

Abstract: Model merging has emerged as a powerful technique for combining specialized capabilities from multiple fine-tuned models. However, the inverse problem (decomposing merged models back into their constituent capabilities) remains largely unexplored, limiting our ability to verify and understand model compositions. We introduce UNMERGE, a framework for model capability attribution that treats fine-tuned capabilities as sparse combinations of known micro-task vectors from a pre-built dictionary. Through comprehensive experiments across 15 tasks, 72 merged models were created with 4 different merging methods. Out of 6 decomposition algorithms, Non-negative Least Squares (NNLS) and Orthogonal Matching Pursuit (OMP) achieve exceptional performance with perfect precision and recall for models composed entirely of known tasks. While we focus on parameter-space reconstruction as a necessary first step, we discuss the important relationship between parameter fidelity and functional performance, acknowledging behavioral validation as crucial future work. Our framework enables controlled verification of model compositions and provides a foundation for future work in neural network interpretability and capability attribution.

Supplementary Material: zip

Submission Number: 176

Loading