Abstract: Fine-tuning large language models provides strong in-domain performance but limits generalization and requires storage of many specialized models. Retraining a unified multitask model is often infeasible due to data unavailability or high computational cost. The majority of model merging approaches rely on performing arithmetic operations directly on model parameters.
Although research in model merging has expanded significantly in recent years, two distinct approaches have become dominant: 1) techniques that mitigate interference from redundant parameters and sign conflicts, and 2) techniques that account for the varying sensitivity of individual parameters. However, these two approaches operate independently without considering each other's strengths and remain disconnected from each other. In this work, we aim to unify these two well-established yet currently disconnected approaches by integrating insights from both the approaches.
We propose DRIFT-MEDIAN, a sensitivity-aware model merging approach that incorporates Fisher information and a coordinate-wise importance measure within a weighted median aggregation framework. Comprehensive experiments on several LLMs and CLIP models demonstrate that task-vector interference mitigation and parameter sensitivity are complementary factors in model merging. DRIFT-MEDIAN integrates both principles within a unified framework. Across the evaluated settings, this integration improves mean performance retention (PRR), although performance gains may vary across individual tasks. We make the code publicly available at https://anonymous.4open.science/r/drift-median.
Submission Type: Long submission (more than 12 pages of main content)
Changes Since Last Submission: We would like to mention the following major changes:
* We updated all the results of the paper, specifically we ran all the experiments under controlled and consistent setup.
* We switched to lm-eval for instruction following evaluation from previous setup containing inspect evals, as inspect evals was taking very long time to evaluate examples (which it was doing earlier as well), but previously lm-eval gave inconsistent results with high variance on the version we used, so we choose inspect eval to ensure reproducibility. Current version of lm-eval gives consistent results so we switched.
* We switched to vllm instead of hf backend previously for gsm8k/minereva_math, multilingual (Earlier we could not find a compatible version that works for all the benchmarks)
* We have reported mean and std of 3 runs where applicable. However, most runs are deterministic under consistent environment (we verified). Therefore, we report a single score on almost all the numbers in the paper.
* We removed the table for Llama2-7B, where some rows were computed by our setup whereas others were taken from PCB merging paper. Instead, we added PCB merging as a baseline in Table 1 and Table 2 with results from our setup.
* Moved table 1 (GPT-2 based experiments) to Appendix since our primary target is LLMs
* We reduce the hyperparameter sweep for DRIFT-MEDIAN with increments of 0.1 instead of 0.05 previously, also we did more thorough hyperparameter sweep for the baseline methods
* We added additional analysis in section 4.3 and observe that Math has highest task vector magnitude but low agreement with other tasks. We remove the previous Table 4 done in outdated setup and add Table 3 & 4 and 5,6, containing task wise overlap, similarity, etc and ablation on CLIP-based tasks since the metrics are same for classification whereas LLMs have different metrics making it harder to interpret. (Also it is cheaper to perform validation runs)
* We added anonymized github link for reproducibility
Assigned Action Editor: ~Mohammad_Emtiyaz_Khan1
Submission Number: 8004
Loading