\section{Metric Ablation Analysis}
\label{app:metric_ablation}

To validate the coefficient analysis from Section~\ref{sec:coefficient_analysis}, we perform ablation experiments where entire groups of metrics are excluded from the linear optimization.

\subsection{Metric Categories}

We organize our 28 metrics into five categories:

\paragraph{Subspace Metrics (7 metrics):}
\texttt{right\_subspace\_overlap}, \texttt{right\_subspace\_overlap\_top\_k}, \texttt{right\_subspace\_overlap\_bottom\_k}, \texttt{subspace\_overlap}, \texttt{singular\_value\_overlap}, \texttt{interaction\_matrix\_overlap\_top\_k}, \texttt{interaction\_matrix\_overlap\_bottom\_k}

\paragraph{Gradient-Based Metrics (6 metrics):}
\texttt{encoder\_gradient\_\{cosine\_similarity, l2\_distance, dot\_product\}}, \texttt{input\_gradient\_\{cosine\_similarity, l2\_distance, dot\_product\}}

\paragraph{Effective Rank Metrics (7 metrics):}
\texttt{effective\_rank}, \texttt{effective\_rank\_mergeability\_score}, \texttt{layerwise\_effective\_rank}, \texttt{layerwise\_effective\_rank\_mergeability\_score}, \texttt{stable\_rank}, \texttt{spectral\_gap}, \texttt{singular\_value\_ratio}

\paragraph{Task Vector Metrics (5 metrics):}
\texttt{task\_vector\_cosine\_similarity}, \texttt{task\_vector\_l2\_distance}, \texttt{task\_vector\_dot\_product}, \texttt{task\_vector\_magnitude\_ratio}, \texttt{weight\_space\_angle}

\paragraph{Activation Metrics (4 metrics):}
\texttt{activation\_l2\_distance}, \texttt{activation\_cosine\_similarity}, \texttt{activation\_magnitude\_ratio}, \texttt{activation\_dot\_product}

\subsection{Results}

\begin{table}[htbp]
\centering
\caption{Full metric category ablation results. Val $r$ reported as mean across 20 LOTO folds. $\Delta$ indicates change from baseline (28 metrics).}
\label{tab:metric_ablation_full}
\small
\begin{tabular}{lcccccc}
\toprule
\textbf{Method} & \textbf{Baseline} & \textbf{No Subspace} & \textbf{No Gradient} & \textbf{No EffRank} & \textbf{No TaskVec} & \textbf{No Activ} \\
& (28) & ($-$7) & ($-$6) & ($-$7) & ($-$5) & ($-$4) \\
\midrule
Weight Avg & 0.555 & 0.424 \scriptsize{($-$0.13)} & 0.399 \scriptsize{($-$0.16)} & 0.488 \scriptsize{($-$0.07)} & 0.503 \scriptsize{($-$0.05)} & 0.514 \scriptsize{($-$0.04)} \\
Arithmetic & 0.343 & 0.348 \scriptsize{($+$0.01)} & 0.285 \scriptsize{($-$0.06)} & 0.363 \scriptsize{($+$0.02)} & 0.394 \scriptsize{($+$0.05)} & 0.419 \scriptsize{($+$0.08)} \\
TSV & 0.572 & 0.368 \scriptsize{($-$0.20)} & 0.514 \scriptsize{($-$0.06)} & 0.603 \scriptsize{($+$0.03)} & 0.592 \scriptsize{($+$0.02)} & 0.555 \scriptsize{($-$0.02)} \\
Isotropic & 0.328 & 0.165 \scriptsize{($-$0.16)} & 0.344 \scriptsize{($+$0.02)} & 0.308 \scriptsize{($-$0.02)} & 0.314 \scriptsize{($-$0.01)} & 0.376 \scriptsize{($+$0.05)} \\
\midrule
\textbf{Avg $\Delta$} & --- & \textbf{$-$0.123} & $-$0.064 & $-$0.009 & $+$0.001 & $+$0.017 \\
\bottomrule
\end{tabular}
\end{table}

\subsection{Analysis}

\paragraph{Subspace and Gradient Metrics Confirm Coefficient Analysis.}
The ablation results align with the stable metrics identified in Section~\ref{sec:coefficient_analysis}. Subspace metrics---including \texttt{right\_subspace\_overlap} and \texttt{right\_subspace\_overlap\_top\_k}, which had consistently positive coefficients---prove most critical (average $\Delta r = -0.123$). TSV is particularly affected, dropping from $r=0.572$ to $r=0.368$. Similarly, gradient metrics---where \texttt{encoder\_gradient\_l2\_distance} and \texttt{input\_gradient\_l2\_distance} had the largest negative coefficients---rank second in importance ($\Delta r = -0.064$). This confirms that the learned coefficients reflect genuine predictive importance.

\paragraph{Effective Rank Metrics Are Dispensable.}
Removing effective rank metrics causes minimal change ($\Delta r = -0.009$), consistent with their near-zero coefficients in the optimization. For TSV and Arithmetic, performance actually improves slightly, suggesting these metrics may add noise.

\paragraph{Task Vector and Activation Metrics Are Redundant.}
The notable finding is that removing task vector metrics ($\Delta r = +0.001$) or activation metrics ($\Delta r = +0.017$) causes no degradation---some methods even improve. Despite \texttt{activation\_cosine\_similarity} having a moderate positive coefficient, its information appears redundant with subspace metrics. Task vector metrics, though intuitive measures of model similarity, provide no additional predictive power.

\subsection{Importance Ranking}

\begin{table}[htbp]
\centering
\caption{Metric categories ranked by importance (most negative $\Delta$ = most important).}
\label{tab:metric_importance_ranking}
\begin{tabular}{clccc}
\toprule
\textbf{Rank} & \textbf{Category} & \textbf{Metrics} & \textbf{Avg $\Delta r$} & \textbf{Impact} \\
\midrule
1 & Subspace & 7 & $-$0.123 & Critical \\
2 & Gradient-based & 6 & $-$0.064 & High \\
3 & Effective rank & 7 & $-$0.009 & Low \\
4 & Task vector & 5 & $+$0.001 & Negligible \\
5 & Activation & 4 & $+$0.017 & Redundant \\
\bottomrule
\end{tabular}
\end{table}

\subsection{Implications}

\paragraph{Minimal Metric Set.}
A reduced set containing only subspace and gradient-based metrics (13 total) would likely achieve comparable performance to the full 28-metric set. This reduces computational cost: subspace metrics require SVD decomposition, while gradient metrics require forward-backward passes on calibration data---but both are more informative than simpler alternatives.

\paragraph{Understanding Mergeability.}
The importance of subspace metrics suggests that mergeability is fundamentally about \emph{geometric alignment} of learned transformations. Models merge well when their task vectors modify similar subspaces of the weight space. Gradient metrics capture complementary information about functional similarity on data.

\paragraph{Why Task Vector Metrics Are Redundant.}
Task vector metrics measure raw distances and angles in weight space, while subspace metrics analyze the \emph{structured} geometric relationships (which directions are modified, not just how much). The ablation shows that this structural information subsumes the simpler distance metrics---if two task vectors have high subspace overlap, their raw distance becomes less informative for predicting merge success.
