\subsection{Generalization to Unseen Task Combinations}
\label{sec:l2to}

While LOTO cross-validation evaluates prediction on held-out pairs involving one familiar and one novel task, we also examine the more challenging setting where \emph{both} tasks in the validation pair are absent from training. This leave-two-tasks-out (L2TO) protocol provides a stricter test of whether learned coefficients capture fundamental mergeability principles that transfer to entirely novel task combinations.

\paragraph{Protocol.}
For each of the 190 task pairs $(A, B)$, we train on the $\binom{18}{2} = 153$ pairs where neither task $A$ nor task $B$ appears, and evaluate on the single held-out pair. This eliminates any task-specific information from training, testing whether the learned metric combinations generalize beyond the tasks used to fit them.

\paragraph{Results.}
Table~\ref{tab:l2to_summary} summarizes the L2TO results. Aggregate validation correlations range from $r = -0.14$ to $r = 0.39$, substantially lower than LOTO performance. Weight Averaging shows the strongest generalization ($r = 0.39$), followed by TSV ($r = 0.25$), while Task Arithmetic and Isotropic Merging show near-zero or negative correlations. The gap between training correlations ($r \approx 0.54$--$0.75$) and validation performance indicates that some learned coefficients overfit to task-specific patterns.

\begin{table}[t]
\centering
\caption{Leave-two-tasks-out cross-validation results. Aggregate validation $r$ measures prediction accuracy across all 190 held-out pairs where both tasks are unseen during training.}
\label{tab:l2to_summary}
\begin{tabular}{lcccc}
\toprule
\textbf{Method} & \textbf{Train $r$} & \textbf{Train $r$ std} & \textbf{Val $r$} & \textbf{Nonzero} \\
\midrule
Weight Averaging & 0.72 & 0.05 & \textbf{0.39} & 18.2 \\
TSV & 0.75 & 0.06 & 0.25 & 18.0 \\
Task Arithmetic & 0.59 & 0.08 & $-$0.09 & 18.0 \\
Isotropic & 0.54 & 0.07 & $-$0.14 & 16.9 \\
\bottomrule
\end{tabular}
\end{table}

\paragraph{Interpretation.}
The L2TO results reveal an important distinction: \emph{metric importance for explaining mergeability within known tasks differs from metric importance for predicting mergeability of novel task combinations}. Weight Averaging's superior L2TO performance suggests its mergeability depends more on task-agnostic properties captured by our metrics, while other methods may rely more on task-specific factors not reflected in pairwise metrics alone.

These results do not diminish the practical utility of our framework. In realistic deployment scenarios, practitioners typically have access to results from at least some task combinations before predicting new ones (the LOTO setting). The L2TO analysis instead characterizes the \emph{limits} of purely metric-based prediction and identifies which merging methods are more amenable to such prediction.
