\section{MLP Ablation Study}
\label{app:mlp_ablation}

To investigate whether non-linear combinations of mergeability metrics could yield superior predictive performance compared to our linear approach, we conducted an ablation study using Multi-Layer Perceptrons (MLPs). For each merging method, we trained separate MLPs to predict post-merge performance from the same 29 mergeability metrics used in the linear model.

\paragraph{Experimental Setup.}
Each MLP consists of a single hidden layer with 8 units, ReLU activation, and dropout regularization (rate 0.4), resulting in 249 trainable parameters per model. We use the same leave-one-task-out (LOTO) cross-validation protocol as the linear model: for each of the 20 tasks, we train on pairs from the remaining 19 tasks and evaluate on pairs involving the held-out task. This results in 20 separate model trainings per merging method, with predictions aggregated across all folds. Models were trained for 300 epochs using Adam optimizer with learning rate 0.001, optimizing mean squared error loss.

\paragraph{Results.}
Table~\ref{tab:mlp_ablation} compares the validation Pearson correlations achieved by the MLP models against the linear combination approach under identical LOTO cross-validation settings.

\begin{table}[htbp]
\centering
\caption{Comparison of validation Pearson correlation ($r$) between linear combination and MLP approaches for predicting post-merge performance using leave-one-task-out cross-validation. Despite greater model capacity (249 vs.\ 29 parameters), MLPs do not consistently improve upon the linear baseline and show higher variance across folds.}
\label{tab:mlp_ablation}
\begin{tabular}{lcccc}
\toprule
\textbf{Method} & \textbf{Linear (Val $r$)} & \textbf{MLP (Val $r$)} & \textbf{MLP (Val $r$ std)} & \textbf{$\Delta$} \\
\midrule
Task Arithmetic & 0.407 & 0.084 & 0.227 & $-0.323$ \\
Weight Averaging & 0.525 & 0.407 & 0.308 & $-0.118$ \\
Isotropic & 0.337 & 0.426 & 0.178 & $+0.089$ \\
TSV & 0.609 & 0.570 & 0.328 & $-0.039$ \\
\bottomrule
\end{tabular}
\end{table}

The results show that the MLP models generally underperform or match the simpler linear approach:

\begin{itemize}
    \item \textbf{Task Arithmetic}: The MLP shows severe degradation, achieving only $r = 0.084$ compared to $r = 0.407$ for the linear model. This suggests the MLP overfits to spurious patterns in the training data that do not generalize across tasks.
    \item \textbf{Weight Averaging}: The MLP achieves $r = 0.407$, lower than the linear model's $r = 0.525$, with high variance across folds (std = 0.308).
    \item \textbf{Isotropic}: The MLP shows a slight improvement ($r = 0.426$ vs.\ $r = 0.337$), though the difference is within the standard deviation.
    \item \textbf{TSV}: Both approaches achieve similar performance ($r = 0.570$ vs.\ $r = 0.609$), with the MLP showing high variance (std = 0.328).
\end{itemize}

\paragraph{Why Do MLPs Underperform?}
Although MLPs are theoretically more expressive than linear models, several factors explain their inferior performance in this setting:

\begin{enumerate}
    \item \textbf{Limited training data}: With only 179 task pairs total and approximately 150 training pairs per fold, the dataset is too small to reliably learn non-linear relationships. The MLP's 249 parameters (versus 29 for the linear model) create a high parameter-to-sample ratio that promotes overfitting. Classical statistical theory suggests that reliable estimation requires at least 10--20 samples per parameter; our setting provides fewer than 1 sample per MLP parameter.

    \item \textbf{High-dimensional input space}: With 29 input metrics, the MLP must learn meaningful interactions in a 29-dimensional space from limited samples. The curse of dimensionality makes it exponentially harder to identify true non-linear patterns versus spurious correlations as dimensionality increases.

    \item \textbf{Task distribution shift}: The LOTO evaluation requires generalization to entirely new tasks. Non-linear models can memorize task-specific patterns that appear useful on training tasks but fail catastrophically on held-out tasks. Linear models, being more constrained, are forced to learn simpler relationships that transfer better.

    \item \textbf{Noise amplification}: If the underlying relationship is approximately linear with noise, an MLP will attempt to fit the noise through non-linear transformations, leading to worse generalization. The linear model's inductive bias acts as implicit regularization.
\end{enumerate}

\paragraph{Discussion.}
These results support our choice of a linear combination approach for several reasons:
\begin{enumerate}
    \item \textbf{Interpretability}: Linear coefficients directly indicate the relative importance and direction of each metric's contribution, facilitating analysis of which metric categories matter for each merging method. This transparency is crucial for understanding \emph{why} certain task pairs are predicted to merge well or poorly.
    \item \textbf{Robustness}: The linear model shows more consistent generalization across all merging methods, whereas MLPs exhibit high variance across folds and degrade substantially for some methods.
    \item \textbf{Appropriate inductive bias}: Given the limited data and the need to generalize across diverse tasks, the linear model's simplicity is a feature rather than a limitation. It captures the dominant signal without overfitting to noise.
\end{enumerate}

The failure of MLPs to improve upon linear combinations suggests that the relationship between mergeability metrics and post-merge performance is approximately linear, or at least that any non-linear patterns are too subtle to reliably capture with the available data. This finding aligns with similar observations in other domains where simpler models outperform complex ones when data is scarce relative to model capacity.