\section{Hyperparameters}
\label{app:hyperparameters}

This appendix details all hyperparameters used in our experiments, including merging method configurations, metric computation settings, and optimization parameters.

\subsection{Merging Methods}

Table~\ref{tab:merging_hyperparameters} summarizes the hyperparameters for each merging method evaluated in our benchmark.

\begin{table}[htbp]
\centering
\caption{Hyperparameters for each merging method.}
\label{tab:merging_hyperparameters}
\begin{tabular}{llp{8cm}}
\toprule
\textbf{Method} & \textbf{Parameter} & \textbf{Value / Description} \\
\midrule
\multirow{2}{*}{Task Arithmetic} & $\alpha$ (scaling) & 0.3 \\
 & Formula & $\theta_{\text{merged}} = \theta_{\text{pre}} + \alpha \sum_t \tau_t$ \\
\midrule
\multirow{2}{*}{Weight Averaging} & $\alpha$ (scaling) & 1.0 (implicit, no scaling) \\
 & Formula & $\theta_{\text{merged}} = \theta_{\text{pre}} + \frac{1}{T} \sum_t \tau_t$ \\
\midrule
\multirow{3}{*}{Isotropic Merging} & $\alpha$ (ViT-B-16, 20 tasks) & 1.0 \\
 & $\alpha$ (ViT-B-16, 14 tasks) & 1.2 \\
 & $\alpha$ (ViT-B-16, 8 tasks) & 1.4 \\
\midrule
\multirow{2}{*}{TSV} & SVD compression & Per-task (compression ratio $= 1/T$) \\
 & Non-matrix aggregation & Mean \\
\bottomrule
\end{tabular}
\end{table}

\paragraph{Task Arithmetic.} We use a fixed scaling coefficient $\alpha = 0.3$, which has been found to work well across diverse task combinations in prior work~\cite{ilharco2023editing}.

\paragraph{Isotropic Merging.} The scaling coefficient $\alpha$ varies depending on the number of tasks being merged. For ViT-B-16 with 20 tasks (our primary setting), we use $\alpha = 1.0$.

\paragraph{TSV (Task Singular Vectors).} The SVD compression ratio is set to $1/T$ where $T$ is the number of tasks being merged. Non-matrix parameters (e.g., biases, layer norms) are aggregated using the mean.

\subsection{Mergeability Metrics}

Table~\ref{tab:metric_hyperparameters} lists the hyperparameters used for computing mergeability metrics.

\begin{table}[htbp]
\centering
\caption{Hyperparameters for mergeability metric computation.}
\label{tab:metric_hyperparameters}
\begin{tabular}{lll}
\toprule
\textbf{Category} & \textbf{Parameter} & \textbf{Value} \\
\midrule
\multirow{4}{*}{Subspace Overlap} & $k$ (top/bottom directions) & 10 \\
 & Singular value overlap $k$ & 100 \\
 & Applied to & Left and right singular vectors \\
 & Layers & All transformer blocks \\
\midrule
\multirow{4}{*}{Activation-Based} & Calibration samples per task & 10 \\
 & Batch size & 32 \\
 & Random seed & 42 \\
 & Target layer & \texttt{visual.transformer.resblocks.11} \\
\midrule
\multirow{4}{*}{Gradient-Based} & Calibration samples per task & 10 \\
 & Batch size & 8 \\
 & Random seed & 42 \\
 & Gradient type & Encoder and input gradients \\
\bottomrule
\end{tabular}
\end{table}

\paragraph{Subspace Overlap Metrics.} For the left and right subspace overlap metrics, as well as the interaction matrix overlap, we use $k=10$ singular directions from both the top (highest singular values) and bottom (lowest singular values) of the spectrum. This captures both the principal and residual subspaces of the task vectors. For singular value overlap, we use $k=100$ to capture a broader distribution of the singular value spectrum.

\paragraph{Activation-Based Metrics.} We extract activations from the last transformer block (\texttt{resblocks.11} for ViT-B-16) using 10 calibration samples per task. The activations are compared using L2 distance, cosine similarity, magnitude ratio, and dot product.

\paragraph{Gradient-Based Metrics.} We compute gradients with respect to both the encoder parameters and the input images. This requires a forward-backward pass on calibration data, using 10 samples per task with a batch size of 8 to manage memory constraints.

\subsection{Linear Optimization}

Table~\ref{tab:optimization_hyperparameters} details the hyperparameters for the learned linear mergeability predictor.

\begin{table}[htbp]
\centering
\caption{Hyperparameters for linear mergeability optimization.}
\label{tab:optimization_hyperparameters}
\begin{tabular}{ll}
\toprule
\textbf{Parameter} & \textbf{Value} \\
\midrule
Optimizer & Adam \\
Learning rate & 0.01 \\
Maximum iterations & 1,000 \\
Early stopping patience & 50 iterations \\
Convergence threshold & $10^{-4}$ \\
Metric normalization & Min-max to $[-1, 1]$ \\
Target metric & Normalized test accuracy (average) \\
Cross-validation & Leave-one-task-out (20 folds) \\
\bottomrule
\end{tabular}
\end{table}

\subsection{MLP Ablation}

For the MLP ablation study (Appendix~\ref{app:mlp_ablation}), we used the following hyperparameters:

\begin{table}[htbp]
\centering
\caption{Hyperparameters for MLP mergeability predictor.}
\label{tab:mlp_hyperparameters}
\begin{tabular}{ll}
\toprule
\textbf{Parameter} & \textbf{Value} \\
\midrule
Architecture & Input $\rightarrow$ Hidden $\rightarrow$ Output \\
Hidden dimension & 8 \\
Activation & ReLU \\
Dropout rate & 0.4 \\
Optimizer & Adam \\
Learning rate & 0.001 \\
Weight decay (L2) & 0.001 \\
Epochs & 300 \\
Batch size & Full batch \\
Input normalization & Min-max to $[-1, 1]$ \\
\bottomrule
\end{tabular}
\end{table}

\subsection{Fine-Tuning}

The task-specific models were fine-tuned from a pretrained CLIP ViT-B-16 checkpoint using the hyperparameters in Table~\ref{tab:finetuning_hyperparameters}. Fine-tuning epochs vary per dataset based on convergence characteristics.

\begin{table}[htbp]
\centering
\caption{Fine-tuning hyperparameters.}
\label{tab:finetuning_hyperparameters}
\begin{tabular}{ll}
\toprule
\textbf{Parameter} & \textbf{Value} \\
\midrule
Base model & CLIP ViT-B-16 (OpenAI) \\
Optimizer & AdamW \\
Batch size & 64 \\
Gradient accumulation & 2 steps \\
Gradient clipping & 10.0 \\
Precision & FP32 \\
Epochs & Dataset-specific (1--147) \\
\bottomrule
\end{tabular}
\end{table}

The number of fine-tuning epochs varies significantly across datasets, ranging from 1 epoch for PCAM to 147 epochs for Flowers102. This reflects the different convergence characteristics and dataset sizes. Representative values include: MNIST (5), CIFAR10 (6), CIFAR100 (6), Cars (35), DTD (76), and OxfordIIITPet (82).