% \documentclass{uai2022} % for initial submission
\documentclass[accepted]{uai2022} % after acceptance, for a revised
                                    % version; also before submission to
                                    % see how the non-anonymous paper
                                    % would look like
%% There is a class option to choose the math font
% \documentclass[mathfont=ptmx]{uai2022} % ptmx math instead of Computer
                                         % Modern (has noticable issues)
% \documentclass[mathfont=newtx]{uai2022} % newtx fonts (improves upon
                                          % ptmx; less tested, no support)
% NOTE: Only keep *one* line above as appropriate, as it will be replaced
%       automatically for papers to be published. Do not make any other
%       change above this note for an accepted version.

%% Choose your variant of English; be consistent
\usepackage[american]{babel}
% \usepackage[british]{babel}

%% Some suggested packages, as needed:
\usepackage{natbib} % has a nice set of citation styles and commands
    \bibliographystyle{plainnat}
    \renewcommand{\bibsection}{\subsubsection*{References}}
\usepackage{mathtools} % amsmath with fixes and additions
% \usepackage{siunitx} % for proper typesetting of numbers and units
\usepackage{booktabs} % commands to create good-looking tables
\usepackage{tikz} % nice language for creating drawings and diagrams
\usepackage{amsmath,bm, amsfonts}
\usepackage{algorithm}
\usepackage{algorithmic}
\usepackage{adjustbox}
\usepackage{multirow}
\usepackage{subfig}

%% Provided macros
% \smaller: Because the class footnote size is essentially LaTeX's \small,
%           redefining \footnotesize, we provide the original \footnotesize
%           using this macro.
%           (Use only sparingly, e.g., in drawings, as it is quite small.)

%% Self-defined macros
\newcommand{\swap}[3][-]{#3#1#2} % just an example
\DeclareMathOperator*{\argmax}{arg\,max}
\DeclareMathOperator*{\argmin}{arg\,min}
\def\vtheta{{\bm{\theta}}}
\def\gC{{\mathcal{C}}}
\def\gT{{\mathcal{T}}}
\def\gS{{\mathcal{S}}}
\def\gQ{{\mathcal{Q}}}
\def\gF{{\mathcal{F}}}
\def\gP{{\mathcal{P}}}
\def\gD{{\mathcal{D}}}
\def\gL{{\mathcal{L}}}
\def\gZ{{\mathcal{Z}}}
\def\ve{{\bm{e}}}
\def\sR{{\mathbb{R}}}
\title{Meta-Learning without Data via\\ Wasserstein Distributionally-Robust Model Fusion (Supplementary material)}

% The standard author block has changed for UAI 2022 to provide
% more space for long author lists and allow for complex affiliations
%
% All author information is authomatically removed by the class for the
% anonymous submission version of your paper, so you can already add your
% information below.
%
% Add authors
\author[1]{\href{mailto:<jj@example.edu>?Subject=Your UAI 2022 paper}{Zhenyi Wang}{}}
\author[2]{Xiaoyang Wang}
\author[3]{Li Shen}
\author[2]{Qiuling Suo}
\author[2]{Kaiqiang Song}
\author[2]{Dong Yu}
\author[1]{Yan Shen}
\author[1]{Mingchen Gao}
% Add affiliations after the authors
\affil[1]{%
    Department of Computer Science and Engineering.\\
    State University of New York at Buffalo\\
    NY, USA
}
\affil[2]{%
    Tencent AI Lab\\
    Seattle, WA, USA
}
\affil[3]{%
    JD Explore Academy\\
    Beijing, China
  }
  
  \begin{document}
\maketitle







\section{Experiments}



\subsection{Baselines} \label{sup:baseline}
To show the effectiveness of the proposed methods, we construct various baseline methods and compare them in the following.



\textbf{Finetuning} We first randomly initialize the parameters for the network, then use few-shot labeled data to finetune this randomly initialized network. This method performs the worst because it does not incorporate the information from the pre-trained models. 


\textbf{Vanilla averaging (VA)}. We average all the models in a layerwise manner. We average the parameter values element-wise across all the pre-trained models as the fused parameters for each averaged layer. This method assumes that all the pre-trained models are solving the same task, and there is correspondence for the same position parameters across all the pre-trained models. However, this property does not hold in our data-free meta-learning setting since each pre-trained model is to solve a different task. Thus, there is no correspondence among different pre-trained models. 

\textbf{MAML} \citep{finn17a}, which meta trains all the tasks
with available training and testing data together. This setting is entirely different from ours. We use these datasets to train a MAML as in standard meta-learning. This baseline gives us a sense of how MAML performs with available training and testing data compared to the data-free setting. MAML with available training data does not perform well in this setting because the number of tasks (100), the same number as the pre-trained models, is relatively much smaller than that of standard data-based meta-learning. Thus, it learns weak domain knowledge. 

\textbf{Optimal transport averaging (OTA)} \citep{singh2021model},   
    Step 1: following  \citep{singh2021model},  assume we are at layer $l$ and that neurons in the previous layers have already been aligned.

     Step 2: we use uniform distributions to initialize the histogram for this layer probability measures.

     Step 3: we use layer $l$ of one randomly sampled pre-trained model as the estimate of the fused model for layer $l$. We then calculate the aligned model with respect to this estimate for each pre-trained model.

     Step 4: we calculate the average of all the aligned models as the fused model for layer $l$.

     This method also assumes that the different pre-trained models solve the same task. Thus, different model parameters can be aligned. However, in the data-free meta-learning scenario, different models solve different tasks. Second, they did not consider and optimize the generalization to the unseen tasks. 




\textbf{Model fusion with Gaussian process (MFGP)} \citep{personfusion}


   There are three modules for MFGP.

   [1] Base Module network.
   This module is to  compute the mean vector and diagonal covariance matrix of the outer multivariate Gaussian that distributes 
   $\bm{w}_{\oslash}$ is a 100-dimension vector generated from a 100-dimensional noise vector.

    [2]  Task-Specific Module Gaussian process parameterization.   
    This module consists of 10 independent sparse Gaussian processes (GPs), which represent the 10 independent priors over 10 random functions mapping from the task embedding to a scalar.

    [3]  Crossing Module $P(\vtheta|\bm{w}_{\oslash}, \bm{w})$  Parameterization.

    This module is to compute the mean vector and diagonal covariance matrix of the outer multivariate Gaussian that models the distributions of $\vtheta$.  The above parameterization  describes the generative process of $\vtheta$ from $\bm{w}$ and $\bm{w}_{\oslash}$ for a 
single task $\mathcal{T}_i$ . The fusion model is trained with a variational lower bound. 




During meta testing, we adapt MFGP to fuse pre-trained models in the following way with our proposed method:

$\ve_{init} = \frac{1}{N}\sum_{i=1}^{i=N} \ve_i$

$\vtheta_{init} = f_{\bm{\phi}_{meta}} (\ve_{init})$

Where $\ve_{init}$ is the average embedding of all the pre-trained models, and $\bm{\phi}_{meta}$ is the optimal solution to the  Eq 8  (main text). 

This method uses the Gaussian process, which can only handle simple networks, such as MLP,  to fuse standard pre-trained models, and can be hard to scale to more complex problems, e.g., our setting.  Furthermore, they did not consider and optimize the generalization to unseen tasks. 



\subsection{More results} \label{app:results}

In this section, we give several ablation studies to verify the effective and stability of our proposed framework on the offline DFL2L task. 

\textbf{Ablation Study} We evaluate the effectiveness of DRO for model fusion by ablating the component of DRO. The results are shown in Table \ref{tab:ablationCIFAR5way}. We can observe that with DRO, the performance can be improved by 1.2\% and 1.5\% for 10-shot and 20-shot on CIFAR-FS, respectively. 


\begin{table}[H]
\centering   
\caption{Ablation study on offline DFL2L CIFAR-FS 5-way classification}
\begin{adjustbox}{scale=1.0,tabular= lccc,center}
\begin{tabular}{lrrrrrrr} 
\toprule
 &10-shot &20-shot &\\
\midrule
Ours (w/o DRO) & 49.23 $\pm$ 1.7&  53.35 $\pm$ 1.4\\
% \midrule
Ours (w/ DRO) & 50.42 $\pm$ 1.5&  54.86 $\pm$ 1.2\\
\bottomrule
\end{tabular}
\label{tab:ablationCIFAR5way}
\end{adjustbox}
\end{table}


\textbf{Hyperparameter Sensitivity}
We evaluate the model performance sensitivity with different values $\gamma$ in Table \ref{tab:hyperimagenet10way}. For the considered $\gamma$ value, the proposed model performance is not very sensitive to $\gamma$ value variations, although there are some variations among different $\gamma$ values. 



\begin{table}[H]
\centering   
\caption{Hyperparameter sensitivity on offline DFL2L MiniimageNet 5-way classification}
\begin{adjustbox}{scale=1.0,tabular= lccc,center}
\begin{tabular}{lrrrrrrr} 
\toprule
\textbf{$\gamma$}&10-shot &20-shot &\\
\midrule
$\gamma = 10.0$ & 37.09 $\pm$ 1.8& 43.37 $\pm$ 1.5 & \\
% \midrule
$\gamma = 2.0$ & 37.36 $\pm$ 1.7& 43.67 $\pm$ 1.6 & \\
% \midrule
$\gamma = 0.5$ & 37.57 $\pm$ 1.5& 43.31 $\pm$ 1.4 & \\
\bottomrule
\end{tabular}
\label{tab:hyperimagenet10way}
\end{adjustbox}
\end{table}


\subsection{Hyperparameter selection}

As mentioned in the main text, we convert the Wasserstein ball constraint into the objective functions; after using Lagrangian duality, the optimization becomes: 

$\max_{\bm{\phi}} \inf_{\nu \in \mathcal{P}} \mathbb{E}_{\nu} [\mathcal{F}(\bm{\phi}) + \gamma (W(\mu, \nu)-\delta) ]$.
 
 Since $\delta$ is not an optimization variable (constant) and does not affect optimization, the constraint is implicitly regularized by the Lagrange multiplier $\gamma$. That is to say, with or without $\delta$ does not affect the optimization. Therefore, the above optimization can be equivalently formulated as follows:

$\max_{\bm{\phi}} \inf_{\nu \in \mathcal{P}} \mathbb{E}_{\nu} [\mathcal{F}(\bm{\phi}) + \gamma W(\mu, \nu) ]$. 

In this case, the $\gamma$ controls the regularization. The problem of choosing $\delta$ becomes choosing $\gamma$.
For selecting $\gamma$,  as mentioned in the main text, we have a validation set of pre-trained models that can be used for determining $\gamma$.

First, we calculate the meta initialization for the validation set of pre-trained models as follows:

$\ve_{init} = \frac{1}{N}\sum_{i=1}^{i=N} \ve_i$

$\vtheta_{init} = f_{\bm{\phi}_{meta}} (\ve_{init})$


Where $\ve_{init}$ is the average embedding of all the pre-trained models, and $\bm{\phi}_{meta}$ is the optimal solution to Eq (8)  (main text).

Then, we calculate the likelihood of the validation-set pre-trained models based on the following equations. The likelihood function of the validation pre-trained model $\vtheta_i$  follows the following Gaussian likelihood function:


$ P(\vtheta_i|\vtheta_{init}) = exp(-\frac{||\vtheta_{init}-\vtheta_i||^2}{\sigma^2})$


Then, we can use grid search to select $\gamma$ with the highest likelihood on the validation-set of pre-trained models as the best $\gamma$. 
Suppose we want to work with $\delta$ directly instead of the $\gamma$ regularization. We can use projected gradient descent to project the gradient update into the Wasserstein ball constraint; the best $\delta$ can be selected similarly to the above procedures for selecting $\gamma$. 


\subsection{More Discussion}

\paragraph{Model Fusion vs Transfer Learning}
 The number of pre-trained models determines which method should be adopted, classical transfer learning or meta-learning. If the number of pre-trained models is small, then meta-learning is unnecessary. If we only have one pre-trained model, transfer learning would be enough and well-studied in existing works. If we only have very few pre-trained models, how to use them depends on the downstream tasks, practical deployment requirements, etc. For example, if we have both GPT and BERT, then using which one depends on downstream tasks. If the downstream task is text generation, we can choose GPT. If the task is language understanding, we can use BERT. However, our focus is on the meta-learning scenario, i.e., there are many available pre-trained models, but we have to design a general method for learning how to use them. Thus, the research focus is entirely different.


For how to use big models, such as BERT and GPT, fusing them would be  more challenging. However, most existing works still focus on much smaller and simpler networks, such as MLP and CNN. One solution for fusing such big models is that, we can first divide large layers into smaller blocks, then apply our method to fuse models in a block-wise manner. This would simplify the fusion process. 



\bibliography{uai2022-template}


\end{document}
