% \documentclass{uai2022} % for initial submission
\documentclass[accepted]{uai2022} % after acceptance, for a revised
                                    % version; also before submission to
                                    % see how the non-anonymous paper
                                    % would look like
%% There is a class option to choose the math font
% \documentclass[mathfont=ptmx]{uai2022} % ptmx math instead of Computer
                                         % Modern (has noticable issues)
% \documentclass[mathfont=newtx]{uai2022} % newtx fonts (improves upon
                                          % ptmx; less tested, no support)
% NOTE: Only keep *one* line above as appropriate, as it will be replaced
%       automatically for papers to be published. Do not make any other
%       change above this note for an accepted version.

%% Choose your variant of English; be consistent
\usepackage[american]{babel}
% \usepackage[british]{babel}

%% Some suggested packages, as needed:
\usepackage{natbib} % has a nice set of citation styles and commands
    \bibliographystyle{abbrvnat}
    \renewcommand{\bibsection}{\subsubsection*{References}}
\usepackage{amsmath}
\usepackage{amsfonts}
\usepackage{makecell}
\usepackage{listings, multicol}
\usepackage{xcolor}
\usepackage{subcaption}

\definecolor{codegreen}{rgb}{0,0.6,0}
\definecolor{codegray}{rgb}{0.5,0.5,0.5}
\definecolor{codepurple}{rgb}{0.58,0,0.82}
\definecolor{backcolour}{rgb}{0.95,0.95,0.92}

\lstdefinestyle{mystyle}{
    backgroundcolor=\color{backcolour},   
    commentstyle=\color{codegreen},
    keywordstyle=\color{magenta},
    numberstyle=\tiny\color{codegray},
    stringstyle=\color{codepurple},
    basicstyle=\ttfamily\scriptsize,
    breakatwhitespace=false,         
    breaklines=true,                 
    captionpos=b,                    
    keepspaces=true,                 
    numbers=left,                    
    numbersep=5pt,                  
    showspaces=false,                
    showstringspaces=false,
    showtabs=false,                  
    tabsize=2
}

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%                               PROOF, THEOREM, and FRIENDS
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\newcommand{\BlackBox}{\rule{1.5ex}{1.5ex}}  % end of proof
\newenvironment{proof}{\par\noindent{\bf Proof\ }}{\hfill\BlackBox\\[2mm]}
\newtheorem{example}{Example} 
\newtheorem{theorem}{Theorem}
\newtheorem{lemma}[theorem]{Lemma} 
\newtheorem{proposition}[theorem]{Proposition} 
\newtheorem{remark}[theorem]{Remark}
\newtheorem{corollary}[theorem]{Corollary}
\newtheorem{definition}[theorem]{Definition}
\newtheorem{conjecture}[theorem]{Conjecture}
\newtheorem{axiom}[theorem]{Axiom}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

\lstset{style=mystyle}

% \usepackage{mathtools} % amsmath with fixes and additions
% \usepackage{siunitx} % for proper typesetting of numbers and units
\usepackage{booktabs} % commands to create good-looking tables
\usepackage{tikz} % nice language for creating drawings and diagrams
\usepackage{xr}

\makeatletter
\newcommand*{\addFileDependency}[1]{% argument=file name and extension
  \typeout{(#1)}
  \@addtofilelist{#1}
  \IfFileExists{#1}{}{\typeout{No file #1.}}
}
\makeatother

\newcommand*{\myexternaldocument}[1]{%
    \externaldocument{#1}%
    \addFileDependency{#1.tex}%
    \addFileDependency{#1.aux}%
}

\myexternaldocument{bartunov_444}
\numberwithin{equation}{section}

\usepackage{chngcntr}
\counterwithin{figure}{section}

%% Provided macros
% \smaller: Because the class footnote size is essentially LaTeX's \small,
%           redefining \footnotesize, we provide the original \footnotesize
%           using this macro.
%           (Use only sparingly, e.g., in drawings, as it is quite small.)

%% Self-defined macros
\newcommand{\swap}[3][-]{#3#1#2} % just an example

\title{Equilibrium Aggregation: Encoding Sets via Optimization \\ (Supplementary material)}

% Encoding sets via optimization
% Optimization-based aggregation for sets
% An optimization-based aggregation operator for sets
% Optimization-based aggregation for sequences and sets
%


% The standard author block has changed for UAI 2022 to provide
% more space for long author lists and allow for complex affiliations
%
% All author information is authomatically removed by the class for the
% anonymous submission version of your paper, so you can already add your
% information below.
%
% Add authors
\author[1,2,*]{\href{mailto:<sbos.net@gmail.com>?Subject=Equilibrium Aggregation}{Sergey Bartunov}{}}
\author[1,*]{Fabian B. Fuchs}
\author[1]{Timothy P. Lillicrap}
% Add affiliations after the authors
\affil[1]{%
    DeepMind\\
    London, United Kingdom
}
% \affil[3]{%
%     Work done at DeepMind
% }
\affil[2]{%
    Now at CHARM Therapeutics\\
    London, United Kingdom\\
    \vspace{3mm}
}
\affil[*]{%
    Joint first authorship
}
  
\begin{document}
\maketitle

\appendix

\section{Equilibrium aggregation as MAP inference}

Here we provide another useful perspective on Equilibrium Aggregation which is connecting the method to prior work in Bayesian inference and continuing one of the arguments made by~\cite{DeepSets}.


Consider a joint distribution over a sequence of random variables $\mathbf{x}_1, \mathbf{x}_2, \ldots$.
The sequence is called infinitely exchangeable if, for any $N$ the joint probability $p(\mathbf{x}_1, \mathbf{x}_2, \ldots, \mathbf{x}_N)$ is invariant to permutation of the indices. 
Formally speaking, for any permutation over indices $\pi$ we have
$$
    p(\mathbf{x}_1, \mathbf{x}_2, \ldots, \mathbf{x}_N) = p(\mathbf{x}_{\pi(1)}, \mathbf{x}_{\pi(2)}, \ldots, \mathbf{x}_{\pi(N)}).
$$

According to De Finetti's theorem (see, for example, \citep{diaconis1987dozen}), the sequence $\mathbf{x}_1, \mathbf{x}_2, \ldots$ is infinitely exchangeable iff, for all $N$, it admits the following mixture-style decomposition:
$$
    p(\mathbf{x}_1, \mathbf{x}_2, \ldots, \mathbf{x}_N) = \int \prod_{i=1}^N p(\mathbf{x}_i | \mathbf{y}) p(\mathbf{y}) d\mathbf{y}.
$$
Since the existence of this model for exchangeable sequences is guaranteed, one can consider the posterior distribution $p(\mathbf{y} | \mathbf{x}_1, \mathbf{x}_2, \ldots, \mathbf{x}_N)$ which effectively \emph{encodes} all global information about the observed inputs.

Since full and exact posterior inference is often infeasible (and the theorem does not guarantee at all that the prior $p(\mathbf{y})$ and the likelihood $p(\mathbf{x} | \mathbf{y})$ are conjugate or otherwise admit closed-form inference), in practice \emph{maximum a posteriori probability (MAP)} estimates are used when a point estimate is sufficient:
\begin{align}
    \hat{\mathbf{y}} &= \arg \max_{\mathbf{y}} \log p(\mathbf{y} | \mathbf{x}_1, \ldots, \mathbf{x}_N) \nonumber \\
    &= \arg \max_{\mathbf{y}} \left[ \sum_{i=1}^N \underbrace{\log p(\mathbf{x}_i | \mathbf{y})}_{=-F(\mathbf{x}_i, \mathbf{y})} + \underbrace{\log p(\mathbf{y})}_{=-R(\mathbf{y})} \right]. \label{eq:MAP}
\end{align}

Informally speaking, this means that MAP encoding of sets under a probabilistic model with a global hidden variable (which must exists albeit potentially in a complicated form) amounts to the optimization problem~\eqref{eq:MAP} which is almost the same as the Equilibrium Aggregation formulation~\eqref{eq:aggregation_optimization}.
Allowing the potential $F(\mathbf{x}, \mathbf{y})$ to be a flexible neural network, it is possible to recover the desired negative log-likelihood $-\log p(\mathbf{x} | \mathbf{y})$ (up to an additive constant).

This observation provides an additional theoretical argument in support of Equilibrium Aggregation and also suggests a number of interesting extensions one can imagine by further exploring the vast toolset of probabilistic inference.

\section{Attention as equilibrium aggregation}
\label{sec:appendix_attention}
We have already outlined how simple pooling methods can be recovered as special cases of Equilibrium Aggregation.
Here, we demonstrate how Equilibrium Aggregtaion can learn to model the popular attention mechanism.

We denote the interaction or query vector as $\mathbf{h}$. Note that we consider many-to-one aggregation and therefore only have one query vector. Here, the query vector is learned and independent of the input set.
For brevity, we will ignore the commonly used distinction between keys and values over which the attention is computed and will simply consider a set of vectors $X = \{ \mathbf{x}_i \}_{i=1}^N$ serving as both.
Now, we split the aggregation result as $\mathbf{y} = [\mathbf{y}_r, y_s]$ and define the potential function as follows:
\begin{equation*}
    F(\mathbf{x}, \mathbf{y}) = \exp(\mathbf{h}^T \mathbf{x}) || \mathbf{x} - \mathbf{y}_r ||_2^2 + (y_s - \exp(\mathbf{h}^T \mathbf{x}))^2.
\end{equation*}
Assuming no prior, the optimization problem~\eqref{eq:equilibrium_aggregation} would then lead to the following solution:
$$
    \mathbf{y}_r = {1 \over N} \sum_{i=1}^N \exp(\mathbf{h}^T \mathbf{x}_i) \mathbf{x}_i, \quad y_s = {1 \over N} \sum_{i=1}^N \exp(\mathbf{h}^T \mathbf{x}_i),
$$
from which the normalized result can be recovered trivially as 
$$
{\mathbf{y}_r \over y_s} = \sum_{i=1}^N \frac{\exp(\mathbf{h}^T \mathbf{x}_i)}{\sum_{j=1}^N \exp(\mathbf{h}^T \mathbf{x}_j)} \mathbf{x}_i.
$$

\section{Practical implementation of Equilibrium Aggregation}

While we generally found Equilibrium Aggregation to be robust to various aspects of implementation, in this appendix we share the best practices discovered in our experiments.

\subsection{Potential function}

\lstinputlisting[language=Python, caption=Potential function implementation in Jax., label={code:potential}, float=*]{code/potential.py}

The potential function $F(\mathbf{x}, \mathbf{y})$ in experiments has been implemented as a two-layer ResNet with tanh activations, layer normalization~\citep{ba2016layer} and, importantly,  sum-of-the-squares output. The Jax implementation can be found in Listing~\ref{code:potential}.

tanh activations and layer normalization ensured numerically stable gradients with respect to $\mathbf{y}$.
At the same time, sum of the squares allowed the potential to exhibit more rich behaviour, especially when all of the potentials are summed in the total energy.

\subsection{Scaled energy}

The number of elements in the set $N$ may vary significantly across different data points in a dataset which ultimately would make it difficult to set the single optimization schedule (learning rate and momentum) that would work equally well for all values of $N$. 
This is because energy~\eqref{eq:aggregation_optimization} is a sum over all elements in the set and so the gradient  $\nabla_{\mathbf{y}} E(X, \mathbf{y})$ is scaled linearly with $N$. 

A potential solution to this problem would be to simply average the potentials instead of summing them, but this would make it very difficult if not impossible to reason about the number of elements in the set from $\mathbf{y}$.
Thus, we use a different solution where we still scale the energy so that it does increase in magnitude as $N$ grows but does so at a sublinear rate:
\begin{equation}\label{eq:scaled_energy}
    E(X, \mathbf{y}) = {R(\mathbf{y}) + \sum_{i=1}^N F(\mathbf{x}_i, \mathbf{y}) \over (N + \epsilon)} \log_2 (N + 1),
\end{equation}
where $\epsilon = 10^{-8}$ is a small constant to prevent division by zero in the case of an empty set.

\subsection{Initialization}

In all experiments $\mathbf{y}^{(0)}$ has been set to a zero vector which, as we found, facilitated faster training. 

\subsection{Inner-loop optimization algorithm}

\lstinputlisting[language=Python, caption=Optimizer code in Jax., label={code:optimizer}, float=*]{code/optimizer.py}

We used gradient descent with Nesterov-accelerated momentum~\citep{nesterov1983method} as an algorithm for optimizing~\eqref{eq:equilibrium_aggregation}.
We provide the full code in Listing~\ref{code:optimizer}.

\begin{figure}
    \centering
    \includegraphics[width=\linewidth]{figures/optimizer_stats.pdf}
    \caption{Evolution of various trainable parameters of the inner-loop optimizer.}
    \label{fig:optimizer_stats}
\end{figure}

Figure~\ref{fig:optimizer_stats} shows the evolution of the trainable learning rate and momentum parameters of the optimizer on the MOLPCBA-GIN experiment, as well as the regularization weight.  
One can see that all three parameters largely stabilize after first $10^6$ training steps.

\subsection{Implicit differentiation}

In the course of this work we briefly explored the possibility of employing implicit differentiation. 
However, in this regime it is not trivial to allow e.g. the learning rate to be trained together with the model end-to-end and we found it difficult to propose an optimization schedule that would work well in all phases of training. 
Larger step sizes led to unstable training and smaller step sizes required too many iterations to converge making implicit differentiation less efficient computationally than the straightforward explicit differentiation which we ended up using for all the experiments.

\section{Further experimental details}

\subsection{Median Estimation}
\label{sec:median_details}

\paragraph{Data Creation} The data is created indefinitely on the fly. For each sample, first, one of three probability distributions is selected by chance: \textit{uniform} (between $0$ and $1$), \textit{gamma} (scale $0.2$, shape $0.5$), or \textit{normal} (mean $0.5$, standard deviation $0.4$). Then, 100 values are randomly drawn from the selected distribution. The label is the median value of the set of these 100 values.

\paragraph{Evaluation} For \textit{average performance} (bold lines in Fig. \ref{fig:exp_toy}), we average across seeds, do exponential smoothing and report the performance after 10 million training steps. Equilibrium Aggregation is roughly one order of magnitude better. For \textit{best performing seed} (faded lines in Fig. \ref{fig:exp_toy}), we report the best performing evaluation step (each evaluation step uses 80000 samples) across all seeds.

\subsection{MOLPCBA}

As mentioned in the main text, we performed a brief hyperparameter search for the weight of the $L_{\text{aux}}$~\eqref{eq:aux_loss}.
Based on these results, we proceeded with the weight of $1$ with both of the architectures.
We did not optimize this hyperparameter for local aggregation and simply used the value of $10^{-4}$ as in the rest of the experiments.
Both local and global aggregations used 15 iterations of energy minimization.

\section{Ablation Studies}
We performed several ablation studies that we hope add helpful context.

\subsection{Number of Gradient Steps \& Performance}

The model takes gradient steps to find the minimum of the energy function in the aggregation operator. More gradient steps should help find a more accurate approximation of the minimum and could therefore be expected to increase overall model performance. The following is an ablation on MOLPCBA + GCN + EA showing how the number of gradient steps influences the performance:
\newline \newline
\begin{tabular}{l l}
    \vspace{1mm}
    \bf \# Gradient Steps & \bf Best Valid. Performance \\
    \midrule
    $1$ & $0.235$\\
    $2$ & $ 0.257$\\
    $5$ & $0.263$\\
    $10$ & $0.268$\\
    \vspace{2mm}
\end{tabular}

This shows increasing performance with increasing number of steps, with an expected levelling-off at higher step numbers.

\subsection{Compute Time \& Number of Gradient Steps}
The performance benefits of additional gradient steps observed above raise the question of how high their computational cost is.
In the following, we measure how much time it takes for different networks with the same number of embeddings and layers to complete 2 million training steps on MOLPCBA:
\newline \newline
\begin{tabular}{l l}
    \vspace{1mm}
    \bf Method & \bf Time \\
    \midrule
    Sum/Deep Sets & 5h30min \\
    EA with 2 gradient steps & 7h50min \\
    EA with 5 gradient steps & 10h8min \\
    EA with 10 gradient steps & 15h44min \\
    \vspace{2mm}
\end{tabular}

% \begin{align*}
%     & \textbf{Method} \quad & \quad \textbf{Time} \\
%     & \text{Sum/Deep Sets} \quad & \quad \text{5h30min} \\
%     & \text{EA with 2 gradient steps} \quad & \quad \text{7h50min} \\
%     & \text{EA with 5 gradient steps} \quad & \quad \text{10h8min} \\
%     & \text{EA with 10 gradient steps} \quad & \quad \text{15h44min} \\
% \end{align*}
We see two research directions for increasing the speed of EA: 1) Exploiting the implicit function theorem. 2) Using less gradient steps during training than at test time.


\subsection{Auxiliary Loss \& Performance}
Here, we examine the influence of the weighting of the auxiliary loss in (\ref{eq:aux_loss}) on the performance. We found this loss  to be generally helpful for performance. It encourages the network to find a minimum as tracked by the norm of the final gradient step in figure 5. This is an ablation study on MOLPCBA + GIN + EA:
\newline \newline
\begin{tabular}{l l}
    \vspace{1mm}
    \bf Auxiliary Loss Weight & \bf Best Valid. Performance \\
    \midrule
    $10^{-4}$ & $0.250$\\
    $10^{-3}$ & $ 0.261$\\
    $10^{-2}$ & $0.257$\\
    $10^{-1}$ & $0.254$\\
    $1$ & $0.263$\\
    \vspace{2mm}
\end{tabular}

This shows a relatively stable behavior across different loss weightings, with higher weightings leading to slightly better performance on average.




\subsection{Capacity of EA \& Performance}
Furthermore, we provide an ablation on MOLPCBA + GCN + EA where the first column specifies the relative number of embeddings in the energy function compared to the one in Section \ref{sec:molpcba} (number of weights roughly scales quadratically with that). We made the rest of the graph network smaller to reduce the computational cost, hence the scores are overall lower.
\newline \newline
\begin{tabular}{l l}
    \vspace{1mm}
    \bf Embeddings in Energy Function & \bf Best Valid. \\
    \midrule
    $10\%$ & $0.208$\\
    $30\%$ & $ 0.218$\\
    $60\%$ & $0.228$\\
    $100\%$ & $0.233$\\
    $130\%$ & $0.222$\\
    \vspace{2mm}
\end{tabular}

This shows a drop in performance when going to 30\% and 10\% of the original network capacity. For larger capacities, the performance differences seem less significant.





% \begin{table*}[]
%     \centering
%     \caption{Comparison between different aggregation methods on MOLPCBA.}
%     Sum/Deep Sets	5h30min
%     EA with 2 gradient steps	7h50min
%     EA with 5 gradient steps	10h8min
%     EA with 10 gradient steps	15h44min
%     \begin{tabular}{c|c|c|c}
%         \toprule
%          \bf Local Aggregation & \bf Global Aggregation & \bf Validation MAP & \bf Test MAP \\
%          \midrule
%          \makecell{Graph Convolutional Network \\ \small \citep{kipf2016semi}} & \makecell{Sum \\ Multi-Head Attention \\ Principal Neighbourhood Aggregation \\ \bf Equilibrium Aggregation} & \makecell{0.223 \\ 0.248 \\ 0.226 \\ \bf 0.269} & \makecell{0.203 \\ 0.229 \\ 0.209 \\ \bf 0.252} \\
%          \midrule
%          \makecell{Graph Isomorphism Network \\ \small \citep{xu2018powerful}} &  \makecell{Sum \\ Multi-Head Attention \\ Principal Neighbourhood Aggregation \\ \bf Equilibrium Aggregation} & \makecell{0.255 \\ 0.254 \\ 0.262 \\ \bf 0.263} & \makecell{0.232 \\ 0.234 \\ 0.244 \\ \bf 0.246} \\
%          \midrule
%          \bf Equilibrium Aggregation & \bf Equilibrium Aggregation & \bf 0.269 & \bf 0.258 \\
%          \bottomrule
%     \end{tabular}
%     \label{tab:gnn}
% \end{table*}

\bibliography{bartunov_444}


\end{document}

