% \documentclass{uai2023} % for initial submission
\documentclass[accepted]{uai2023} % after acceptance, for a revised
                                    % version; also before submission to
                                    % see how the non-anonymous paper
                                    % would look like
%% There is a class option to choose the math font
% \documentclass[mathfont=ptmx]{uai2023} % ptmx math instead of Computer
                                         % Modern (has noticable issues)
% \documentclass[mathfont=newtx]{uai2023} % newtx fonts (improves upon
                                          % ptmx; less tested, no support)
% NOTE: Only keep *one* line above as appropriate, as it will be replaced
%       automatically for papers to be published. Do not make any other
%       change above this note for an accepted version.

%% Choose your variant of English; be consistent
\usepackage[american]{babel}
% \usepackage[british]{babel}

%% Some suggested packages, as needed:
\usepackage{natbib} % has a nice set of citation styles and commands
    \bibliographystyle{plainnat}
    \renewcommand{\bibsection}{\subsubsection*{References}}
\usepackage{mathtools} % amsmath with fixes and additions
% \usepackage{siunitx} % for proper typesetting of numbers and units
\usepackage{booktabs} % commands to create good-looking tables
\usepackage{tikz} % nice language for creating drawings and diagrams

\usepackage{mathtools} % amsmath with fixes and additions
% \usepackage{siunitx} % for proper typesetting of numbers and units

\hypersetup{
colorlinks   = true, %Colours links instead of ugly boxes
urlcolor     = blue, %Colour for external hyperlinks
linkcolor    = blue, %Colour of internal links
citecolor    = blue %Colour of citations
}

\usepackage{hyperref}
\usepackage{url}

%------------------------------------------------------------------------------
% Put added packages and preambles here
\usepackage{graphicx}
\usepackage{tabularx}
\usepackage{subfigure}
\usepackage{amsmath,amsthm,amsfonts,bm,amssymb}
\usepackage{multirow,colortbl}
\usepackage{enumitem}
\usepackage{wrapfig}
\usepackage{algorithm,algpseudocode}
\usepackage{caption}
\usepackage{booktabs}

\definecolor{Gray}{gray}{0.9}
\newcommand{\gc}{\cellcolor{Gray}}
%% \newcommand{\la}{METR-LA}
%% \newcommand{\bay}{PEMS-BAY}
%% \newcommand{\pmub}{PMU-B}
%% \newcommand{\pmuc}{PMU-C}
\newcommand{\la}{\textsc{metr-la}}
\newcommand{\bay}{\textsc{pems-bay}}
\newcommand{\pmub}{\textsc{pmu-b}}
\newcommand{\pmuc}{\textsc{pmu-c}}
\newcommand{\fone}{\textsc{f1}}
\newcommand{\prauc}{\textsc{prauc}}
%% \newcommand{\rocauc}{\textsc{rocauc}}
\newcommand{\rocauc}{\textsc{auc}}
\newcommand{\acc}{\textsc{acc}}
\newcommand{\perm}{\texttt{p}}
\newcommand{\ones}{\bm{1}}
\newcommand{\DD}{\mathcal{D}}
\newcommand{\gauss}{\mathcal{N}}
\newcommand{\unif}{\mathcal{U}}
\newcommand{\mean}{\mathbb{E}}
\newcommand{\real}{\mathbb{R}}
\newcommand{\todo}{{\color{red} TODO}}
%% \newcommand{\rd}[1]{\textcolor{red}{#1}}
%% \newcommand{\bl}[1]{\textcolor{blue}{#1}}
\newcommand{\rd}[1]{#1}
\newcommand{\bl}[1]{#1}
\newcommand{\no}[1]{#1}

\DeclareMathOperator{\softmax}{softmax}
\DeclareMathOperator{\embed}{embedding}
\DeclareMathOperator{\relu}{ReLU}
\DeclareMathOperator{\lstm}{LSTM}
\DeclareMathOperator{\gru}{GRU}
\DeclareMathOperator{\sigmoid}{sigmoid}
\DeclareMathOperator{\entropy}{entropy}
\DeclareMathOperator{\gumbel}{Gumbel}
\DeclareMathOperator{\cat}{Cat}
\DeclareMathOperator{\ber}{Ber}
%\DeclareMathOperator{\unif}{Uniform}
\DeclareMathOperator{\erf}{erf}
\DeclareMathOperator{\bias}{Bias}
\DeclareMathOperator{\diag}{diag}
% customized commands
\DeclareMathOperator{\argmax}{\arg\max}
\DeclareMathOperator{\argmin}{\arg\min}
\DeclareMathOperator*{\minimize}{\text{minimize}}
\DeclareMathOperator*{\maximize}{\text{maximize}}
\DeclareMathOperator*{\st}{\text{subject to}}

\theoremstyle{definition} \newtheorem{definition}{Definition}
\theoremstyle{remark}     \newtheorem{remark}{Remark}
\theoremstyle{remark}     \newtheorem{example}{Example}
\theoremstyle{definition}      \newtheorem{theorem}{Theorem}%[section]
\theoremstyle{plain}      \newtheorem{conjecture}[theorem]{Conjecture}
\theoremstyle{plain}      \newtheorem{proposition}[theorem]{Proposition}
\theoremstyle{plain}      \newtheorem{corollary}[theorem]{Corollary}
\theoremstyle{plain}      \newtheorem{lemma}[theorem]{Lemma}

\graphicspath{{figs/}}


%% Provided macros
% \smaller: Because the class footnote size is essentially LaTeX's \small,
%           redefining \footnotesize, we provide the original \footnotesize
%           using this macro.
%           (Use only sparingly, e.g., in drawings, as it is quite small.)

%% Self-defined macros
\newcommand{\swap}[3][-]{#3#1#2} % just an example

\title{Federated Learning of Models Pre-Trained on Different Features with Consensus Graphs}

% The standard author block has changed for UAI 2023 to provide
% more space for long author lists and allow for complex affiliations
%
% All author information is authomatically removed by the class for the
% anonymous submission version of your paper, so you can already add your
% information below.
%
% Add authors
\author[1]{Tengfei Ma}
\author[2]{Trong Nghia Hoang}
\author[3,1]{\href{mailto:<chenjie@us.ibm.com>?Subject=Your UAI 2023 paper}{Jie Chen}{}}
% Add affiliations after the authors
\affil[1]{%
    IBM Research
}
\affil[2]{%
    Washington State University
}
\affil[3]{%
    MIT-IBM Watson AI Lab
  }
  
\begin{document}
\maketitle

%------------------------------------------------------------------------------
\begin{abstract}
Learning an effective global model on private and decentralized datasets has become an increasingly important challenge of machine learning when applied in practice. Existing distributed learning paradigms, such as Federated Learning, enable this via model aggregation which enforces a strong form of modeling homogeneity and synchronicity across clients. This is however not suitable to many practical scenarios. For example, in distributed sensing, heterogeneous sensors reading data from different views of the same phenomenon would need to use different models for different data modalities. Local learning therefore happens in isolation but inference requires merging the local models to achieve consensus. To enable consensus among local models, we propose a feature fusion approach that extracts local representations from local models and incorporates them into a global representation that improves the prediction performance. Achieving this requires addressing two non-trivial problems. First, we need to learn an alignment between similar feature components which are arbitrarily arranged across clients to enable representation aggregation. Second, we need to learn a consensus graph that captures the high-order interactions between local feature spaces and how to combine them to achieve a better prediction. This paper presents solutions to these problems and demonstrates them in real-world applications on time series data such as power grids and traffic networks.
\end{abstract}

%------------------------------------------------------------------------------
\section{Introduction}
\label{sec:intro}
To improve the scalability and practicality of machine learning applications in situations where training data are becoming increasingly decentralized and proprietary, Federated Learning (FL) ~\citep{McMahan2017,Yang2019,Li2019,Kairouz2019} has been proposed as a new model training paradigm that allows data owners to collaboratively train a common model without having to share their private data with others. The FL formalism is therefore poised to resolve the computation bottleneck of model training on a single machine and the risk of privacy violation, in light of recent policies such as the General Data Protection Regulation~\citep{Albrecht2016}.

However, FL requires a strong form of homogeneity and synchronicity among the data owners (clients) that might not be ideal in practice. First, it requires all clients to agree in advance to a common model architecture and parameterization. Second, it requires clients to synchronously communicate their model updates to a common server, which assembles the local updates into a global learning feedback. This is rather restrictive in cases where different clients draw observations from different data modalities of the phenomenon being modeled. It leads to heterogeneous data complexities across clients, which in turn requires customized forms of modeling. Otherwise, enforcing a common model with high complexity might not be affordable to clients with low compute capacity; and vice versa, switching to a model with low complexity might result in the failure to unlock important inferential insights from data modalities.

A variant of FL~\citep{Hardy2017,Hu2019,Chen2020}, named vertical FL (VFL), has been proposed to address the first challenge, which embraces the concept of vertically partitioned data. This concept is figuratively named through cutting the data matrix vertically along the feature axis, rather than the data axis. Existing approaches maintain separate local model parameters distributed across clients and global parameters on a central server. All parameters are then learned together, which causes a practical drawback: 

{\bf Coordination overhead among clients and the central server, such as engineering protocols that enable multiple rounds of communication (i.e., synchronicity) and coordination effort (i.e., homogeneity) to converge on universal choices of models and training algorithms, would be required, which can be practically expensive depending on the scale of the application.} 

To mitigate both constraints on homogeneity and synchronicity\footnote{Note that in our case, synchronicity requires co-training among clients which is a weaker constraint than its usual meaning of further requiring clients to synchronize their updates per iteration.} satisfactorily, we ask the following question and subsequently develop an answer to it:

{\bf Can we separate global consensus prediction from local model training?}

As shown later in our experiments, we will address this question in a real-world context of the national electricity grid, over which thousands of phasor measurement units (PMUs) were deployed to monitor the grid condition and data were recorded in real-time by each PMU~\citep{smartgrid.gov}. PMU measurements, as time series data, are owned by several parties, each of which may employ different technologies leading to heterogeneous recordings under varying sampling frequencies and measured attributes. These data can be used to train machine learning models that identify grid events (e.g., fault, oscillation, and generator trip). Such an event detection system relies on collective series measurements at the same time window but distributed across owners. Using VFL to build a common model on such decentralized and heterogeneous data is plausible but not practical, because of a lack of autonomy to facilitate coordination among owners.

{\bf Main Contribution.} To resolve the challenge, we introduce a feature fusion perspective to this setting, which aims to minimize coordination among clients and maximize their autonomy via a local--global model framework. Therein, each client trains a customized local model with its data modalities. The training is independent and incurs no coordination. Once trained, local feature representations of each client can then be extracted from the penultimate layer of the corresponding local models. Then, a central server collects and aggregates these representations into a more holistic global representation, used to train a model for global inference. There are two technical challenges that need to be addressed to substantiate the envisioned framework. 

{\bf C1.} {\bf There is an ambiguity regarding the correspondence between components of local feature representations across different clients.} This ambiguity arises because local models were trained separately in isolation and there is no mechanism to enforce that their induced feature dimensions would be aligned. As a matter of fact, it is possible to permute the induced feature dimensions without changing the prediction outcome. Thus, if two models are trained separately, they might end up looking at the same feature space but with permuted dimensions.

{\bf C2.} {\bf There are innate local interactions among subsets of clients that need to be accounted for.} Naively concatenating or averaging the local feature representations accounts for the global interaction but ignores such local interactions, which are important to boost the accuracy of global prediction as shown later in our experiments.

To address {\bf C1}, note that the feature dimension alignment problem is discrete in nature; furthermore, there is no direct feedback to optimize for such alignment. To sidestep this challenge, we develop a neuralized alignment layer whose parameters are differentiable and can therefore be part of a larger network, including the feature aggregation and prediction layers, which can be trained end-to-end via gradient back-propagation (Section~\ref{sec:ambiguity}). To address {\bf C2}, we employ graph neural networks as the global inference model, where the graph corresponds to the explicit or implicit relational structure of the data owners. As such a graph might not be given in advance, we treat the combinatorial graph structure as a random variable of a product of Bernoulli distributions whose (differentiable) parameters can also be optimized via gradient-based approach (Section~\ref{sec:bernoulli}). The technical contributions of this work are summarized below.

{\bf 1.} We formalize a feature fusion perspective for distributed learning, in settings where data is vertically partitioned. This is an alternative view to VFL but as elaborated above, is more applicable when iterative training synchronicity is not possible among clients (Section~\ref{sec:fed.inf}).

{\bf 2.} We formulate a federated feature fusion (F$^3$) framework that consists of a network of pre-trained local models and a central model that collects and fuses the local feature representations (induced from these pre-trained models) to generate a global model with better predictive performance (Section~\ref{sec:framework}). This is achieved via addressing {\bf C1} (Section~\ref{sec:ambiguity}) and {\bf C2} (Section~\ref{sec:bernoulli}) above.

{\bf 3.} We demonstrate experiments with four real-life data sets, including power grids and traffic networks, and show the effectiveness of the proposed framework (Section~\ref{sec:exp}).

%------------------------------------------------------------------------------
\section{Problem Formulation}\label{sec:fed.inf}
Federated Feature Fusion (F$^3$) is a new but more practical setup for VFL \citep{Hu2019,Chen2020}; it aims to enable collaboration between data owners that possess private access to different sets of features describing the same set of training data points. However, unlike VFL which require clients to synchronize their training processes \citep{yang2019parallel,li2021label,fu2021vf2boost,cheng2021secureboost,Hu2019,diao2021heterofl} in multiple iterations of communication, F$^3$ allows data owners to train their own local models in isolation and only requires one round of communication in which local feature representations induced from the heterogeneously pre-trained local models are shared with a trusted server for feature fusion. 

{\bf Relation to FL with Heterogeneous Clients.} We note that similar ideas on extending federated learning to accommodate clients with heterogeneous models \citep{tan2022federated,tan2022fedproto,lin2020ensemble,chen2022personalized} has been proposed. However, these methods are still restricted to horizontal settings. Local models still need to operate on the same feature space and some of which also require client models to be trained together via multiple rounds of communication. As such, their focuses are on addressing different forms of heterogeneities: (1) heterogeneous data distributions; (2) heterogeneous model architectures; and (3) heterogeneous pre-training, which are all important but are different from feature heterogeneities, which is a new form of heterogeneities we are seeking to address.

To further emphasize on the novelty of our setting and solution significance, we review and discuss the formulation of VFL and F$^3$ below, which argues with concrete, real-life examples why the F$^3$ setting is more practical and how this practicality would entail significant technical challenges that necessitate new solutions in Sections~\ref{sec:ambiguity} and~\ref{sec:bernoulli}.

{\bf Federated Learning with Vertically Partitioned Data.} From a data perspective, the decentralized nature of data in VFL is a transposition to that of the traditional horizontal federated learning (HFL) \citep{McMahan2017}. Instead of owning the same set of features for different sets of data points as in HFL, the data owners in VFL now own different sets of features for the same set of data points; and they share a common label set of these data points. 

From the existing literature, two lines of work are noted. One takes the data matrix literally -- by assuming tabular data and studying linear models -- where model parameters have natural correspondence to the data parts~\citep{Hardy2017,Nock2018,Heinze2014,Heinze2016}. Often, these approaches are hard to generalize to complex data with many owners. Another line of work advocates the use of models with modular structure in which separate parts of the model are responsible to locally aggregate different sets of local features owned by different owners; and a global parameterization is used to combine these local features. This is similar in spirit to F$^3$ but require clients to synchronize the training processes of their assigned model parts, which incurs expensive communication and creates dependence among the clients~\citep{Hu2019,Chen2020}.%
\footnote{Note that the approach proposed by~\citet{Hu2019} assumes no parameters for the global model. Were global parameters present, gradient communication is inevitable.} 

Mathematically, for each datum $\mathbf{x}_k$ with label $y_k$, let $\mathbf{x}^i_k$ be the feature set of the datum that the $i$-th owner possesses. That is, $\mathbf{x}_k = (\mathbf{x}^1_k,\mathbf{x}^2_k,\ldots,\mathbf{x}^n_k)$ with $n$ data owners. VFL aims to find aggregation parameter $\mathbf{w}$ and local representation parameters $\{\boldsymbol{\theta}_i\}_{i=1}^n$ that minimize 
\begin{eqnarray}
\hspace{-6.5mm}\mathbf{L}\left(\mathbf{w}, \boldsymbol{\theta}\right) \hspace{-2mm}&\triangleq&\hspace{-2mm} \frac{1}{m}\sum_{k=1}^m \ell\Bigg[g\Bigg(\Big\{\phi_i\left(\mathbf{x}^i_k; \boldsymbol{\theta}_i\right)\Big\}_{i=1}^n; \mathbf{w}\Bigg), y_k\Bigg] \label{eq:VFL}
\end{eqnarray}
where each $\phi_i(\mathbf{x}^i_k; \boldsymbol{\theta}_i)$ is a (learnable) local embedding of $\mathbf{x}_k^i$ parameterized by a separate parameter vector $\boldsymbol{\theta}_i$ owned by the $i$-th owner, $g(\phi_1, \phi_2, \ldots, \phi_n; \mathbf{w})$ is an aggregation function parameterized by $\mathbf{w}$ and $\ell$ is a prediction loss, e.g. the cross-entropy loss for classification or $\ell_2$ loss for regression. The loss in Eq.~\eqref{eq:VFL} is averaged over all training data points $\mathbf{x}_1, \mathbf{x}_2, \ldots, \mathbf{x}_m$.

{\bf Federated Feature Fusion.} The setting of F$^3$ is similar to VFL, except that the data owners share neither data nor models with each other to ensure a higher degree of privacy compliance, which is often the more practical setting in industry -- see the example on power grid at the end of this section. For this reason, the VFL minimization task in Eq.~\eqref{eq:VFL} above is changed to finding a single set of aggregation parameter $\mathbf{w}$ that minimizes
\begin{eqnarray} 
\hspace{-11.5mm}\mathbf{L}\left(\mathbf{w}\right) \hspace{-1mm}&\triangleq&\hspace{-1mm} \frac{1}{m}\sum_{k=1}^m \ell\Bigg[g\Big(\mathbf{h}_k^1, \ \mathbf{h}_k^2, \ldots, \ \mathbf{h}_k^n; \mathbf{w}\Big), y_k\Bigg] \label{eq:FFM}
\end{eqnarray}
where $\mathbf{h}_k^i = \phi_i^\ast(\mathbf{x}_k^i)$ with $\phi_i^\ast = \argmin_{\phi_i} \ell_i\left(\phi_i(\mathbf{x}_k^i), y_k\right)$ which characterizes the locally optimal feature representation obtained in isolation by the $i$-th owner. As such, Eq.~\eqref{eq:FFM} only requires one round of communication where $\{\mathbf{h}_k^i\}_{k,i}$ are communicated to a trusted server. Prior to that, each data owner can freely learn their own feature representation model $\phi_i(\mathbf{x}_k^i)$ with different parameterization and architecture, catering towards their own compute capacities and data representation. This avoids forcing the data owners to participate in a joint training scheme which often requires expensive coordination and is not practical. 

However, in exchange for this practicality, two key challenges arise. First, as local models are separately trained, the correspondence between components of induced feature representations across local models become ambiguous since there is no mechanism to enforce their alignment. Second, for the same reason, there are potential innate local interactions among subsets of clients and a naive concatenation or averaging of their corresponding feature presentations will likely ignore such interactions, resulting in decreasing performance. These correspond to high-level challenge {\bf C1} and {\bf C2} in Section~\ref{sec:intro} which will be addressed in Sections~\ref{sec:ambiguity} and~\ref{sec:bernoulli} as our key technical contributions.

{\bf Separation of Local and Global Model Training.} We remark that the separation of the local and global model training is driven by practical constraints in the real world. For example, there are cases in which data owners prefer to only release pre-trained models for the collaboration, which are not updatable. This is often preferable in production systems that compartmentalize into separately pre-trained workflows maintained by different product groups~\citep{Su2018}, which needs a decoupled architecture such that updates to such workflow models can be implemented fast to scale the business. A similar design appears in many previous works, including~\citet{Wang2022,Lam2021,Yurochkin2019,NghiaICML19,Yurochkin2019a,NghiaAAAI19,NghiaICML20}.

\begin{figure}[h]
  \centering
  \includegraphics[width=\linewidth]{figs/illustration.png}
  \caption{Federated Feature Fusion: A global prediction is produced collectively based on a set of global features which are the result of fusing local feature representations supplied by the data owners. These feature representations are induced from locally trained models on raw local data which might be heterogeneous.}
  \label{fig:illustration}
\end{figure}

\textbf{Data Example.} Let us consider the power grid monitoring task as an example. Figure~\ref{fig:illustration} visualizes PMU measurements distributed across data owners. A panel of time series corresponds to a specific time window and the series collectively represent one data point, which the event detection system classifies. In this simplified illustration, each data owner possesses one series recorded by one PMU; but in practice they may own different amounts of PMUs (and thus series). Moreover, the series may differ in length because of varying sampling frequencies; and the series are multivariate with possibly different number of variates. All these variations contribute to feature heterogeneity, which necessitates the construction of separate local models. Note that if an event does not cascade over the entire grid, some local models may report event whereas others report normal, resulting in conflicting opinions. A consensus global model is responsible to resolve the conflict. Additionally, missing data may occur.

%------------------------------------------------------------------------------
\section{Federated Feature Fusion}\label{sec:framework}
As detailed above, the proposed framework for F$^3$ consists of local models $\phi_i$ and a global feature fusion model $g$, such that their composition minimizes the loss in Eq.~\eqref{eq:FFM}. Each data owner $i$ possesses a local model trained with its data, independently of other owners. This way, no data sharing is invoked and privacy is of minimal concern. However, because the local models lack a global vision and may be conflicting, a central (global) model is key to coordinating the local opinions for final prediction. To maintain autonomy, local models are frozen once pre-trained and will not join the training of the global model. Data owners send local data representations to a centralized server for global model training (and inference). In other words, the global model queries neither the raw data nor the local models from data owners. As long as owners agree to send the less decipherable representations to the central server, global inference can be made.

\textbf{Local Models.} We treat a neural network except the final output layer as a feature extractor, which produces the representation $\mathbf{h}^i_k$ of an input fragment $\mathbf{x}^i_k$; and treat for simplicity the output layer as a logistic regression. That is, a local model $g_i(\mathbf{x}_k^i)$ reads:
\begin{eqnarray}\label{eqn:local}
\hspace{-26mm}g_i\left(\mathbf{x}^i_k\right) &\triangleq& \softmax\left(\mathbf{W}_i\cdot\mathbf{h}^i_k + \mathbf{b}_i\right)
\end{eqnarray}
where $\mathbf{h}^i_k = \phi_i\left(\mathbf{x}^i_k\right)$.
Hereafter, we will interchangeably use \emph{representation}, \emph{embedding}, and \emph{latent vector} to mean $\mathbf{h}^i_k$. These $\mathbf{h}^i_k$'s are assumed to have the same shape across $i$, although $\mathbf{x}^i_k$ can have different shapes and the embedding function can have different architectures to cope with feature heterogeneity.
%In fact, $h^i$ may even be trained in an unsupervised manner (e.g., an autoencoder without labeling information), although we use a supervised local model for ease of exposition.
A simple example of the embedding function is a fully connected layer $\mathbf{h}^i_k = \relu(\mathbf{U}_i\cdot\mathbf{x}^i_k+\mathbf{c}_i)$; but an arbitrarily complex function is also applicable.

\textbf{Global Model.} The global model $g$ melds together all local representations to generate a prediction:
\begin{eqnarray}\label{eqn:global}
\hspace{-27mm}y_k \ \simeq\ \widehat{y}_k &\triangleq& g\left(\mathbf{h}^1_k,\mathbf{h}^2_k,\ldots,\mathbf{h}^n_k; \mathbf{w}\right)
\end{eqnarray}
which is parameterized by $\mathbf{w}$. For example, the parameterization $\mathbf{w} = \{\mathbf{W}_0, \mathbf{W}_1, \mathbf{b}_0, \mathbf{b}_1\}$ characterize two fully connected (FC) layers interleaved with mean pooling:
\begin{eqnarray}\label{eqn:global.ex}
\hspace{-6mm}g\Big(\Big\{\mathbf{h}^i_k\Big\}_{i=1}^n; \mathbf{w}\Big) \hspace{-2mm}&=&\hspace{-2mm}\softmax\left(\mathbf{W}_1\cdot\boldsymbol{\alpha}\ +\ \mathbf{b}_1\right) \mathrm{where}\nonumber\\
\hspace{-6mm}\boldsymbol{\alpha} \hspace{-2mm}&=&\hspace{-2mm} \frac{1}{n}\sum_{i=1}^n\relu\Big(\mathbf{W}_0\cdot\mathbf{h}^i_k\ +\ \mathbf{b}_0\Big)
\end{eqnarray}
Given a particular parameterization $\mathbf{w}$, we can substantiate Eq.~\eqref{eqn:global} above and plug it into Eq.~\eqref{eq:FFM}. The optimal value for $\mathbf{w}$ can then be achieved by solving the corresponding minimization task therein. However, designing the form of $\mathbf{w}$ is highly non-trivial and is in fact tied to the previously mentioned challenges {\bf C1} and {\bf C2}, as elaborated below.

\begin{figure*}[t]
  \centering
  \includegraphics[width=\linewidth]{architecture-crop.pdf}
  \caption{Federated Feature Fusion Framework. Local models are trained independently and separately from the global model. The algorithm is summarized in Algorithm~\ref{alg:framework}.}
  \label{fig:architecture}
\end{figure*}

\textbf{Challenges.} Two considerations are pertinent to the design of $\mathbf{w}$. First, when the latent dimensions have semantic meaning -- e.g. when the local models are trained to yield disentangled representations~\citep{Higgins2018} -- each latent feature of the local representations may not match, because an arbitrary permutation of the latent dimensions does not change a local model. Second, a naive mean pooling as in~\eqref{eqn:global.ex} misses the inter-dependencies between local data, leading to a less well performing global model. Such inter-dependencies occur in the power grid example because of the physics of an electricity network. Hence, we use latent alignment to address the first problem and graph neural network to address the second one. Incorporating these two components, we show the full proposed framework in Figure~\ref{fig:architecture} and Algorithm~\ref{alg:framework}. The solutions to these challenges will be discussed in Sections~\ref{sec:ambiguity} and~\ref{sec:bernoulli}.

\begin{algorithm}[h]
\caption{Federated Feature Fusion (F$^3$)}
\label{alg:framework}
\begin{algorithmic}[1]
\Function{Training}{$\{(\mathbf{x}^i_k,y_k)_{k=1}^m\}_{i=1}^n$}
\State Local clients learn $\{g^i\}_{i=1}^n$ with $\{(\mathbf{x}^i_k,y_k)_{k=1}^m\}_{i=1}^n$
\State Local clients share $\{\mathbf{h}_k^i \triangleq \phi_i(\mathbf{x}_k^i)\}_{i=1}^n$ via Eq.~\eqref{eqn:local}
\State Minimize Eq.~\eqref{eq:FFM} composed with Eq.~\eqref{eqn:global.ex.gcn}
\State Minimizing loss is averaged over samples of $\widehat{\mathbf{A}}$
\State Entries of $\widehat{\mathbf{A}}$ are sampled using~\eqref{eqn:z}
\EndFunction
\Statex
\Function{Inference}{$\{\mathbf{x}_k\}_{k=1}^m$ where $\mathbf{x}_k = (\mathbf{x}_k^i)_{i=1}^n$}
\State Local clients evaluate and send $\{\mathbf{h}_k^i\}_{i,k}$ to server
\State Prediction is produced via Eq.~\eqref{eqn:global.ex}
\EndFunction
\end{algorithmic}
\end{algorithm}

%------------------------------------------------------------------------------
\section{Aligning Representations}\label{sec:ambiguity}
For the global model to be meaningful, the feature dimensions of the local representations $\mathbf{h}^i_k$ should be aligned under the same feature space. For example, in~\eqref{eqn:global.ex}, all $\mathbf{h}^i_k$'s multiply the same weight matrix $\mathbf{W}_0$: each element of $\mathbf{h}^i_k$ thus corresponds to one input neuron of the initial fully connected layer. Permutations of elements will destroy the correspondence: Arbitrary arrangements of feature dimensions of the latent vectors cause ambiguity on what an optimal global model can be built.

To elaborate, let us use a vector $\perm$ to denote the index (column) permutation of a vector (matrix). Then, the $i$th local model~\eqref{eqn:local} can be equivalently written as
\begin{eqnarray}\label{eqn:local.perm}
\hspace{-7mm}g_i\left(\mathbf{x}^i_k\right) \hspace{-2mm}&\triangleq&\hspace{-2mm} \softmax\Big(\mathbf{W}_i\left[:,\perm_i\right]\cdot \mathbf{h}^i_k\left[\perm_i\right] \ +\  \mathbf{b}_i\left[\perm_i\right]\Big)
\end{eqnarray}
where $\mathbf{h}^i_k \triangleq \phi_i\left(\mathbf{x}_k^i\right)$. This is true
for any permutation $\perm_i$ as long as the embedding function is able to produce a permuted $\mathbf{h}^i_k[\perm_i]$ under the same input $\mathbf{x}^i_k$. Such a requirement can be easily satisfied if the embedding function is a fully connected layer such as $\mathbf{h}[\perm] = \relu(\mathbf{W}[\perm,:]\cdot \mathbf{x} + \mathbf{b}[\perm])$. In fact, it is satisfied by most neural networks as well. The supplement gives another example: GRU~\citep{Cho2014}.

Hence, we propose to align the feature dimensions across all local vectors $\mathbf{h}^i_k$ to disambiguate the ambiguity. This proposal amounts to adapting the global model~\eqref{eqn:global} to:
\begin{eqnarray}\label{eqn:global2}
\hspace{-9mm}y_k \ \simeq\ \widehat{y}_k &\triangleq& g\left(\mathbf{P}_1 \cdot \mathbf{h}^1_k,\mathbf{P}_2\cdot \mathbf{h}^2_k,\ldots, \mathbf{P}_n\cdot \mathbf{h}^n_k\right),
\end{eqnarray}
where $\mathbf{P}_i$ is an alignment matrix for each data owner $i$, implementing the (manual) index or column permutation above in linear algebra. We can then treat each $\mathbf{P}_i$ as a free parameter matrix to optimize. It may be square or rectangle, the latter case indicating a change of the number of features. We also show an alternative hard alignment by parameterizing $\mathbf{P}_i$ as a permutation matrix in the supplement.

%------------------------------------------------------------------------------
\section{Learning a Consensus Graph}\label{sec:bernoulli}
The example global model~\eqref{eqn:global.ex} performs a naive averaging for the local representations. Since data owners are often interconnected, a more expressive model is needed to exploit their relational interactions to improve inference~\citep{Battaglia2018}. Here, we use a graph neural network (GNN)~\citep{Zhang2020,Wu2021} to model and learn these high-order relational interactions.

{\bf A. Modeling Consensus via GCN with Latent Graph.} All variants of GNN are applicable to our setting but we focus on the most basic GCN~\citep{Kipf2017} for presentation clarity. Let $\mathbf{A}$ be the graph's adjacency matrix and let $\mathbf{H}_k$ be the matrix of aligned local representations:
\[
\mathbf{H}_k\ \ \triangleq\ \ \begin{bmatrix}
- (\mathbf{P}_1\mathbf{h}^1_k)^\top - \\
\vdots \\
- (\mathbf{P}_n\mathbf{h}^n_k)^\top - \\
\end{bmatrix}.
\]
The global prediction $y_k \simeq \widehat{y}_k = g(\mathbf{h}^1_k, \ldots, \mathbf{h}^n_k; \mathbf{w})$ in Eq.~\eqref{eqn:global} is then substantiated with
\begin{eqnarray}\label{eqn:global.ex.gcn}
\hspace{-2mm}g\Big(\Big\{\mathbf{h}^i_k\Big\}_{i=1}^n; \mathbf{w}\Big) \hspace{-2mm}&\triangleq&\hspace{-2mm}  \softmax\left(\frac{1}{n}\ones^\top \mathbf{A}^\dagger\cdot \mathbf{W}_1\right) \quad \mathrm{with}\nonumber\\
\hspace{-2mm}\mathbf{A}^\dagger \hspace{-2mm}&=&\hspace{-2mm}\widehat{\mathbf{A}}\cdot\relu\left(\widehat{\mathbf{A}}\mathbf{H}_k\mathbf{W}_0\right)
\end{eqnarray}
where $\widehat{\mathbf{A}}$ is a normalization of $\mathbf{A}$~\citep{Kipf2017} and $\mathbf{w} = (\mathbf{W}_0, \mathbf{W}_1, \{\mathbf{P}_i\}_{i=1}^n)$ are learnable parameters. 

Here, we adapt the traditional GCN prediction with the inclusion of $\frac{1}{n}\ones^T$ as pooling before output. Modulo this modification, Eq.~\eqref{eqn:global.ex.gcn} is the traditional one used in the literature with bias terms omitted. It is interesting to note the equivalence between the GCN~\eqref{eqn:global.ex.gcn} and graph-agnostic~\eqref{eqn:global.ex} models when $\widehat{\mathbf{A}}$ is replaced by $\mathbf{I}$, omitting the bias terms.

In GCN, $\mathbf{A}$ corresponds to the consensus graph among local owners as graph nodes. If such a graph is not present, it is possible to learn one such that~\eqref{eqn:global.ex.gcn} still outperforms~\eqref{eqn:global.ex}. In this case, we treat $\mathbf{A}$ as a random variable of the matrix Bernoulli distribution, where the success probabilities are free parameters to learn. Formally, the elements $\mathbf{A}_{ij}$ are independent and each follows $\ber(\theta_{ij})$, where $\theta_{ij}$ denotes the corresponding probability~\citep{Kipf2018,Shang2021}. Then, the global model $g$ has $\mathbf{W}_0$, $\mathbf{W}_1$, the $\mathbf{P}_i$'s, as well as $\theta$, as parameters. Following~\citet{Franceschi2019,Shang2021}, we formulate the training loss as an expectation over $\mathbf{A}$'s distribution and draws a sample $\mathbf{A}$ to obtain an unbiased estimate of the loss and its gradient. %in each stochastic optimization step. 

{\bf Security in Transmitting Data Representation.} Transmitting data representation might risk exposing raw data and mitigating such risks in multiple rounds of communication is often non-trivial. However, in our case, there is a single communication round so sanitizing the data representation does not pose a new challenge. It can be addressed using a variety of existing, well-established techniques such as Shamir’s Secret Sharing (SSS) \citep{shamir1979}. Each local representation (a secret) can be splitted into multiple shares distributed to multiple central entities who process and combine the results to reproduce the desired result. As the central entities process the shares independently and only communicate the results to a coordinator, the SSS protocol can guarantee no single central entity would have enough information to access any local representations.

{\bf B. Differentiable Graph Sampling.} However, the central challenge of this approach is that the sample $\mathbf{A}_{ij}$ is not differentiable with respect to the corresponding Bernoulli bias $\theta_{ij}$, which in turn makes the training loss non-differentiable with respect to $\theta$. To sidestep this difficulty, we propose the following reparameterization, which presents a learnable (differentiable) transformation of a sample drawn from a continuous distribution to a discrete Bernoulli sample. This transformation is detailed in Definition~\ref{def:1} below, which is followed by Theorem~\ref{thm:cdf.z} showing the distributional convergence of this transformation to the desired distribution.

\begin{definition}\label{def:1}
Let $F$ be the CDF of an arbitrary continuous probability distribution. Sample $s$ from this {\bf reference distribution} and let
\begin{eqnarray}\label{eqn:z}
\hspace{-14mm}z &\triangleq& \sigmoid\left(\frac{1}{\tau}\Big(F^{-1}(\theta)-s\Big)\right),\quad\tau>0.
\end{eqnarray}
We call this the \textbf{ICDF} reparameterization which is named after the use of inverse cumulative $F^{-1}$.
\end{definition}

\begin{theorem}\label{thm:cdf.z}
For all $\tau>0$, $\theta\in(0,1)$ and $t\in[0,1]$, if the distribution with CDF $F$ is finitely supported on $[a,b]$, then
\begin{eqnarray}\label{eqn:cdf.z}
\hspace{-10mm}\Pr(z \le t) \hspace{-2mm}&=&\hspace{-2mm} \begin{cases}
0 \quad\text{if}\ t \ <\  \sigma((F^{-1}(\theta)-b)/\tau),\\
1 \quad\text{if}\ t \ >\ \sigma((F^{-1}(\theta)-a)/\tau)
\end{cases}
\end{eqnarray}
or otherwise,
\begin{eqnarray}\label{eqn:cdf.zo}
\hspace{-10mm}\Pr(z \le t) \hspace{-2mm}&=&\hspace{-2mm}1-F\left(F^{-1}(\theta)+\tau\log\left(\frac{1}{t}-1\right)\right).
\end{eqnarray}
In case the distribution is not finitely supported (i.e., $a=-\infty$ and/or $b=+\infty$), Eqs.~\eqref{eqn:cdf.z} and~\eqref{eqn:cdf.zo} still hold because either (or both) of the first two cases will not be invoked. Thus, the distribution of $z$ converges to $\ber(\theta)$ as $\tau\to0$.
\end{theorem}

{\bf Discussion.} We note that an alternative to the above can be achieved via using the Gumbel softmax reparameterization~\citep{Jang2017,Maddison2017} which also features a differentiable relaxation of the Bernoulli distribution that approximates it asymptotically. However, in order to obtain one Bernoulli sample, the Gumbel trick requires to sample the Gumbel distribution twice. 

Instead, our proposed reparameterization only requires sampling from the reference distribution only once. We also show that the {\bf ICDF} reparameterization converges as fast as the Gumbel softmax. Both approaches have asymptotic convergence rate on the order of $O(\tau^2)$ as shown in the supplement. Empirically, we also show that {\bf ICDF} induces marginally better performance than Gumbel softmax. This is why we prefer {\bf ICDF} to Gumbel in our work.

%------------------------------------------------------------------------------
\section{Related Work}
The concept of federated learning was first coined by~\citet{McMahan2017} and it has attracted surging interests since then. Recent literature reviews~\citep{Yang2019,Li2019,Kairouz2019} have comprehensively studied the topic, summarized systems and infrastructures, and also suggested open problems. Among these, one interesting direction is a new family \citep{Hardy2017,Nock2018,Heinze2014,Heinze2016} of federated learning that studies a federated scenario where features (instead of samples) are split across owners. This setting bears resemblance to our federated feature fusion scenario, but a key distinction is all literature in this direction focuses on local models that can be trained together whereas in our scenarios, local models are trained in isolation to avoid the cost and overhead of coordination among different parties.

To build consensus among local models, our framework learns parameter matrices to align their local representations. Such alignments similarly appear in model fusion, where a number of models are fused together into a common model through aligning model parameters~\citep{Yurochkin2019a}. In the context of deep learning, if the neural networks come from the same model family, their weights can be matched layer-wise, even if the numbers of weights are different~\citep{Yurochkin2019,Wang2020}. The referenced work treats the problem as a bipartite graph matching, where the cost matrix is inferred from maximum a posteriori estimation. Then, the Hungarian algorithm~\citep{Kuhn1955} is applied to find the matching. In our work, instead we treat the permutation alignment as a differentiable parameterization with the help of Sinkhorn--Knopp~\citep{Sinkhorn1967,Mena2018,Emami2018}, so that it can be learned end-to-end with other parameters.

Our framework also advocates learning a graph of data owners in the global model. Graph structure learning appears under various contexts. One field of study is grounded in the context of probabilistic graphical models, whereby a directed acyclic structure is learned. Gradient-based approaches in this context include~\citet{Zheng2018,Yu2019,Lachapelle2020}. On the other hand, a general graph may still be useful without resorting to causality. Recent approaches supporting GNN-based modeling include~\citet{Kipf2018,Franceschi2019,Wu2020,Shang2021}, wherein a graph structure is simultaneously learned together with the GNN parameters.

%------------------------------------------------------------------------------
\section{Experiments}\label{sec:exp}
This section reports comprehensive experiment results to demonstrate the effectiveness of our proposed federated feature fusion (F$^3$) substantiated with the developed techniques in Sections~\ref{sec:ambiguity} and~\ref{sec:bernoulli}.

%------------------------------------------------------------------------------
\textbf{Datasets.} We use four real-life, time series datasets. Two are PMU data collected from multiple data owners of the U.S. power grid. For proof of concept, we smooth out heterogeneity and prepare homogeneous data sets. Such a pre-processing is sufficient to test the proposed techniques under minimal impact of the complication by the otherwise diverse local models. Since the PMU data sets are proprietary, we also use two public, traffic data sets~\citep{Li2018} for experimentation. A summary of these  data sets is given in Table~\ref{tab:dataset} while other processing details are deferred to the supplement due to limited space.

\begin{table}[t]
  \centering
  \caption{Dataset Statistics.}
  \label{tab:dataset}
  \setlength{\tabcolsep}{1.4mm}
  \begin{tabular}{lllll}
    \toprule
    & \la & \bay & \pmub & \pmuc\\
    \midrule
    \# samples    & 2856 & 4343 & 4853 & 1884\\
    \# owners     & 207  & 325  & 43   & 188\\
    \# features   & 1    & 1    & 2    & 2\\
    \# classes    & 2    & 2    & 4    & 4\\
    series length & 12   & 12   & 30   & 30\\
    missing data  & N    & N    & Y    & Y\\
    given graph   & Y    & Y    & N    & N\\
    \bottomrule
  \end{tabular}
\end{table}

\begin{figure}[t]
  \centering
  \includegraphics[width=\linewidth]{entropy.png}
  \caption{Distributions of prediction entropy.}
  \label{fig:entropy}
\end{figure}

%------------------------------------------------------------------------------
\textbf{Experiment Setting.}
All local models are LSTM~\citep{Hochreiter1997} with the same hyperparameters, but pre-trained separately by using local data. The local models are not fine-tuned in the training of the global model. Each dataset is split randomly for training/validation/testing. See the supplement for further details.


\begin{table*}[t]
  \centering
  \caption{Effectiveness of latent alignment in a graph-based global model. Superscript numbers are standard deviations.}
  \label{tab:results}
  \begin{tabularx}{\textwidth}{lcccccccc}
    \toprule
    & \multicolumn{2}{c}{\la}
    & \multicolumn{2}{c}{\bay}
    & \multicolumn{2}{c}{\pmub}
    & \multicolumn{2}{c}{\pmuc}\\
    & \fone & \rocauc & \fone & \rocauc & \fone & \rocauc & \fone & \rocauc\\
    \midrule
    \hspace{-2mm}{\bf A}: Horizontal FL
    & .25$^{.000}$ & -           & .33$^{.000}$ & -           & .36$^{.000}$ & -           & .29$^{.000}$  & -\\

    \hspace{-2mm}{\bf B}: Majority Voting
    & .11$^{.000}$ & -           & .09$^{.000}$ & -           & .29$^{.000}$ & -           & .18$^{.000}$  & -\\

    \hspace{-2mm}{\bf C}: Binary Thresholding
    & .69$^{.000}$ & -           & .64$^{.000}$ & -           & -           & -           & -            & -\\

    \hspace{-2mm}{\bf D}: Best Model Selection
    & .53$^{.000}$ & .70$^{.000}$ & .55$^{.000}$ & .79$^{.000}$ & .37$^{.000}$ & .69$^{.000}$ & .32$^{.000}$ & .62$^{.000}$\\

    \hspace{-2mm}{\bf E}: Mean Pooling -- Eq.~\eqref{eqn:global.ex}
    & .77$^{.009}$ & .96$^{.004}$ & .74$^{.012}$ & .93$^{.001}$ & .38$^{.008}$ & .71$^{.006}$ & .34$^{.008}$ & .64$^{.010}$\\

    \hspace{-2mm}{\bf F}: Transformer
    & .78$^{.023}$ & .94$^{.018}$ & .72$^{.045}$ & .93$^{.027}$ & \bf.39$^{.003}$ & .70$^{.009}$ & .40$^{.053}$ & .67$^{.058}$\\

    \hspace{-2mm}{\bf G}: Concatenation
    & \bf.83$^{.008}$ & \bf.97$^{.002}$ & .80$^{.066}$ & .96$^{.028}$ & \bf.39$^{.006}$ & .71$^{.036}$ & .40$^{.025}$ & .68$^{.040}$\\

    \midrule
    \hspace{-2mm}{\bf H}: F$^3$ w. no alignment
    & .80$^{.009}$ & .96$^{.004}$ & .75$^{.009}$ & .94$^{.001}$ & \bf.39$^{.003}$ & .73$^{.015}$ & .40$^{.020}$ & .66$^{.018}$\\

    \hspace{-2mm}{\bf J}: F$^3$ w. parameter tying 
    & .82$^{.009}$ & \bf.97$^{.001}$ & .75$^{.009}$ & .94$^{.004}$ & \bf.39$^{.006}$ & .72$^{.010}$ & .37$^{.012}$ & .66$^{.008}$\\

    \hspace{-2mm}{\bf K}: F$^3$ w. alignment
    & \bf.83$^{.010}$ & \bf.97$^{.001}$ & \bf.86$^{.005}$ & \bf.98$^{.002}$ & \bf.39$^{.008}$ & .73$^{.008}$ & \bf.45$^{.015}$ & .72$^{.003}$\\

    \midrule
    \hspace{-2mm}{\bf L}: VFL w. graph/alignment
    & \bf.83$^{.012}$ & \bf.97$^{.001}$ & \bf.86$^{.014}$ & \bf.98$^{.002}$ & \bf.39$^{.006}$ & \bf.74$^{.009}$ & \bf.45$^{.015}$ & \bf.73$^{.003}$\\

    \hspace{-2mm}{\bf M}: VFL w.o. pre-trained local
    & .77$^{.020}$ & .94$^{.021}$ & .77$^{.014}$ & .95$^{.006}$ & .34$^{.014}$ & .69$^{.012}$ & .35$^{.008}$ & .65$^{.014}$\\
    \bottomrule
  \end{tabularx}
\end{table*}

\begin{table*}[t]
  \centering
  \caption{Impact of learning a graph across different alignment settings.}
  \label{tab:results.graph}
  \begin{tabularx}{\textwidth}{clcccccccc}
    \toprule
    && \multicolumn{2}{c}{\la}
    & \multicolumn{2}{c}{\bay}
    & \multicolumn{2}{c}{\pmub}
    & \multicolumn{2}{c}{\pmuc}\\
    && \fone & \rocauc & \fone & \rocauc & \fone & \rocauc & \fone & \rocauc\\
    \midrule
    \multirow{4}{*}{\rotatebox[origin=c]{90}{\textbf{No Align}}}
    & No Graph
    & .768$^{.009}$ & .957$^{.004}$ & .738$^{.012}$ & .935$^{.001}$ & .381$^{.008}$ & .711$^{.006}$ & .342$^{.008}$ & .636$^{.010}$\\

    & Given Graph
    & .763$^{.020}$ & .957$^{.007}$ & .742$^{.024}$ & .942$^{.005}$ & -            & -            & -            & -\\

    & $\kappa$-NN Graph
    & .715$^{.015}$ & .952$^{.004}$ & .695$^{.013}$ & .934$^{.004}$ & .372$^{.001}$ & .711$^{.013}$ & \bf.404$^{.016}$ & \bf.680$^{.014}$\\

    & ICDF
    & \bf.798$^{.009}$ & \bf.963$^{.004}$ & \bf.755$^{.009}$ & \bf.943$^{.001}$ & \bf.387$^{.003}$ & \bf.734$^{.015}$ & .403$^{.020}$ & .663$^{.018}$\\
    \midrule
    \midrule
    \multirow{4}{*}{\rotatebox[origin=c]{90}{\textbf{Align}}}
    & No Graph
    & .813$^{.009}$ & .970$^{.002}$ & .846$^{.008}$ & .977$^{.001}$ & .386$^{.009}$ & .725$^{.012}$ & .386$^{.008}$ & .694$^{.005}$\\

    & Given Graph
    & .828$^{.007}$ & .974$^{.001}$ & .854$^{.003}$ & .977$^{.001}$ & -            & -            & -            & -\\

    & $\kappa$-NN Graph
    & .803$^{.020}$ & .968$^{.002}$ & .855$^{.003}$ & .973$^{.002}$ & .378$^{.002}$ & .718$^{.015}$ & .418$^{.007}$ & .702$^{.009}$\\

    & ICDF
    & \bf.835$^{.010}$ & \bf.975$^{.001}$ & \bf.860$^{.005}$ & \bf.980$^{.002}$ & \bf.390$^{.008}$ & \bf.734$^{.008}$ & \bf.451$^{.015}$ & \bf.725$^{.003}$\\
    \bottomrule
  \end{tabularx}
\end{table*}

%------------------------------------------------------------------------------
\textbf{Conflicting Local Predictions.}
We first show that local models do not produce consistent predictions, which rationalizes the effort of training a global model and performing federated feature fusion. For each datum, we compute the entropy of the predicted labels and summarize the entropies for all data into a distribution, plotted in Figure~\ref{fig:entropy}. Recall that the lower the entropy, the more consistent the local predictions. The figure, however, shows that a substantial amount of entropies is away from zero, suggesting that local predictions are inconsistent.

%------------------------------------------------------------------------------
\textbf{Effectiveness of Federated Feature Fusion.}
We make two sets of comprehensive comparisons to evaluate the effectiveness of the proposed framework. The first set, as outlined in Table~\ref{tab:results}, compares F$^3$ with a number of non-graph baselines ({\bf A--G}), including: (a) horizontal FL ({\bf A}) which requires both model homogeneity and training synchronicity among clients that are not admitted in our setting; (b) a set of standard ensemble strategies ({\bf B--F}) that combine the local models, such as voting, binary thresholding, best local model, mean-pooling via Eq.~\eqref{eqn:global.ex}, as well as a simplified Set Transformer with 2 layers and 4 heads~\citep{lee2019set}; and (c) a vertical FL baseline ({\bf G}) via feature concatenation. 

This set also contains several variants of our proposed federated feature fusion model, featuring an ablation study of the effectiveness of our model components: ({\bf H}) F$^3$ without alignment; ({\bf J}) F$^3$ with partially tied parameters among local models; and ({\bf K}) F$^3$ with learnable alignment. Note that {\bf J} is an alternative to alignment which comes with the cost of imposing strong homogenization -- though not as strong as {\bf A} -- among local models despite the different nature of their data. All variants {\bf H--K} use {\bf ICDF} reparameterization to learn the graph structure. 

From Table~\ref{tab:results}, we observe that baselines {\bf A--D}, lacking either local models or a holistic global model, perform significantly worse than the other baselines (including our F$^3$ variants, the ensemble via mean pooling, and concatenation). On the other hand, baselines {\bf E--G} perform better than {\bf A--D} but they lack a proper alignment of local models or they impose a strong form of homogenization among local models to sidestep alignment. Therefore, they are expectedly outperformed by {\bf K} that performs alignment. 

We also compare with two variants of our model {\bf K} in the vertical FL setting. {\bf L} uses the same local and global models as {\bf K} but allows gradients to be sent back to local clients, thus local models can be updated. It achieves similar performance as {\bf K} but leads to much more communication cost with multiple rounds of gradient messages. {\bf M} assumes no local pre-trained models and all local and global models are trained jointly from scratch. Its performance is much worse than {\bf K} and {\bf L}, which explains the merit of pre-trained local models. Another typical VFL baseline {\bf G} with pre-trained local models and a simple concatenation based global model is also inferior.

%------------------------------------------------------------------------------
\textbf{Impact of Learning a Graph.} Our next set of experiments, as outlined in Table~\ref{tab:results.graph}, demonstrate the impact of learning a graph that characterizes the innate local interactions among subsets of clients, following our challenge statement {\bf C2} in the introduction, on both alignment and non-alignment baselines. This provides ablation studies on the isolated impact of having a specific graph learning component. In particular, for each alignment setting, we demonstrate the impact on performance with (a) not using a graph; (b) using a predefined graph; (c) learning the graph structure using a {\bf $\kappa$-NN} baseline \citep{fatemi2021slaps}; and (d) learning the graph structure using {\bf ICDF}. The reported numbers suggest that regardless of whether the model performs alignment, graph learning always improves performance.

{\bf Remark.} The {\bf $\kappa$-NN} baseline ($\kappa = 10$) is implemented following the description in \citep{fatemi2021slaps}. Specifically, during training, we generate a local graph for each batch for node features $\mathbf{X}$ via a symmetrization of $\Tilde{\mathbf{A}} = \text{\bf $\kappa$-NN}(\mathrm{MLP}(\mathbf{X}))$ which (1) feeds the node features through a MLP neural block; and (2) draws an edge between each node and its $\kappa$ nearest neighbors where the neighborhood is defined using the cosine similarity on the space of MLP-projected feature vectors.

%------------------------------------------------------------------------------
\section{Conclusion}
In this paper, we study federated feature fusion, which presents a less addressed scenario of federated learning where data owners or clients need to customize their own local models to accommodate different sets of features. Unlike federated learning, the clients need to learn their own model separately in isolation and only communicate their local feature representations afterwards. We motivate the practicality of federated feature fusion with a power grid example and propose a local--global model framework for it. Two important components of the framework are the alignment of the data representations produced by local models and the learning of the global model by using a graph neural network. Comprehensive experiments suggest the feasibility and the effectiveness of federated feature fusion. We release our code at \url{https://github.com/matenure/federated_feature_fusion}.

%------------------------------------------------------------------------------
\begin{acknowledgements} % will be removed in pdf for initial submission,
						 % (without ‘accepted’ option in \documentclass)
                         % so you can already fill it to test with the
                         % ‘accepted’ class option
This material is based upon work supported by the Department of Energy under Award Number(s) DE-OE0000910, and the Defense Advanced Research Projects Agency (DARPA) through Cooperative Agreement D20AC00004 awarded by the U.S. Department of the Interior (DOI), Interior Business Center. This report was prepared as an account of work sponsored by an agency of the United States Government.  Neither the United States Government nor any agency thereof, nor any of their employees, makes any warranty, express or implied, or assumes any legal liability or responsibility for the accuracy, completeness, or usefulness of any information, apparatus, product, or process disclosed, or represents that its use would not infringe privately owned rights.  Reference herein to any specific commercial product, process, or service by trade name, trademark, manufacturer, or otherwise does not necessarily constitute or imply its endorsement, recommendation, or favoring by the United States Government or any agency thereof.  The views and opinions of authors expressed herein do not necessarily state or reflect those of the United States Government or any agency thereof.
\end{acknowledgements}

%------------------------------------------------------------------------------
% References
\bibliography{reference}
\end{document}
