%\documentclass[accepted]{uai2022} % for initial submission
\documentclass[accepted]{uai2022} % after acceptance, for a revised
                                    % version; also before submission to
                                    % see how the non-anonymous paper
                                    % would look like
%% There is a class option to choose the math font
% \documentclass[mathfont=ptmx]{uai2022} % ptmx math instead of Computer
                                         % Modern (has noticable issues)
% \documentclass[mathfont=newtx]{uai2022} % newtx fonts (improves upon
                                          % ptmx; less tested, no support)
% NOTE: Only keep *one* line above as appropriate, as it will be replaced
%       automatically for papers to be published. Do not make any other
%       change above this note for an accepted version.

%% Choose your variant of English; be consistent
\usepackage[american]{babel}
% \usepackage[british]{babel}

%% Some suggested packages, as needed:

\usepackage{natbib} % has a nice set of citation styles and commands
    \bibliographystyle{plainnat}
    \renewcommand{\bibsection}{\subsubsection*{References}}
\usepackage{mathtools} % amsmath with fixes and additions
% \usepackage{siunitx} % for proper typesetting of numbers and units
\usepackage{booktabs} % commands to create good-looking tables
\usepackage{tikz} % nice language for creating drawings and diagrams

%% Provided macros
% \smaller: Because the class footnote size is essentially LaTeX's \small,
%           redefining \footnotesize, we provide the original \footnotesize
%           using this macro.
%           (Use only sparingly, e.g., in drawings, as it is quite small.)

%% Self-defined macros
\newcommand{\swap}[3][-]{#3#1#2} % just an example


%%%%
\usepackage[utf8]{inputenc} % allow utf-8 input
\usepackage[T1]{fontenc}    % use 8-bit T1 fonts
\usepackage{hyperref}       % hyperlinks
\usepackage{url}            % simple URL typesetting
\usepackage{booktabs}       % professional-quality tables
\usepackage{amsfonts}       % blackboard math symbols
\usepackage{nicefrac}       % compact symbols for 1/2, etc.
\usepackage{microtype}      % microtypography
\usepackage{amsmath}
\usepackage{amssymb}
\usepackage{mathrsfs}
\usepackage{graphicx}
\usepackage{algorithm}
\usepackage{algorithmic}
\usepackage{amsthm}
\usepackage{subcaption}
\usepackage{xcolor}
\usepackage[toc,page]{appendix}
\usepackage{enumitem}

\newtheorem{theorem}{Theorem}
\newtheorem{proposition}{Proposition}
\newtheorem{lemma}{Lemma}
\newtheorem{definition}{Definition}
\newtheorem{example}{Example}
\newtheorem{remark}{Remark}


\newcommand{\rf}[1]{{\color{blue} #1}}
\newcommand{\ros}[1]{{\color{red} #1}}
\newcommand{\rosanna}[1]{{\color{green} #1}}
\newcommand{\ar}[1]{{\color{orange} #1}}
\newcommand{\mass}[1]{{\color{brown} #1}}

\usepackage[textwidth=2.0cm, textsize=tiny]{todonotes} % for writting
\newcommand{\rt}[2][noinline]{\todo[color=red!20,#1]{{\bf Ros:} #2}}
\newcommand{\remi}[2][noinline]{\todo[color=blue!20,#1]{{\bf Remi:} #2}}
\newcommand{\gino}[2][noinline]{\todo[color=yellow!20,#1]{{\bf Gino:} #2}}
\newcommand{\mas}[2][noinline]{\todo[color=brown!20,#1]{{\bf Mas:} #2}}
\newcommand{\alain}[2][noinline]{\todo[color=orange!20,#1]{{\bf Alain:} #2}}
\newcommand{\leo}[2][noinline]{\todo[color=black!20,#1]{{\bf Leo:} #2}}

%\usepackage{caption} 
%\captionsetup[table]{skip=10pt}
\newcommand{\red}[1]{\textcolor{blue}{#1}}
\hypersetup{
     colorlinks = true,
     linkcolor = blue,
     anchorcolor = blue,
     citecolor = blue,
     filecolor = blue,
     urlcolor = blue
     }

%%%%
\title{Multi-source Domain Adaptation via Weighted Joint Distributions Optimal Transport}

%\aistatsauthor{ Rosanna Turrisi \And Rémi Flamary \And Alain Rakotomamonjy \And  Massimiliano Pontil}

% The standard author block has changed for UAI 2022 to provide
% more space for long author lists and allow for complex affiliations
%
% All author information is authomatically removed by the class for the
% anonymous submission version of your paper, so you can already add your
% information below.
%
% Add authors
\author[1]{\href{mailto:<rosanna.turrisi@edu.unige.it>?Subject=Your UAI 2022 paper}{Rosanna Turrisi}{}}
\author[2]{ Rémi Flamary}
\author[3]{Alain Rakotomamonjy}
\author[4,5]{Massimiliano Pontil}

% Add affiliations after the authors
\affil[1]{%
    DIBRIS, MaLGa, University of Genova, Genoa; CTSNC, Istituto Italiano di Tecnologia, Ferrara, Italy}
%\affil[2]{%
%    CTSNC, Istituto Italiano di Tecnologia, Ferrara, Italy}
\affil[2]{%
    CMAP, École Polytechnique, Institut Polytechnique de Paris
}
\affil[3]{%
    Criteo AI Lab, Paris}
\affil[4]{%
    CSML, Istituto Italiano di Tecnologia, Genoa, Italy}
\affil[5]{%
    Dept. of Computer Science, University College London, U.K.}
  \begin{document}
\maketitle



\begin{abstract}
This work addresses the problem of domain adaptation on an unlabeled \textit{target} dataset using knowledge from multiple labelled \textit{source} datasets. Most current approaches tackle this problem by searching for an embedding that is invariant across source and target domains, which corresponds to searching for a universal
classifier that works well on all domains.
In this paper, we address this problem from a new perspective: instead of crushing diversity of the \textit{source} distributions, we exploit it to adapt better to the \textit{target} distribution.
Our method, named Multi-Source Domain Adaptation via Weighted Joint Distribution Optimal Transport (MSDA-WJDOT), aims at finding simultaneously an Optimal Transport-based alignment between the \textit{source} and \textit{target} distributions and a
re-weighting of the \textit{sources} distributions.
We discuss the theoretical aspects of the method and propose a conceptually simple algorithm. Numerical experiments indicate that the proposed method achieves
state-of-the-art performance on simulated and real datasets.

% The problem of domain adaptation on an unlabeled \textit{target}  dataset using knowledge from multiple labelled \textit{source} datasets is becoming increasingly important. A key challenge is to design an approach that overcomes the covariate and \textit{target}  shift both among the sources, and between the \textit{source} and \textit{target}  domains. In this paper, we address this problem from a new perspective: 
%instead of looking for a latent representation invariant between \textit{source} and \textit{target}  domains, we exploit the diversity of \textit{source} distributions by 
%tuning their weights to the \textit{target}  task at hand. 
%Our method, named Weighted Joint Distribution Optimal Transport (WJDOT), aims at finding simultaneously an Optimal Transport-based alignment between the \textit{source} and \textit{target}  distributions and a re-weighting of the \textit{sources} distributions. 
%We discuss the theoretical aspects of the method and propose a conceptually simple algorithm. Numerical experiments indicate that the proposed method achieves
%state-of-the-art performance on simulated and real-life datasets.

\end{abstract}

\section{Introduction}

Many machine learning algorithms assume that the test and training datasets are sampled from the same distribution. However, in many real-world applications, new data can exhibit a distribution change (\textit{domain shift}) that degrades the algorithm performance. This shift can be observed for instance in computer vision when changing background, location, illumination or pose of the test images, or in speech recognition when the recording conditions or speaker accents are varying.
%
To overcome this problem, Domain Adaptation (DA) \citep{jiang2008literature, kouw2019review} attempts to leverage labelled data from a \textit{source}
domain, in order to learn a classifier for unseen or unlabelled data in a \textit{target} domain. 

Several DA methods incorporate a distribution discrepancy loss into a neural network to overcome the domain gap. The distances between distributions are usually measured through 
an adversarial loss  \citep{ganin2016domain,Ghifary2016, Tzeng2015, Tzeng2017} or 
integral probability metrics, such as the maximum mean discrepancy \citep{Mingsheng2016, Tzeng2014}. DA techniques based on Optimal Transport have been proposed by \citep{Courty2015, Courty2017, Damodaran2018} and justified theoretically by \cite{Redko2017}. 

In this work, we focus on the setting, more common in practice, in which  several labelled \textit{sources} are available, denoted in the following as multi-source domain adaptation (MSDA) problem.
Many recent approaches motivated by theoretical considerations 
have been proposed for this problem. For instance, \cite{mansour2009domain,hoffman2018algorithms} provided theoretical guarantees on how several \textit{source} predictors can be combined using proxy measures, such as the accuracy of a hypothesis. This approach can achieve a low error predictor on the \textit{target}  domain, under the assumption that the \textit{target}  distribution can be written as a convex combination of the \textit{source} distributions. %\ros{In \cite{montesuma2021}, the sources are aggregated into an intermediate domain through the Wasserstein barycenter. Once the aggregation step is done, standard DA between the barycenter and target domain is employed.}

Other MSDA methods \citep{Peng2018, Zhao2018,Wen2019} look for a single hypothesis that minimizes the convex combination of its error on all \textit{source} domains and they provide theoretical bounds of the error of the obtained hypothesis on the \textit{target} domain. Those guarantees generally involve some terms depending on the distance between each \textit{source} distribution and the \textit{target}  distribution and suggest to find an embedding in which the feature distributions between \textit{sources} and \textit{target}  are as close as possible, by using Adversarial Learning \citep{Zhao2018,Xu2018,Lin2020} or Moment Matching \citep{Peng2018}. 
However, it may not be possible to find an embedding preserving discrimination even when the distances between \textit{source} and \textit{target} marginals are small. One such example is given in Figure \ref{fig:visu_method}, in which a rotation between the \textit{sources} prevents the existence of such invariant embedding as theorized by \cite{pmlr-v97-zhao19a}. At last, we mention the very recent line of works on MSDA considering approaches inspired from imitation learning \cite{nguyen2021most,nguyen2021stem} and the work by \cite{montesuma2021} building on Wasserstein barycenters.

%In this paper, we address the MSDA problem following a radically different route. Instead looking for a latent representation in which all \textit{source} distributions are similar to the \textit{target}
%Instead of looking for a latent representation in which all \textit{source} distributions
%are similar to the \textit{target}  one, we embrace the diversity of \textit{source} distributions
%and look for a convex combination of the joint distribution of \textit{sources} with minimal distance to the \textit{target} one, {without referring to a proxy measure such  as the accuracy of \textit{source} predictors.} 

%\paragraph{Contributions} 

\noindent {\bf Contributions~} In this paper, we address the MSDA problem following a radically different route {than the usual
approach consisting in looking for a latent representation in which all \textit{source} distributions are similar to the \textit{target}}.
The approach we advocate embraces the diversity of \textit{source} distributions
and look for a convex combination of the joint \textit{source} distributions with minimal  Wasserstein distance to an estimated \textit{target} distribution, without relying on a proxy measure such as the accuracy of \textit{source} predictors.\\
%
%\mass{<MAYBE HERE WE SHOULD START A SUBSECTION "Contributions" where we push a bit more the novelty of what we do>}
%
We support this novel conceptual approach by deriving a generalization bound on the \textit{target} error.
%, involving the above Wasserstein distance. 
%The bound is obtained through a  technical result
%that has its own interest (XXXnot sure about this for lemma 2?XXX). 
Our algorithm consists  
in optimizing {a key term} in this generalization bound, given 
by the Wasserstein distance  between the estimated joint \textit{target} distribution and a weighted sum of the joint \textit{source} distributions. 
 One unique feature of our approach is that the weights of the \textit{source} distribution are learned simultaneously with the classification function, which allows us to distribute the mass based on the similarity of the 
\textit{sources} with the \textit{target}, both in the feature and in the output spaces. As such, our model can also handle problems in which only target shift occurs. % After having derived a new generalization bound on the \textit{target} involving that distance, 
% we propose to optimize the Wasserstein distance, defined on the feature/label product space, similar to what was proposed in \cite{Courty2017} but between the \textit{target}  domain and a weighted sum of the labelled \textit{sources}.
% A unique feature of our approach is that the weights are learned simultaneously with the classification function, which allows us to distribute the mass based on the similarity of the 
% \textit{sources} with the \textit{target}, both in the feature and in the output spaces. 
Interestingly the estimated
 weights provide a measure of domain relatedness and interpretability. We refer to the proposed method as Multi-Source Domain Adaptation via Weighted Joint Distribution Optimal Transport (MSDA-WJDOT). 
 

%The rest of the paper is organized as follows. In Section \ref{sec:RelatedWork}, we recall the basics of Optimal Transport (OT) problem and the Joint Distribution Optimal Transport (JDOT), which is the precursor of the method presented here. In Section \ref{sec:WJDOT}, we present a theoretical analysis of multi-source DA and introduce the proposed MSDA-WJDOT method. In Section \ref{sec:experiment}, we provide experimental results on both synthetic data and real life applications. Finally in Section \ref{sec:conc} we summarize our findings and draw final remarks.


%\paragraph{Notations} 
\noindent {\bf Notations~} Let $g:\mathcal{X}\rightarrow \mathcal{G}$ be a differentiable embedding function, with  $\mathcal{G}$ the embedding space. Throughout the paper all input distributions are in this  embedding space. 
We let $p_S$ and $p_T$ be the true joint distributions in the \textit{source} and \textit{target} domains, respectively. Both distributions are supported on the product space $\mathcal{G}\times \mathcal{Y}$, where $\mathcal{Y}$ is the label space. 
In practice we only have access to a finite number $N_S$ of samples in the \textit{source} domain leading to the empirical \textit{source} distribution
$\hat p_S=\frac{1}{N_S}\sum_{i=1}^{N_S}\delta_{g(x_S^i),y_S^i}$
where $\delta$ is the Dirac function. In the \textit{target} domain, only a finite number of unlabeled samples $N_{T}$ in the feature space is available. % and let $\hat\mu _{T}=\frac{1}{N_{T}}\sum_{i=1}^{N_{T}}\delta_{g(x_{T}^i)}$ be corresponding empirical \textit{target}  marginal distribution.\\
We then denote with $\Delta^J:=\{\pmb{\alpha}\in [0,1]^{J}| \sum_{i=1}^{J} \alpha_i = 1\}$  the $(J-1)$-dimensional simplex. Finally, given a loss function $L$ and a joint distribution $p$, the expected loss of a function $f$ is defined as $\varepsilon_{p}(f)=\mathbb{E}_{(x,y)\sim p} [L(y,f(x))]$.




\section{Optimal Transport and DA}\label{sec:RelatedWork}
In this section we first
recall the Optimal Transport problem and the notion of Wasserstein distance. 
%, playing a central role in our approach. 
Then we discuss how they were exploited for domain adaptation (DA) in the Joint Distribution Optimal Transport (JDOT) formulation that will be central in our approach.
%that will be an important part of our contribution.

\paragraph{Optimal Transport}
The Optimal transport (OT) problem has been originally introduced by 
%Monge in 1781 
\cite{Monge} and, reformulated  as a relaxation by 
%Kantorovich 
\cite{Kantorovich}.
Let $\hat\mu _{S}=\sum_{i} a^S_i\delta_{x_S^i}$, $\hat\mu _{T}=\sum_{i} a^T_i\delta_{x_T^i}$ be discrete probability measures with $\pmb{a^S},\pmb{a^T} \in\Delta^J$. 
%elements of the simplex. 
%{with $\sum_{i=1}^J a^i_{k} = 1$ and $a^i_{k} \geq 0, \forall i,k$}. 
The OT problem searches a transport plan $\pi\in \Pi(\hat\mu_{S}, \hat\mu_{T})$, where
$$\Pi(\hat\mu_{S}, \hat\mu_{T}) := \Big\{\pi\geq 0 ~\big|~ \sum_{i=1}^J \pi_{i,j}=a^T_j, \sum_{j=1}^J \pi_{i,j}=a^S_i\Big\}, 
$$
{that is, the set of 
%with 
joint probabilities} with marginals $\mu_{1}\text{ and }\mu_{2}$, that solve the following problem:
% \begin{equation}
%  \sum _{ij} C_{ij} \cdot \gamma _{ij},
% \end{equation}
% where 
% $C \text{ is a cost matrix in } \mathbb{R}_{+}^{N_{1}\times N_{2}}$.
% The quantity 
\begin{equation}\label{eq:OT}%\textstyle
W_{C}(\hat\mu_{S}, \hat\mu_{T}) =  \operatorname*{min}_{\pi\in \Pi(\hat\mu_{S}, \hat\mu_{T})} \sum _{i,j=1}^J C_{i,j} \cdot \pi _{i,j}
\end{equation}
where $C_{i,j} = c(x_{S}^{i}, x_{T}^{j})$ represents the cost of transporting mass between $x_S^i$ and $x_T^j$  for a given ground cost function $c:\mathcal{X}\times \mathcal{X}\rightarrow \mathbb{R}_{+}$. It is often chosen to be the Euclidean distance, recovering the classical $W_1$ Wasserstein distance. Given a ground cost $C$, $W_{C}(\hat\mu_{S}, \hat\mu_{T})$ corresponds to the minimal cost for mapping one distribution to the other and $\pi^\star$ is the OT matrix describing the relations between \textit{source} and \textit{target}  samples. OT and in particular Wasserstein distance have been used with success in numerous machine learning applications such as Generative Adversarial Modeling \citep{arjovsky2017wasserstein,genevay2018learning} and DA \citep{Courty2015,Courty2017,shen2018wasserstein}.
%In \cite{Courty2017}, authors also show the advantages of the Euclidean distance in some experiments. Nevertheless, other cost
%functions could be investigated on the base of the nature of data and the type of application.%\\

\paragraph{Joint Distribution Optimal Transport (JDOT)} This method has {been proposed} by \cite{Courty2017} to address the problem of unsupervised DA with only one joint \textit{source} distribution $\hat p_S$ and the feature marginal \textit{target}  distribution $\hat\mu _{T}$. 
%The Kantorovich formulation in \eqref{eq:OT} can in principle be used for joint distributions instead of the feature marginal ones. 
Since no labels are available in the \textit{target} domain, the authors proposed to use a proxy joint empirical distribution $\hat p_{T}^f$ whereby labels are replaced by the prediction of a classifier
$f:\mathcal{G}\rightarrow\mathcal{Y}$, that is
\begin{equation}
    \hat p_{T}^f=\frac{1}{N_{T}} \sum\limits_{i=1}^{N_{T}} \delta _{g(x_{T}^{i}), f(g(x_{T}^{i}))}.
    \label{eq:proxy}
\end{equation}
In order to use a joint distribution in the Wasserstein distance, they defined, for $z,z'\in {\cal G}$ and $y,y' \in {\cal Y}$, the cost
\[
D(z,y;z',y') = \beta \|z-z'\|^2 + {L}(y,y')
\]
where ${L}$ is a loss between classes and $\beta$ weights the strength of feature loss. This cost takes into account embedding and label discrepancy. To train a meaningful classifier on the \textit{target} domain, 
%the authors of 
\cite{Courty2017} solved the  problem
 \begin{equation}\label{eq:JDOT}
 \operatorname*{min}_{ f}  W_D(\hat p_S,\hat p_T^f) 
\end{equation}
where the minimization is over a suitable set of classifiers and the objective 
%function 
$W_D(\hat p_S,\hat p_T^f)$ is a Wasserstein distance between the joint \textit{source} and joint ``predicted'' \textit{target}, 
\[
\operatorname*{min}_{\pi\in \Pi(\hat p_{S}, \hat p_T^f)}\sum_{i,j=1}^J
D(g(x_{S}^{i}), y_{S}^{i};g(x_T^{j}), f(g(x_T^{j}))) \cdot \pi _{i,j}. \]

JDOT has been supported by generalization error guarantees, 
%see 
\citep[see][for a discussion]{Courty2017}.
\iffalse
$f:\mathcal{G}\rightarrow\mathcal{Y}$, that is,
\begin{equation}
    \hat p_{T}^f=\frac{1}{N_{T}} \sum\limits_{i=1}^{N_{T}} \delta _{g(x_{T}^{i}), f(g(x_{T}^{i}))}.
    \label{eq:proxy}
\end{equation}
%Note that, similarly, we can also define $p^f$ as the proxy distribution obtained from the true marginal distribution $\mu$.
In order to train a meaningful classifier on the \textit{target}  domain, the authors proposed to solve the following optimization problem:
 \begin{multline}\label{eq:JDOT}
 \operatorname*{min}_{ f} \Big\{ W_D(\hat p_S,\hat p_T^f) =
\\  \operatorname*{min}_{\pi\in \Pi(\hat p_{S}, \hat p_T^f)}\sum_{i,j=1}^J
D(g(x_{S}^{i}), y_{S}^{i};g(x_T^{j}), f(g(x_T^{j}))) \cdot \pi_{i,j} \Big\}
\end{multline}
where the ground cost metric has been designed to measure both embedding and label discrepancy as $D(g(x_{S}), y_{S}; g(x_{T}),f(g(x_{T}))) = \beta \|g(x_{S}) - g(x_{T})\|^2 + {L}(y_{S},f(g(x_{T})))$ where ${L}$ is a loss between classes and $\beta$ weights the strength of feature loss. JDOT has been supported by generalization error guarantees, see 
%shown in 
\cite{Courty2017} for a discussion. 
\fi
%to be theoretically grounded and some generalization bounds were provided in the paper.  
It was later extended to deep learning framework where the embedding $g$ was estimated simultaneously with the classifier $f$, via an efficient stochastic optimization procedure in \citep{Damodaran2018}. 
%A very important 
A key aspect of JDOT, that was overlooked 
%in the original paper, and its deep extension
by the domain adaptation community,
is the fact that the optimization problem {involves} the joint embedding/label distribution. This is in contrast to a large majority of DA approaches \citep{ganin2016domain,sun2016deep,shen2018wasserstein} using divergences only on the marginal distributions, whereas using simultaneously feature and labels information is the basis of most generalization bounds as discussed in the next section.% but discards the labels when measuring the divergence between both embedded distributions.
%\mas{I do not fully understand the rational of this paragraph}
%\remi{better? Alain had a nice sentence when we talked but is think i did not express it well ;)}


% , as
% \begin{equation}
%  \operatorname*{inf}_{\gamma\in \Pi(p_{1}, p_{2})} \sum _{ij} D_{ij} \cdot \gamma _{ij}
% \end{equation}

% where $D_{ij} := D_{ij}(x_{1}^{i}, y_{1}^{i}; x_{2}^{j}, y_{2}^{j}) =c(x_{1}^{i}, x_{2}^{j}) +  \beta \cdot  \mathcal{L}(y_{1}^{i}, y_{2}^{j})$, $\mathcal{L}$ 
% is a loss function measuring the discrepancy between $y_{1}^{i}$ ad $y_{2}^{j}$ and $\beta\in\mathbb{R}_{+}$. \\

% The authors in \cite{Courty2017} take into account the problem in which the set $Y_{2}$ is unknown and the goal is 
% to predict it by learning a function $f_{2}:X_{2}\rightarrow \mathcal{Y}$. In this case, they propose to replace $y_{2}$ with $f_{2}(x_{2})$ and
% $p_{2}$ with $p_{f_{2}}:=\frac{1}{N_{2}} \sum _{i=1}^{N_{2}} \delta _{(x_{2}^{i}, f_{2}(x_{2}^{i}))} $. Therefore, the Joint Distribution Optimal Transport (JDOT) algorithm problem looks for
%  \begin{equation}\label{eq:JDOT}
%  \operatorname*{inf}_{\gamma\in \Pi(p_{1}, p_{2}), f_{2}} \sum_{ij} D(x_{1}^{i}, y_{1}^{i}; x_{2}^{j}, f_{2}(x_{2}^{j})) \cdot \gamma _{ij}
% \end{equation}
% with
% \begin{equation}\label{eq:cost}
%     D_{ij}(x_{1}^{i}, y_{1}^{i}; x_{2}^{j},f_{2}(x_{2}^{j})) = c(x_{1}^{i}, x_{2}^{j}) + \beta _{2}\mathcal{L}(y_{1}^{i},f_{2}(x_{2}^{j})).
% \end{equation}
% The function solution $f_{2}$ produces predictions to optimally match \textit{source} labels in the transport plan.

\section{Multi-source DA via Weighted Joint Optimal Transport}\label{sec:WJDOT}
We now discuss our MSDA approach. 
We assume to have $J$ \textit{sources} with joint distributions $p_{S,j}$, for $1\leq j\leq J$. %When confusion does not arise we use the shorthand $p_j \equiv p_{S,j}$.  %For the simplicity's sake, we will denote the joint distribution of the $j$-th \textit{source} with $p_{j}$.\\ % and the number of samples with $N_j$.
%first discuss the problem of generalization  for %multi-source DA (MSDA) 
%the MSDA problem and 
We define a convex combination of the \textit{source} distributions
\begin{equation}
    p_{S}^{\alpha} = \sum _{j=1}^{J} \alpha _{j}p_{S,j}
\end{equation}with $\pmb{\alpha}\in\Delta^{J}$ and we present a novel generalization bound for MSDA problem that depends on $p_S^{\alpha}$. %We present a novel generalization bound for %multi-source DA (MSDA) the MSDA problem that depends on a weighting of the \textit{source} distributions. 
Then, we introduce the MSDA-WJDOT optimization problem and propose an algorithm to solve it. Finally, we discuss the relation between MSDA-WJDOT and other MSDA approaches.
 


\subsection{Generalization Bound}
%for Multi-source DA}



The theoretical limits of DA are well studied and well understood since the work of \cite{ben2010impossibility} that provided an "impossibility theorem" showing that, if the \textit{target}  distribution is too different from the \textit{source} distribution, adaptation is not possible. However in the case of MSDA, 
%\mas{I slightly rephrase this}
one can exploit the diversity of the \textit{source} domains and use only the \textit{sources} close to the \textit{target}  distribution, thereby 
obtaining a better generalization bound.
%intuitively use the variability of the \textit{source} domains and use only the \textit{sources} close to the \textit{target}  distribution to obtain a better generalization bound. 
%For this purpose, an idea that has been investigated recently in ML \cite{mansour2009domain} to find a convex combination of the \textit{source} distributions {which is most suited to the target}. 
For this purpose, a relevant assumption, already considered in \cite{mansour2009domain}, is that the \textit{target}  distribution is a convex combination of the \textit{source} distributions. 
The soundness of such an approach is illustrated by the following lemma.
%where $\varepsilon_{p}(f)=\mathbb{E}_{(x,y)\sim p} [L(y,f(x))]$ is the expected loss of function $f$ on the joint distribution $p$.
%\rt{maybe we can move the definition of $\varepsilon_{p}(f)$ in the notation}
\begin{lemma}\label{lemma1}
For any hypothesis $f \in \mathcal{H}$, denote by $\varepsilon_{p_{T}}(f)$ and $\varepsilon_{p_S^{\boldsymbol{\alpha}}}(f)$, the expected loss of $f$ on the \textit{target}  distribution and on the weighted sum of the \textit{source} distributions, with respect to a loss function $L$ bounded by $B$. Then
\begin{equation}\label{eq:bound_tv}\textstyle
    \varepsilon_{p_{T}}(f) \leq \varepsilon_{p_S^{\boldsymbol{\alpha}}}(f) + B\cdot D_{TV}\left(p_S^{\boldsymbol{\alpha}},p_T\right)
\end{equation}
where $D_{TV}$ is the total variation distance.
\end{lemma}
This simple inequality, whose proof is presented in the appendix, tells us that the key point
for \textit{target}  generalization is to have a function $f$ with low error on a
combination of the joint \textit{source} distributions  and that combination should be "near" to
the \textit{target}  distribution.
%that if one can match the weighted joint 
%\textit{source} distributions then one can find an hypothesis that has low generalization error on the target. 
Note that this also holds for single \textit{source} DA problem corroborating the recent findings that just matching marginal distributions may not be sufficient \citep{pmlr-v97-wu19f}. While the above lemma provides a simple and principled guidance for a multi-source
DA algorithm, it cannot be used for training since %poses two issues:  first, the total variation
%distance requires 
%imposes 
%the estimation of the underlying distributions, and most importantly,
it assumes that labels in the \textit{target}  domain are known. In the following, we provide 
%theoretical results for generalization 
%\mas{I rewrote this part - please have a look} 
a generalization bound in a realistic scenario where no \textit{target}  labels are available  and a self-labelling strategy is employed to compensate for the missing labels.
%and by considering a divergence measure on empirical distributions
%and a self-labelling strategy. %\rt{is this sentence incomplete?}

Taking inspiration from the result in Lemma \ref{lemma1}, we propose a theoretically grounded framework for learning from multiple \textit{sources}. To this end, we first recall the notion of Probabilistic Transfer Lipschitzness (PTL) of a classifier \cite{Courty2017}, that will be used in our method. 
\begin{definition}{{\em (PTL Property)}} 
%Let $p_S$ and $p_T$ be respectively the \textit{source} and \textit{target}  distributions. 
Let $D$ be a metric on $\mathcal{G}$ and let $\phi : \mathbb{R} \rightarrow [0,1]$. A labeling function $f : \mathcal{G} \rightarrow \mathbb{R}$ and a joint distribution $\pi\in\Pi(\mu_S,\mu_T)$ %with marginals $\mu_S$ and $\mu_T$ 
are $\phi$-Lipschitz transferable if for all $\lambda > 0$, we have
$$
{\rm Prob}_{(x_S,x_T)\sim \pi}\big [|f(x_S) - f(x_T) | > \lambda D(x_S,x_{T}) \big ] \le \phi(\lambda).
$$
\label{def1}
\end{definition}
The PTL property is a reasonable assumption for DA that was introduced in \cite{Courty2017} and provides a bound on the probability of finding pair of \textit{source}-\textit{target}  samples of different label within a $1/\lambda$-ball. 

Our approach is based on the idea that one can compensate the lack of \textit{target}  labels by using an hypothesis labelling function $f$ which provides a joint distribution $p_{T}^f$ in \eqref{eq:proxy}, where $f$ is searched in order to align $p_{T}^f$ with a weighted combination of \textit{source} distributions $p_S^\alpha$.
Following this idea, %and building upon previous work on
%single-\textit{source} domain adaptation JDOT \cite{Courty2017}, 
we introduce the definition of similarity measure and a new generalization bound for MSDA. %

\begin{definition}{{\em (Similarity measure)}} 
Let $\mathcal{H}$ be a space of $M$-Lipschitz labelling functions. Assume 
%also that the input space is so 
that, for every $f \in \mathcal{H}$ and $x,x' \in {\cal G}$, $|f(x) - f(x^\prime)| \leq M$.
Consider the following measure of similarity between $p_{S}^{\boldsymbol{\alpha}}$ and $p_T$ introduced in \cite[Def. 5]{ben2010impossibility}
\begin{equation}
\Lambda(p_S^{\boldsymbol{\alpha}},p_T)=\min_{f\in\mathcal{H}} \varepsilon_{p_S^{\boldsymbol{\alpha}}}(f) + \varepsilon_{p_T}(f),\label{eq:lambda}
\end{equation}
% $$%\texstyle
% \Lambda(p^{\boldsymbol{\alpha}},p^T)=\min_{f\in\mathcal{H}} \varepsilon_{p^{\boldsymbol{\alpha}}}(f) + \varepsilon_{p^T}(f)
% $$ 
where the risk is measured w.r.t. to a symmetric and $k$-Lipschitz loss function that satisfies the triangle inequality.
\label{defLambda}
\end{definition}

\begin{theorem}\label{thm:gen2}
%\ros{We already used $\delta$ for the Dirac function!} \rf{i think it is ok it is clear from context that this delta is a real}
%Under the assumptions of Theorem 1, 

%\ros{The assumptions of $L$ should be stated at the beginning of Sec.3}
Let $\mathcal{H}$ be the space introduced in Definition \ref{defLambda}
 and assume that the function $f^*$ minimizing Eq. \ref{eq:lambda} satisfies the PTL property (Definition \ref{def1}).
Let $\hat p_{S,j}$ be $j$-th source empirical distributions of $N_j$ samples and $\hat p_T$ the empirical target distribution with $N_T$ samples. Then for all $\lambda>0$ , with $\beta=\lambda k$ in the ground metric $D$, we have with probability at least  $1-\eta$ that
%\rt{I think we should remind who's $\beta$}
\begin{equation}
\nonumber%\label{eq:gen2}
\scriptstyle
\begin{split}
    \varepsilon_{p_T}(f) \leq &W_D\left(\hat p_S^{\boldsymbol{\alpha}}, \hat p_T^f\right)+\sqrt{\frac{2}{c'}\log \frac{2}{\eta}}\left(\frac{1}{N_T}+\sum_{j=1}^J\frac{\alpha_j}{ N_j}\right)\\
&+ \Lambda(p_S^{\boldsymbol{\alpha}},p_T)%\ W_D( p^f,p^T_\dagger) + \ W_D( p^T_\dagger, p^T) + 
 + kM \phi(\lambda).
 \end{split}
\end{equation}
\end{theorem}
Note that the quantity $\Lambda(p_S^{\boldsymbol{\alpha}},p_T)$ in the bound 
%the bound above depends on the term $\Lambda$ measuring 
measures the discrepancy between the true \textit{target}  distribution and the "best" combination of the \textit{source} distributions and, 
similarly to some terms in the DA bounds of \cite{ben2010impossibility}, it is not directly controllable. However, we have experimentally checked that our approach
minimizes an upper bound of this term $\Lambda$ -- see discussion in Section \ref{sec:experiment} and Figure \ref{fig:lambda} in the appendix. 
%\rt{who is $\alpha$? is it the term that weights the embedding distance in jdot? Note that 1) we have a $\beta$ term weighting the labels, not the features 2) this notation is misleading as here $\alpha$ are the weights}
%Note that  %number of samples in each empirical distribution has an impact on the generalization, but
Interestingly the $1/N_j$ ratios in the bound are weighted by $\alpha_j$ which means that even if one \textit{source} is poorly sampled it won't have a large impact as soon as the coefficient $\alpha_j$ stays small. 
This suggests to investigate some kind of regularization for the weights ${\boldsymbol{\alpha}}$ but since it would introduce one more hyperparameter we left it to future works and in the following focus only on optimizing the first term of the bound.   


\subsection{MSDA-WJDOT Problem}
%In this section we introduce our contribution WJDOT and we discuss the optimization procedure. Finally we discuss the relations of our approach w.r.t. the state of the art.

\paragraph{MSDA-WJDOT Optimization Problem} 

\begin{figure*}[ht]
\begin{minipage}[b]{1.0\linewidth}
    \centering
    \includegraphics[width=.85\linewidth]{Figures/visu_method.pdf}
    
    \end{minipage}
\vspace{-.5truecm}
\caption{2D simulated data. (Left) illustration of 4 \textit{source} distributions corresponding to 4 increasing rotations. The color of the sample corresponds to the class. (Center Left) \textit{source} distributions and \textit{target}  distribution in black because no class information is available. (Center Right) \textit{source} distributions weighted by the optimal $\pmb{\alpha}^\star=[0,0.5,0.5,0]$ from MSDA-WJDOT: only Source 2 and 3 have a weight $>0$ because they are the closest to the \textit{target} in the Wasserstein sense. (Right) Final MSDA-WJDOT \textit{target} classification.}
    \label{fig:visu_method}
\end{figure*}



%We first consider a scenario in which the embedding function $g$ is known or the data is already linearly separated ($g\equiv$ identity function).
%We assume that we have access to the target, the empirical measure $\mu$ with support on $G:=\{g(x^{i})\}_{i=1}^{N}$, and \textit{source} empirical mesures $\mu _{s}$ on $G_{s}:=\{g(x^{i}_{s})\}_{i=1}^{N_{s}} $ for $s=1,\cdots,S$. 
%We denote with $p_{s}$ the joint feature/label empirical distributions $p_{s}=\frac{1}{N_{s}}\sum _{i=1}^{N_{s}} \delta_{(g(x_{s}^{i}), y_{s}^{i})}$ for all $s\in\{1, \cdots, S\}$. Similarly to JDOT we define the joint feature/classifier distribution $p^{f}=\frac{1}{N}\sum _{i=1}^{N} \delta _{(g(x^{i}), f(g(x^{i})))}$.


Our approach aims at finding a function $f$ that aligns the distribution $p_T^{f}$ with a convex combination  $\sum _{j=1}^{J} \alpha _{j} p_{S,j}$ of the \textit{source} distributions with convex weights $\pmb{\alpha}\in\Delta^J$ on the simplex.
We express the multi-domain adaptation problem as
\begin{equation}\label{eq:DA1}
 %\pmb{\alpha}^{*}, f^{*} = 
 \operatorname*{min}_{\pmb{\alpha}, f} \quad
W_D\left(\hat p_T^{f}, \sum _{j=1}^{J} \alpha _{j} \hat p_{S,j}\right).
\end{equation}
 %where $\lambda\in\mathbb{R}_{+}$. 
 Problem above is a minimization of the first term in the bound from Theorem \ref{thm:gen2} %\mas{Wrong reference} \eqref{eq:gen2}
 with respect to both $f$ and $\pmb{\alpha}$. %We chose here not to regularized the weights since this would introduce a new parameter that can be hard to validate. 
 The role of the weight $\pmb{\alpha}$ is crucial because it allows in practice to select (when $\pmb{\alpha}$ is sparse) the \textit{source} distributions that are the closest in the Wasserstein sense and use only those distributions to transfer label knowledge from. % with JDOT. 
 An example of the method %called 
 %Weighted Joint Distribution Optimal Transport (WJDOT)
 is provided in Figure \ref{fig:visu_method} showing 4 \textit{source} distributions in 2D obtained from rotation in the 2D space. %The \textit{target}  distribution is actually generated to have an angle in between the \textit{sources} 2 and 3 and the final solution for $\pmb{\alpha}^\star=[0,0.5,0.5,0]$ is illustrated in the center right plot of the Figure \ref{fig:visu_method}.  
  One interesting property of our approach is that it can adapt to a lot of variability in the \textit{source} distributions as long as the distributions lie in a distribution manifold and this manifold is sampled correctly by the \textit{source} distributions. %  which allows to interpolate between close \textit{sources}. 
  For instance the linear weights allow to interpolate between \textit{source} distributions and recover the weighted \textit{source} that is the closest to the manifold of distribution, hence providing a tightest generalization as shown in the previous section.
 
 
 
%  \ros{ The role of the weight $\pmb{\alpha}$ is crucial as a good similarity measure provides a higher accuracy. Let us take into account the trivial case in which the \textit{target} dataset is one of the \textit{source} domains. In this case, without any weight, the presence of the other \textit{sources} would cause an underperforming classification. On the contrary, our approach is able to recover $\pmb{\alpha}$ as one-hot vector allocating all the weight to the \textit{source} domain that coincides with the \textit{target}.}
 



%\begin{minipage}[h]{0.45 \textwidth}
%Intuitively the embedding function determines how to organize the items in the 
%space $\mathcal{G}$ in a way to best optimize the final objective.\\
%\end{minipage}
%\begin{minipage}[h]{0.5\textwidth}
% \includegraphics[scale=0.3]{funcomposition.png}
% \captionof{figure}{MTL for $S=5$}
%\end{minipage}

%\rosanna{TODO: comparison with competitors}

\paragraph{Optimization Algorithm}  Problem \eqref{eq:DA1} can be solved with a block coordinate descent similarly to what was proposed in \cite{Courty2017}. But with the introduction of the weights $\pmb{\alpha}$ we numerically observed that one can easily get stuck in a local minimum with poor performances. So we proposed the optimization approach in Algorithm \ref{alg:wjdot}, that is an alternated projected gradient descent \textit{w.r.t.} the parameters $\pmb{\theta}$ of the classifier  $f_{\pmb{\theta}}$ and the weights $\pmb{\alpha}$ of the sources. Note that the sub-gradient of $\nabla_{\pmb{\theta}}W$ is computed by solving the OT problem and using the fixed OT matrix to compute the gradient similarly to \cite{Damodaran2018}. %The sub-gradient $\nabla_{\pmb{\alpha}}W$ can be computed in closed form from the optimal dual variable of the OT problem.
It is well known that the subgradient \emph{w.r.t.} the weights of a distribution can be expressed as $\nabla_{\pmb{w}} W(\mu,\sum_{i=1}^J w_i\delta_{x_i})=\pmb{\beta}$ where $\pmb{\beta}$ is the optimal right dual variable of the problem. Moreover, the sub-gradient $\nabla_{\pmb{\alpha}}W$ can be computed in closed form as
$$\nabla_{\alpha_j} W_D\left(\hat p_T^{f}, \sum _{j=1}^{J} \frac{\alpha _{j}}{N_j} \sum_{i=1}^{N_j} \delta_{(g(x_{j}^{i}),y_{j}^{i})}\right) = N_j\sum_{i=1}^{N_j} \beta^*_{j,i} $$
where $\beta^*_{j,i}$ is the dual variable for sample $i$ in source domain $j$.
The definition of the projection to the simplex $P_{\Theta}$ is provided in supplementary materials. Also note that while we did not need it in the numerical experiments, Algorithm \ref{alg:wjdot} can be performed on mini-batches by sub-sampling the \textit{source} and \textit{target}  distribution on very large datasets as suggested in \cite{Damodaran2018} which has been shown to provide robust estimators in \cite{fatras2019learning}.

%\begin{figure}[ht]
%\centering\includegraphics[scale=0.45]{Figures/algorithm_stability_over50trials.pdf}
%\caption{Loss function and $\alpha$ values for different initializations.}
%\label{fig:stability2W}
%\centering\includegraphics[scale=0.45]{Figures/stability_for_big_S_nmax2000.pdf}
%\caption{Loss function and $\alpha$ sparsity for increasing number of sources $S$.}
%\label{fig:varyingS}
%\end{figure}


{\small
\begin{algorithm}[t]
\caption{Optimization for MSDA-WJDOT\label{alg:wjdot}}
\begin{algorithmic}
\STATE Initialise $\pmb{\alpha}=\frac{1}{J}\mathbf{1}_J$ and $\pmb{\theta}$ parameters of $f_{\pmb{\theta}}$ and steps $\mu_{\pmb{\alpha}}$ and $\mu_{\pmb{\theta}}$.
\REPEAT %\STATE{<text>} 
\STATE $\pmb{\theta}\leftarrow \pmb{\theta}-\mu_{\pmb{\theta}}\nabla_{\pmb{\theta}} W_D\Big(\hat p_T^{f}, \sum _{j=1}^{J} \alpha _{j} \hat p_{S,j}\Big)$
\STATE $\pmb{\alpha}\leftarrow P_{\Delta^J}\Big(\pmb{\alpha}-\mu_{\pmb{\alpha}}\nabla_{\pmb{\alpha}} W_D(\hat p_T^{f}, \sum _{j=1}^{J} \alpha _{j}\hat p_{S,j}) \Big)$
\UNTIL{Convergence}
\end{algorithmic}
\end{algorithm}
}


\subsection{Related work}

\paragraph{MSDA approaches learning only the classifier}
MSDA-WJDOT is %obviously strongly 
related to JDOT \citep{Courty2017} but proposes a non-trivial extension of it to multisource domain adaption.
%but opens the door for a more general approach that can adapt to MSDA.
Indeed, there are two simple ways to apply JDOT to  multi-source DA, which we refer to as Concatenated JDOT (\texttt{CJDOT}) and Multiple JDOT (\texttt{MJDOT}). The first one consists in concatenating all the \textit{source} samples into one \textit{source} distribution (equivalent to uniform $\pmb{\alpha}$ if all $N_j$ are equal) and using classical JDOT on the resulting distribution. The second one consists in optimizing a sum of JDOT losses for every \textit{source} distribution but again, this leads to uniform impact of the \textit{sources} on the estimation. %some \textit{sources} may be destructive and finding an optimal weighting is important. 
It is clear that both approaches are not robust when some \textit{sources} distributions are very different from the \textit{target} (those would have a small weight in MSDA-WJDOT). 
{Recently, \cite{montesuma2021} proposed to compute a  Wasserstein barycenter to aggregate the source marginal distributions. Once the intermediate domain is computed, they transport the Wasserstein barycenter into the target domain using the Sinkhorn algorithm \citep{Cuturi2013} with (\texttt{WTB}$_{reg}$) or without (\texttt{WTB}) class regularization.
The Wasserstein barycenter is also used in another MSDA approach, called \texttt{JCPOT} \citep{redko2019optimal}, to estimate the class proportion. This method, based on \cite{Courty2015}, has been proposed to address only \textit{target} shift (change in proportions between the classes) and satisfies a generalization bound showing that estimating the class proportion in the \textit{target} distribution is key to recovering good performances. 
MSDA-WJDOT can also handle the \textit{target} shift as a special case since the reweighting $\pmb{\alpha}$ is directly related to the proportion of classes. A crucial difference between MSDA-WJDOT and the barycenter-based approaches described above  is that they rely  only on aligning marginal distributions,  whereas the proposed method 
%estimates the class proportion and classifier simultaneously
aligns joint distributions
by optimizing a Wasserstein distance in the joint embedding/label space.}\\
Also note that MSDA-WJDOT relies on a weighting of the samples where the weight is shared inside the \textit{source} domains. This is a similar approach {to} 
%as 
DA approaches such as Importance Weighted Empirical Risk Minimization (\texttt{IWERM}) \citep{Sugiyama2007} designed for Covariate Shift that use a reweighing of all the samples. One major difference is that we only estimate a relatively small number of weights in $\pmb{\alpha}$ leading to a better posed statistical estimation. It is indeed well known that estimation of continuous density which is necessary for a proper individual reweighting of the samples is a very difficult problem in high dimension. 
{All the above mentioned methods do not require to learn an embedding, whose estimation may be computationally expensive and unnecessary (e.g., when a pre-trained model is available). Further, there exists numerous examples of \textit{source} variability in real life (such as rotation between the full distributions) that cannot be handled with a  global embedding.}

{\bf MSDA approaches estimating an embedding~}
%\paragraph{MSDA approaches estimating an embedding}
As discussed in the introduction, the majority of recent DA approaches based on deep learning \citep{ganin2016domain,sun2016deep,shen2018wasserstein} relies on the estimation of an embedding that is invariant to the domain which means that the final classifier is shared across all domains when the embedding $g$ is estimated. Those approaches have been extended to multiple \textit{sources} with the objective that the embedded distributions between \textit{sources} and \textit{target}  are similar. {Authors in \cite{Xu2018} propose an algorithm based on adversarial learning, named Deep Cocktail Network (\texttt{DCTN}), to learn a feature extractor, domain discriminators and \textit{source} classifiers. The domain discriminator provides multiple source-target-specific perplexity scores that are used to weight the source-specific classifier predictions and produce the \textit{target}  estimation. In  \cite{Peng2018}, the embedding is learned by aligning moments of the \textit{source} and \textit{target}  distributions, by an approach called Moment matching (\texttt{M}$^{\pmb{3}}$\texttt{SDA}) .}
Our approach differs greatly here as we do not try to cancel the variability across \textit{sources} but to embrace it by allowing the approach to automatically find the \textit{source} domains closest in terms of embedding and labeling function.

\section{Numerical Experiments}\label{sec:experiment}
\begin{figure*}[ht]
\begin{minipage}[b]{1.0\linewidth}
\centering\includegraphics[width=.95\linewidth]{Figures/algorithm.pdf}
\vspace{-.1truecm}
\caption{(Left and Center-Left) Loss function and $\alpha$ coefficients with different weights initializations. (Center-Right and Right) Loss function and $\alpha$ sparsity for increasing number of sources $J$.}
\label{fig:stability}
\end{minipage}
\end{figure*}   

In this section, we first discuss the implementation and the robustness of MSDA-WJDOT. We then evaluate and compare it with state-of-the-art MSDA methods, on both simulated and real data. {The numerical implementation relies on the Pytorch  \citep{paszke2017automatic} and Python Optimal Transport \citep{flamary2021pot} toolboxes and will be released upon publication.}
%For research reproducibility, all the Python/Pytorch \cite{paszke2017automatic} code will be released upon publication.

%\paragraph{
{\bf Practical Implementation} We used in all numerical experiments the MSDA-WJDOT solver from Algorithm \ref{alg:wjdot}. We recall that in this paper we assume to have access to a meaningful (as in discriminant) embedding $g$.  This is a realistic scenario due to the wide availability of pre-trained models and advent of reproducible research. Nevertheless we discuss here how to estimate such an embedding when none is available. %First note that as discussed in section \ref{sec:WJDOT}, WJDOT uses the variability of the \textit{source} distributions in the embedding spaces which implies that any embedding estimated by minimizing the distributions divergence between embedding only will lose this important information. 
% Until now we considered the \textit{embedding function} $g$ as given. Indeed, we suppose that the its learning takes place in an earlier stage. It may be learned by optimizing the classification problem on a labelled dataset $\mathcal{D}$ as
% \begin{equation}
%   \operatorname*{min}_{g, c} \frac{1}{\#\mathcal{D}} \sum _{(x, y)\in\mathcal{D}} \mathcal{L}(c\circ g(x), y),
% \end{equation}
% where $c:\mathcal{G}\rightarrow\mathcal{Y}$ and $\mathcal{D}$ can be, for example, the union of the \textit{sources} $\big( \cup_{s=1}^{s} X_{s}, \cup_{s=1}^{s} Y_{s}\big)$. However, we believe that this optimization can be not very effective for our method as it results in an embedding that is invariant across \textit{sources}. 
To keep the variability of the \textit{sources} that is used by MSDA-WJDOT
we propose to estimate $g$ with the  Multi-Task Learning framework originally proposed in \cite{Caruana1997}, i.e. 
 \begin{equation}\label{standardMTL}\textstyle
 \operatorname*{min}_{g, \{f_{j}\}_{j=1}^{J}} \quad\sum _{j=1}^{J} \frac{1}{N_{j}} \sum _{i=1}^{N_{j}} \mathcal{L}(f_{j}\circ g(x_{j}^{i}), y_{j}^{i}).
\end{equation} 
This approach for estimating an embedding $g$ makes sense because it promotes a $g$ that is discriminant for all tasks but allows a variability thanks to the task specific final classifiers $f_j$ which is an assumption at the core of MSDA-WJDOT. 
We refer to MSDA-WJDOT where the embedding $g$ is learned with the above procedure as \texttt{MSDA-WJDOT}$_{MTL}$. Note that this is a two step procedure.\\
An important question, especially when performing unsupervised DA, is how to perform the validation of the parameters including early stopping. We propose here to use the sum of squared errors (SSE) between the \textit{target} points in the embedding and their cluster centroids. Specifically, we estimate cluster membership on the  the outputs through $f\circ g$. Then the SSE is computed in the embedding $g$ using the estimated clusters. Intuitively, if the SSE decreases it means that $f$ attributes the same label to samples of the target domain that are close in the embedding.
We also explored another strategy, based on the classifier accuracy on the sources, that is discussed and reported in the supplementary material.\\
%\ros{In this paper, we do not investigate the Lipschitz constant of $f$ and suppose it being constant in the experiments. An alternative strategy could be penalizing an estimate of this constant, but this would add an extra parameter that requires validation parameter We chose instead to limit the complexity of $f$ with early stopping, which has the advantage of making optimization shorter.}
% Moreover, as the lack of the \textit{target} labels, standard early stopping cannot be adopted during the training. To overcome the problem, w explore two alternative strategies:
% \begin{itemize}[noitemsep,topsep=0pt,leftmargin=*]
% \item WJDOT$^{sse}$: the algorithm stops when the sum of the squared errors (SSE) between the estimated outputs and the estimated cluster centroids of the unlabelled set $X$ does not increase anymore.
% \item WJDOT$^{acc}$: the used measure is the weighted accuracy on the labelled sets. More precisely, it is defined as
% $$ \sum _{s=1}^{S} \alpha _{s} \Bigg[ \sum _{j=1}^{N_{s}} \mathbf{1}_{y^{s}_{j}} \Big(f(g(x_{j}^{s}))\Big) \Bigg],$$
% where $\pmb{\alpha}$ is the learning parameter of formula \eqref{eq:DA1} and $\mathbf{1} _{y}(x)$ is the indicator function that provides 1 if $x=y$ and 0 otherwise.  
% \end{itemize}
{In addition, to provide a lower and an upper bound of the MSDA performance, we implemented supervised classification methods trained on the \textit{sources} (\texttt{Baseline}), the \textit{target} (\texttt{Target}), on both \textit{sources} and \textit{target} (\texttt{Baseline+Target}) domain. We consider \texttt{Baseline} as a performance lower bound as the \textit{target} domain is not used during training, whereas \texttt{Target} and \texttt{Baseline+Target} are two unrealistic approaches that use labels in \textit{target}. Note that \texttt{Target} trains a classifier using only \textit{target} labels and is more prone to overfitting since less samples are available. Since we have access to labels for \texttt{Target} and \texttt{Baseline+Target}, we validate the model by using the classification accuracy on the \textit{target}  validation set making those two approaches clear upper bounds on the attainable performance for each dataset. %\\
All methods are compared on the same dataset split in training (70\%), validation (20\%) and testing (10\%) but the validation set is used only for \texttt{Baseline+Target} and \texttt{Target}.}

%We also provide performances for \texttt{Baseline} that trains a classifier that maximizes performance among all \textit{source} domains. This approach measure the ability to train a unique classifier that is robust to all domains and performs well on target. %For the last methods, the parameters and early stopping validation is based on the classification accuracy on the \textit{source} domains. 
%Finally, we also compare to two unrealistic approaches that use labels in target:   \texttt{Baseline+Target} is similar to \texttt{Baseline} but also use labels in the \textit{target}  domain. \texttt{Target} trains a classifier using only \textit{target} labels and is more prone to overfitting since less samples are available. Since we have access to labels for the two last approaches, we validate the model by using the classification accuracy on the \textit{target}  validation set making those two approaches clear upper bounds on the attainable performance for each dataset. %\\


\begin{figure*}[!ht]
\begin{minipage}[b]{.95\linewidth}
\centering\includegraphics[width=1.0\linewidth]{Figures/toy3D.pdf}
%\centering\includegraphics[width=.99\linewidth]{Figures/Exp1_acc_alpha_3_30_sources2.pdf}
\end{minipage}
\vspace{-.1truecm}
\caption{Simulated dataset. Methods' accuracy and recovered  $\boldsymbol\alpha$ weights for an increasing rotation angle of the \textit{target} samples: (Left and Center-Left) $J=3$ and (Right and Center-Right) $J=30$ sources.}
\label{toydata_boxplot}
\end{figure*}



%\paragraph{
{\bf Algorithm convergence and stability}
In Figure \ref{fig:stability} (\textit{Left} and \textit{Center-left}) we show the stability of the algorithm for different weights initialization. The loss function always converges and the $\pmb{\alpha}$ coefficients are not affected by the initialization. Moreover, we observed in practice that choosing the same step for $\pmb{\alpha}$ and $\pmb{\theta}$ does not degrade the performance and in all experiments we validated it via early stopping. We also noticed a fast convergence of the weights $\pmb{\alpha}$, meaning that the relevant domains are quickly identified. This behavior is illustrated in Fig. \ref{fig:stability} (\textit{Right}), where $\pmb{\alpha}$ sparsity rapidly increases for any choice of $S$ illustrating that only few relevant source distributions are used in practice. We also report the loss convergence for increasing number of sources $S$ (\textit{Center-right}).





%Note that those 

 
%  \paragraph{Compared methods} We compare our approach with the following MSDA methods among which two non obvious extention of the JDOT formulation:
% \begin{itemize}[noitemsep,topsep=0pt,leftmargin=*]
% \item \texttt{IWERM} : Importance Weighted Empirical Risk Minimization \cite{Sugiyama2007} that is a variant of ERM where the samples are weighted by the ratio of the \textit{target} and \textit{source} densities. The final optimization problem minimize the sum of the IWERM for each sources.
% %$$ \frac{1}{S} \sum_{s=1}^{S} \frac{1}{N_{s}}\sum _{i=1}^{N_{s}} \Bigg(\frac{\mu(x_{s}^{i})}{\mu_{s}(x_{s}^{i})}\Bigg)^{\lambda}\mathcal{L}(y_{s}^{i}, c(x_{s}^{i})),$$
% %where $0\leq\lambda\leq1$, $c:\mathcal{X}\rightarrow \mathcal{Y}$ is a training classifier.
% \item \texttt{CJDOT}: JDOT where all the samples from the \textit{source} distribution are concatenated and have uniform weights. When all the \textit{sources} have the same number of samples \texttt{CJDOT} is equivalent to \texttt{WJDOT} with fixed weights $\pmb{\alpha}=\frac{1}{S}\mathbf{1}_S$.
% %We refer to this method as \textbf{CJDOT(E)} if OT is computed by Exact OT, and \textbf{ CJDOT(B)} when Bures-Wasserstein distance is used instead. When  not specified, we always consider the exact Wasserstein distance.
% \item \texttt{MJDOT}: we optimize $f$ with respect to $S$ independent JDOT losses (\ref{eq:JDOT}) for each \textit{source} corresponding to minimizing $\sum_s W(p_s,p^f)$. %\texttt{MJDOT} is also investigated with exact OT and Bures-Wassretsin.
% \rosanna{
% \item \texttt{DCTN}: Deep cocktail network \cite{Xu2018} (TO ADD)
% \item \textbf{MM}: Moment matching for MSDA \cite{Peng2018} (TO ADD)
% }
%  \end{itemize}

  %\\
% Further, we also provide performances for baselines simple baselines and for methods that have access to labelled samples on target:
%  \begin{itemize}[noitemsep,topsep=0pt,leftmargin=*]
%   \item \texttt{Baseline}: we train an average classifier on all the samples from \textit{sources} with no information from target. This approach measure the ability to train a unique classifier that is robust to the domain and perform well on target.
%  \item \texttt{Baseline+Target}: same as \texttt{Baseline} but also uses includes labelled samples from target. This method uses \textit{target}  samples and can be seen as a multi-task learning in the embedded space. It cannot be used in practice in the unsupervised case we are studying. 
% \item \texttt{Target}: We train directly on the labelled \textit{target}  samples. note that this method has access to a smaller number of samples as the previous one and might lead to overfitting. 


%\paragraph{
{\bf Simulated Data: Domain Shift}
We consider a classification problem similar to what is illustrated in Figure \ref{fig:visu_method}, but with 3 classes, i.e. $\mathcal{Y}=\{0,1,2\}$, and in 3D. For the \textit{sources} and \textit{target}  we generate $N_{j}$ and $N_T$ samples from $J+1$ Gaussian distributions rotated of angle $\theta _{j}\in [0, \frac{3}{2}\pi]$ around the $x$-axis. %The \textit{target}  set is obtained with teh same approach. 
As the data is already linearly separated, we set $g$ as the identity function in this experiment. %Consequently, $p_{s}$ and $p^{f}$ will be the joint distributions on the product space $\mathcal{X}\times\mathcal{Y}$. %\\
We carried out many experiments in order to see the effect of different parameters such as the number of \textit{source} domains $J$, of \textit{source} samples $N_{j}$ and of \textit{target} samples $N_T$. Each experiment has been repeated 50 times. 
We report in Fig. \ref{toydata_boxplot} the accuracy of all methods with $N_{j}=N_{T}=300$ for $J=3$ (Left) and $J=30$ (Right). All competing methods are clearly outperformed by \texttt{MSDA-WJDOT} both in term of performance and variance even for a limited number of sources. Interestingly MSDA-WJDOT can even outperform \texttt{Target} due to its access to a larger number of samples.  Another important aspect of MSDA-WJDOT is the obtained weights $\boldsymbol\alpha$ that can be used for interpretation. We show in Fig. \ref{toydata_boxplot} the $\boldsymbol\alpha$ weights that are attributed to the \textit{sources} (ordered on the $x$-axis by increasing rotation angles), for in an increasing rotation angle in the \textit{target} samples ($y$-axis). The estimated weights tend to be sparse and put more mass on \textit{sources} that have a similar angle \emph{i.e.} we recover automatically the closest \textit{sources} in the joint distribution manifold. Note that we only report the method's performances on those two configurations; the results for other experiments can be found in the supplementary material.\\
{We next investigate how the function $\Lambda$ in \eqref{eq:lambda} behaves when the weights $\pmb{\alpha}$ are optimized \emph{w.r.t.} the first term of the bound in Theorem \ref{thm:gen2}. To this end we computed for $30$ sources an upper bound of $\Lambda$ with the $0$-$1$ loss by using the estimated $\hat f$  instead of the minimizer in 
%optimizing it as in Eq. 
\eqref{eq:lambda}. We recover a value of $0.57$ that is very close to twice the Bayes error, corresponding to the best possible value for $\Lambda$ in this experiment. On the other hand, the value for the upper bound of $\Lambda$ for a uniform $\pmb{\alpha}$ is $0.64$ and  $0.65$ in average for 10000 randomly drawn values of $\pmb{\alpha}$. This suggests that optimizing $\pmb{\alpha}$ with MSDA-WJDOT leads to a minimization of $\Lambda$ in the generalization bound. 
}
% \end{itemize}




\begin{figure*}[ht]
\begin{minipage}[b]{1.0\linewidth}
    \centering
    \includegraphics[width=.95\linewidth, height=3.4cm]{Figures/targetshift2D_exp.pdf}
    \end{minipage}
    \vspace{-.4truecm}
    \caption{Illustration of MSDA-WJDOT on target shift problem. (Left) illustration of 2 \textit{source} and \textit{target} distributions with unbalanced classes. (Center-Left) \textit{source} distributions weighted by $\pmb{\alpha}$ and estimated target classifier. (Center-Right and Right) classification accuracy of MSDA-JDOT and JCPOT and $\pmb\alpha$ coefficients at varying of class proportions in \textit{target} dataset.}
    \label{fig:targetshift}
\end{figure*}

%\paragraph{
{\bf Simulated Data: Target Shift~}
%\noindent {\bf Simulated Data: Target Shift~}
We take into account the target shift problem with 2D \textit{source} and \textit{target} datasets which present different proportions of classes. The proportion of the class $c$ in the \textit{source} $j$ is defined as $P_{j}^{c}=\frac{\#\{y_{j}^{i}=c\}}{N_j}$ (and similarly for the \textit{target}). We consider a binary classification task and we sample \textit{sources} and \textit{target} datasets from the same Gaussian distribution. In Fig. \ref{fig:targetshift} (Left and Center) we illustrate two \textit{sources} and \textit{target} distributions and how MSDA-WJDOT reweights the \textit{sources}. As we can see, almost all the mass is concentrated on Source 2 ($\alpha _{2} \gg \alpha_{1}$) because its class proportion is closer to the \textit{target} one. Instead, Source 1 has a class proportion inverted w.r.t. the \textit{target}. 
In the experiment reported in Fig. \ref{fig:targetshift} (Center-Right and Right) we have $J=20$ \textit{sources} with $P_{j}^{2}$ randomly generated between $0.1$ and $0.9$ (we ordered the \textit{sources} s.t. $P_{j}^{2}\leq P_{j+1}^{2}$). We show the average classification accuracy and the $\pmb\alpha$ weights over 50 trials for varying $P_{T}^{2}$ in $\{0.1, 0.2, \cdots, 0.9\}$. Our method always outperforms \texttt{JCPOT} and selects the \textit{sources} with a proportion of classes closer to the one in the \textit{target}. 


%Fig. \ref{fig:targetshift} (center-right and right) reports the classification accuracy and the $\pmb\alpha$ weights for varying $P_{T}^{2}$ in $\{0.1, 0.2, \cdots, 0.9\}$ ($x$-axis), with $J=20$ \textit{sources} with $P_{j}^{2}$ randomly generated between $0.1$ and $0.9$ (in the $y$-axis of the right figure, we ordered the \textit{sources} s.t. $P_{j}^{2}\leq P_{j+1}^{2}$). The $\pmb\alpha$ matrix shows a diagonal behavior, meaning that MSDA-WJDOT is able to select the \textit{sources} with a proportion of classes closer to the one in the \textit{target}. 

%\paragraph{
{\bf Object Recognition~}
% \begin{figure}[!h]
% \centering\includegraphics[scale=0.3]{Data_img.png}
% \caption{Caltech Office dataset examples.}
% \label{fig:COdataset}
% \end{figure}
%
The Caltech-Office dataset \citep{Gong2012} %\cite{Saenko2010, Gopalan2011, Gong2012,Courty2015} 
contains four different domains: Amazon, Caltech \citep{Griffin2007}, 
Webcam and DSLR. %(office environment images taken from a webcam and a high resolution digital DSLR camera, respectively). 
The variability of the different domains come from several factors: presence/absence of background, lightning conditions, noise, etc. We use for  the embedding function $g$ the output of the 7th layer of a pre-trained DeCAF model \citep{donahue2014decaf}, similarly to what was done in \cite{Courty2015}, resulting into an embedding space $\mathcal{G}\in \mathbb{R}^{4096}$.  For $f$, we employ a one-layer neural network. Training is performed with Adam optimizer with 0.9 momentum and $\epsilon = e^{-8}$. Learning rate and $\ell_2$ regularization on the parameters are validated for all methods. In JDOT extensions and MSDA-WJDOT, we also validate the $\beta$ parameter weighting the feature distance in the cost~\eqref{eq:JDOT}. \\
{The aim of this experiment is to evaluate MSDA-WJDOT and compare it with the current literature in the setting in which the embedding is given.} The performance of the methods is reported in Table \ref{tab:CaltechOffice}. {We can see that MSDA-WJDOT is state of the art providing the best Average Rank (AR)}. Note that the DeCAF pre-trained embedding was originally designed in part to minimize the divergence across domains which as discussed is not the best configuration for MSDA-WJDOT but it still performs very well showing the robustness of MSDA-WJDOT to the embedding.
Moreover, we observed that for each adaptation problem MSDA-WJDOT provides one-hot vector $\pmb{\alpha}$ (reported in supplementary) suggesting that only one \textit{source} is needed for the \textit{target} adaptation. Interestingly the source selected by MSDA-WJDOT for each target is the one that was reported with the best performance for single-source DA in \cite{Courty2015}, which shows that MSDA-WJDOT can automatically find the relevant sources with no supervision.




\begin{table}[!t]
\caption{Object recognition accuracy. The last column reports the average rank across target domain. Results of methods marked by $^*$ are from \cite{montesuma2021}.
}
\label{tab:CaltechOffice}
\begin{center}
\vspace{-.2truecm}
\resizebox{\linewidth}{!}{% 
\begin{tabular}{l|c|c|c|c|c}  
\hline
\textbf{Method} & \textbf{Amazon} & \textbf{dslr} & \textbf{webcam} & \textbf{Caltech10} & \textbf{AR} \\
\hline
\texttt{Baseline} & $93.13 \pm 0.07 $& $94.12 \pm 0.00$ & $89.33 \pm 1.63$ & $82.65 \pm 1.84$ &  5.00 \\
\hline
\texttt{IWERM} & $93.30 \pm 0.75$ & $\pmb{100.00 \pm 0.00}$ &$89.33 \pm 1.16$ & $\pmb{91.19 \pm 2.57} $& 2.75 \\ 
\texttt{CJDOT}&  $93.71 \pm 1.57$ & $93.53\pm4.59$  & ${90.33 \pm 2.13}$ & $85.84 \pm 1.73$ &  {3.50}\\
\texttt{MJDOT}& $94.12 \pm 1.57$& $97.65\pm  2.88$ & $90.27 \pm 2.48$ & $84.72\pm 1.73$ & {3.00} \\
\texttt{JCPOT}$^*$ & $79.23\pm 3.09 $& $81.77\pm 2.81 $&$93.93\pm 0.60$&$77.91 \pm 0.45 $& {5.50}\\
\texttt{WBT}$^*$ & $59.86\pm2.48 $& $60.99\pm2.15 $&$64.13\pm 2.38$&$62.80 \pm 1.61 $&{7.25}\\
\texttt{WBT$_{reg}^*$} & $92.74\pm 0.45 $& $95.87\pm 1.43 $&$ \pmb{96.57\pm 1.76}$&$ 85.01\pm 0.84 $&{4.00}\\
\hline
\texttt{MSDA-WJDOT} & $\pmb{94.23\pm 0.90}$ & $\pmb{100.00\pm 0.00}$ & $89.33\pm 2.91$ & $85.93\pm 2.07$ & \textbf{2.25} \\
\hline
\texttt{Target} & $95.77 \pm 0.31$& $88.35\pm 2.76$ & $99.87\pm 0.65$  & $89.75\pm 0.85$ & -  \\
\texttt{Baseline+Target} & $94.78 \pm 0.48$ &  $99.88\pm 0.82$ & $100.00\pm 0.00$ & $91.89\pm 0.69$ & -  \\
\hline
\end{tabular}}
\end{center}
\end{table}
%
%\paragraph{
{\bf Music-speech Discrimination~}
{We now tackle a MSDA problem in which both the embedding and the \textit{target} classifier need to be learned. Specifically,} we consider the music-speech discrimination task introduced in \cite{Tzanetakis2002}, which includes 64 music and speech tracks of 30 seconds each. We generated 14 noisy datasets by combining the raw tracks with different types of noises from a noise dataset
%\footnote{See: {\small \url{http://spib.linse.ufsc.br/noise.html}}}
({\small \url{spib.linse.ufsc.br/noise.html}}). The noisy datasets have been synthesised by PyDub python library \citep{pydub}. We then used the libROSA python library \citep{librosa} to extract 13 MFCCs, computed every 10ms from 25ms Hamming windows followed by a z-normalization per track. % Finally, each track has been z-normalized. 
We chose each of the four noisy datasets F16, Bucaneer2 (B2), Factory2 (F2), and Destroyerengine (D) as \textit{target} domains, considering the remaining noisy datasets and the clean dataset as labelled \textit{source} domains. The feature extraction $g$ is a Bidirectional Long Short-Term Memory (BLSTM) recurrent network with 2 hidden layers of 50 memory blocks each. The $f$ classifier is learned as one feed-forward layer. Model and training details are reported in the supplementary materials.


We report in Table \ref{tab:MSdiscrimination}, the mean and standard deviation accuracy on the testing set of each \textit{target} dataset over 50 trials, as well as the Average Rank for each method.
First note that on this hard adaptation problem the \texttt{Baseline+Target} approach only slightly improves the \texttt{Baseline}, and most of the methods performance shows large variance. As expected, \texttt{MSDA-WJDOT}$_{MTL}$ significantly outperforms \texttt{MSDA-WJDOT} 
confirming the importance of estimating an embedding $g$ exploiting the \textit{source} variability. \texttt{MSDA-WJDOT}$_{MTL}$ achieves a 1.25 Average Rank outperforming all the other MSDA methods %by a large margin
and also presents low standard deviation, showing robustness to small sample size. Surprisingly, \texttt{MSDA-WJDOT}$_{MTL}$ even outperforms both the \texttt{Target} and \texttt{Baseline+Target} methods, where the labels are available. 




\begin{table}[!t]
\caption{Music-Speech discrimination accuracy and average rank across target domains. Results of methods marked by $^*$ are from \cite{montesuma2021}.
}
\label{tab:MSdiscrimination}
\begin{center}
\vspace{-.2truecm}
\resizebox{\linewidth}{!}{% 
\begin{tabular}{l|c|c|c|c|c}  
\hline
\textbf{Method} & \textbf{F16} & \textbf{B2} & \textbf{F2} & \textbf{D} & \textbf{AR}\\
\hline
\texttt{Baseline} & $69.67 \pm 8.78$ & $57.33 \pm 7.57$ & $83.33 \pm 9.13$ & $87.33\pm 6.72$ & {9.25} \\
\hline
\texttt{IWERM}&$72.22\pm 3.93$ & $58.33 \pm 5.89$& $85.00 \pm 6.23$ & $81.64 \pm 3.33$ &  {8.75} \\
\texttt{IWERM}$_{MTL}$& $75.00\pm 0.00$ & $66.67 \pm 0.00$ & \pmb{$ 100.00 \pm 0.00$}&  $98.33
\pm 3.33 $ & {4.00}\\
\texttt{DCTN} & $66.67\pm 3.61$ & $68.75 \pm 3.61$ & $87.50 \pm 12.5$ & $94.44 \pm 7.86$ & {6.50}\\ 
\texttt{M}$^{\pmb{3}}$\texttt{SDA} & $70.00 \pm 4.08$ & $61.67 \pm 4.08$& $85.00 \pm 11.05$ & $83.33 \pm 0.00 $ & {8.50}\\
\texttt{CJDOT} & $ 59.50 \pm 13.95 $ & $ 50.00\pm 0.00$ & $ 83.33 \pm 0.00 $ & $91.67 \pm 0.00 $ & {9.75} \\
\texttt{CJDOT}$_{MTL}$ & $ 83.83 \pm 5.11 $ & $ 74.83 \pm 1.17$ & $ \pmb{100.00 \pm 0.00}$ & $ 95.74 \pm 16.92 $ & {3.25}\\
\texttt{MJDOT}& $ 66.33  \pm 9.57 $ & $ 50.00\pm 0.00$ & $ 83.33 \pm 0.00 $ & $91.67 \pm 0.00 $&  {9.50}\\
\texttt{MJDOT}$_{MTL}$& $ 86.00 \pm 4.55 $ & $ 72.83\pm 5.73$ & $97.67 \pm 3.74$ & $97.74 \pm 8.28$ & 3.50 \\
\texttt{JCPOT}$^*$ & $88.67 \pm 1.67$& $92.55 \pm 2.11$&$ 82.41 \pm 2.22$&$ 87.89 \pm 1.39$& {5.50}\\
\texttt{WBT}$^*$ & $56.63 \pm 6.56 $& $56.88 \pm 9.54 $&$59.38 \pm 2.61$&$56.63 \pm 6.88 $&{11.75}\\
\texttt{WBT$_{reg}^{*}$}& $\pmb{94.92 \pm 0.68} $& $\pmb{96.27 \pm 1.60} $&$ 96.87 \pm 0.94$&$ 92.98 \pm 1.38 $&{3.00}\\
\hline
\texttt{MSDA-WJDOT} & $83.33 \pm 0.00$ & $58.33 \pm 6.01$ & $87.00\pm 6.05$ & $89.00 \pm 4.84$&{7.00} \\
\texttt{MSDA-WJDOT}$_{MTL}$& $87.17 \pm 4.15 $& $74.83\pm 1.20 $& $99.67 \pm 1.63$ & $\pmb{99.67 \pm 1.63} $& \textbf{2.25} \\
\hline
\texttt{Target} & $ 73.67 \pm 6.09
$  & $69.17\pm 7.50$ & $77.33 \pm 4.73$  & $73.17\pm 9.90$ & - \\
\texttt{Baseline+Target} & $71.06\pm 9.31 $ & $67.62\pm 11.92$ & $85.33 \pm 11.85$ & $79.53 \pm 10.05$ & - \\
\hline
\end{tabular}}
\end{center}
\end{table}



\section{Conclusion}
\label{sec:conc}
We presented
%proposed in this work 
a novel approach for multi-source DA that relies on OT for propagating labels from the \textit{sources} and a weighting of the \textit{source} domains that selects the best \textit{sources} for the \textit{target} task at hand in order to get a better prediction.
We provided results that show that the proposed approach is theoretically grounded. We present numerical experiments on simulated data that shows
the effectiveness of our method on both \textit{domain} and \textit{target shift} problems. Finally, we illustrate the good performance of MSDA-WJDOT on real-world benchmark datasets. 
Future works will investigate a regularization of $\boldsymbol\alpha$ and estimating simultaneously the embedding $g$ with MSDA-WJDOT instead of pre-training it with multitask learning. 
%The embedding $g$ could indeed be 
The embedding could indeed be updated for each new \textit{target} which suggests  an incremental formulation for MSDA-WJDOT that could be 
%very interesting 
valuable in practice. 
%\ros{TO ADD : future work - select only the useful sources (Fig. \ref{fig:stability})}


\subsubsection*{Acknowledgements}
This work was partially funded through the 3IA Cote
d’Azur Investments ANR-19-P3IA-0002 of the French National Research Agency (ANR), the DECIPHER-ASL – Bando PRIN 2017 grant (2017SNW5MB - Ministry of University and Research, Italy), and
a grant from SAP SE and 5x1000, assigned to the University of Ferrara - tax return 2017. This research was produced within the framework of Energy4Climate Interdisciplinary Center (E4C) of IP Paris and Ecole des Ponts ParisTech. This research was supported by 3rd Programme d’Investissements d’Avenir ANR-18-EUR-0006-02. This action benefited from the support of the Chair "Challenging Technology for Responsible Energy" led by l’X – Ecole polytechnique and the Fondation de l’Ecole polytechnique, sponsored by TOTAL.



\bibliography{turrisi_445}





\end{document}

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%


