% \documentclass{uai2022} % for initial submission
\documentclass[accepted]{uai2022} % after acceptance, for a revised
                                    % version; also before submission to
                                    % see how the non-anonymous paper
                                    % would look like
%% There is a class option to choose the math font
% \documentclass[mathfont=ptmx]{uai2022} % ptmx math instead of Computer
                                         % Modern (has noticable issues)
% \documentclass[mathfont=newtx]{uai2022} % newtx fonts (improves upon
                                          % ptmx; less tested, no support)
% NOTE: Only keep *one* line above as appropriate, as it will be replaced
%       automatically for papers to be published. Do not make any other
%       change above this note for an accepted version.

%% Choose your variant of English; be consistent
\usepackage[american]{babel}
% \usepackage[british]{babel}
%\usepackage{algorithm}  
%\usepackage{algorithmic} 
\usepackage{algpseudocode}
\usepackage[vlined,ruled,linesnumbered]{algorithm2e}
\usepackage{appendix}
\usepackage{hyperref}
%% Some suggested packages, as needed:
\usepackage{natbib} % has a nice set of citation styles and commands
    \bibliographystyle{plainnat}
    \renewcommand{\bibsection}{\subsubsection*{References}}
\usepackage{graphicx}
\usepackage{color}
\usepackage{subfigure}
\usepackage{amssymb}
\usepackage{amsmath}  
\usepackage{amsthm}
\newtheorem{definition}{\textbf{Definition}}
\usepackage{mathtools} % amsmath with fixes and additions
% \usepackage{siunitx} % for proper typesetting of numbers and units
\usepackage{booktabs} % commands to create good-looking tables
\usepackage{tikz} % nice language for creating drawings and diagrams
\usepackage{bbm}
\usepackage{xspace}
\setlength{\textfloatsep}{4pt}
%% Provided macros
% \smaller: Because the class footnote size is essentially LaTeX's \small,
%           redefining \footnotesize, we provide the original \footnotesize
%           using this macro.
%           (Use only sparingly, e.g., in drawings, as it is quite small.)

%% Self-defined macros
\newcommand{\swap}[3][-]{#3#1#2} % just an example
\newcommand{\alg}{CAT\xspace}

\title{Cross-Domain Adaptive Transfer Reinforcement \\ Learning Based on State-Action Correspondence (Supplementary material)}

% The standard author block has changed for UAI 2022 to provide
% more space for long author lists and allow for complex affiliations
%
% All author information is authomatically removed by the class for the
% anonymous submission version of your paper, so you can already add your
% information below.
%
% Add authors
\author[1]{Heng You}

\author[1,2]{Tianpei Yang \thanks{Correspondence to Tianpei Yang <tpyang@tju.edu.cn>, Jianye Hao <jianye.hao@tju.edu.cn>}} 

\author[1]{Yan Zheng}
\author[1]{Jianye Hao \textsuperscript{$\small{*}$}}
\author[2]{Matthew E. Taylor}

% Add affiliations after the authors

\affil[1]{College of Intelligence and Computing, Tianjin University, China}

\affil[2]{Department of Computing Science, University of Alberta and Alberta Machine Intelligence Institute, Canada}
  
  \begin{document}
\maketitle





\section*{Appendix} \label{sec:appendix}
\begin{figure}[pth]
    \centering
    \subfigure[Target Env: CentipedeEight]{
    \label{fig:explo:4,6-8}
    \includegraphics[width=0.32\textwidth]{Experiments/explo/4,6-8.pdf}
    }\hspace{-2mm}
    \subfigure[Target Env: CpCentipedeEight]{
    \label{fig:explo:4,6-cp8}
    \includegraphics[width=0.32\textwidth]{Experiments/explo/4,6-cp8.pdf}
    }\hspace{-2mm}
     \subfigure[Target Env: CentipedeSix]{
    \label{fig:explo:4,8-6}
    \includegraphics[width=0.32\textwidth]{Experiments/explo/4,8-6.pdf}
    }
    \caption{Performance of different transfer manners including \emph{explo}, CAT and their combination \alg + \emph{explo}. }\label{fig:explo}
\end{figure}
\textbf{Experiments Description}

For a Centipede agent, its state includes physical information such as joint angular velocity and twist angle, and its actions include control information for the torso bodies and legs, the same is true for the other two types of agents. 
See Table~\ref{tab:environments} for the state- action dimensions and source policy performance of all the agents, where ``C-4'' represents CentipedeFour. Although these agents have completely different state and action dimensions, they share the same dynamic principles, as well as similar physical structures and reward functions, which may be beneficial for transfer learning between different agents with different morphologies. 
\begin{figure}[ht]
    \centering
    \subfigure[Target Env: CentipedeSix]{
    \label{fig:explo:4,ant-6}
    \includegraphics[width=0.32\textwidth]{Experiments/explo/4,ant-6.pdf}
    }\hspace{-2mm}
    \subfigure[Target Env: CentipedeEight]{
    \label{fig:explo:6,ant-8}
    \includegraphics[width=0.32\textwidth]{Experiments/explo/6,ant-8.pdf}
    }\hspace{-2mm}
    \subfigure[Target Env: CentipedeEight]{
    \label{fig:explo:4,cp6-8}
    \includegraphics[width=0.32\textwidth]{Experiments/explo/4,cp6-8.pdf}
    }
    \caption{Performance of different transfer manners including \emph{explo}, CAT and their combination \alg + \emph{explo}. }\label{fig:explo1}
\end{figure}

\begin{table}[ht]
    \centering
    \caption{The state-action dimensions and source policy performance of our environments.}\label{tab:environments}
    \begin{tabular}{c|c|c|c}
      \toprule % from booktabs package
      \bfseries Env & \bfseries State Dim & \bfseries Action Dim & \bfseries Performance \\
      \midrule % from booktabs package
      C-4 & 97 & 10 & 2600\\
      C-6 & 139 & 16 & 2000\\
      C-8 & 181 & 22 & 1500\\
      Cp-6 & 139 & 12 & 1610\\
      Cp-8 & 181 & 18 & 1440\\
      Ant & 111 & 8 & 1100\\
      \bottomrule % from booktabs package
    \end{tabular}
\end{table}


\textbf{Parameter Settings}

The structure is the same for all networks: two fully-connected hidden layers both with 64 hidden units.
See Table~\ref{tab:CAT} for all the hyperparameters used in this paper.


\begin{table}[h]
    \centering
    \caption{\alg hyperparameters.}\label{tab:CAT}
    \begin{tabular}{c|c}
      \toprule % from booktabs package
      \bfseries Hyperparameter & \bfseries Value \\
      \midrule % from booktabs package
      Discount factor ($\gamma$) & 0.99 \\
      Activation & tanh \\
      Optimizer & Adam \\
      Learning Rate & 3 x $10^{-4}$ \\
      Clip range ($\varepsilon$) & 0.2 \\
      Evaluate steps & 200 \\
      Batch size & 64 \\
      \bottomrule % from booktabs package
    \end{tabular}
\end{table}

\textbf{Different Transfer Manners}

%\MET{I'm not really sure what this section is showing. I think it's comparing CAT with two alternatives, but why are these reasonable alternatives? What design decisions do these results validate?}

In this experiment, we apply the two major transfer methods mentioned in Section 1 to the cross-domain setting through the learnt state-action correspondence to validate the choice in this paper.
All the different transfer methods and their combination are as follows:

\begin{itemize}
    \item \emph{explo}: Reusing source policies to interact with the environment for exploration at a decreasing rate.
    \item \alg: Distilling knowledge from multiple source policy networks into the middle layers of the target policy networks used in \alg.
    \item \alg + \emph{explo}: Applying \emph{explo} while distilling knowledge from source policy networks.
\end{itemize}
Figure~\ref{fig:explo} and ~\ref{fig:explo1} show the performance of different transfer methods.
We can see that the performance of \emph{explo} is the worst except PPO since \emph{explo} only selects one source policy at the same time to help the target task for exploration, which is an insufficient and ineffective method compared to \alg.
%the agent can not achieve the optimal performance only by reusing actions of source policies.
In contrast, the \alg agent extracts useful knowledge of each source policy by combining knowledge from source policy networks through the adaptive weighting factors, thus outperforms all methods.
Finally, we can see that the performance of \alg + \emph{explo} is slightly lower than CAT in most cases.
From our point of view, this is because 
\emph{explo} reduces the effectiveness of \alg for the reasons we mentioned above.
%\alg directly distills knowledge into the middle layers of the target policy for exploration while \emph{explo} only reuses actions of source policies.
%Hence, \emph{explo} reduces the effectiveness of \alg.
The results validate our choice in this paper.

% NOTE: necessary when ptmx or no mathfont class option is given
\providecommand{\upGamma}{\Gamma}
\providecommand{\uppi}{\pi}
\end{document}
