% \documentclass{uai2022} % for initial submission
\documentclass[accepted]{uai2022} % after acceptance, for a revised
                                    % version; also before submission to
                                    % see how the non-anonymous paper
                                    % would look like
%% There is a class option to choose the math font
% \documentclass[mathfont=ptmx]{uai2022} % ptmx math instead of Computer
                                         % Modern (has noticable issues)
% \documentclass[mathfont=newtx]{uai2022} % newtx fonts (improves upon
                                          % ptmx; less tested, no support)
% NOTE: Only keep *one* line above as appropriate, as it will be replaced
%       automatically for papers to be published. Do not make any other
%       change above this note for an accepted version.

%% Choose your variant of English; be consistent
\usepackage[american]{babel}
% \usepackage[british]{babel}

%% Some suggested packages, as needed:
\usepackage{natbib} % has a nice set of citation styles and commands
    \bibliographystyle{plainnat}
    \renewcommand{\bibsection}{\subsubsection*{References}}
\usepackage{mathtools} % amsmath with fixes and additions
% \usepackage{siunitx} % for proper typesetting of numbers and units
\usepackage{booktabs} % commands to create good-looking tables

%% Provided macros
% \smaller: Because the class footnote size is essentially LaTeX's \small,
%           redefining \footnotesize, we provide the original \footnotesize
%           using this macro.
%           (Use only sparingly, e.g., in drawings, as it is quite small.)

%% Self-defined macros
\newcommand{\swap}[3][-]{#3#1#2} % just an example

% Optional math commands from https://github.com/goodfeli/dlbook_notation.
\input{math_commands.tex}

\usepackage{hyperref}
\usepackage{url}

\usepackage{graphics}
\usepackage{tablefootnote}
\usepackage{bbm}

\usepackage{arydshln}

\usepackage{graphicx}
\usepackage{float}
\usepackage{subcaption}

\usepackage{multirow}

\usepackage{tikz}
\usetikzlibrary{decorations.text,calc,arrows.meta}

\usepackage{algorithm}
\usepackage{algorithmic}

\usepackage{xr}

\makeatletter
\newcommand*{\addFileDependency}[1]{% argument=file name and extension
  \typeout{(#1)}
  \@addtofilelist{#1}
  \IfFileExists{#1}{}{\typeout{No file #1.}}
}
\makeatother

\newcommand*{\myexternaldocument}[1]{%
    \externaldocument{#1}%
    \addFileDependency{#1.tex}%
    \addFileDependency{#1.aux}%
}

\myexternaldocument{monteiro_150}

\title{Monotonicity Regularization: Improved Penalties and Novel Applications to Disentangled Representation Learning and Robust Classification - Supplementary material}

% The standard author block has changed for UAI 2022 to provide
% more space for long author lists and allow for complex affiliations
%
% All author information is authomatically removed by the class for the
% anonymous submission version of your paper, so you can already add your
% information below.
%
% Add authors
% \author[1,*]{\href{mailto:<joao.monteiro@servicenow.com>?Subject=Your UAI 2022 paper}{João Monteiro}{}}
\author[ ]{\href{mailto:<joao.monteiro@servicenow.com>?Subject=Your UAI 2022 paper}{João Monteiro}{} \thanks{Work done while interning at Borealis AI. Currently at ServiceNow}}
\author[1]{Mohamed Osama Ahmed}
\author[1]{Hossein Hajimirsadeghi}
\author[1,2]{Greg Mori}
% Add affiliations after the authors
% \affil[1]{%
%     ServiceNow Research
% }
\affil[1]{%
    Borealis AI
}
\affil[2]{%
    Simon Fraser University
  }
% \affil[*]{%
%     Work done while at Borealis AI.
%   }
  
\begin{document}
\onecolumn

\maketitle
\vspace{1cm}
\begin{abstract}
  We study settings where gradient penalties are used alongside risk minimization with the goal of obtaining predictors satisfying different notions of monotonicity. Specifically, we present two sets of contributions. In the first part of the paper, we show that different choices of penalties define the regions of the input space where the property is observed. As such, previous methods result in models that are monotonic only in a small volume of the input space. We thus propose an approach that uses mixtures of training instances and random points to populate the space and enforce the penalty in a much larger region. As a second set of contributions, we introduce regularization strategies that enforce other notions of monotonicity in different settings. In this case, we consider applications, such as image classification and generative modeling, where monotonicity is not a hard constraint but can help improve some aspects of the model. Namely, we show that inducing monotonicity can be beneficial in applications such as: (1) allowing for controllable data generation, (2) defining strategies to detect anomalous data, and (3) generating explanations for predictions. Our proposed approaches do not introduce relevant computational overhead while leading to efficient procedures that provide extra benefits over baseline models.
\end{abstract}

\appendix
% NOTE: necessary when ptmx or no mathfont class option is given
\providecommand{\upGamma}{\Gamma}
\providecommand{\uppi}{\pi}

\section{Illustrative examples on the sphere: Mixup helps to populate the small volume interior region}
\label{sec:sphere_example}

To further illustrate the issue discussed in the item 2 of Section \ref{sec:d_issues} as well the effect of our proposal, we discuss a simple example considering random draws from the unit $n$-sphere, shown in Figure \ref{fig:spheres_illustration}, i.e., the set of points $\mathcal{B}=\{x \in \mathbb{R}^n : ||x||_2<1\}$. We further consider a concentric sphere of radius $0<r<1$ given by $\mathcal{B}_r=\{x \in \mathbb{R}^n : ||x||_2<r\}$. We are interested in the probability of a random draw from $\mathcal{B}$ to lie outside of $\mathcal{B}_r$, i.e.: $P(||x||_2>r), x \sim \mathcal{D}(\mathcal{B})$, for some distribution $\mathcal{D}$. We start by defining $\mathcal{D}$ as the $\text{Uniform}(\mathcal{B})$, which results in $P(||x||_2>r)=1 - r^n$. In Figure \ref{fig:n_sphere}, we can see that for growing $n$, $P(||x||_2>r)$ is very large even if $r \approx 1$, which suggests most random draws will lie close to $\mathcal{B}$'s boundary.

\begin{figure}
\centering
    
\tikzset{every picture/.style={line width=0.75pt}} %set default line width to 0.75pt        

\begin{tikzpicture}[x=0.75pt,y=0.75pt,yscale=-1,xscale=1]
%uncomment if require: \path (0,300); %set diagram left start at 0, and has height of 300

%Shape: Circle [id:dp3730630354969209] 
\draw   (226,133) .. controls (226,89.92) and (260.92,55) .. (304,55) .. controls (347.08,55) and (382,89.92) .. (382,133) .. controls (382,176.08) and (347.08,211) .. (304,211) .. controls (260.92,211) and (226,176.08) .. (226,133) -- cycle ;
%Shape: Circle [id:dp49312515174935334] 
\draw   (265,133) .. controls (265,111.46) and (282.46,94) .. (304,94) .. controls (325.54,94) and (343,111.46) .. (343,133) .. controls (343,154.54) and (325.54,172) .. (304,172) .. controls (282.46,172) and (265,154.54) .. (265,133) -- cycle ;
%Straight Lines [id:da6399428958272135] 
\draw    (304,133) -- (333.44,156.74) ;
\draw [shift={(335,158)}, rotate = 218.88] [color={rgb, 255:red, 0; green, 0; blue, 0 }  ][line width=0.75]    (10.93,-3.29) .. controls (6.95,-1.4) and (3.31,-0.3) .. (0,0) .. controls (3.31,0.3) and (6.95,1.4) .. (10.93,3.29)   ;
%Straight Lines [id:da3156279706068157] 
\draw    (304,133) -- (350.75,74.56) ;
\draw [shift={(352,73)}, rotate = 488.66] [color={rgb, 255:red, 0; green, 0; blue, 0 }  ][line width=0.75]    (10.93,-3.29) .. controls (6.95,-1.4) and (3.31,-0.3) .. (0,0) .. controls (3.31,0.3) and (6.95,1.4) .. (10.93,3.29)   ;

% Text Node
\draw (324,79) node [anchor=north west][inner sep=0.75pt]   [align=left] {1};
% Text Node
\draw (322,132) node [anchor=north west][inner sep=0.75pt]   [align=left] {\textit{r}};
% Text Node
\draw (256,174) node [anchor=north west][inner sep=0.75pt]    {$\mathcal{B}$};
% Text Node
\draw (279,143) node [anchor=north west][inner sep=0.75pt]    {$\mathcal{B}_{r}$};


\end{tikzpicture}

    
    
\caption{Illustration unit spheres $\mathcal{B}$ and $\mathcal{B}_r$ on the plane.}
\label{fig:spheres_illustration}
\end{figure}

We now evaluate the case where mixup is applied and random draws are taken in two steps: we first observe $y \sim \text{Uniform}(\mathcal{B})$, and then we perform mixup between $y$ and the origin\footnote{Similar conclusions hold for any fixed point within $\mathcal{B}$. The origin is chosen for convenience.}, i.e., $x = \lambda y$, $\lambda \sim \text{Uniform}([0,1])$. In this case, $P(||x||_2>r)=(1 - r^n)(1 - r)$, which is shown in Figure \ref{fig:mixup_n_sphere} as a function of $r$ for increasing $n$. We can then observe that even for large $n$, $P(||x||_2>r)$ decays linearly with $r$, i.e., we populate the interior of $\mathcal{B}$ and $x$ in this case follows a non-uniform distribution such that its norms histogram is uniform.

\begin{figure}[ht]
\begin{subfigure}{0.5\textwidth}
\includegraphics[width=\textwidth, trim={1cm 0.5cm 0 0}, clip]{figures/p_r_n.pdf}
\caption{$P(||x||_2>r)$ as a function of $r$ for various $n$ and $x \sim \text{Uniform}(\mathcal{B})$.}
\label{fig:n_sphere}
\end{subfigure}
\begin{subfigure}{0.5\textwidth}
\includegraphics[width=\textwidth, trim={1cm 0.5cm 0 0}, clip]{figures/plamb_r_n.pdf}
\caption{$P(||x||_2>r)$ as a function of $r$ for various $n$. In this case, $x = \lambda y$, $\lambda \sim \text{Uniform}([0,1])$, $y \sim \text{Uniform}(\mathcal{B})$.}
\label{fig:mixup_n_sphere}
\end{subfigure}
\caption{Illustrative example showing that uniformly distributed draws on a unit sphere in $\mathbb{R}^n$ concentrate on its boundary for large $n$. Applying mixup populates the interior of the space.}
\label{fig:sphere_example}
\end{figure}

\section{Proof-of-concept evaluation}
\label{sec:proof_of_concept_eval}

We start by describing the approach we employ to generate data containing the properties required by our evaluation. Denote a design matrix by $X_{N \times D}$ such that each of its $N$ rows corresponds to a feature vector within $\mathbb{R}^D$. In order to ensure the data lies in some manifold, we first obtain a low-dimensional synthetic design matrix given by $X'_{N \times d}$, where each entry is sampled randomly from $\text{Uniform}([-10, 10])$. We then expand it to $\mathbb{R}^D$ by applying the following transformation:

\begin{equation}
    X = X'A,
\end{equation}
where the expansion matrix given by $A_{d \times D}$ is such that each of its entries are independently drawn from $\text{Uniform}([0, 1])$. Throughout our experiments, $d=\lfloor 0.3D \rfloor$ was employed.

Target values for the function $f$ to be approximated are defined as sums of functions of scalar arguments applied independently over each dimension. We thus select a set of dimensions $M \in [D]$ with respect to which $f$ is to be monotonic, i.e.:

\begin{equation}
    f(x)=\sum_{i \in M} g_i(x_i) + \sum_{j \in \bar{M}} h_j(x_j),
\end{equation}
and every $g_i:\mathbb{R} \mapsto \mathbb{R}$ is increasing monotonic, while every $h_i:\mathbb{R} \mapsto \mathbb{R}$ is not monotonic.

We then create two evaluation datasets. One of them, referred to as the validation set, is identically distributed with respect to $X$ since it is obtained following the same procedure discussed above. In order to simulate covariate-shift, we create a test set by changing the expansion matrix $A$ to a different one.

\begin{equation}
    X_{val} = X'_{val}A, \quad X_{test} = X'_{test}A_{test},
\end{equation}
where $A_{test}$ will be given by entry-wise linear interpolations between $A$, used to generate the training data, and a newly sampled expansion matrix $A'$: $A_{test} = \alpha A' + (1-\alpha)A$. The parameter $\alpha \in [0,1]$, set to $0.8$ in the reported evaluation, controls the shift between $A_{test}$ and $A_{test}$ in terms of the Frobenius norm, which in turn enables the control of how much the test set shifts relative to the training data.

We thus trained models to approximate $f$ for spaces of increasing dimensions as well as for an increasing number of dimensions with respect to which $f$ is monotonic. Results are reported in Table \ref{tab:synth_rmse} in terms of RMSE on the two evaluation datasets, and in terms of monotonicity in Table \ref{tab:synth_rho} where $\hat{\rho}$ is computed both on random points and on the shifted test set. Entries in the tables correspond to the centers of 95\% confidence intervals resulting from 20 independent training runs.

We highlight the two following observations regarding the prediction performances shown in table \ref{tab:synth_rmse}: different models present consistent performances across evaluations, which suggests different monotonicity-enforcing penalties do not significantly affect prediction accuracy. Moreover, the proposed approach used to generate test data under covariate-shift is effective given the gap in performance consistently observed between the validation and the test partitions. In terms of monotonicity, results in Table \ref{tab:synth_rho} suggest that $\Omega_{random}$ and $\Omega_{train}$ are only effective on either random or data points, which seems to aggravate when the dimension $D$ grows. $\Omega_{mixup}$, on the other hand, is effective on both sets of points, and continues to work well for growing $D$. Furthermore, covariate-shift significantly affects $\Omega_{train}$ for higher-dimensional cases, while $\Omega_{mixup}$ performs well in such a case.

\begin{table}[ht]
\centering
\resizebox{\textwidth}{!}{
\begin{tabular}{ccccccccc}
\hline
$|M|/D$  & \multicolumn{2}{c}{\textit{20/100}} & \multicolumn{2}{c}{\textit{40/200}} & \multicolumn{2}{c}{\textit{80/400}} & \multicolumn{2}{c}{\textit{100/500}} \\ \hline
                  & Valid. RMSE       & Test RMSE       & Valid. RMSE       & Test RMSE       & Valid. RMSE       & Test RMSE       & Valid. RMSE        & Test RMSE       \\ \hline
Non-mon.          & 0.007             & 0.107           & 0.006             & 0.082           & 0.007             & 0.087           & 0.011              & 0.146           \\
$\Omega_{random}$ & 0.008             & 0.117           & 0.006             & 0.081           & 0.007             & 0.093           & 0.012              & 0.125           \\
$\Omega_{train}$  & 0.008             & 0.115           & 0.006             & 0.086           & 0.007             & 0.089           & 0.012              & 0.134           \\
$\Omega_{mixup}$  & 0.008             & 0.114           & 0.007             & 0.084           & 0.008             & 0.088           & 0.012              & 0.134           \\ \hline
\end{tabular}}
\caption{Prediction performance of models trained on generated data in spaces of growing dimension ($D$) and number of monotonic dimensions ($|M|$). Different regularization strategies do not affect prediction performance. The performance gap consistently observed across the evaluation sets highlights the shift between the two sets of points. The lower the values of RMSE the better.}
\label{tab:synth_rmse}
\end{table}

\begin{table}[ht]
\centering
\begin{tabular}{ccccccccc}
\hline
$|M|/D$  & \multicolumn{2}{c}{\textit{20/100}} & \multicolumn{2}{c}{\textit{40/200}} & \multicolumn{2}{c}{\textit{80/400}} & \multicolumn{2}{c}{\textit{100/500}} \\ \hline
                  & $\hat{\rho}_{random}$   & $\hat{\rho}_{test}$   & $\hat{\rho}_{random}$   & $\hat{\rho}_{test}$   & $\hat{\rho}_{random}$   & $\hat{\rho}_{test}$   & $\hat{\rho}_{random}$    & $\hat{\rho}_{test}$   \\ \hline
Non-mon.          & 99.90\%           & 99.99\%         & 97.92\%           & 94.96\%         & 98.47\%           & 96.56\%         & 93.98\%            & 90.01\%         \\
$\Omega_{random}$ & 0.00\%            & 3.49\%          & 0.00\%            & 4.62\%          & 0.01\%            & 11.36\%         & 0.02\%             & 19.90\%         \\
$\Omega_{train}$  & 1.30\%            & 0.36\%          & 4.00\%            & 0.58\%          & 9.67\%            & 0.25\%          & 9.25\%             & 5.57\%          \\
$\Omega_{mixup}$  & 0.00\%            & 0.35\%          & 0.00\%            & 0.44\%          & 0.00\%            & 0.26\%          & 0.00\%             & 0.42\%          \\ \hline
\end{tabular}
\caption{Fraction of monotonic points $\hat{\rho}$ for models trained on generated data in spaces of growing dimension ($D$) and number of monotonic dimensions ($|M|$). Different regularization strategies is effective on only one of $\hat{\rho}_{random}$ or $\hat{\rho}_{test}$, while $\Omega_{mixup}$ seems effective throughout conditions. The lower the values of $\hat{\rho}$ the better.}
\label{tab:synth_rho}
\end{table}

\section{Models and training details for experiments reported in Section \ref{sec:monotonicity_as_regularizer}}
\label{sec:eval_details_regularizer}

For the case of CIFAR-10, WideResNets \citep{zagoruyko2016wide} are used. The models are initialized randomly and trained both with and without the monotonicity penalty. Standard stochastic gradient descent (SGD) implements the parameters update rule with a learning rate starting at 0.1, being decreased by a factor of 10 on epochs 10, 150, 250, and 350. Training is carried out for a total of 600 epochs with a batch size of 64. For ImageNet, on the other, training consists of fine tuning a pre-trained ResNet-50, where the fine-tuning phase included the monotonicity penalty. We do so by training the model for 30 epochs on the full ImageNet training partition. In this case, given that the label set $\mathcal{Y}$ is relatively large, using the standard ResNet-50 would result in small slices $S_k$. To avoid that, we add an extra final convolution layer with $W=15K$. Training is once more carried out with SGD using a learning rate set to 0.001 in this case, and reduced by a factor of 5 at epoch 20. In both cases, the group monotonicity property is enforced at the last convolutional layer. Other hyperparameters such as the strength $\gamma$ of the monotonicity penalty as well as the inverse temperature $\mu$ used to compute $\Omega_{group}$ are set to 1 and 50 for the case of CIFAR-10, and to 5 and 10 for the case of ImageNet. Both momentum and weight decay are further employed and their corresponding parameters are set to 0.9 and 0.0001. For MNIST classifiers, training is performed for 20 epochs using a batch size of 64 and the Adadelta optimizer \citep{zeiler2012adadelta} with a learning rate of 1.


\section{Enforcing group monotonicity under small samples}
\label{sec:small_sample_group_monotonicity}

Using CIFAR-10, we further evaluate how the proposed group monotonicity penalty behaves in data-constrained settings, i.e., we check whether or not the property can be enforced under small sample regimes. We do so by sub-sampling the original training data by randomly selecting a fraction of the training images uniformly across classes. We then train the same WideResNet for the same computation budget in terms of number of iterations as the models trained in the complete set of images. The learning rate schedule also matches that of the training on the full dataset in that the learning rate is reduced at exactly the same iterations across all training cases. Results are reported in Table \ref{tab:constrained_prediction_eval} for sub-samples corresponding to 10\%, 30\%, and 60\% of CIFAR-10. Results are consistent across the three sets of results in showing that predictions obtained from the total activation of feature slices approximate the prediction performance of the underlying model for the case of group monotonic predictors, i.e., the extent to which the underlying model is able to accurately predict correct classes upper bound the resulting ``level of monotonicity''. In simple terms, the better the classifier, the more group monotonic it can be made.

\begin{table}[]
\centering
\begin{tabular}{ccc}
\hline
Model                 & $\argmax_{k \in \mathcal{Y}} h(x)_k$ & $\argmax_{k \in \mathcal{Y}} T_k(x)$ \\ \hline
\multicolumn{3}{c}{10\%}                                                 \\ \hline
WideResNet            & 85.68\%          & 16.35\%                       \\
\emph{Mono}WideResNet & 85.77\%          & 82.21\%                       \\ \hline
\multicolumn{3}{c}{30\%}                                                 \\ \hline
WideResNet            & 92.12\%          & 14.51\%                       \\
\emph{Mono}WideResNet & 92.42\%          & 88.88\%                       \\ \hline
\multicolumn{3}{c}{60\%}                                                 \\ \hline
WideResNet            & 94.51\%          & 10.08\%                       \\
\emph{Mono}WideResNet & 94.86\%          & 93.81\%                       \\ \hline
\end{tabular}
\caption{Top-1 accuracy obtained by both standard and group monotonic models on sub-samples of CIFAR-10. Predicition performance obtained by classifiers defined by the total activations is upper bounded by the performance obtained at the output layer for monotonic models.}

\label{tab:constrained_prediction_eval}
\end{table}

\section{Selecting feature maps to compute visual explanations}
\label{sec:monocam}

Approaches based on Class Activation Maps (CAM) such as Grad-CAM and its variations \citep{selvaraju2017grad,chattopadhay2018grad} seek to extract \emph{explanations} from convolutional models. By explanation we mean to refer to indications of properties of the data implying the predictions of a given model. Under such a framework, one can obtain so-called explanation heat-maps through the following steps: (1) Compute a weighted sum of activations of feature maps in a chosen layer; (2) Upscale the results in order to match the dimensions of the input data; (3) Superimpose results onto the input data. Specifically for the case of applications to image data, following those steps results in highlighting the patches of the input that were deemed relevant to yield the observed predictions. Different approaches were then introduced in order to define the weights used in the first step. A very common choice is to use the total gradient of the output corresponding to the prediction with respect to activations of each feature map.

For the case of group monotonic classifiers, we are interested in verifying whether one can define useful explanation heat-maps by considering only the feature slices corresponding to the predicted class, i.e., for a given input pair $(x,y)$, we compute explanation heat-maps considering only its corresponding feature activation slice $S_y(x)$. We thus design an experiment to evaluate the effectiveness of such an approach by using external auxiliary classifiers to perform predictions from test data that was occluded using explanation heat-maps obtained using different models and sets of representations. In other words, we use the explanation maps to remove from the data the parts that were not indicated as relevant. We then assume that good explanation maps will be such that classifiers are able to correctly classify occluded data since relevant patches are conserved. In further details, occlusions are computed by first applying a CAM operator given a model $h$ and data $x$, which results in a heat-map with entries in $[0,1]$. We then use such a heat-map as a multiplicative mask to get an occluded version of $x$, denoted $x'$, i.e.:

\begin{equation}
    x' = \text{CAM}(x, h) \circ x,
\end{equation}
where the operator $\circ$ indicates element-wise multiplication. An example of such a procedure is shown in Figure \ref{fig:cam_occluison_example}. We apply the above procedure to all of the validation data, and use resulting points to then assess the prediction performance of auxiliary classifiers.

\begin{figure}
    \centering
    \includegraphics[width=0.8\textwidth]{figures/occluded.png}
    \caption{Example of explanation heat-map and corresponding occlusion obtained with Grad-CAM and a ResNet-50 trained on ImageNet. The example belongs to the validation set and corresponds to the class \emph{snowmobile}.}
    \label{fig:cam_occluison_example}
\end{figure}

Explanation maps are computed using the same models discussed in Section \ref{sec:group_monotonicity_prediction} for ImageNet. The CAM operator corresponds to a variation of Grad-CAM++ \citep{chattopadhay2018grad} where the model activations are directly employed for weighing feature maps rather than the gradients. We consider 4 auxiliary pre-trained classifiers corresponding to ResNext-50 \citep{xie2017aggregated}, MobileNet-v3 \citep{howard2019searching}, VGG-16 \citep{simonyan2014very}, and  SqueezeNet \citep{iandola2016squeezenet}. Results are reported in Table \ref{tab:explanation_maps} which also include the reference performance of the auxiliary classifiers on the standard validation set in order to provide an idea of the gap in performance resulting from removing parts of test images via occlusion. We highlight the performance reported in the last row of the Table. In that case, explanation maps for the group monotonic model are computed from only the features of the class slice, which is enough to match the performance of a standard ResNet-50 with full access to the features. This suggests that representations learned by group monotonic models are such that all the information required to explain a given class is contained in the slice reserved for that class.

\begin{table}[]
\centering
\begin{tabular}{ccccc}
\hline
\multirow{2}{*}{Model ($h$)}             & \multicolumn{4}{c}{Aux. classifier}              \\ \cline{2-5} 
                                   & ResNext-50 & MobileNet-v3 & VGG-16  & SqueezeNet \\ \hline
Reference perf.                          & 77.62\%    & 74.04\%      & 71.59\% & 58.09\%    \\ \hdashline
ResNet-50                          & 72.94\%    & 68.31\%      & 67.34\% & 49.95\%    \\
\emph{Mono}ResNet-50               & 72.88\%    & 68.75\%      & 66.99\% & 48.92\%    \\
\emph{Mono}ResNet-50 (Constrained) & 72.44\%    & 66.55\%      & 66.92\% & 45.83\%    \\ \hline
\end{tabular}
\caption{Top-1 accuracy of auxiliary classifiers evaluated on data created by occluding patches deemed irrelevant by explanation heat-maps given by different models. The performance of monotonic classifiers when constrained to consider only the feature maps within the slice corresponding to their prediction is further reported and shown to closely math the performance of cases where the full set of features is considered.}

\label{tab:explanation_maps}
\end{table}

\section{Examples of explanation heat-maps and occluded data}

In Figure~\ref{fig:explanation_heat_maps}, we show examples of explanation heat-maps obtained using different approaches. Corresponding occlusions resulting from the different approaches are shown in~\ref{fig:occlusions}.

\begin{figure}%[h]
\centering

\begin{subfigure}{0.95\textwidth}
\includegraphics[width=\textwidth]{figures/hm_1.png}
\end{subfigure}
\begin{subfigure}{0.95\textwidth}
\includegraphics[width=\textwidth]{figures/hm_2.png}
\end{subfigure}
\begin{subfigure}{0.95\textwidth}
\includegraphics[width=\textwidth]{figures/hm_3.png}
\end{subfigure}
\begin{subfigure}{0.95\textwidth}
\includegraphics[width=\textwidth]{figures/hm_3.png}
\end{subfigure}

\caption{Examples of explanation heat-maps superimposed onto images. From left to right we have the original image, results obtained from a ResNet-50, a \emph{mono}ResNet-50, and a \emph{mono}ResNet-50 where the CAM operator only access the slice corresponding to the underlying class. All are obtained with Grad-CAM.}
\label{fig:explanation_heat_maps}
\end{figure}


\begin{figure}%[h]
\centering

\begin{subfigure}{0.95\textwidth}
\includegraphics[width=\textwidth]{figures/occ_1.png}
\end{subfigure}
\begin{subfigure}{0.95\textwidth}
\includegraphics[width=\textwidth]{figures/occ_2.png}
\end{subfigure}
\begin{subfigure}{0.95\textwidth}
\includegraphics[width=\textwidth]{figures/occ_3.png}
\end{subfigure}
\begin{subfigure}{0.95\textwidth}
\includegraphics[width=\textwidth]{figures/occ_4.png}
\end{subfigure}

\caption{Examples of occluded data using explanation heat-maps. From left to right we have the original image, results obtained from a ResNet-50, a \emph{mono}ResNet-50, and a \emph{mono}ResNet-50 where the CAM operator only access the slice corresponding to the underlying class. All are obtained with Grad-CAM.}
\label{fig:occlusions}
\end{figure}

\section{Analysis of color sequences for generated data}
\label{sec:color_analysis}

\begin{figure}
    \centering
    \includegraphics[width=0.3\textwidth]{figures/RGB_color_wheel_24.png}
    \caption{HUE circle of RGB images. Original image from: \url{https://en.wikipedia.org/wiki/Hue}.}
    \label{fig:hue_circle}
\end{figure}

We performed a set of experiments in order to evaluate whether some kind of ordering could be observed once we generate data for increasing values of $z$, specifically on dimensions that correspond to colors. To do that, we created an increasing sequence of values by defining a uniform grid in $[0,1]$ with 50 steps. We then encoded a particular image, but decoded latent vectors after substituting the $z$ value in the dimension corresponding to \emph{floor color} by the values in the sequence.

Generated sequences of images are shown in Figures \ref{fig:floor_color_traversal_baseline} and \ref{fig:floor_color_traversal_reg} for the base and monotonic models, respectively. In each such a case, we plot the images on the left, and bottom-left patches of size 10x10 so as to highlight the color sequences that we observe with such an approach. Surprisingly, we observed that monotonic models tend to generate colors in a sequence that matches the HUE circle for RGB images, represented in Figure \ref{fig:hue_circle} for reference. Besides visually verifying that to be the case across a number of generated examples, in Table \ref{tab:hue_eval} in Section \ref{sec:disentanglement_analysis} we check the fraction of the dataset where such sequences of patches are sorted in terms of their HUE angles.

\begin{figure}%[h]
\centering

\begin{subfigure}{0.4\textwidth}
\centering
\includegraphics[width=\textwidth]{figures/baseline_images.png}
\caption{Data for increasing values for the latent dimension associated to \emph{floor color}.}
\end{subfigure}
\hspace{1cm}
\begin{subfigure}{0.4\textwidth}
\centering
\includegraphics[width=\textwidth]{figures/baseline_patches.png}
\caption{Bottom-left 10x10 patches of generated images.}
\end{subfigure}

\caption{Data generated by \emph{standard model} for traversals of $z$ on the dimension corresponding to \emph{floor color}}
\label{fig:floor_color_traversal_baseline}
\end{figure}

\begin{figure}%[h]
\centering

\begin{subfigure}{0.4\textwidth}
\centering
\includegraphics[width=\textwidth]{figures/regularized_images.png}
\caption{Data for increasing values for the latent dimension associated to \emph{floor color}.}
\end{subfigure}
\hspace{1cm}
\begin{subfigure}{0.4\textwidth}
\centering
\includegraphics[width=\textwidth]{figures/regularized_patches.png}
\caption{Bottom-left 10x10 patches of generated images.}
\end{subfigure}

\caption{Data generated by \emph{monotonic model} for traversals of $z$ on the dimension corresponding to \emph{floor color}}
\label{fig:floor_color_traversal_reg}
\end{figure}


\section{Examples of data generated with standard and monotonic models}

We illustrate data generated for linear trajectories in the latent space of standard and monotonic models. To do that, we start from a fixed image, and modify one generative factor at a time. We then generate images by feeding the decoder with points in the linear trajectory between the outputs of the encoder for the pair of images. Generated data for each modified factor are shown in Figures~\ref{fig:interpolations_0_floor_hue}, \ref{fig:interpolations_1_wall_hue}, \ref{fig:interpolations_2_object_hue}, \ref{fig:interpolations_3_scale}, \ref{fig:interpolations_4_shape}, and \ref{fig:interpolations_5_orientation}.

\begin{figure}%[h]
\centering

\begin{subfigure}{0.6\textwidth}
\centering
\includegraphics[width=\textwidth]{figures/0_floor_hue.png}
\caption{Input pair.}
\end{subfigure}

\begin{subfigure}{0.4\textwidth}
\centering
\includegraphics[width=\textwidth]{figures/0_floor_hue_base.png}
\caption{Data generated by standard model.}
\end{subfigure}
\hspace{1cm}
\begin{subfigure}{0.4\textwidth}
\centering
\includegraphics[width=\textwidth]{figures/0_floor_hue_reg.png}
\caption{Data generated by monotonic model.}
\end{subfigure}

\caption{Generating data by moving along the line passing over latent representation for inputs for which a single factor is different. Generative factor changing: floor color.}
\label{fig:interpolations_0_floor_hue}
\end{figure}



\begin{figure}%[h]
\centering

\begin{subfigure}{0.6\textwidth}
\centering
\includegraphics[width=\textwidth]{figures/1_wall_hue.png}
\caption{Input pair.}
\end{subfigure}

\begin{subfigure}{0.4\textwidth}
\centering
\includegraphics[width=\textwidth]{figures/1_wall_hue_base.png}
\caption{Data generated by standard model.}
\end{subfigure}
\hspace{1cm}
\begin{subfigure}{0.4\textwidth}
\centering
\includegraphics[width=\textwidth]{figures/1_wall_hue_reg.png}
\caption{Data generated by monotonic model.}
\end{subfigure}

\caption{Generating data by moving along the line passing over latent representation for inputs for which a single factor is different. Generative factor changing: wall color.}
\label{fig:interpolations_1_wall_hue}
\end{figure}



\begin{figure}%[h]
\centering

\begin{subfigure}{0.6\textwidth}
\centering
\includegraphics[width=\textwidth]{figures/2_object_hue.png}
\caption{Input pair.}
\end{subfigure}

\begin{subfigure}{0.4\textwidth}
\centering
\includegraphics[width=\textwidth]{figures/2_object_hue_base.png}
\caption{Data generated by standard model.}
\end{subfigure}
\hspace{1cm}
\begin{subfigure}{0.4\textwidth}
\centering
\includegraphics[width=\textwidth]{figures/2_object_hue_reg.png}
\caption{Data generated by monotonic model.}
\end{subfigure}

\caption{Generating data by moving along the line passing over latent representation for inputs for which a single factor is different. Generative factor changing: object color.}
\label{fig:interpolations_2_object_hue}
\end{figure}



\begin{figure}%[h]
\centering

\begin{subfigure}{0.6\textwidth}
\centering
\includegraphics[width=\textwidth]{figures/3_scale.png}
\caption{Input pair.}
\end{subfigure}

\begin{subfigure}{0.4\textwidth}
\centering
\includegraphics[width=\textwidth]{figures/3_scale_base.png}
\caption{Data generated by standard model.}
\end{subfigure}
\hspace{1cm}
\begin{subfigure}{0.4\textwidth}
\centering
\includegraphics[width=\textwidth]{figures/3_scale_reg.png}
\caption{Data generated by monotonic model.}
\end{subfigure}

\caption{Generating data by moving along the line passing over latent representation for inputs for which a single factor is different. Generative factor changing: scale.}
\label{fig:interpolations_3_scale}
\end{figure}


\begin{figure}%[h]
\centering

\begin{subfigure}{0.6\textwidth}
\centering
\includegraphics[width=\textwidth]{figures/4_shape.png}
\caption{Input pair.}
\end{subfigure}

\begin{subfigure}{0.4\textwidth}
\centering
\includegraphics[width=\textwidth]{figures/4_shape_base.png}
\caption{Data generated by standard model.}
\end{subfigure}
\hspace{1cm}
\begin{subfigure}{0.4\textwidth}
\centering
\includegraphics[width=\textwidth]{figures/4_shape_reg.png}
\caption{Data generated by monotonic model.}
\end{subfigure}

\caption{Generating data by moving along the line passing over latent representation for inputs for which a single factor is different. Generative factor changing: shape.}
\label{fig:interpolations_4_shape}
\end{figure}



\begin{figure}%[h]
\centering

\begin{subfigure}{0.6\textwidth}
\centering
\includegraphics[width=\textwidth]{figures/5_orientation.png}
\caption{Input pair.}
\end{subfigure}

\begin{subfigure}{0.4\textwidth}
\centering
\includegraphics[width=\textwidth]{figures/5_orientation_base.png}
\caption{Data generated by standard model.}
\end{subfigure}
\hspace{1cm}
\begin{subfigure}{0.4\textwidth}
\centering
\includegraphics[width=\textwidth]{figures/5_orientation_reg.png}
\caption{Data generated by monotonic model.}
\end{subfigure}

\caption{Generating data by moving along the line passing over latent representation for inputs for which a single factor is different. Generative factor changing: orientation.}
\label{fig:interpolations_5_orientation}
\end{figure}

\clearpage
\bibliography{bibliography}

\end{document}

