%\comment{ % \documentclass{uai2023} % for initial submission
\documentclass[accepted]{uai2023} % after acceptance, for a revised
                                    % version; also before submission to
                                    % see how the non-anonymous paper
                                    % would look like
%% There is a class option to choose the math font
% \documentclass[mathfont=ptmx]{uai2023} % ptmx math instead of Computer
                                         % Modern (has noticable issues)
% \documentclass[mathfont=newtx]{uai2023} % newtx fonts (improves upon
                                          % ptmx; less tested, no support)
% NOTE: Only keep *one* line above as appropriate, as it will be replaced
%       automatically for papers to be published. Do not make any other
%       change above this note for an accepted version.

%% Choose your variant of English; be consistent
\usepackage[american]{babel}
% \usepackage[british]{babel}

%% Some suggested packages, as needed:
\usepackage{natbib} % has a nice set of citation styles and commands
    \bibliographystyle{plainnat}
    \renewcommand{\bibsection}{\subsubsection*{References}}
\usepackage{mathtools} % amsmath with fixes and additions
% \usepackage{siunitx} % for proper typesetting of numbers and units
\usepackage{booktabs} % commands to create good-looking tables
\usepackage{tikz} % nice language for creating drawings and diagrams


% Additional user-added packages
\definecolor{linkblue}{rgb}{0.0, 0.3, 0.6}
\definecolor{dartmouthgreen}{rgb}{0.05, 0.5, 0.06}
\definecolor{frenchblue}{rgb}{0.0, 0.45, 0.73}
\definecolor{mediumred-violet}{rgb}{0.73, 0.2, 0.52}
\definecolor{darkorange}{rgb}{0.80, 0.439, 0}
\definecolor{orange(ryb)}{rgb}{0.98, 0.6, 0.01}
\definecolor{darkorchid}{rgb}{0.6, 0.2, 0.8}
\definecolor{independence}{RGB}{187,212,113}
\definecolor{independence-text}{RGB}{145,176,53} % A little darker
\definecolor{weakdependence}{RGB}{99,164,108}
\definecolor{weakdependence-text}{RGB}{68,117,75} % A little darker
\definecolor{moderatedependence}{RGB}{35,51,41}
\definecolor{moderatedependence-text}{RGB}{52,76,61} % A little lighter

% Add user-defined packages
\usepackage{hyperref}
\hypersetup{
  colorlinks = true,
  linkcolor=linkblue,   % color of internal links
  citecolor=linkblue,   % color of links to bibliography
  urlcolor=linkblue,    % color of external links
  pagebackref=true,
  implicit=false,
  bookmarks=true,
  bookmarksopen=true,
  pdfdisplaydoctitle=true
}
\usepackage{amsmath}
\usepackage{amsthm}
\usepackage{amssymb}
\theoremstyle{definition}
\newtheorem{definition}{Definition}
\newtheorem{theorem}{Theorem}
\newtheorem{lemma}[theorem]{Lemma}
\newtheorem{assumption}{Assumption}
\usepackage{xcolor}
\usepackage{float}
\usepackage{adjustbox}
\usepackage{bbm}
\usepackage{cancel}
\usepackage{soul}


\usepackage[titlenumbered,ruled]{algorithm2e}

\usepackage[inline]{trackchanges}
\addeditor{RuG}
\addeditor{coopermj}
\addeditor{TODO}
\addeditor{Attention}

\newcommand{\tikzcircle}[2][red,fill=red]{\tikz[baseline=-0.5ex]\draw[#1,radius=#2] (0,0) circle ;}%

% \newcommand{\prob}{\Pr}
\def\prob#1{\Pr(\, #1 \,)}
\def\Cprob#1#2{\prob{ #1 \,|\,#2}}

\newcommand{\coopermj}[1]{\textcolor{blue}{\textbf{coopermj}: \textit{#1}}}
\newcommand{\aligharari}[1]{\textcolor{purple}{\textbf{aligharari}: \textit{#1}}}
\newcommand{\rgk}[1]{\textcolor{red}{\textbf{rahul suggests:}: \textit{#1}}}
\def\eg{{\em e.g.},\ }
\def\ie{{\em i.e.},\ }
\long\def\comment#1{}

%% Provided macros
% \smaller: Because the class footnote size is essentially LaTeX's \small,
%           redefining \footnotesize, we provide the original \footnotesize
%           using this macro.
%           (Use only sparingly, e.g., in drawings, as it is quite small.)

%% Self-defined macros
\newcommand{\swap}[3][-]{#3#1#2} % just an example
%}

\title{Copula-Based Deep Survival Models for Dependent Censoring} % Instructions for Authors: Title in Title Case

% The standard author block has changed for UAI 2023 to provide
% more space for long author lists and allow for complex affiliations
%
% All author information is authomatically removed by the class for the
% anonymous submission version of your paper, so you can already add your
% information below.
%
% Add authors
\author[1,2]{Ali Hossein Gharari Foomani$^{*,}$}
\author[3,5]{Michael Cooper$^{*,\dagger,}$}
\author[1,2]{Russell Greiner}
\author[3,4,5]{Rahul G. Krishnan}
% Add affiliations after the authors
\affil[1]{%
Department of Computing Science, University of Alberta
}
\affil[2]{%
Alberta Machine Intelligence Institute
}
\affil[3]{%
Department of Computer Science, University of Toronto
}
\affil[4]{%
Department of Laboratory Medicine and Pathobiology, University of Toronto
}
\affil[5]{%
Vector Institute
}
  
\begin{document}
\maketitle
\def\thefootnote{*}\footnotetext{These authors contributed equally to this work.}\def\thefootnote{\arabic{footnote}}

\def\thefootnote{$\dagger$}\footnotetext{Correspondence to \href{mailto:coopermj@cs.toronto.edu}{coopermj@cs.toronto.edu}.}\def\thefootnote{\arabic{footnote}}


\begin{abstract}
A survival dataset describes a set of instances (\eg patients) and provides, for each, 
either the time until an event (\eg death), 
or the censoring time (\eg when lost to follow-up --
which is a lower bound on the time until the event). 
We consider the challenge of survival prediction:
learning, from such data, a predictive model 
that can produce an individual survival distribution for a novel instance. 
Many contemporary methods of survival 
prediction
implicitly assume that the event and censoring distributions are independent conditional on the instance’s covariates
– a strong assumption that is difficult to verify 
(as we observe only one outcome for each instance)
and which can induce significant bias when it does not hold. 
This paper presents a parametric model of survival that extends modern non-linear survival analysis 
by relaxing
the assumption of conditional independence. On synthetic and semi-synthetic
data, our approach significantly improves estimates of survival distributions compared to the standard that assumes conditional independence in the data.\footnotemark
\end{abstract} 
\footnotetext{Code available at \href{https://github.com/rgklab/copula_based_deep_survival}{this GitHub repository}.}

\begin{figure*}
\centering
\begin{tikzpicture}[->, thick]
\node[circle, draw, fill=gray!30, minimum size=24pt] at (0, -0.75) (x) {$X$};
\node[circle, draw, minimum size=24pt] (te) at (1.5, 0) {$T_E$};
\node[circle, draw, minimum size=24pt] (tc) at (1.5, -1.5) {$T_C$};
\node[circle, draw, fill=gray!30, minimum size=24pt] (t) at (3, 0) {$T_{\text{obs}}$};
\node[circle, draw, fill=gray!30, minimum size=24pt] (d) at (3, -1.5) {$\delta$};
  \path[every node/.style={sloped,anchor=south,auto=false}]
        (x) edge             [frenchblue] node {} (te)
        (x) edge             [mediumred-violet] node {} (tc)
        (te) edge            [darkorange] node {} (d)
        (te) edge            [dartmouthgreen] node {} (t)
        (tc) edge            [darkorange] node {} (d)
        (tc) edge            [dartmouthgreen] node {} (t);
\end{tikzpicture}
\qquad\qquad
\begin{tikzpicture}[->, thick]
\node[circle, draw, fill=gray!30, minimum size=24pt] at (0, -0.75) (x) {$X$};
\node[circle, draw, minimum size=24pt] (te) at (1.5, 0) {$T_E$};
\node[circle, draw, minimum size=24pt] (tc) at (1.5, -1.5) {$T_C$};
\node[circle, draw, fill=gray!30, minimum size=24pt] (t) at (3, 0) {$T_{\text{obs}}$};
\node[circle, draw, fill=gray!30, minimum size=24pt] (d) at (3, -1.5) {$\delta$};
  \path[every node/.style={sloped,anchor=south,auto=false}]
        (x) edge             [frenchblue] node {} (te)
        (x) edge             [mediumred-violet] node {} (tc)
        (te) edge            [black] node {} (tc)
        (te) edge            [darkorange] node {} (d)
        (te) edge            [dartmouthgreen] node {} (t)
        (tc) edge            [darkorange] node {} (d)
        (tc) edge            [dartmouthgreen] node {} (t);
\end{tikzpicture}
\qquad\qquad
\begin{tikzpicture}[->, thick]
\node[circle, draw, fill=gray!30, minimum size=24pt] at (0, 0) (x) {$X$};
\node[circle, draw, dashed, minimum size=24pt] at (0, -1.5) (u) {$U$};
\node[circle, draw, minimum size=24pt] (te) at (1.5, 0) {$T_E$};
\node[circle, draw, minimum size=24pt] (tc) at (1.5, -1.5) {$T_C$};
\node[circle, draw, fill=gray!30, minimum size=24pt] (t) at (3, 0) {$T_{\text{obs}}$};
\node[circle, draw, fill=gray!30, minimum size=24pt] (d) at (3, -1.5) {$\delta$};
  \path[every node/.style={sloped,anchor=south,auto=false}]
        (x) edge              [frenchblue] node {} (te)
        (x) edge              [mediumred-violet] node {} (tc)
        (u) edge              [black, dashed] node {} (te)
        (u) edge              [black, dashed] node {} (tc)
        (te) edge             [darkorange] node {} (d)
        (te) edge             [dartmouthgreen] node {} (t)
        (tc) edge             [darkorange] node {} (d)
        (tc) edge             [dartmouthgreen] node {} (t);
\end{tikzpicture}
\caption{\small Three graphical models of survival analysis, showcasing thedependencies between covariates $X$, event/censorship times $T_E$ / $T_C$, time of last observation $T_{obs}$ and event indicator $\delta$. Shaded nodes represent variables whose values we can observe. Blue and magenta arrows represent the event and censoring functions \textcolor{frenchblue}{$f_E : \mathcal{X} \rightarrow \mathbb{R}_+$}, \textcolor{mediumred-violet}{$f_C : \mathcal{X} \rightarrow \mathbb{R}_+$}, respectively, of arbitrary functional form. Green arrows into a node $T$ represent the function \textcolor{dartmouthgreen}{$\mathbb{R}_+^2 \rightarrow \mathbb{R}_+$} defined by\textcolor{dartmouthgreen}{$\min\left(t_e, t_c\right)$}. Orange arrows into a node $\delta$ represent the indicator function \textcolor{darkorange}{$\mathbb{R}_+^2 \rightarrow \{0, 1\}$} defined by \textcolor{darkorange}{$\mathbbm{1}\left[t_e < t_c\right]$}. The leftmost graph demonstrates the case of conditionally independent censoring (or censoring-at-random, CAR), because conditioning on $X$ $d$-separates~\citep{geiger1990d} $T_E$ and $T_C$. The center- and right-most graphs represent cases in which the censoring and event times may be conditionally dependent (or censoring-not-at-random, CNAR): in the center graph, this is through a direct dependency between $T_E$ and $T_C$, while in the rightmost graph, this is via the unobserved confounding node, $U$, that affects both $T_E$ and $T_C$.}
\label{fig:survival-graphs}
\end{figure*}

\section{Introduction}  % 1

Clinical and epidemiological investigations often want to predict the time until the onset of an event of interest. 
As examples, a clinical trial of a therapeutic cancer regimen may 
compare the time-to-mortality in patients who received an experimental therapy against
the times % that
of the patients in the control arm~\citep{emmerson2021understanding, zhang2011antiangiogenic}; 
and a study developing a clinical risk score may want to regress the time until patient mortality onto covariates of interest, in order to leverage the learned model parameters in a predictive risk algorithm~\citep{jia2019cox}.

In such time-to-event prediction tasks, % problems, 
it is common 
to only have a lower bound on the time-to-event for some 
instances in the study cohort. Here, we focus on \textit{right censored} instances -- \eg patients who left the study prior to their time of death 
(loss-to-follow-up), 
or patients who did not die prior to the conclusion of the study (administrative censoring)~\citep{leung1997censoring, lesko2018censor}.
% \note[RuG]{I use Survival PREDICTION to refer to situations where we want to find a PREDICTIVE model...
% Also, what of left-censor or interval?}
\textit{Survival prediction} refers to the development of statistical models that support time-to-event prediction
when some training instances are censored.
% NOTE: also left censored, ...
Rather than discarding 
such
censored instances, methods in 
survival analysis  
instead leverage the censoring time as a \textit{lower bound} on that individual's time-to-event~\citep{kalbfleisch2011statistical}.

Let $X^{(i)} \in \mathcal{X}$ refer to 
the covariates of the $i^{th}$ patient,
and let $T_{\text{obs}}^{(i)} \in \mathbb{R}_+$ refer to their time of last observation, taken to be the minimum of the event time 
$T_E^{(i)} \in \mathbb{R}_+$ and censorship time 
$T_C^{(i)} \in \mathbb{R}_+$.
Because a patient can be either censored or uncensored, but not both, we only observe one of $\{\,T_E, \, T_C\,\}$ for each patient.
A common assumption in survival analysis is 
% that of 
\textit{conditionally independent censoring}~\citep{kalbfleisch2011statistical}:
% . That is, 
\begin{equation}
T_E\ \perp\ T_C \ |\ X
\label{eq:conditional_indep}
\end{equation}
\ie % Equation \ref{eq:conditional_indep} assumes that 
once $X$ is known, knowing either the event or censoring time does not provide additional information about the other quantity; see Figure~\hbox{\ref{fig:survival-graphs}}(left). This assumption does not always hold. Figure~\ref{fig:survival-graphs} shows this assumption is violated when the event time affects the censoring time,
or in the presence of unobserved confounding variables.
When Equation~\ref{eq:conditional_indep} does not hold,
we say that the data features \textit{dependent censorship}, a common feature of survival data that is unaccounted, or assumed to be absent, in modern survival prediction.

This is not a theoretical concern. Consider a study assessing the survival outcomes of a cohort of chronic disease patients treated with a certain type of medication. The study collects basic demographic and medical information about each patient, their time-of-death or censorship, and an indicator expressing whether the patient died or was censored.

Now imagine that sicker patients often remove themselves from the study in order to explore alternative treatment options. This presents a form of selection bias: while we may surmise that a patient who is censored is more likely to be sicker than their uncensored counterpart, and therefore, may have a lower time-of-death, a statistical model that does not account for this will likely over-estimate each patient's survival time, which may have implications when assessing the safety and utility of the medication in question. This motivating example, characterized by the middle graph in Figure \ref{fig:survival-graphs}, presents a scenario that contemporary approaches to survival regression are often poorly equipped to accommodate. 
%Such an 
% assumption 
% \change[RuG]{is often unrealistically strong}{does not always hold}.
% % it is also \add[coopermj] {typically impossible to verify in practice because we only observe one outcome (either event or censorship) per instance, but never both}. 
% \note[RuG]{I moved the simple liver transplant example here.. and simplified it. Ok?}
% \add[Rug]{
% To illustrate, consider 
% predicting the time-to-death for patients waitlisted for a liver transplant~\mbox{\citep{kuntz2009hepatology}}.
% Note that receiving a liver is a censoring event here,
% and that, under the Final Rule, the federal policy mandate governing the allocation of deceased-donor transplant organs in the United States, organs are to be allocated based on medical urgency.
% This means the censoring time of a patient (who was allocated a liver for transplant)
% is related to when that patient would have died
% -- \hbox{\ie}
% such patients are likely to be sicker than the uncensored patients,
% and therefore, are likely to have a lower time-to-event.
% This means there is a direct dependence between event time and censoring time (Figure~\hbox{\ref{fig:survival-graphs}}, center).
% Note that conventional methods of survival regression do not consider such dependent censoring.}

% \note[RuG]{Here, can you give an example of HOW BAD THIS COULD BE?
% Eg, give a situation where this bias means only men get grafts, but this means on average, a model that assumes independence will do much worse than one that correctly models the dependence?}\note[coopermj]{I don't think the simple example lets us do this; the more complex one with patient sex does. That interpretation of the motivating example is currently in the supplement.}


% % \note[RuG]{The original material is below -- commented out -- ok?}
% % \comment{ 


% \add[coopermj]{Dependent censorship is not just a theoretical concern. Consider the use of risk scores in liver transplantation. Liver transplant represents a life-saving treatment intervention in patients suffering from end-stage liver disease (ESLD) \mbox{\citep{kuntz2009hepatology}}. Under the Final Rule, the federal policy mandate governing the allocation of deceased-donor transplant organs in the United States, organs are to be allocated in proportion to medical urgency. In practice, urgency is defined by the MELD-Na \mbox{\citep{biggins2005serum}}, a survival model trained on a population of right-censored patients previously waitlisted for transplant. Under this model, patients are censored if they receive a deceased-donor transplant liver, or if they leave the waitlist for reasons other than being too sick to transplant.

% However, the setting in which the data is collected presents a form of selection bias. Specifically, as livers are allocated to the sickest patients, knowing the censoring time of a patient may well provide information about their corresponding event time. We may surmise that a censored patient, by virtue of being allocated a liver for transplant, is likely to be sicker than their uncensored counterpart, and therefore, is likely to have a lower time-to-event. This motivating example, characterized by a direct dependence between event and censoring time, is represented by the middle graph in Figure \mbox{\ref{fig:survival-graphs}}, and represents a scenario that is often unaccounted for by conventional methods of survival regression.}
% }

Note that it is typically impossible to verify a dependency in practice because we only observe one outcome (either event or censorship) per instance, 
but never both. 
Also, this dependency between $T_E$ and $T_C$ can be quite subtle if it takes place by means of unobserved confounding variables. The effect of variables $U$ are highlighted in Figure~\mbox{\ref{fig:survival-graphs}} (right).
% \note[RuG]{Where is the example about the blood factor?}

%Recent years have seen the emergence of a burgeoning subfield of survival analysis focused on r
Relaxing the conditional independence assumption of Equation~\ref{eq:conditional_indep} has been previously studied.
However, existing approaches either do not permit the incorporation of covariates (\eg \cite{zheng1995estimates}, \cite{rivest2001martingale}, \cite{de2013generalized}), or make strict assumptions over the form of the marginal distributions of $f_{T_E}$ and $f_{T_C}$ 
(\eg \cite{escarela2003fitting}). 
These limitations mean it is difficult to apply these ideas to survival times modeled via nonlinear functions (such as neural networks) that are increasingly being used.
In this vein, our work makes the following contributions:
\begin{enumerate}[wide, labelwidth=!, labelindent=0pt]
    \item We show how to leverage copulas
    to correct for dependent censorship in neural network based models of survival outcomes. 
    We present a parameteric proportional hazards model that leverages neural networks to relax assumptions on the distributional form of the marginal event and censoring functions, and employs a copulas to model the dependence between event and censoring. % \add[coopermj]
    {We also present a method to jointly learn the model and copula parameter from right-censored survival data.}
    To our knowledge, this work represents the first neural network-based model of survival analysis to account for dependent censoring.

    % \item We devise a method to learn both the model and copula parameter from data. 
    
    \item We demonstrate that conventional survival metrics, like concordance, 
    are biased under dependent censoring, and we highlight the general impossibility of unbiased evaluation in this regime.
    %We highlight several challenges associated with evaluating the performance of survival models under dependent censoring using existing metrics. 
    %\add[coopermj]
    % We additionally demonstrate how results on the impropriety of conventional scoring rules (\mbox{\eg \cite{rindt2022survival}}) generalize to the regime of dependent censoring.}
    \item It is statistically impossible to determine whether $T_E$ and $T_C$ are independent or dependent from data alone. We show how the \textit{choice of copula can represent an assumption} (prescribed via domain knowledge) over the relationship between the event and censoring distributions. Our paper cleanly characterizes the dependence assumptions underlying two common families of copula (\mbox{the Clayton and Frank families}), and provides guidance to practitioners in choosing a copula to meet their needs. The incorporation of the copula enables practitioners to improve the resulting model on a variety of different benchmarks. 
\end{enumerate}



\section{Background and Preliminaries}

For notation, we will use \color{frenchblue}$T_E$ \color{black} and \color{mediumred-violet}$T_C$ \color{black} where appropriate to refer (respectively) to the random variables representing time-of-\color{frenchblue}event \color{black} and \color{mediumred-violet}censorship\color{black}. When a time could refer to either, we will instead simply use $T$. Realizations of each random variable, such as the time-of-event for a specific patient, will be denoted with a superscript (\eg $T_E^{(i)}$).


\subsection{Survival Analysis Preliminaries}

Our work will use  the following elementary quantities defined by survival analysis:
%. Let
$f_{T|X}$, $F_{T|X}$, representing the conditional density and cumulative distribution functions over the time of an outcome of interest (\eg event or censorship). Then, we have the following definitions.

\begin{definition}[Survival Function]
The \textit{survival function}
\begin{equation}
S_{T|X}(t|X)\ \triangleq\ \Cprob{T > t}{X}\ =\ 
1-F_{T|X}(\,t\,|\,X\,)
\end{equation}
%, denoted $S_{T|X}$, 
represents the likelihood that event (or censorship) will take place after a specifed time, $t$.
\end{definition}

\begin{definition}[Hazard Function]
The \textit{hazard function},
\begin{equation}
\small
h_{T|X}(t|X) \triangleq \lim_{\epsilon \rightarrow 0}
  \Cprob{T \in [t, t + \epsilon) }{ T \geq t, X} \ =\ \frac{f_{T|X}(t|X)}{S_{T|X}(t|X)}
 % \frac{f_{T|X}(\,t\,|\,X\,)}{S_{T|X}(\,t\,|\,X\,)}
 \label{eq:hazard_function}
\end{equation}
 represents the probability that the event will take place within an infinitesimal window in the future, given that it has not yet occurred.
\end{definition}

\begin{definition}[Likelihood Function]
The general likelihood function for survival data
$\mathcal{D}\,=\,\{ (X^{(i)}, T_\text{obs}^{(i)}, \delta^{(i)}) \}_{i=1}^N$ is the following
\footnote{
{The standard presentation of the survival likelihood is the survival likelihood under conditional independence (Equation \mbox{\ref{eq:independence_likelihood}}), which represents a special case of Equation \mbox{\ref{eq:generallikelihood}}. For a derivation of Equation \mbox{\ref{eq:generallikelihood}}, refer to Appendix~D}.}
\begin{align}\footnotesize
\mathcal{L}(\mathcal{D}) = \prod_{i=1}^N &\color{frenchblue}{\underbrace{\color{black}\left[\int_{T^{(i)}_\text{obs}}^\infty f_{T_E, T_C | X}(T^{(i)}_{\text{obs}},\, t_c\, |\, X^{(i)})\,dt_c\right]\color{frenchblue}}_{\Pr\left(T_E = T^{(i)}_{\text{obs}},\, T_C > T^{(i)}_{\text{obs}}\, |\, X^{(i)}\right)}}^{\color{black}\delta^{(i)}}\label{eq:survivallikelihood}\\&\color{mediumred-violet}{\underbrace{\color{black}\left[\int_{T^{(i)}_{\text{obs}}}^\infty f_{T_E, T_C | X}(t_e,\, T^{(i)}_{\text{obs}}\, |\, X^{(i)})\,dt_e\right]}_{\color{mediumred-violet} \Pr\left(T_C = T^{(i)}_{\text{obs}},\, T_E > T^{(i)}_{\text{obs}}\, |\, X^{(i)}\,\right)}}^{\color{black}1-\delta^{(i)}} \nonumber
\end{align}
\label{eq:generallikelihood}
\end{definition}
% \normalsize

\subsection{Copulas and Sklar's Theorem}

\begin{definition}[Copula \citep{nelsen2007introduction}]
A copula $C(u_1, ..., u_m) : [0, 1]^m \rightarrow [0, 1]$  %,
is a function with the following properties.
\begin{enumerate}[wide, labelwidth=!, labelindent=0pt]
    \item \ul{Groundedness}: if there exists an $i \in \{1, ..., m\}$ such that $u_i = 0$, then $C(u_1, ..., u_m) = 0$.
    \item \ul{Uniform Margins}: for all $i \in \{1, ..., m\}$, if $\forall j:\ j\neq i \Rightarrow u_{j} = 1$, then $C(u_1, ..., u_m) = u_i$.
    \item \ul{$d$-Increasingness}: for all $u = (u_1, ..., u_m)$, $v = (v_1, ..., v_m)$ where $u_i < v_i$ for all $i = 1, ..., m$, the following holds:
    \begin{equation*}
    \sum_{l \in \{0, 1\}^m} (-1)^{l_1 + ... + l_m} C(u_1^{l_1}v_1^{1-l_1}, ..., u_m^{l_m}v_m^{1-l_m}) \geq 0
    \end{equation*}
\end{enumerate}
\label{def:copula}
\end{definition}

The utility of copulas as probabilistic objects stems primarily from the application Sklar's Theorem \citep{sklar1959fonctions}, which demonstrates that any joint cumulative density can be written in terms of a copula over the quantiles of its marginal cumulative densities.

In this work, we will place our emphasis on those copulas that model \textit{joint survival functions}. Such copulas are known as \textit{survival copulas}, and their own version of Sklar's theorem (Equation \ref{eq:sklar-survival}) applies.

\begin{figure}
\centering
\includegraphics[width=0.47\textwidth]{figures/figure1-final.png}
\caption{\small Visualization of how Sklar's Theorem (Survival) models quantile dependency using a copula. \textcolor{violet}{\textbf{(1)}} The observed event time, $T_E^{(i)}$, is \textcolor{violet}{\textbf{(2)}} mapped through the event survival function, $S_{T_E|X}$, to \textcolor{violet}{\textbf{(3)}} obtain an \textit{event quantile}, $T_E^{(i),\text{Quantile}}$. \textcolor{violet}{\textbf{(4)}} A \textit{censoring quantile} is sampled from the copula, $T_C^{(i),\text{Quantile}} \sim C_\theta(\cdot | T_E^{(i),\text{Quantile}})$; the distributions to the left of the vertical axis show the probability mass of $C_\theta(\cdot | T_E^{(i),\text{Quantile}})$ under \textcolor{independence-text}{\textbf{no}}, \textcolor{weakdependence-text}{\textbf{weak}}, and \textcolor{moderatedependence-text}{\textbf{moderate}} dependence. Notice how as the dependence increases, the distribution $C_\theta(\cdot | T_E^{(i),\text{Quantile}})$ concentrates mass around $T_E^{(i),\text{Quantile}}$. \textcolor{violet}{\textbf{(5)}} The censoring quantile is then mapped through the inverse censoring survival function, $S_{T_C|X}^{-1}$, to \textcolor{violet}{\textbf{(6)}} obtain a corresponding time-of-censorship, $T_C^{(i)}$. The distributions below the horizontal axis show the distribution of $T_C^{(i)}$ under \textcolor{independence-text}{\textbf{no}}, \textcolor{weakdependence-text}{\textbf{weak}}, and \textcolor{moderatedependence-text}{\textbf{moderate}} dependence.
}
% \caption{\small Sklar's Theorem (Survival) to model dependency between event and censoring quantiles. The main plot shows \textcolor{frenchblue}{event} and \textcolor{mediumred-violet}{censoring} survival curves for patient $i$, as well as the time-of-event for that patient, $T^{(i)}$ ($\delta^{(i)} = 1$). The density curves to the left of the vertical axis show $f_{T_C^{\text{Quantile}} | T_E^{\text{Quantile}}}$, the conditional distribution function of the censoring quantile given the observed event quantile, as it appears under \textcolor{independence-text}{independence}, \textcolor{weakdependence-text}{weak dependence}, and \textcolor{moderatedependence-text}{moderate dependence}. Under \textcolor{independence-text}{independence}, this distribution is uniform, under \textcolor{weakdependence-text}{weak} and \textcolor{moderatedependence-text}{moderate dependence}, it concentrates around the observed event quantile, $T_E^{(i), \text{Quantile}}$. The curves below the horizontal axis show the probability density of $T_C^{(i)}$, obtained by mapping $T_C^{(i), \text{Quantile}}$ through the function $S_{T_C|X}^{-1}$. Under \textcolor{independence-text}{independence}, we have $T_C^{(i)} \sim -{\frac{\partial}{\partial t}}S_{T_C|X}\left(t|X^{(i)}\right)$, however, under \textcolor{weakdependence-text}{weak dependence} and \textcolor{moderatedependence-text}{moderate dependence}, the distribution for $T_C^{(i)}$ concentrates around $S_{T_C|X}^{-1}(T_E^{(i), \text{Quantile}} | X^{(i)})$.}
\label{fig:quantile-figure}
\end{figure}

\begin{theorem}[Sklar's Theorem (Survival Copulas) \citep{nelsen2007introduction}]
A survival copula\footnote{The copula that relates the joint cumulative distribution $F_{X_1, ..., X_m}$, with the marginal cumulative distribution functions is typically not the same as that which relates the joint survival function $S_{T_1, ..., T_M}$ with the marginal survival functions, though both are valid copulas \citep{nelsen2007introduction}.} is a copula that applies Sklar's Theorem to survival functions, as follows:
\begin{equation}
\label{eq:sklar-survival}\small
S_{T_1, ..., T_m}(t_1,\, \dots\,,\, t_m\,)\ =\ C(\,S_{T_1}(t_1),\, \dots,\, S_{T_m}(t_m)\,)
\end{equation}
\end{theorem}
A visualization of the way in which a copula induces dependency between $T_E$ and $T_C$ via the quantiles of $S_{T_E | X}$ and $S_{T_C | X}$, is shown in Figure \ref{fig:quantile-figure}.

We will focus on two families of copulas, the Clayton \citep{clayton1978model} and Frank \citep{frank1979simultaneous} families. Within these families, the copula $C_\theta$ is parameterized by a single parameter, $\theta$, interpreted as the degree of dependence between the marginal distributions under Equation \ref{eq:sklar-survival}. A larger value of $\theta$ implies greater dependency between the marginal distributions, and both families of copulas converge to the independence copula as $\theta$ approaches 0. We additionally restrict ourselves to \textit{bivariate} survival copulas, although in principle, these methods could be directly extended to accommodate an arbitrary number of competing events.
Such uniparametric copulas provide a parameter-efficient means of modeling the joint survival function: given that survival analysis already provides tools to model the marginal survival functions $S_{T_E}$, $S_{T_C}$, a model that couples these distribution functions via a uniparametric copula $C_\theta$ only requires adding one additional parameter to the model.

\section{Related Work}
\label{sec:relatedwork}
\textbf{Deep Learning in Survival Analysis: }
Linear models of survival analysis make the (often unrealistic) assumption that an individual's time-to-event is determined by a linear function of his or her covariates. \cite{faraggi1995neural} presented the first neural-network based model of survival, by incorporating a neural network into a Cox Proportional Hazards (CoxPH) model \citep{cox1972regression}. Although subsequent experimentation found the Farragi-Simon model unable to outperform its linear CoxPH counterpart \citep{mariani1997prognostic, xiang2000comparison}, DeepSurv \citep{katzman2018deepsurv} leveraged modern tools from deep learning such as SELU units \citep{klambauer2017self} and the Adam optimizer \citep{kingma2014adam}, to learn a practical neural network-based CoxPH model that reliably outperformed the linear CoxPH on nonlinear outcome data. Since then, variations of neural network-based models of survival, such as DeepHit \citep{lee2018deephit} (and its extension to time-varying data, Dynamic-DeepHit \citep{lee2019dynamic}), Deep Survival Machines \citep{nagpal2021deep}, SuMo-net \citep{rindt2022survival}, Transformer-based survival models \citep{hu2021transformer, wang2022survtrace}, and methods based off of Neural ODEs \citep{tang2022soden} have been introduced to model survival outcomes. Though these models successfully relax assumptions around the functional form of marginal risk, they do not jointly model the event and censoring times, a limitation that does not allow them to appropriately account for dependent censorship.

DeepSurv has enjoyed enduring success in part due to its broad applicability and strong performance on clinical data (\eg \cite{kim2019deep, hung2019deep, she2020development}). Therefore, our investigation will focus on relaxing the conditional independence assumption in a parameteric proportional hazards model; we leave to future work the relaxation of the conditional independence assumption in other classes of neural network based survival models.

\textbf{Missing/Censored-Not-At-Random Data and Identification:}
Since we do not simultaneously observe $T_E$ and $T_C$, we can treat the problem of survival analysis as one of missing data. The standard taxonomy \citep{rubin1976inference, tsiatis2006semiparametric} of missing data partitions variables into one of three classes: \textit{missing completely at random (MCAR)} where the missingness process is independent of the value of any observed variable, \textit{missing at random (MAR)} where the missingness process may depend on the value of one or more observed covariates, and \textit{missing not at random (MNAR)} where the missingness process may depend on unobserved variables (such as unobserved confounding or self-masking). Similarly, censorship in survival analysis can take place \textit{completely at random (CCAR)}, \textit{at random (CAR)}, or \textit{not at random (CNAR)} \citep{leung1997censoring, lipkovich2016sensitivity}. The conditional independence assumption of Equation \ref{eq:conditional_indep} is equivalent to asserting CAR in the data. 
%\textbf{Identifiability of the Joint Distribution: }

MNAR data, in the general case, is non-identifiable \citep{nabi2020full}; but survival analysis imposes stronger assumptions on the data than general models of missing data, since observed event time acts as a lower bound for unobserved event time (in the case of censored data). Therefore, prior work has focused on investigating the scenarios in which model parameters of survival data can be uniquely identified. \cite{tsiatis1975nonidentifiability} established that, in the general case, the joint over $M$ variables, $\Pr(T_1, ..., T_M)$ is not generally identifiable from observations of the random variable $T = \min\left(T_1, ..., T_M\right)$; although if the joint distribution is defined in terms of a known copula $C$, and the marginals are continuous, then identifiability holds \citep{zheng1996identifiability, carriere1995removing}. \cite{crowder1991identifiability} extended this line of work and showed that even if all the marginal distributions $f_1, ..., f_M$ are known, the joint distribution remains non-identifiable. Research in statistics has since defined tuples of marginals and copulas for which the joint distribution is identifiable. Notably, \cite{schwarz2013identifiability}  and prove that if the marginals $f_E$ and $f_C$ are known, several sub-classes of Archimedean copulas are identifiable in the bivariate case. 
\cite{zheng1996identifiability, carriere1995removing} highlight conditions for identifiability when the form and parameter of the copula are known \textit{a priori}.  
%The (\eg by being drawn from one of the 
\cite{schwarz2013identifiability} categorize copulas into sub-classes wherein the ground-truth copula, $C_{\theta^*}$, is identifiable. Our current analysis does not touch upon the identifiability of the joint distribution in the context of neural network based models of survival outcomes though the success of our method does highlight this as an important area for future study. 
Many machine learning models remain non-identified \citep{bona2021parameter} while remaining useful as  predictive and descriptive models. We consider our method a similar approach in this respect.

\textbf{Copula-Based Models of Dependent Censoring: }
Prior literature has leveraged copulas to model the relationship between the event and censoring distributions in order to account for the effect of dependent censoring \cite{emura2018analysis}. To our knowledge, the first such work was that of \cite{zheng1995estimates} and \cite{rivest2001martingale}, whose development of the nonparametric Copula-Graphic Estimator extended the Kaplan-Meier Estimator \citep{kaplan1958nonparametric} to cases where the dependence between $T_E$ and $T_C$ takes the form of an assumed copula (both $C$, $\theta$ assumed to be known). Though parametric estimators for this problem have been proposed in prior literature, they tend to make strict assumptions over the distributional form of $f_{T|X}$ (\eg that it is a linear-Weibull function \citep{escarela2003fitting}\footnote{Although Escarela does not directly model dependent censoring, but rather dependent competing events, the approach can be directly extended to this domain.}). Proposed semi-parametric estimators \citep{chen2010semiparametric, emura2017joint, deresa2022copula} suffer from much the same problem, as both of these approaches assume that the hazard is a linear function of the instance covariates. To our knowledge, no such copula-based model exists to accommodate more complex relationships between covariates and risk while also accounting for dependent censoring. This is the gap our research aims to fill.

\section{Model and Optimization} % formerly "Methodology"

We now present our extension of the Weibull CoxPH model~\citep{barrett2014weibull}, and discuss the problem of learning nonlinear models of survival outcomes under dependent censorship. Our approach entails modeling each outcome -- event and censorship -- independently with an extension of the Weibull CoxPH model, and linking them via a copula in the likelihood function during training. Our approach makes the following assumptions.

% \subsection{Model Assumptions}

\begin{assumption}[Known Form of the Copula]
\label{assmp:knownform}
%Though we do not assume prior knowledge of the copula parameter, $\theta$, w
We assume prior knowledge of the functional form of the copula (\eg that $C_{\theta^*}$, the copula associated with the data-generating process, is a Clayton copula).\footnote{In some experiments, we weaken this assumption, and we will explicitly note where this is the case.}
\end{assumption}%Our work jointly estimates the copula parameter, $\theta$, and the parameters of the marginal distributions. 

\begin{assumption}[Proportional Hazards \citep{cox1972regression}]
\label{assmp:proportionalhazards}
The hazard for each outcome (event/censorship) can be decomposed into some \textit{baseline hazard} $\lambda_0$, dependent only on time, and some \textit{covariate hazard} $g$, dependent only on the covariates $X$. 
That is, there exists some 
appropriate
$\lambda_0, g$ for which
$h_{T|X}(t| X) = \ \lambda_0(t) \,\exp(\,g(X)\,)$.
\end{assumption}

\subsection{The Weibull CoxPH Model} % 4.2
\label{sec:Weibull-Cox}

Let $\lambda_0(t) = \left(\frac{\nu}{\rho}\right)\left(\frac{t}{\rho}\right)^{\nu-1}$ denote the baseline hazard of the Weibull CoxPH model, 
and let $g_{\psi}$ denote a neural network 
with parameters $\psi$
mapping the covariate space $\mathcal{X}$ to the real line. 
Then, leveraging the proportional hazards assumption, we define our model in terms of its hazard:
\begin{equation}
\label{eq:model-hazard}
\hat{h}_{T|X}(\,t|X\,)\ =\ \left(\frac{\nu}{\rho}\right)\left(\frac{t}{\rho}\right)^{\nu-1} \exp\left(\,g_\psi(X)\,\right)
\end{equation}

Let $\phi = \{\nu, \rho, \psi\}$ denote the complete set of model parameters, and observe that the Weibull CoxPH model is fully parametric model over these \textit{marginal parameters} $\phi$. By rearranging Equation \ref{eq:model-hazard}, this class of models readily admits $\hat{S}_{T|X}$, the estimated survival function over each outcome, and $\hat{f}_{T|X}$, the correponding probability mass function. These two quantities will allow us to perform maximum likelihood estimation --
their derivations are provided in Appendix C.4.1 and C.4.2.
\begin{align}
\hat{S}_{\,T|X}(\,t|X\,)\ &=\ \exp\left(-\left(\frac{t}{\rho}\right)^\nu g_\psi(X)\right)\\
\hat{f}_{T|X}(\,t|X\,)\ &=\ h_{T|X}(\,t|X\,)\ \hat{S}_{T|X}(\,t|X\,)
\end{align}

\subsection{Maximum Likelihood Learning Under Dependent Censorship}
\label{sec:likelihood}

Let $\mathcal{D} = \{(X^{(i)}, T_{\text{obs}}^{(i)}, \delta^{(i)})\}_{i=1}^N$ represent a dataset comprising $N$ i.i.d. draws from some data-generating distribution. Let $X^{(i)} \in \mathcal{X}$ refer to a set of baseline covariates collected about each individual $i$. Let $T_{\text{obs}}^{(i)} \in \mathbb{R}_+$ refer to their time of last observation, taken to be the minimum of latent variables $T_E^{(i)} \in \mathbb{R}_+$, $T_C^{(i)} \in \mathbb{R}_+$, representing the event and censoring times, respectively. Finally, let $\delta^{(i)} \in \{0, 1\}$ represent an event indicator taking on the value $\mathbbm{1}[T_E^{(i)} < T_C^{(i)}]$. 
Let $C$ represent a survival copula. 
Given $\mathcal{D}$, we learn by maximizing the likelihood of the observed data.

Under conditional independence, Equation \ref{eq:survivallikelihood} factorizes and simplifies into the familiar form of the survival likelihood.
\footnotesize
\begin{align}
\mathcal{L}(\mathcal{D}) = \prod_{i=1}^N &\left[f_{T_E|X}(T^{(i)}_{\text{obs}} | X^{(i)})S_{T_C|X}(T^{(i)}_{\text{obs}}|X^{(i)})\right]^{\delta^{(i)}} \label{eq:independence_likelihood}\\& \left[f_{T_C|X}(T^{(i)}_{\text{obs}} | X^{(i)})S_{T_E|X}(T^{(i)}_{\text{obs}}|X^{(i)})\right]^{1-\delta^{(i)}}\nonumber
\end{align}
\normalsize

However, when $T_E,T_C$ are no longer conditionally independent, we can no longer rely on this clean decomposition of the log-likelihood. Instead, we make use of the following lemma.
\begin{lemma}[Conditional Survival Function Under Sklar's Theorem (Survival)]
\label{lemma:copula-conditional}
If $S_{T_E, T_C | X}(t_e, t_c | x) = \left.C(u_1, u_2)\middle|_{\substack{{u_1=S_{T_E|X}(t_e|x)}\\ {u_2=S_{T_C|X}(t_c|x)}}}\right.$, then,
\footnotesize
\begin{equation}
\int_{t_c}^\infty f_{T_C | T_E, X}(t_c | t_e, x) = \frac{\partial}{\partial u_1} C(u_1, u_2)|_{\substack{{u_1=S_{T_E|X}(t_e|x)}\\ {u_2=S_{T_C|X}(t_c|x)}}}.
% \right.
\nonumber
\end{equation}
\end{lemma}
\normalsize

Applying Lemma~\ref{lemma:copula-conditional} to Equation \ref{eq:survivallikelihood} yields the log-likelihood for survival models under dependent censorship.
\footnotesize
\begin{align}
\label{eq:loglikelihood}
\ell(\mathcal{D}) &= \sum_{i=1}^N 
{ } \delta^{(i)}\log \left[f_{T_E|X}(T^{(i)}_\text{obs}| X^{(i)})\right] + \\&\delta^{(i)} \log \left[\frac{\partial}{\partial u_1}C(u_1, u_2)\middle\vert_{\substack{u_1 = S_{T_E|X}(T^{(i)}_\text{obs}|X^{(i)})\\u_2 = S_{T_C|X}(T^{(i)}_\text{obs})|X^{(i)}} }\right] +\nonumber\\
&(1-\delta^{(i)})\log \left[f_{T_C|X}(T^{(i)}_\text{obs}| X^{(i)})\right] + \nonumber\\
&(1-\delta^{(i)}) \log \left[\frac{\partial}{\partial u_2}C(u_1, u_2)\middle\vert_{\substack{u_1 = S_{T_E|X}(T^{(i)}_\text{obs}|X^{(i)})\\u_2 = S_{T_C|X}(T^{(i)}_\text{obs})|X^{(i)}} }\right].\nonumber
% \ell(\mathcal{D})\ &=\ \sum_{i=1}^N\ \delta^{(i)}\log f_{T_E|X}\left(T^{(i)}_{\text{obs}}| X^{(i)}\right)\ \ + \\
% &\delta^{(i)}\log \frac{\partial}{\partial u_1}\left(C(u_1, u_2)\middle\vert_{\substack{u_1 = S_{T_E|X}\left(T^{(i)}_{\text{obs}}|X^{(i)}\right)\\u_2 = S_{T_C|X}\left(T^{(i)}_{\text{obs}}|X^{(i)}\right)}}\right)\ \ + \nonumber\\
% &(1-\delta^{(i)})\log f_{T_C|X}\left(T^{(i)}_{\text{obs}}| X^{(i)}\right)\ \ + \nonumber\\
% &(1-\delta^{(i)})\log \frac{\partial}{\partial u_2}\left(C(u_1, u_2)\middle\vert_{\substack{u_1 = S_{T_E|X}\left(T^{(i)}_{\text{obs}}|X^{(i)}\right)\\u_2 = S_{T_C|X}\left(T^{(i)}_{\text{obs}}|X^{(i)}\right)} }\right)\nonumber
\end{align}
\normalsize


In this expression, the first term corresponds to the log likelihood of observing the event at time $T_{\text{obs}}^{(i)}$. The second term corresponds to the conditional probability of observing the censorship time after the event time, given that the event time is $T_{\text{obs}}^{(i)}$. The third and fourth terms, by symmetry, represent the same quantities for the censorship time. Despite the visual complexity of Equation \ref{eq:loglikelihood}, the partial derivatives of the Clayton and Frank copulas admit closed form solutions, so the log likelihood function has a closed form and can be maximized via gradient-based methods. Algorithm~\ref{alg:optimization} details the optimization procedure used to jointly optimize the marginal and copula parameters. Empirically, we find that scaling the gradient of $\hat{\theta}$ by a large constant factor $K$, and then clipping it prior to taking each update step, supports stable optimization in this regime ($K = 1000$ in our experiments). Additional implementation details and hyperparameters are discussed in Appendix E.2.

\RestyleAlgo{ruled}
\SetKwComment{Comment}{$\ $\# }{ }
\SetKwComment{Commentt}{\# }{ }
{
\begin{algorithm}[h]
\small
\KwIn{
$\mathcal{D}$: survival dataset of the form $\{(X^{(i)}, T_\text{obs}^{(i)}, \delta^{(i)})\}_{i=1}^N$; $C_\theta$: a bivariate copula, parameterized by $\theta$; $\mathcal{M}$, a class of survival model parameterized by $\phi$ that can produce $\hat{S}^{(\mathcal{M})}_{T|X}(t|X)$, $\hat{f}^{(\mathcal{M})}_{T|X}(t|X)$, for each $X^{(i)} \in \mathcal{D}$; $\alpha$: learning rate for event model, censoring model, and copula parameter; $M$: number of training epochs; $K$: large constant factor; $\theta_{\text{min}}$: small positive number.
}
\KwResult{$\hat{\theta}, \hat{\phi}_E, \hat{\phi}_C$: learned parameters of the copula and each marginal survival model.}
\hrulefill\\
$\mathcal{M}_E \gets \texttt{Instantiate}(\mathcal{M}; \hat{\psi}_E^{(0)})$ \;
$\mathcal{M}_C \gets \texttt{Instantiate}(\mathcal{M}; \hat{\psi}_C^{(0)})$ \;
$C_\theta \gets \texttt{Instantiate}(C; \hat{\theta}^{(0)})$ \;
\For{$i = 1,\, ...\,,\, M$}{
    $\mathcal{L}_{i} \gets \ell\left[\mathcal{D}; \hat{f}^{\left(\mathcal{M}_E\right)}_{T|X},  \hat{f}^{\left(\mathcal{M}_C\right)}_{T|X}, \hat{S}^{\left(\mathcal{M}_E\right)}_{T|X},   \hat{S}^{\left(\mathcal{M}_C\right)}_{T|X},  
    C_{\hat{\theta}^{(i)}}\right]$\;
    $\hat{\psi}_C^{(i)} \gets $\texttt{AdamUpdate}($\mathcal{L}_i$, $\hat{\psi}_C$, $\alpha$) \;
    $\hat{\psi}_E^{(i)} \gets $\texttt{AdamUpdate}($\mathcal{L}_i$, $\hat{\psi}_E$, $\alpha$) \;
    $\nabla \hat{\theta}^{(i)} \gets \nabla \hat{\theta}^{(i)} \times K$\;
    $\nabla \hat{\theta}^{(i)} \gets \nabla \hat{\theta}^{(i)} \vert_{[-0.1, 0.1]}$\;
    $\hat{\theta}^{(i)} \gets $\texttt{AdamUpdate}($\mathcal{L}_i$, $\hat{\theta}$, $\alpha$) \;
    $\hat{\theta}^{(i)} \gets \min(\hat{\theta}^{(i)}, \theta_{\text{min}})$ \Comment{Constrain theta > 0}
}
% $\mathcal{L}_{\text{min}}$ \gets $\infty$ \;
% \For{$\theta' \in \Theta$}{
%     \If{$\ell\left[\mathcal{D}; \hat{f}^{\left(\mathcal{M}_E\right)}_{T|X},  \hat{f}^{\left(\mathcal{M}_C\right)}_{T|X}, \hat{S}^{\left(\mathcal{M}_E\right)}_{T|X},   \hat{S}^{\left(\mathcal{M}_C\right)}_{T|X},  
%     C_{\theta'}\right] < \mathcal{L}_{\text{min}}$}{
%     $\hat{\theta}^{(i)}$ \gets \theta'\;
    
%     $\mathcal{L}_{\text{min}} = \ell\left[\mathcal{D}; \hat{f}^{\left(\mathcal{M}_E\right)}_{T|X},  \hat{f}^{\left(\mathcal{M}_C\right)}_{T|X}, \hat{S}^{\left(\mathcal{M}_E\right)}_{T|X},   \hat{S}^{\left(\mathcal{M}_C\right)}_{T|X},  
%     C_\theta\right]$\;
%     }
% }
\Return{$\hat{\theta}^{(i)}$, $\hat{\psi}_E^{(i)}$, $\hat{\psi}_C^{(i)}$}
\caption{\small Learning Under Dependent Censorship \normalsize}
\label{alg:optimization}
\normalsize
\end{algorithm}
}

\section{Evaluation}

\subsection{Metrics are Biased Under Dependence}

Standard metrics such as the concordance index \citep{harrell1982evaluating, uno2011c}, time-dependent concordance (TDCI) \citep{gerds2013estimating}, and Brier score \citep{brier1950verification} cannot  effectively evaluate models learned under dependent censoring. To demonstrate this, we generate survival data under a copula, and compare the performance of the data-generating event model, $f_{T_E|X}$,  on censored and uncensored data as the dependency increases. The results of this experiment are shown in Table \mbox{\ref{tab:biased_metrics}}. As the dependence increases, both the concordance and Brier score under censoring deviate from their values without censoring. This suggests that the utility of these metrics decreases as the dependence in censoring increases. This challenges previous results that use these measures as the primary statistics of interest when assessing the performance of models under dependent censoring.

By way of analogy, we describe the connection between evaluation under dependent censoring and the potential outcomes framework from causal inference. In the case where censoring takes place completely at random, metrics like concordance and Brier score are suitable means of evaluation, akin to how a randomized controlled trial produces an unbiased estimate of the average treatment effect. Under observed confounding, weighting schemes like inverse-propensity censorship weighting \mbox{\cite{uno2011c, graf1999assessment}} leverage a censoring model to produce an unbiased estimator of the evaluation statistic. But confounding of the form in survival analysis does not readily admit a censoring model that can be used to perform weighting adjustment since the covariates required for such a model remain unobserved. Consequently, unbiased model evaluation under dependent censoring is fundamentally a problem of counterfactual analysis and not feasible to solve using observational data alone.

\begin{table}[]
    \centering\scriptsize
    \setlength\tabcolsep{2pt}
    \begin{tabular}{|c||c|c|c|c|c|c|}
    % \add[coopermj]{
    \hline
    & \multicolumn{3}{c|}{C-Index ($\uparrow$)} & \multicolumn{3}{c|}{Brier Score ($\downarrow$)}\\
    $\tau$ & \multicolumn{1}{c}{Uncensored} & \multicolumn{1}{c}{Censored} & \multicolumn{1}{c|}{Abs. Diff. ($\downarrow$)} & \multicolumn{1}{c}{Uncensored} & \multicolumn{1}{c}{Censored} & \multicolumn{1}{c|}{Abs. Diff. ($\downarrow$)}\\
    \hline\hline
    0.01 & 0.6151 & 0.6187 & 0.0037 & 0.0719 & 0.0859 & 0.0140\\
    0.2 & 0.6144 & 0.6140 & 0.0004 & 0.0757 & 0.0909 & 0.0152\\
    0.4 & 0.6170 & 0.6164 & 0.0006 & 0.0726 & 0.0943 & 0.0217 \\
    0.6 & 0.6172 & 0.6342 & 0.0170 & 0.0733 & 0.0963 & 0.0230\\
    0.8 & 0.6125 & 0.6873 & 0.0748 & 0.0744 & 0.1054 & 0.0310\\
    \hline
    % }
    \end{tabular}
    \caption{\small The results of an experiment comparing the concordance index and Brier score on an uncensored population, against that on a population experiencing dependent censoring. The full details of this experiment are provided in Appendix E.1.}
    \label{tab:biased_metrics}
\end{table}


\begin{figure}
    \centering
    \includegraphics[width=0.4\textwidth]{figures/evaluation_metric.png}
    \caption{\small The \textit{Survival-$\ell_1$} metric, $\mathcal{C}_{\text{Survival-}\ell_1}(S, \hat{S})$, for \textcolor{frenchblue}{event} and \textcolor{mediumred-violet}{censoring} distributions. Dashed lines represent the predicted survival curves, \textcolor{frenchblue}{$\hat{S}_{T_E|X}$}, and \textcolor{mediumred-violet}{$\hat{S}_{T_C|X}$}, while solid lines represent the corresponding ground-truth survival curves, \textcolor{frenchblue}{$S_{T_E|X}$}, and \textcolor{mediumred-violet}{$S_{T_C|X}$}. The black horizontal line represents the normalizing quantile, $Q_{\lVert\cdot\rVert}$, which is used to standardize the duration of the survival curve across patients when calculating the \textit{Survival-}$\ell_1$. The area of the hatched blue region above $Q_{\lVert\cdot\rVert}$ is the value of \textcolor{frenchblue}{$\mathcal{C}_{\text{Survival-}\ell_1}(S_{T_E|X}, \hat{S}_{T_E|X})$}, while that of the hatched pink region is the value of \textcolor{mediumred-violet}{$\mathcal{C}_{\text{Survival-}\ell_1}(S_{T_C|X}, \hat{S}_{T_C|X})$}.}
    \label{fig:eval-metric}
\end{figure}

% \add[Attention]{Thoughts on keeping in the proper scoring rule results vs. discarding them? The previous paragraph makes strictly stronger claims re: the drawbacks of dependent censoring on evaluation (and in retrospect the results do seem quite obvious; additionally, no reviewers noted this as a central strength of the work). I have commented them out in this version of the draft.}

%  Though it is a known result that the Brier Score is not proper under dependent censoring, we prove that the TDCI is also not proper under such dependence. The proof of Theorem \ref{thm:tdci_improper} is in the supplement.

% \begin{theorem}
%     The time-dependent concordance index is not a proper scoring rule under dependent censoring.
%     \label{thm:tdci_improper}
% \end{theorem}

\textbf{The \textit{Survival-}$\ell_1$ Metric:} We introduce the \textit{Survival-$\ell_1$} as a means of quantifying bias in survival analysis due to dependent censoring on synthetic data. The \textit{Survival-$\ell_1$} metric $\mathcal{C}_{\text{Survival-}\ell_1} : \mathcal{S} \times \mathcal{S} \rightarrow \mathbb{R}_+$, is the $\ell_1$ distance between the ground-truth survival curve, $S_{T|X}$, and the estimate achieved by a survival model, $\hat{S}_{T|X}$ (Figure \ref{fig:eval-metric}), over the lifespan of the curves.

However, the scale of the naive $\ell_1$ measure between survival curves is proportional to the total amount of elapsed time under each survival curve. To ensure that survival curves over longer lifespans do not contribute proportionally more to the evaluation metric than those over shorter lifespans, we define the small constant \textit{normalizing quantile}, $Q_{\lVert\cdot\rVert}$ (in our experiments, $Q_{\lVert\cdot\rVert} = 0.01$). We can loosely think of the time when each survival curve reaches the normalizing quantile as the ``end time'' of that survival curve. By normalizing the area between the survival curves by the \textit{temporal normalization} value $T_{\text{max}}^{(i)} = S^{-1}_{T|X^{(i)}}\left(Q_{\lVert\cdot\rVert}\right)$, we ensure that the duration spanned by a patient's survival curve does not influence that patient's contribution to $\mathcal{C}_{\text{Survival-}\ell_1}$ relative to other patients.

Our \textit{Survival-$\ell_1$} metric therefore takes the following form:
\footnotesize
\begin{align}
\mathcal{C}_{\text{Survival-}\ell_1}(S, \hat{S}) = \sum_{i=1}^N &\frac{1}{N \times T_{\text{max}}^{(i)}} \int_{0}^{\infty} \\&\left|S_{T|X}(t|X^{(i)}) - \hat{S}_{T|X}(t|X^{(i)})\right| dt\nonumber
\end{align}
\normalsize

\begin{figure*}[h!]
    \centering
    \includegraphics[width=\textwidth]{figures/results_pdf.pdf}
    \caption{\small \textbf{Left}: Plot of the bias, $\mathcal{C}_{\text{Survival-}\ell_1}$, as a function of the dependence (Kendall's $\tau$), for both independence-assuming and copula-based models on synthetic data. 
    Going from left to right on the $x$-axis denotes stronger dependence between the survival and event time in the data generating process. The $y$-axis is overloaded; the scales on the left hand side of each $y$-axis correspond to bias incurred in the prediction of the event times and the scales on the right hand side correspond to bias incurred in the prediction of the censoring times.
    Dotted lines represent the bias in the \textcolor{frenchblue}{event} and \textcolor{mediumred-violet}{censoring} survival curves incurred by independence-assuming models, while solid lines represent the bias incurred by our copula-based approach. The copula-based approach yields a lower line for each event, indicating a better approximation of the ground-truth survival function.The shaded region represents the standard deviation of the \textit{Survival-}$\ell_1$ across 10 instantiations of the model with different random seeds. \textbf{Right}: For each value of $\tau$ in the left plot, we plot the recovered value of Kendall's $\tau$, $\hat{\tau}$, as a function of the true dependence, $\tau^*$. The dashed diagonals line, representing $\hat{\tau} = \tau^*$, is plotted for reference. Points close to the line indicate that the learned dependence parameter was close to that of the data-generating process.}
    \label{fig:mainresults}
\end{figure*}


\section{Experiments and Results}

\textbf{Synthetic Data: }
The \textit{Survival}-$\ell_1$ metric places strong assumptions on our knowledge of the data-generating process by assuming access to the ground-truth survival functions for each outcome. For this reason, we predominantly make use of synthetic data to evaluate the merits of our approach.

Algorithm \ref{alg:datagenerating} provides a means of generating synthetic data under a specified copula $C$ with Weibull CoxPH margins. For the \texttt{Linear-Risk} experiment shown in Figure \ref{fig:mainresults}, we generate data according Algorithm \ref{alg:datagenerating} with $X \in \mathbb{R}^{N \times 10} \sim \mathcal{U}_{[0,1]}$, $\nu_E^* = 4, \rho_E^* = 14, \psi_E^*(X) = \beta_E^T(X)$, $\nu_C^* = 3, \rho_C^* = 16, \psi_C^*(X) = \beta_C^T(X)$, where $\beta_E, \beta_C \in [0,1]^{10} \sim \mathcal{U}_{[0,1]}$. For the \texttt{Nonlinear-Risk} experiment, we run Algorithm \ref{alg:datagenerating} with $X \in \mathbb{R}^{N \times 10} \sim \mathcal{U}_{[0,1]}$, $\nu_E^* = 4, \rho_E^* = 17, \psi_E^*(X) = \sum_{i=1}^{10}X_{i}^{2}/8$, $\nu_C^* = 3, \rho_C^* = 16, \psi_C^*(X) = \beta_{C}^{T}X^{2}/5$, where $ \beta_C \in [0,1]^{10} \sim \mathcal{U}_{[0,1]}$. Each synthetic experiment was performed on $20,000$ train, $10,000$ validation, and $10,000$ test samples.

The network $g_\psi$ in the model we train on the \texttt{Linear-Risk} data consists of a single linear layer, while the network $g_\psi$ in the model we train on the \texttt{Nonlinear-Risk} data consists of a three-layer fully-connected neural network with ELU activations and hidden layers consisting of $[10, 4, 4, 4, 2, 1]$ dimensions, respectively.

\textbf{Semi-Synthetic Data}: To investigate the promise of our approach on non-synthetic data, we artificially censor regression datasets according to a various degrees of dependence. We choose two datasets (\texttt{STEEL}) \citep{asuncion2007uci} and \texttt{AIRFOIL} \citep{misc_airfoil_self-noise_291} from the UCI Machine Learning Repository. We induce censoring in the data according to Algorithm 4 in Appendix D.2. We then train a linear version of our method on the artificially censored dataset and evaluate our performance via the $R^2$ statistic\footnote{Note that a method like \textit{Survival-}$\ell_1$ does not apply to this context, as semi-synthetic data does not provide ground-truth survival curves.}. In this experiment, we compare our approach against two baselines: a linear Weibull CoxPH model trained on the regression data \textit{without censoring}, and a linear independence-assuming Weibull CoxPH model.


\RestyleAlgo{ruled}
\SetKwComment{Comment}{$\ $\# }{ }
\SetKwComment{Commentt}{\# }{ }
{
\begin{algorithm}[h]
\small
\KwIn{
$X \in \mathbb{R}^{N \times d}$: a set of covariates, $g_{\psi} : \mathbb{R}^{N \times d} \rightarrow \mathbb{R}$: a class of risk function parameterized by $\psi$, $C_\theta$: a class of copula parameterized by $\theta$, $(\nu^*_E, \rho^*_E, \psi^*_E), (\nu^*_C, \rho^*_C, \psi^*_C), \theta^*$: data-generating parameters associated with each outcome model and the copula, respectively.}
\KwResult{$\mathcal{D}$, a survival dataset with the desired dependence.}
\hrulefill\\
% $\mathcal{M}_E \gets \texttt{Instantiate}(\mathcal{M}; \hat{\psi}_E^{(0)})$ \;
% $\mathcal{M}_C \gets \texttt{Instantiate}(\mathcal{M}; \hat{\psi}_C^{(0)})$ \;
% $C_\theta \gets \texttt{Instantiate}(C; \hat{\theta}^{(0)})$ \;
$\mathcal{D} = \emptyset$\;
\For{$i = 1,\, ...\,,\, N$}{
    $u_1^{(i)}, u_2^{(i)} \sim C_{\theta^*}$\;
    $T_E^{(i)} \gets \left(\frac{-\log(u_1)}{g_{\psi_E^*}(X^{(i)})}\right)^{\frac{1}{\nu^*_E}}\rho^*_E$\;
    $T_C^{(i)} \gets \left(\frac{-\log(u_2)}{g_{\psi_C^*}(X^{(i)})}\right)^{\frac{1}{\nu^*_C}}\rho^*_C$\;
    $\mathcal{D} \gets \mathcal{D} \cup \{(X^{(i)}, \min(T_E^{(i)}, T_C^{(i)}), \mathbbm{1}[T_E^{(i)} < T_C^{(i)}])\}$\;
}
\Return{$\mathcal{D}$}
\caption{\small Generating Synthetic Dependent Survival Data \normalsize}
\label{alg:datagenerating}
\normalsize
\end{algorithm}
}
% \textbf{Datasets: }
% We provide a brief overview of the datasets we use to generate \texttt{Outcome-Synthetic} data for our model. Further details of each dataset can be found in the supplement.

%\begin{enumerate}[wide, labelwidth=!, labelindent=0pt]
%    \item 
%     \ul{\texttt{METABRIC}} \citep{curtis2012genomic}: \texttt{METABRIC} is a benchmark survival dataset drawn from a study to perform genomic sub-group analysis of breast tumors. The dataset includes covariates and time-to-mortality/censorship information for 1,904 patients suffering from breast cancer.
% %    \item 

%     \ul{\texttt{SUPPORT}} \citep{knaus1995support}: The \texttt{SUPPORT} dataset was curated to develop a predictive model for the survival of hospitalized adults. It comprises covariates and time-to-mortality/censorship data for 8,873 patients.
%\end{enumerate}

%\section{Results and Discussion}
Our results highlight three properties of our framework. First, our model is capable of reducing the bias in the learned individual survival curve (as measured by the \textit{Survival-$\ell_1$} metric). Second, the learning algorithm does, in many cases, recover the ground truth coefficient associated with the copula when parameterizing the prediction of the event and censoring time with neural networks. Finally, our framework opens up new avenues to learning more complex forms of dependence between event and survival time.

\textbf{Reducing Bias in Survival Outcomes:}
Figure \ref{fig:mainresults} (left column) plots the model bias as measured by the $\textit{Survival-}\ell_1$, and how it behaves across datasets (in rows of plots). %Within each plot, going from left to right on the $x$-axis denote stronger dependence between the survival and event time in the data generating process. %The $y$-axis is overloaded: the scales on the left depict the magnitude of bias associated with \textcolor{frenchblue}{event time} and the scale on the right denote the magnitude of bias associated with \textcolor{mediumred-violet}{censorship time}.

We highlight that our approach of modeling the dependence structure between event and censorship times reduces the bias in the model's estimation of survival curves. The bias is substantially lower under our approach for all values of $\tau > 0$, and we note that the improvements are more pronounced for larger values of $\tau$ indicating that the improvements in our approach are larger as the dependence between censorship and event time is stronger. We see consistent results holding for both the \texttt{Linear-Risk} and \texttt{Nonlinear-Risk} data-generating processes, and for both the Frank and Clayton families of copula. 
% Second, we note that the scale of the bias in the context of event time is significantly larger than the scale of the bias in the context of censorship time -- we hypothesize this is due to the fact that we have fewer patients (across all datasets) for whom we observe event time than we observe censorship time. 
In the special case where $\tau = 0$, we observe that our approach correctly recovers the independence copula, and learns an unbiased survival curve.

Our results on the artificially censored \texttt{STEEL} and \texttt{AIRFOIL} datasets suggest that our method also shows promise on non-synthetic data. On the \texttt{STEEL} dataset, our method achieves an $R^2$ of $0.508$ under high dependence ($\tau=0.8$), compared to the $R^2$ of $0.341$ achieved by the independence-assuming model. Likewise, on the \texttt{AIRFOIL} dataset, our method achieves an $R^2$ of $0.484$ under high dependence ($\tau=0.8$), compared to the $R^2$ of $0.330$ achieved by the independence-assuming model. Across different degrees of dependence, our approach reliably outperforms the independence-assuming baseline, and often approaches the performance of the model trained on the uncensored version of the data. The complete table of results can be found in Appendix G.1 (\texttt{STEEL}) and G.2 (\texttt{AIRFOIL}).


\begin{figure}
    \centering
    \includegraphics[width=0.4\textwidth, keepaspectratio]{figures/error_convex.pdf}
    \caption{\small Plot of the bias ($\mathcal{C}_{\text{Survival-}\ell_1}$) as a function of dependence (Kendall's $\tau$), for independence-assuming and copula-based Weibull CoxPH models on synthetic data with linear margins drawn from a convex combination of copulas. In this experiment, we optimize over a mixture of two copulas (one Frank, one Clayton), rather than a single uniparametric copula. As in Figure 4, the dashed lines represent the bias incurred by independence-assuming models, while the solid lines represent the bias incurred by our approach. This figure highlights that our method is capable of relaxing Assumption \ref{assmp:knownform} by way of a convex combination of copulas.}
    \label{fig:convex-combination}
\end{figure}

\textbf{Empirical Recovery of the Copula Parameter: }
How close are the recovered parameters of the copula to the true parameters used in the data-generating process? Although we do not have a formal proof of identifiability, we nevertheless study this question empirically on the two datasets in Figure \ref{fig:mainresults} (right column). Here, we find that our approach is able to reliably recover a $\hat{\theta}$ that is close to $\theta^*$ across different datasets and families of copula.

\textbf{Relaxating Assumption \ref{assmp:knownform}: }
Next, we showcase the flexibility of our framework via a relaxation of Assumption \ref{assmp:knownform}. Specifically, rather than parameterizing our model with $C_\theta$, a single copula of an assumed functional form, we instead parameterize it with a convex combination of Clayton and Frank copulas. During optimization, we learn $\theta_{\text{Frank}}$, $\theta_{\text{Clayton}}$, and $\kappa$, a mixing parameter.
Because the Clayton and Frank copulas are both Archimedean, we know that their convex mixture is also a valid Archimedean copula \citep{bacigal2010some, bacigal2015generators}. Figure \ref{fig:convex-combination} shows the results of an experiment on synthetic data with \texttt{Linear-Risk} margins and a dependency produced by a convex combination of copulas: $C_{\text{Mix.}}(u,v) = \kappa C_{\text{Frank}}(u,v) + (1-\kappa) C_{\text{Clayton}}(u,v)$. In this experiment, we fix $\kappa = 0.5$. As in the case where the functional form of $C$ was known, the mixture model reduces bias in estimation of the event and censoring distributions.


% \begin{figure}[h]
% \centering
% \includegraphics[width=0.48\textwidth]{figures/convex_combination_final.png}
% \caption{\small The \textit{Survival-$\ell_1$} incurred by independence-assuming and copula-based models when Assumption \ref{assmp:knownform} is relaxed through the use of a convex combination of multiple classes of copula.
% }
% \label{fig:convex-combination}
% \end{figure}

% \section{Discussion}

% The proportion of censored patients can influence the bias incurred in the estimation of the survival function. For the datasets shown in Figure \ref{fig:mainresults}, the 

\section{Discussion}
\label{sec:discussion}

\subsection{Dependent Censoring in Practice}
\textbf{Evaluating Survival Models on Observational Data:} Given the impossibility of evaluation from observational data alone, how should a practitioner apply our method? We propose that practitioners adopt simulation -- the present gold standard of evaluation from the causal inference literature -- as a primary means to test the performance of survival models under dependent censoring. Such methods as \mbox{\cite{parikh2022validating} and \cite{mahajan2022empirical}} present means of generating counterfactual synthetic data that is similar to the available observational data. Then, evaluating model performance on the simulated data using counterfactual metrics (like Survival-$\ell_1$) is treated as a viable proxy of model performance on the downstream data.

\textbf{The Assumptions Encoded by the Clayton and Frank Copulas:} Given that we only observe either the time of event or censorship, identifying the joint distribution between these variables is generally not possible. Therefore, the choice of copula represents a \textit{assumption} over the data. How can a practitioner leverage domain knowledge in order to select the right copula to use within our framework? Consider how the copula parameter, $\theta$, relates the event and censoring curves under three different circumstances. (1) If the censoring and event curves are identical, then $\theta$ grows with the probability that the time of event and censorship are the same. (2) If the censoring curve decays faster than the survival curve, $\theta$ grows with the probability that the time of censorship precedes the time of event. (3) If the survival curve decays faster than the censoring curve, $\theta$ grows with the probability that the time of event precedes the time of censorship. For a fixed $\theta$, the Clayton copula expresses this dependency as stronger at later times (lower quantiles), and weaker at earlier times (higher quantiles). The Frank copula expresses strength of the dependency at more uniform strength across all time periods. A visualization of these three cases, and of the quantile densities expressed by the Clayton and Frank copulas, can be found in Appendix B.3.

\section{Conclusion}  % 7

%\textbf{Future Work:} % 7.1
The method of using copulas to couple marginal survival distributions is a general one. As future work, we consider extending this approach to other classes of neural survival models, such as those that do not assume either proportional hazards or a Weibull baseline hazard.
Though the \textit{Survival-$\ell_1$} metric is a sufficient metric to demonstrate the promise of our approach, it relies on knowledge of the complete survival curve for each instance; this is typically not available in real-world data. The careful study of the behaviour of conventional evaluation metrics under dependence, and the design of strategies to more faithfully ascertain the performance of a model from observational data alone remain open avenues for future work.

%\textbf{Contributions:} % 7.2
Modern statistical methods in survival analysis increasingly rely on complex, nonlinear functions of risk; however, existing applications of deep learning to survival analysis do not accommodate dependent censoring that may be present in the data. This work relaxes this key assumption, and presents the first neural network-based model of survival to accommodate dependent censoring. 
Our experimental results demonstrate the promise of our method: our approach significantly reduces the \textit{Survival-$\ell_1$} (bias) in estimation and our optimization technique is reliably able to recover the underlying dependence parameter in survival data across datasets of varying feature sizes.

% Extra citations from supplement
\nocite{seabold2010statsmodels}
\nocite{asuncion2007uci}
\nocite{Dua:2019}
\nocite{ve2021efficient}
\nocite{sathishkumar2020energy}
\nocite{sathishkumar2020industry}
\clearpage


% \begin{contributions} % will be removed in pdf for initial submission 
% 					  % (without ‘accepted’ option in \documentclass)
%                       % so you can already fill it to test with the
%                       % ‘accepted’ class option
%     Briefly list author contributions. 
%     This is a nice way of making clear who did what and to give proper credit.
%     This section is optional.

%     H.~Q.~Bovik conceived the idea and wrote the paper.
%     Coauthor One created the code.
%     Coauthor Two created the figures.
% \end{contributions}

\begin{acknowledgements} % will be removed in pdf for initial submission,
						 % (without ‘accepted’ option in \documentclass)
                         % so you can already fill it to test with the
                         % ‘accepted’ class option
The authors gratefully acknowledge the support of several funding sources without which this project would not be possible. RG is supported by NSERC, Amii, and CIFAR. RGK is supported by a Canada CIFAR AI Chair. MC is supported by a CIHR Health Science Impact Fellowship. This research was supported by a a DSI Catalyst Grant from the University of Toronto.
\end{acknowledgements}

% References
\bibliography{uai2023-template}

\end{document}
