% \documentclass{uai2022} % for initial submission
\documentclass[accepted]{uai2022} % after acceptance, for a revised
                                    % version; also before submission to
                                    % see how the non-anonymous paper
                                    % would look like
%% There is a class option to choose the math font
% \documentclass[mathfont=ptmx]{uai2022} % ptmx math instead of Computer
                                         % Modern (has noticable issues)
% \documentclass[mathfont=newtx]{uai2022} % newtx fonts (improves upon
                                          % ptmx; less tested, no support)
% NOTE: Only keep *one* line above as appropriate, as it will be replaced
%       automatically for papers to be published. Do not make any other
%       change above this note for an accepted version.

%% Choose your variant of English; be consistent
\usepackage[american]{babel}
% \usepackage[british]{babel}

%% Some suggested packages, as needed:
\usepackage{natbib} % has a nice set of citation styles and commands
    \bibliographystyle{plainnat}
    \renewcommand{\bibsection}{\subsubsection*{References}}
\usepackage{mathtools} % amsmath with fixes and additions
% \usepackage{siunitx} % for proper typesetting of numbers and units
\usepackage{booktabs} % commands to create good-looking tables
\usepackage{tikz} % nice language for creating drawings and diagrams
\usepackage{bm}
\usepackage{bbm}
\usepackage{amsmath}
\usepackage{amssymb}
\usepackage{amsthm}
\usepackage{xcolor}
\usepackage{graphicx}
\usepackage{comment}
\usepackage{xr}
\usepackage{subcaption}
\usepackage{bbold}
\usepackage{algorithm}
\usepackage{algorithmic}
\newtheorem{theorem}{{\bf Theorem}}
\newtheorem{lemma}{{\bf Lemma}}
\newtheorem{proposition}{{\bf Proposition}}
\newtheorem{remark}{{\bf Remark}}
\newtheorem{corollary}{{\bf Corollary}}
\newtheorem{definition}{{\bf Definition}}
\newtheorem{assumption}{Assumption}
\newcommand{\clamp}{\operatorname{clamp}}
\newcommand{\E}{\mathop{\mathbb{E}}}
\newcommand{\tr}{\mathop{\rm tr}}
\usepackage{array}
\usepackage{wrapfig}
\usepackage{multirow}
\usepackage{tabularx}
%% Provided macros
% \smaller: Because the class footnote size is essentially LaTeX's \small,
%           redefining \footnotesize, we provide the original \footnotesize
%           using this macro.
%           (Use only sparingly, e.g., in drawings, as it is quite small.)

%% Self-defined macros
\newcommand{\swap}[3][-]{#3#1#2} % just an example
%\usepackage{subfiles}

\makeatletter
\newcommand*{\addFileDependency}[1]{% argument=file name and extension
  \typeout{(#1)}
  \@addtofilelist{#1}
  \IfFileExists{#1}{}{\typeout{No file #1.}}
}
\makeatother

\newcommand*{\myexternaldocument}[1]{%
    \externaldocument{#1}%
    \addFileDependency{#1.tex}%
    \addFileDependency{#1.aux}%
}
\myexternaldocument{hasan_123-supp}

\title{Modeling Extremes with $d$-max-decreasing Neural Networks}

% The standard author block has changed for UAI 2022 to provide
% more space for long author lists and allow for complex affiliations
%
% All author information is authomatically removed by the class for the
% anonymous submission version of your paper, so you can already add your
% information below.
%
% Add authors
\author[1]{\href{mailto:<ali.hasan@duke.edu>?Subject=Your UAI 2022 paper}{Ali Hasan}{}}
\author[2]{Khalil~Elkhalil}
\author[2]{Yuting~Ng}
\author[3]{Jo\~ao~M.~Pereira}
\author[1]{Sina~Farsiu}
\author[4]{Jose~Blanchet}
\author[2]{Vahid~Tarokh}
% Add affiliations after the authors
\affil[1]{%
    Department of Biomedical Engineering\\
    Duke University\\
    Durham, North Carolina, USA
}
\affil[2]{%
    Department of Electrical and Computer Engineering\\
    Duke University\\
    Durham, North Carolina, USA
}
\affil[3]{%
    Instituto de Matem\'atica Pura e Aplicada\\
    Rio de Janeiro, Brazil
  }
 \affil[4]{%
    Department of Management Science and Engineering\\
    Stanford University\\
    Stanford, California, USA
  }
  
  \begin{document}
\maketitle

\begin{abstract}
We propose a neural network architecture that enables non-parametric calibration and generation of multivariate extreme value distributions (MEVs). 
MEVs arise from Extreme Value Theory (EVT) as the necessary class of models when extrapolating a distributional fit over large spatial and temporal scales based on data observed in intermediate scales. 
In turn, EVT dictates that $d$-max-decreasing, a stronger form of convexity, is an essential shape constraint in the characterization of MEVs. 
As far as we know, our proposed architecture provides the first class of non-parametric estimators for MEVs that preserve these essential shape constraints. 
We show that the architecture approximates the dependence structure encoded by MEVs at parametric rate. 
Moreover, we present a new method for sampling high-dimensional MEVs using a generative model. 
We demonstrate our methodology on a wide range of experimental settings, ranging from environmental sciences to financial mathematics and verify that the structural properties of MEVs are retained compared to existing methods.

\end{abstract}

\section{Introduction}

\begin{figure}
    \centering
    \includegraphics[width=0.4\textwidth]{imgs/uai/mainfig.png}
    \caption{ %\textcolor{magenta}{can the $\alpha=(0.05,0.5,0.95)$ instead?} 
    Equivalent representations of MEVs in dimension two, from dependent at the top row to independent at the bottom row. Left column, samples from MEV; Middle column, spectral representation; Right column, Pickands dependence function. We propose methods for estimating the Pickands function (section~\ref{sec:dmnn}), recovering the spectral density (section~\ref{sec:generative}) and sampling MEVs (section~\ref{sec:sampling}).}
    \label{fig:illustration}
\end{figure}
Modeling the occurrence of extreme events is an important task in many disciplines such as medicine, environmental science, engineering, and finance.
For example, understanding the probability of a patient having an adverse reaction to medication or the distribution of economic shocks is critical to mitigating the associated effects of these events~\citep{dey2016extreme}. 
However, these events are rare in occurrence and therefore are often difficult to characterize with traditional statistical tools. 
This has been the primary focus of extreme value theory (EVT), which describes how to extrapolate the occurrence of rare events outside the range of available data.
In the one-dimensional case, EVT provides remarkably simple models for the asymptotic distribution of the maximum of an infinite number of independent and identically distributed (i.i.d.) random variables, which is due to the celebrated Fisher-Tippet-Gnedenko theorem~\citep{embrechts_book}. 
%This is due to the fundamental result presented in the Fisher-Tippet-Gnedenko theorem~\citet{frechet1927loi, fisher1928limiting, mises1936distribution, gnedenko1943distribution}, which characterizes the class of distributions that arise as the asymptotic limit of (centered and normalized) maxima. 
These are known as the generalized extreme value (GEV) distributions \citep{dehaan_book}.

Perhaps more relevant to practical use-cases is to consider simultaneous extremes in the multi-dimensional scenario.
For example, how are extreme weather patterns related in geographical areas or how do extremes of different financial instruments relate?   
%The problem of modeling complex interactions between different extreme variables is more difficult, especially in the face of data scarcity inherent in the definition of rare events.
Unlike the one-dimensional case, multivariate extreme value (MEV) distributions generally do not endow simple analytical forms of the underlying density.
This leads to difficulties in performing inference tasks using conventional methods.
Instead, MEV distributions are characterized by tail dependence functions embedded in extreme value copulas~\citep{pickands1981multivariate,segers_copulas}.
%In practice, extreme value copulas are hard to estimate due to the difficulty in enforcing conditions of the tail dependence functions and the lack of data.
%\vspace{-10pt}
\subsection*{Background: Extreme Value Copulas}
We start with a brief overview of multivariate EVT and provide additional background material in Appendix~\ref{sec:bg}.
 Let $\Delta_{d-1}$ denote the unit $d-$dimensional simplex.
Let $X_i=(X_{1}^{(i)},\ldots,X_{d}^{(i)}) \in \mathbb{R}^d$ for $i \in \{1, \ldots, n \}$ be a sample of i.i.d. random vectors with common continuous probability distribution $F$, marginals $F_1, \ldots, F_d$ and copula $C_F$. The copula $C_F:[0, 1]^d \to [0, 1]$ is a function that satisfies:
\begin{equation*}
     C_F(\mathbf{u}) = \mathbb{P} \left[F_1(X_{1}) \leq u_1, \ldots, F_d(X_{ d}) \leq u_d \right].
\end{equation*}
Let the vector of \emph{component-wise maxima} be given by:
$
    M^{(n)} = \left(M_{1}^{(n)}, \ldots, M_{d}^{(n)} \right), 
$
where $M_{k}^{(n)} = \max_{i=1,\ldots,n} X_{k}^{(i)}$ for $k \in \{1, \ldots, d\}$.  
Let $C_n$ be the copula of $\bar{M}^{(n)}$ given by:
$
    \bar{M}^{(n)} = \left(\frac{M_{1}^{(n)} - b_{1}^{(n)}}{a_{1}^{(n)}}, \ldots, \frac{M_{d}^{(n)} - b_{d}^{(n)}}{a_{d}^{(n)}} \right),
$
where each component-wise maxima $M_{k}^{(n)}$ is normalized with sequences of real numbers $a_{k}^{(n)} > 0$ and $b_{k}^{(n)}$ such that the corresponding limiting marginal is non-degenerate.
Then the following property known as \emph{max-stability} holds:
\begin{equation*}
    C_n(u_1, \ldots, u_d) = C_F(u_1^{1/n}, \ldots, u_d^{1/n})^n,\; \forall \; \mathbf{u}\in[0,1]^d.
\end{equation*}

We are interested in finding the limiting copula $C$ of $C_n$ as $n \to \infty$. 
The limiting copula is then called an \emph{extreme value copula} and we say that $C_F$ is in the \emph{maximum domain of attraction} of $C$, denoted as $C_F \in \text{MDA}(C)$. The limiting extreme value copula $C$ has the form \citep{segers2012max}:
\begin{equation}
\label{eq:pickands_copula}
    \begin{split}
    C(\mathbf{u})  =  \exp\Bigg[&\left( \sum_{k=1}^d \log u_k\right) \\
    &A\left(\frac{\log u_1}{\sum_{k=1}^d \log u_k}, \ldots, \frac{\log u_d}{\sum_{k=1}^d \log u_k} \right) \Bigg] ,
    \end{split}
\end{equation}
where $A$ is known as a \emph{Pickands dependence function} that defines the joint dependence of a MEV.

\begin{definition}[Pickands dependence function]
\label{def:pickands}
A function $A : \Delta_{d-1} \to [1/d, 1]$ is called a Pickands dependence function if it satisfies the following properties:
\begin{enumerate}
    \item $A$ is homogeneous of order 1 and $d$-max-decreasing where $d$ is the dimension;
    \item $A$ satisfies $\label{pickands_bounds} \max_{k = 1,\ldots,d} w_k \leq A(\mathbf{w}) \leq 1$ for all $\mathbf{w} \in \Delta_{d-1}$.
    \item $A(\mathbf{e}_k) = 1$ where $\mathbf{e}_k$ is the $k^\text{th}$ canonical basis vector.
\end{enumerate}
\end{definition}

We give the functional definition of \emph{$d$-max-decreasing} in Appendix~\ref{sec:fdmd}\footnote{Intuitively, $d$-max-decreasing describes a stronger form of convexity needed to ensure that subsets of margins remain valid MEVs.
See \citet[Theorem 5.2.2]{hofmann2009characterization} and \citet[Theorem 6]{ressel2013homogeneous} for further details.} and instead give the spectral correspondence of $A$ here. 

\begin{definition}[Spectral form of Pickands dependence function]
\label{definition_integral}
For any Pickands dependence function $A$, there exists a Borel measure (spectral measure) $\Lambda$ on $\Delta_{d-1}$ satisfying $\int_{\Delta_{d-1}} s_k \, \mathrm{d} \Lambda(\mathbf{s}) = 1$ for $k \in \{1,\ldots,d\}$ such that 
\begin{equation}
A(\mathbf{w}) = \int_{\Delta_{d-1}} \max_{k=1,\ldots,d}w_k s_k \, \mathrm{d}\Lambda(\mathbf{s}), \:\: \mathbf{w} \in \Delta_{d-1}.
\label{eqn:pickands_integral}
\end{equation}
\end{definition}

The equality $\int_{\Delta_{d-1}} s_k \, \mathrm{d} \Lambda(\mathbf{s}) = 1$ is only used as a convention to standardize the margins, and is not essential in maintaining the $d$-max decreasing property~\citep{fougeres2013dense}.
%We use this integral representation and further discuss its implications in section \ref{sec:sampling}.
To provide some intuition on the aims of this paper, Figure~\ref{fig:illustration} illustrates the relationship between different equivalent representations for a canonical parametric MEV -- the symmetric logistic distribution with dependence parameter $\alpha=0.05$ leaning towards complete dependence and $\alpha=0.999$ leaning towards complete independence. The proposed methods estimates the Pickands function (right most column) and recovers the spectral measure (middle column) which enables sampling MEVs (left most column).

\paragraph{Related Work.}
A number of techniques have been developed to estimate extreme value copulas from data.
The most relevant to the present work is that by \cite{pickands1981multivariate} where a non-parametric estimator of the Pickands function was first proposed. Following works such as \citet{caperaa1997nonparametric} and  \citet{bucher2011new} describe alternative takes on estimating the dependence function. 
The above methods, however, do not guarantee that the estimate completely satisfy the conditions of a valid Pickands dependence function.
In \citet{marcon2017multivariate}, the authors consider a projection of a nonparametric estimator to a convex function represented as a Bernstein polynomial.
However, the number of parameters required significantly increases with both the amount of data and the dimensionality, making it difficult for higher dimensional problems or problems with many data points. 
Finally, a number of Pickands estimators were compared and described in \citet{vettori2018comparison}, and notably none of the estimators reviewed satisfied all requirements of the Pickands function in cases where $d > 2$.
For additional details, please refer to the review on extreme value copulas in \citet{segers_gudendorf_copulas}. 
A theoretical review of $d$-max-decreasing functions and their applications to copulas is given in \citet{ressel2019copulas}.
%Motivated by these challenges, we propose to model the Pickands dependence function using $d$-max decreasing neural networks to enforce the properties of Pickands dependence functions while permitting a flexible model that can learn arbitrarily complex tail dependencies. In addition, we present an algorithm to train the neural network given limited data based on the non-parametric estimator of~\citet{pickands1981multivariate} instead of traditional MLE. 
%On another front, we provide a method for recovering the spectral representation \cite{de1984spectral} of MEV distributions from a given Pickands dependence function. Recovering the spectral representation allows us to leverage existing sampling algorithms such as \cite{dombry2016exact, liu2016optimal} and also gives additional insight on possible clustering behavior of the extremes \citep{engelke2018graphical}.

\textbf{Our Contributions.} 
\begin{enumerate}
    \item We present $d$-max-decreasing neural networks, an architecture constrained to represent Pickands dependence functions of MEVs.
    \item We prove that, in the limit, the proposed architecture can approximate arbitrary Pickands functions.
    \item We propose a generative neural network representation of the spectral density of Pickands functions.
    \item We propose an extension of the Pickands estimator to train neural networks.
\end{enumerate}


\section{Neural Representations of Extreme Value Distributions}
\label{sec:Pickands_ICNN}
Our main results propose two architectures for representing MEVs: a deterministic method for representing the Pickands dependence function, and a stochastic method for representing the spectral measure. 
While both represent equivalent quantities, each is more suited for a particular task.
The deterministic representation is more suitable for estimating exceedance probabilities whereas the spectral representation is more suitable for sample generation.

\subsection{d-max-decreasing Neural Networks}
\label{sec:dmnn}
We are interested in finding a flexible parameterization of $A$ that enforces all the properties given in Definition~\ref{def:pickands}. The most difficult property to enforce is being $d$-max-decreasing.
To that end, we propose a new architecture inspired by Maxout Networks~\citep{goodfellow2013maxout} and Input Convex Neural Networks (ICNNs)~\citep{amos2017input}.
The proposed architecture, dubbed \emph{$d$-max Neural Networks (dMNNs)}, has additional restrictions to fulfill the conditions of the Pickands dependence function.

\begin{theorem}[$d$-max-decreasing Neural Architecture]
\label{thm:arch}
Let $A_{\bm \theta}^{(m)}(\mathbf{w})$ be a function defined as:
\begin{equation}
 \begin{split}
& A_{\bm \theta}^{(m)}(\mathbf{w}) \\
& := \max \bigg( \max_{k = 1,\ldots,d} w_k,\,  L^{(m)}(\mathbf{w}) +(1 - L^{(m)}(\mathbf{e})^T \mathbf{w}) \bigg),
\end{split}
    \label{eq:arch}
\end{equation}
where
\begin{align*}
L^{(m)}(\mathbf{w}) &= \frac1{n_m}\sum_{j=1}^{n_m}\left ( \ell^{(m)} \circ \ell^{(m-1)}\circ \cdots \circ \ell^{(1)}(\mathbf{w}) \right)_j,  \\
\ell^{(i)}(\mathbf{h}^{(i-1)})_j &= \max_{k=1,\ldots,n_{i-1}} \left (\Theta_{j,\cdot}^{(i)} \odot {h}^{(i-1)} \right )_{k},\\
\mathbf{h}^{(i-1)} &= \ell^{(i-1)}\circ \cdots \circ \ell^{(1)}(\mathbf{w}),\\
L(\mathbf{e}) &=(L(\mathbf{e}_1),\ldots,L(\mathbf{e}_d))^T,
\end{align*}
$m$ is the number of layers, $n_i$ is the width of the $i^\text{th}$ layer, $\;\Theta^{(i)} \in  \mathbb{R}^{n_{i} \times n_{i-1}}_+$ are the weights of the $i^\text{th}$ layer, constrained to be all positive, and $\mathbf{e}_i$ is the $i^\text{th}$ canonical basis vector.
$\odot$ denotes component-wise multiplication.

Then, $A_{\bm\theta}^{(m)}(\mathbf{w})$ is a $d$-max-decreasing function. Moreover, $A_{\bm\theta}^{(m)}(\mathbf{w})$ represents a valid Pickands dependence function.
\end{theorem}

\begin{proof}[Intuition of proof]
The proof uses the idea that $\mathbb{E}_\mathbf{s}[\max_{k=1,\ldots,d}(w_ks_k)], \:\: \mathbf{s} \in \Delta_{d-1}$ is $d$-max-decreasing and certain compositions of this function retain this property. The full proof is given in Appendix~\ref{sec:proof_arch}.
\end{proof}

%\textcolor{magenta}{figure of deep architecture, e.g. the one given in deep Archimax copulas, figure 1.}

%\textcolor{magenta}{alternative normalization scheme with multiplication.}

%\textcolor{magenta}{normalization not required to guarantee d-norm but gives standardized margins for convenience of inference. cite Anne-Laure Fougères, Cécile Mercadier, John Nolan. Dense classes of multivariate extreme value distributions. 2012}
For notational convenience, we drop the $(m)$ unless needed. 
To get an intuition behind the structure of the architecture, note that in the single layer case in the limit as $n_1\to\infty$, the weights $\bm \theta$ correspond to samples of the spectral measure in Definition~\ref{definition_integral} and the expectation is computed empirically. 
%We provide an additional representation in the Appendix using weight normalization.
While the proposed architecture is guaranteed to enforce the properties of the Pickands function, and is thus $d$-max-decreasing, we are also interested in seeing how well it can approximate an arbitrary Pickands dependence function.
We present results in the following theorem:

\begin{comment}
\begin{theorem}[Approximation Capabilities]
\label{thm:approx}
A single layer, infinite width network of the form \eqref{eq:arch} can exactly represent an arbitrary Pickands dependence function.
\end{theorem}
\begin{proof}[Intuition of proof]
If we expand the expectation in \eqref{eqn:pickands_integral}, we can see that this corresponds to \eqref{eq:arch} with a single layer and $m_1 \to \infty$. 
The full proof is given in Appendix~\ref{sec:proof_convergence}. 
\end{proof}

Finally, we are interested in analyzing how strong the convergence towards the true Pickands function is assuming the network parameters correspond to samples from the true spectral measure. 
In that sense, we provide a final result that shows the estimator converges uniformly to the true Pickands function as a function of the width of the network. 
\begin{figure}
    \centering
    \includegraphics[scale=0.14]{imgs/uai/pickands.png}
    \caption{ %\textcolor{magenta}{can the $\alpha=(0.05,0.5,0.95)$ instead?} 
   }
    \label{fig:illustration}
\end{figure}
\end{comment}
\begin{theorem}[Uniform Convergence]
\label{thm:approx}
Suppose that $\bm \theta$ are samples from the true spectral measure and $A$ is the true Pickands function.
The empirical process $$\mathbb{G}_n = \sqrt{n} \left ( A_{\bm \theta}^{(1)}(\mathbf{w}) - A(\mathbf{w}) \right )$$ converges to a zero mean Gaussian process as $n \to \infty$ where $A_{\bm \theta}^{(1)}$ is a single layer $d$MNN of width $n$.
\end{theorem}
\begin{proof}[Intuition of proof]
We first establish pointwise convergence. Then we show $A$ is Lipschitz over a bounded set whose covering number grows in accordance with functions that are $P-$Donsker. 
The full proof is given in Appendix~\ref{sec:proof_pointwise}. 
\end{proof}
%\vspace{-10pt}
The result in Theorem~\ref{thm:approx} has many implications on the properties of the proposed network since it, for example, allows us to quantify the uncertainty associated with our function estimates.
Using the proposed architecture, we mitigate issues faced by previous estimators, such as \citep{bucher2011new, caperaa1997nonparametric, marcon2017multivariate}, in enforcing the $d$-max-decreasing property, inequalities, and endpoints of the function. 

\subsection{A Generative Model for the Spectral Measure}
%\vspace{-5pt}
\label{sec:generative}
While the spectral measure can be computed from the weights of the proposed $d$MNN, we propose an alternative representation of the spectral measure using a generative neural network. We model $\mathbf{y} \sim \Lambda$ in \eqref{eqn:pickands_integral} as the output of a generative neural network $G( \, \cdot \,; \bm \phi) \in \mathbb{R}^d_+$ with parameters $\bm \phi$, i.e. $\mathbf{y} = G(\mathbf{z}; \bm \phi)$ which maps input samples $\mathbf{z} \sim p_z$ to $\mathbf{y}$, where $p_z$ is a distribution that is easy to sample from (such as a multivariate Gaussian distribution). 
This leads us to a representation of $A$ in terms of the generator:
\begin{equation}
A_G(\mathbf{w}) := \mathbb{E}_{\mathbf{y} \sim G} \left [ \max_{k=1,\ldots,d} w_k y_k \right],
\label{eq:generator}
\end{equation}
where $\mathbb{E}[y_k] = 1$. 
The expectation is taken empirically with a large number of samples from $G$.
\begin{remark}
The function given by~\eqref{eq:generator} satisfies all the necessary conditions for a valid Pickands function.
\label{rmk:generator}
\end{remark}
Following Remark~\ref{rmk:generator}, we informally note that it follows from the universal approximation theorem of neural networks that if $G$ is sufficiently expressive then \eqref{eq:generator} can represent an arbitrary Pickands dependence function.

\paragraph{Use Cases of Each Representation.}
The difference between the representation given by the $d$MNN~\eqref{eq:arch} and the generative neural network~\eqref{eq:generator} is: in the $d$MNN case the spectral measure is modeled by a discrete number of elements as dictated by the $d$MNN architecture, while in the generator case the implicit distribution of the spectral measure is modeled. 
The $d$MNN is useful in representing probabilistic quantities since it provides a deterministic representation of the CDF and therefore
it does not exhibit the variance of the generative representation. On the other hand, the generative model is capable of simulating many realizations of the MEV, particularly useful for sampling applications.
%\vspace{-10pt}
\section{Parameter Estimation}
Fitting data to high dimensional copulas is often a difficult task since the probability density function (PDF) is not directly modeled. 
In general, specific parametric families are used to make the process easier, such as in Archimedean copulas.
While it is theoretically possible to first obtain the underlying PDF via differentiating the CDF and then fit the $d$MNN with Maximum Likelihood Estimation (MLE), the procedure is computationally complex, especially in high dimensions.
The main drawback of such a method lies in the need to differentiate the $d-$variate CDF, since nested differentiation with existing automatic differentiation methods may result in numerical errors \citep{margossian2019review}.
%this is computationally complex, particularly in high dimensions. 
%Specifically, nested differentiation with existing automatic differentiation methods may result in numerical errors, as described in the review by \citet{margossian2019review}.
Instead, we use specific properties of MEVs to transform the parameter fitting procedure into MLE over univariate random variables.
We additionally present the analogs for survival distributions in Appendix~\ref{sec:survival}. 

\subsection{Fitting the Dependence Function}
Let $F_k$ denote the univariate marginal CDF (which can be fitted using MLE as in \cite{embrechts_book} or the $L$-moments method of \cite{L_moments}) of the $k^\text{th}$ normalized component wise maxima $\bar{M}_k^{(n)} = \frac{M_k^{(n)} - b_k^{(n)}}{a_k^{(n)}}$, $k \in \{1, \ldots, d\}$. In addition, let $\mathbf{w}=\left(w_1, \ldots, w_d\right) \in \Delta_{d-1}$. 
We introduce the transformation on $\bar{M}_k^{(n)}$:
\begin{align}
    \label{transform1}
    \widetilde{M}_k^{(n)} & = - \log (F_k(\bar{M}_k^{(n)})), \: \forall k \in \{1, \ldots, d\}, \\ 
    \label{transform2}
    Z_w & = \min_{k=1,\ldots,d} \widetilde{M}_k^{(n)} / w_k.
\end{align}
Then, we have: $\mathbb{P} \left[ Z_w > z\right] = e^{- z A(\mathbf{w})}$ 
% \begin{align}
%      \mathbb{P} \left[ Z_w > z\right] = e^{- z A(\mathbf{w})}
%     \label{eq:exp_a}
% \end{align}
% \begin{align}
%     & \nonumber \mathbb{P} \left[ Z_w > z\right] \\ 
%     %  & = \mathbb{P} \left[\widetilde{M}_n^{(1)} > w_1 z, \ldots, \widetilde{M}_n^{(d)} > w_d z \right] \\ 
%   \nonumber & = \mathbb{P} \left[F_k(\bar{M}_n^{(1)}) < e^{-z w_1}, \ldots, F_k(\bar{M}_n^{(d)}) < e^{-z w_d} \right] \\ 
%   \nonumber & = e^{\left( - \sum_{k=1}^d z w_k\right) A \left( \frac{-z w_1}{- \sum_{k=1}^d z w_k}, \ldots, \frac{-z w_d}{- \sum_{k=1}^d z w_k}\right)} \\
%  &= e^{- z A(\mathbf{w})}
%     \label{eq:exp_a}
% \end{align}
(for the full derivation, see Section 3 of \cite{segers_gudendorf_copulas}).
This transformation casts the original multi-dimensional distribution into the new variables $Z_w$ that are exponentially distributed with rate parameter given by the Pickands dependence function $A(\mathbf{w})$.
From this transformation, we can fit the model $A_{\bm \theta}(\mathbf{w})$ to samples $Z_w$ using MLE. 
This can be done by training the model $A_{\bm \theta}(\mathbf{w})$ with stochastic gradient descent (SGD) to match the data points $Z_w$ as follows:
\begin{align}
    \label{MLE_loss}
    A^{\star}_{\bm \theta}(\mathbf{w}) = \arg \min_{ \bm \theta} \mathbb{E}_{Z_w} \mathcal{L}(Z_w; {\bm \theta}),
\end{align}
where 
\begin{equation}
    \mathcal{L}(Z_w; {\bm\theta}) = A_{\bm\theta}(\mathbf{w}) Z_w - \log A_{\bm \theta}(\mathbf{w}).
\end{equation}

Alternative losses could be considered by reformulating the loss with respect to the estimators defined in \citet{bucher2011new} and \citet{caperaa1997nonparametric}.
We empirically found that the MLE approach described in~\eqref{MLE_loss} provides the best performance, and it follows naturally from the original formulation of~\citet{pickands1981multivariate}. 
The training procedure is summarized in Algorithm~\ref{alg_train}. 
% \textcolor{magenta}{should shorten the focus on the transformation, put it in the appendix, with a proof for how it relates to rate of exponential. show instead how we extand Pickands estimator to NN and SGD.}

%The choice of hyperparameters such as the block size, number of blocks, learning rate for Adam and batch size are provided in section \ref{sec:results} with settings depending on the application.
\begin{algorithm}[h!]
	\caption{Fitting the Pickands-$d$MNN to Data} 
	\label{alg_train}
	\begin{algorithmic}[1]
	\STATE \textbf{Input:} $\left \{ \left(X_1^{(i)}, \ldots, X_d^{(i)} \right) \right \}_{i=1}^N$, $N=B \times n$ samples of i.i.d. random vectors where $B$ is the number of blocks of data and $n$ is the size of each block.
	\STATE Take component-wise maxima over each block: $\left \{ \left(M_{1}^{(n,b)}, \ldots, M_{d}^{(n,b)} \right)\right \}_{b=1}^B$ where \newline
	\phantom{a}\hspace{35pt} $\displaystyle M_{k}^{(n,b)}=\max _{i=(b-1)n+1,...,bn} X_k^{(i)},$\newline
	for $k\in \{1, \ldots, d\}$ and $b\in \{1, \ldots, B\}$.
	\STATE Fit a GEV to each component-wise maxima $\{ M_{k}^{(n,b)} \}_{b=1}^B$, obtain $\{\bar{M}_{k}^{(n,b)} \}_{b=1}^B$, then estimate marginals $F_k$ for each $k \in \{1, \ldots, d \}$.
	\STATE \textbf{Initialize} the parameters ${\bm \theta} \geq 0$ of the $d$MNN \\ 
    \textbf{Repeat}: 
    \STATE Randomly sample a minibatch of training data $\{\bar{M}_{k}^{(n,b)} \}_{b \in \text{batch}}$
    and uniformly sample $\mathbf{w} \in \Delta_{d-1}$.
    \STATE  Transform samples according to Equations \eqref{transform1} and \eqref{transform2} to obtain transformed samples $\{Z_{w, b} \}_{b \in \text{batch}}$. 
    \STATE Compute gradient
     $\nabla_{{\bm\theta}}  \sum_{b \in \text{batch}} \mathcal{L}\left(Z_{w, b}; {\bm \theta} \right)$.\\
    \STATE Update $\bm \theta$ with Adam \citep{adam}. \\ 
    \textbf{Until} convergence \\ 
    \textbf{Output:} $A^{\star}_{\bm \theta}(\mathbf{w})$. 
	\end{algorithmic} 
\end{algorithm}
%\vspace{-9pt}

\subsection{Fitting the Generator}
Recall that we have an equivalent representation of $A$ given by $A_G$ in \eqref{eq:generator} where $G(\cdot; \bm \phi)$ is a function, with parameters $\bm \phi$, of random variables.
We fit the parameters $\bm \phi$ of the generator by solving the following optimization problem:
\begin{equation}
  \min_{\bm \phi} \mathbb{E}_{Z_w} \mathcal{L}(Z_w;\bm \phi)
 + \eta \left\| \mathbb{E}_{\mathbf{y}  } [\mathbf{y}] - \mathbf{1}_d \right\|_2^2,
\label{eqn:segers_opt}
\end{equation}
with $\mathcal{L}$ now defined using the representation of $A_G$ in \eqref{eq:generator}:
 $$
 \mathcal{L}(Z_w; \bm \phi) = \mathbb{E}_{\mathbf{y}}[\max_{k=1\ldots d}y_k w_k]Z_w - \log \mathbb{E}_{\mathbf{y}}[\max_{k=1\ldots d}y_k w_k],
 $$
 where $\mathbf{y} = (y_1, \ldots, y_d) =  G(\mathbf{z}; \bm \phi)$, $\mathbf{y} \in \mathbb{R}^d_+$ and $\mathbf{z} \in \mathbb{R}^k \sim p_z$ with $\eta >0$ as a regularization factor. 
 Note that the second expectation in~\eqref{eqn:segers_opt} is only needed to enforce the margins. It need not be strictly enforced, enforcing approximately only results in minor changes in the tail index. 
The expectations with respect to $\mathbf{y}$ in \eqref{eqn:segers_opt} are approximated using the sample mean with samples from the generator. 
%Additionally, minimizing the objective in \eqref{eqn:segers_opt} only requires samples of the learned spectral measure $\mathbf{y}$ rather than samples from the MEV distribution, thereby bypassing additional complexities required by sampling from the full MEV distribution while training.

To summarize the parameter estimation section, we bypass the need to differentiate the CDF and use properties of MEVs to estimate the parameters of the distribution from data. 
Both representations of the Pickands function presented can be used with this technique. 
%\vspace{-10pt}
\section{Sampling}
\label{sec:sampling}
While learning MEV distributions from data is important for computing probabilities, it is also useful to simulate possible scenarios by sampling from an estimated MEV distribution.
We introduce a sampling technique using the proposed architectures to efficiently sample from arbitrary MEVs.
To the best of our knowledge, there are no general sampling methods for arbitrary extreme value copula that scale to high dimensions.
This is because MEV sampling algorithms assume knowledge of the spectral measure, and do not consider sampling when given only the Pickands function.
It then becomes necessary to recover the spectral measure from a given Pickands function or from data, which we previously described two methods for doing so.
We additionally note that the traditional method of conditional sampling for copulas is ineffective since it requires both computing high order derivatives and using numerical root-finding techniques.
We base our sampling procedure on algorithms for the infinite dimensional analogue of MEV distributions known as \emph{max-stable processes} \citep{dombry2016exact}. 
Max-stable processes have the property that finite dimensional marginals are MEVs and have a spectral representation in terms of the spectral measure $\Lambda$ for stationary processes.
This ultimately allows us to recast MEV sampling in terms of prior work on sampling from max-stable processes, where established methods exist.

\subsection{Margins of Max-Stable Processes as MEV Distributions}
A stationary max-stable process has the form: 
\begin{equation}
\label{eq:max_stable}
\max_{i \geq 1} \xi_i y_i(x), \:\: x \in \mathbb{X} \subset \mathbb{R}^k
\end{equation}
where $\xi_i$ is the $i^\text{th}$ realization of a Poisson point process with intensity $\xi^{-2} \mathrm{d} \xi$. 
$y_i$ is the $i^\text{th}$ sample from the spectral measure. 
Additionally, $\mathbb{E}[y(x)] = 1,\, x \in \mathbb{X}$ is generally assumed to enforce unit Frechet margins. 
For a finite number $d$ of $\{x_j\}_{j=1}^d$, this corresponds to a $d$-dimensional spectral measure with the same properties as in Definition~\ref{definition_integral}. 
The key idea is to use the representation in \eqref{eq:max_stable} to sample from the full MEV distribution with only knowledge of the spectral measure. 
%Since we recover the spectral measure in both proposed methods, we will use this technique to generate samples from the MEV distribution.
We use the algorithm mentioned in \citet[Algorithm 1]{hofert2018hierarchical} for sampling from the full distribution given samples of the spectral measure. We give the details of the algorithm in Appendix~\ref{sec:algs} Algorithm~\ref{alg:sampling}.

\subsection{Sampling from the dMNN}
Suppose we fit a single layer $d$MNN using Algorithm~\ref{alg_train} with weights given by ${\bm \theta } \in \mathbb{R}_+^{w \times d}$ where $w$ is the width of the network and $d$ is the data dimension.
Consider the transformation $\hat{\theta}_{i,j} = \theta_{i,j} / \sum_{j=1}^d \theta_{i, j}$ where we transform the weights of the network to the unit simplex $\Delta_{d-1}$, and $i,j$ refer to the row and column indices. 

We then choose a number $N$ and compute
$$
\max_{i = 1, \ldots, N} \xi_i \hat{\theta}_{i + j}, \quad j \sim \text{rand}(\{1,\ldots, w-N\})
$$
where $\xi_i$ is defined as per~\eqref{eq:max_stable}.
While this method is effective in sampling, a possible issue is the finite number of $\hat{\theta}$ dictated by the width $w$ of the network. 
The generative model on the other hand allows for unlimited generation of samples of the spectral measure. 

%\textcolor{magenta}{we need to put an algorithm, at least in the appendix.}

\subsection{Sampling from the Generative Model}
Suppose we fit a generative model $G(z; \bm \phi)$ to data following the optimization procedure in \eqref{eqn:segers_opt}. 
Then sampling proceeds similarly to the case with the $d$MNN except in this case we do not use the weights of the network explicitly, but sample from the model:
$$
\max_{i=1,\ldots,N}\xi_i y_i \:\: \text{where} \:\: y_i = G(z_i; \bm \phi), z_i \sim p(z)
$$
where the notation is maintained as above with $p(z)$ defining an easy to sample prior distribution.

As a final note regarding the sampling methods, one particularly useful way of combining the methods is to first estimate $A_{\bm\theta}$ from data using an estimator such as the $d$MNN.
Then, fit the generator to $A_{\bm\theta}$ by taking the mean squared error (MSE) between the two representations, i.e.
\begin{align*}
\min_{\bm \phi} \,& \mathbb{E}_{\bm{w}\sim\text{Unif}(\Delta_{d-1})}( A_{\bm\theta}(\bm{w}) - A_{G}(\bm{w}) )^2 \\
& + \eta \| \mathbb{E}[\mathbf{y}] - \mathbf{1}_d \|_2^2.
\end{align*}
%\begin{align*}
%{\bm \phi}^\star = \arg \min_{\bm \phi} %&\left \| A_{\bm\theta} - \mathbb{E}_{\mathbf{y} \sim G_{\bm \phi}} \left[ \max_{k=1, \ldots, d} w_k y_k \right] \right \| \\
%& + \eta \| \mathbb{E}[\mathbf{y}] - \mathbf{1}_d \|.
%\end{align*}
This provides a simple way to recover the spectral density of any given EVC and thus an effective way to sample from arbitrary MEVs.
We detail this algorithm in Appendix~\ref{sec:algs} Algorithm~\ref{alg:train_gen}.
%\vspace{-15pt}
\section{Results}
\label{sec:results}
%\vspace{-5pt}
\begin{figure}
    \centering
\begin{subfigure}{.22\textwidth}
  \centering
  \includegraphics[width=\linewidth]{imgs/margins/commodities_3d_NaiveEstimator.pdf}  
  \caption{Pickands 3d Margins}
  \label{fig:pick_marg3}
\end{subfigure}
\begin{subfigure}{.22\textwidth}
  \centering
  \includegraphics[width=\linewidth]{imgs/margins/commodities_3d_CFGEstimator.pdf}  
  \caption{CFG 3d Margins}
  \label{fig:cfg_marg3}
\end{subfigure}
\begin{subfigure}{.22\textwidth}
  \centering
  \includegraphics[width=\linewidth, trim=2pt 0pt 2pt 0pt]{imgs/margins/commodities_3d_BDVEstimatorMM.pdf}  
  \caption{BDV 3d Margins}
  \label{fig:bdv_marg3}
\end{subfigure}
\begin{subfigure}{.22\textwidth}
  \centering
  \includegraphics[width=\linewidth]{imgs/margins/commodities_3d_MaxLinear_w512d1.pdf}  
  \caption{$d$MNN 3d Margins}
  \label{fig:net_marg3}
\end{subfigure}
\label{fig:marg3}
\caption{Qualitative comparison of 3d margins from learned 10d MEV for the commodities dataset. The $d$MNN retains margins that are valid Pickands dependence function. The other estimators are non-convex and outside the required bounds. Contours plotted with solid lines. See additional figures in Appendix~\ref{sec:large_figs} and~\ref{sec:more_experiments}, Figures~\ref{fig:net_marg_ozone} to \ref{fig:net_marg_crypto}.}
%\vspace{-5pt}
\end{figure}
\begin{comment}
\begin{figure}
    \centering
\begin{subfigure}{.22\textwidth}
  \centering
  \includegraphics[width=\linewidth]{imgs/margins/spy_extremal_NaiveEstimator.pdf}  
  \caption{Pickands 2d Margins}
  \label{fig:pick_marg_spy}
\end{subfigure}
\begin{subfigure}{.22\textwidth}
  \centering
  \includegraphics[width=\linewidth]{imgs/margins/spy_extremal_CFGEstimator.pdf}  
  \caption{CFG 2d Margins}
  \label{fig:cfg_marg_spy}
\end{subfigure} 
\begin{subfigure}{.22\textwidth}
  \centering
  \includegraphics[width=\linewidth, trim=2pt 0pt 2pt 0pt]{imgs/margins/spy_extremal_BDVEstimatorMM.pdf}  
  \caption{BDV 2d Margins}
  \label{fig:bdv_marg_spy}
\end{subfigure}
\begin{subfigure}{.22\textwidth}
  \centering
  \includegraphics[width=\linewidth]{imgs/margins/spy_extremal_MaxLinear_w512d1.pdf}  
  \caption{$d$MNN 2d Margins}
  \label{fig:net_marg_spy}
\end{subfigure}
\label{fig:marg_spy}
\caption{Qualitative comparison of 28 2d margins from learned 418d MEV for the S\&P dataset. The $d$MNN is the method that retains margins that are valid Pickands dependence functions as the others are non-convex and outside the required bounds.}
\end{figure}
\end{comment}
In this section, we provide numerical results that compare the estimation capabilities of the proposed $d$MNN-based model with well-known estimators from the literature: Pickands \citep{pickands1981multivariate}, CFG \citep{caperaa1997nonparametric}, and the estimator described in \citep{bucher2011new} which we refer to as BDV. These estimators are described in greater detail in Appendix~\ref{sec:estimators}. 
We start by evaluating the performance for estimating survival probabilities on known parametric models, followed by real data.
We conclude with experiments on sampling from a MEV, where we use the proposed generative model for high dimensional data with different dependence structures.
% Throughout this section, we refer to the proposed ICNN-based estimator as Pickands-ICNN. 
%All hyperparameters are fixed to the same values for all experiments presented with the Pickands-$d$MNN.
%Specifically, we use a single layer $d$MNN with a width of $512$ for all estimation experiments in the manuscript 
To align with the results in Theorem~\ref{thm:approx}, for the experiments presented in this section, we use a single layer $d$MNN with a width of $512$. Additional experiments with two different architectures are presented in Appendix~\ref{sec:more_experiments}.
Code for experiments is available at\footnote{\url{https://github.com/alluly/dMNN}}.

\begin{figure}
    \centering
\begin{subfigure}{.23\textwidth}
  \centering
  \includegraphics[width=\linewidth]{imgs/uai/survival_sl_50_w=512d=1.pdf} 
  \caption{$A_\text{SL}$ MSE ($d=2$)}
  \label{fig:sl_survival}
\end{subfigure}
\begin{subfigure}{.23\textwidth}
  \centering
  \includegraphics[width=\linewidth]{imgs/uai/survival_asl_50_w=512d=1.pdf}
  \caption{$A_\text{ASL}$ MSE ($d=2$)}
  \label{fig:asl_survival}
\end{subfigure} 
\caption{MSE of survival probabilities for $d=2$ with $100$ samples for $A_\text{SL}$ (\ref{fig:sl_survival}) and $A_\text{ASL}$ (\ref{fig:asl_survival}). Thresholds are above the $75$th percentile.}
%\vspace{-10pt}
\end{figure}
\begin{figure*}
    \centering
\begin{subfigure}{.247\textwidth}
  \centering
  \includegraphics[width=\linewidth]{imgs/uai/box_sl_w=512d=1_Da.pdf} 
  \caption{$A_\text{SL}$ MSE ($d=256$)}
  \label{fig:sl_mse_est_all_a}
\end{subfigure}
\begin{subfigure}{.247\textwidth}
  \centering
  \includegraphics[width=\linewidth]{imgs/uai/box_asl_w=512d=1_Da.pdf}
  \caption{$A_\text{ASL}$ MSE ($d=256$)}
  \label{fig:asl_mse_est_all_a}
\end{subfigure} 
\begin{subfigure}{.247\textwidth}
  \centering
  %\includegraphics[width=\linewidth]{imgs/mse_sl_alln_thick_zoomed.pdf}  
\includegraphics[width=\linewidth]{imgs/uai/box_sl_w=512d=1_Dd.pdf}  
  \caption{$A_\text{SL}$ MSE ($\alpha=0.5$)}
  \label{fig:sl_mse_est_all_d}
\end{subfigure}
\begin{subfigure}{.24\textwidth}
  \centering
  %\includegraphics[width=\linewidth]{imgs/mse_asl_alln_thick_zoomed.pdf}  
    \includegraphics[width=\linewidth]{imgs/uai/box_asl_w=512d=1_Dd.pdf}  
  \caption{$A_\text{ASL}$ MSE ($\alpha=0.5$)}
  \label{fig:asl_mse_est_all_d}
\end{subfigure}  

\caption{Comparison of $||\hat{A}(\mathbf{w}) - A(\mathbf{w})||_2^2$ for different estimators $\hat{A}$ for different dependence $\alpha = \{0.25, 0.50, 0.75, 1.0\}$ with fixed $d=256$ (\ref{fig:sl_mse_est_all_a}, \ref{fig:asl_mse_est_all_a}) and for fixed  $\alpha=0.5$ with different $d = \{256, 512, 728, 1024\}$ (\ref{fig:sl_mse_est_all_d}, \ref{fig:asl_mse_est_all_d}). The reference $A(\mathbf{w})$ are $A_\text{SL}$ (\ref{fig:sl_mse_est_all_a}, \ref{fig:sl_mse_est_all_d}) and $A_\text{ASL}$ (\ref{fig:asl_mse_est_all_a}, \ref{fig:asl_mse_est_all_d}). Results are over 50 runs with 100 training samples for each run.} 
    \label{fig:sl_asl_mse}
\end{figure*}
\begin{table}[ht!]
\footnotesize
    \centering
    \begin{tabular}{@{}ll@{}}
    Pickands function & Parameters \tabularnewline
    \toprule
    $A_{\text{SL}}(\mathbf{w}) = \left( \sum_{k=1}^d w_k^{1/\alpha}\right)^{\alpha}$ & $\alpha \in (0,1]$ \tabularnewline
    \multirow{3}{*}{$A_\text{ASL}(\mathbf{w}) = \sum_{b \in \mathcal{P}_d}\bigg ( \sum_{i \in b} (\lambda_{i,b}w_i)^{1 / \alpha_b} \bigg)^{\alpha_b}$} &
    $\alpha_b \in (0, 1] $ \tabularnewline
    &  $\lambda_{i, b} \in [0, 1]$ \tabularnewline
    & $ \sum_{i \in b} \lambda_{i,b} =1$%$\sum_{b \in \{c \in \mathcal{P}_d \:: \: i \in c \}} \lambda_{i, b}=1$ 
    \end{tabular}
    \caption{Parametric Pickands functions for the symmetric $A_\text{SL}$ and asymmetric $A_\text{ASL}$ logistic copulas and their valid parameter ranges. $\mathcal{P}_d$ refers to the power set of $\{1,\ldots,d\}$. All functions are defined for domain $\mathbf{w} \in \Delta_{d-1}$.}
    \label{tab:sl}
\end{table}
\paragraph{Synthetic data.} We consider two canonical families of extreme value distributions known as the symmetric logistic ($A_\text{SL}$) and the asymmetric logistic ($A_\text{ASL}$) families where the underlying Pickands function is given by \cite{segers_copulas} listed in Table~\ref{tab:sl}.
$\alpha \in (0, 1]$ is the parameter modeling the degree of dependence between variables ranging from complete dependence ($\alpha=0$) to complete independence ($\alpha=1$).
%For the asymmetric logistic, $\mathcal{P}_d$ is the power set of $\{1,\ldots,d\}$. The dependence parameters $\alpha_b \in (0, 1]$ are defined for $b \in \mathcal{P}_d$ except for all singleton sets. The asymmetry parameters $\lambda_{i, b} \in [0, 1]$ are randomly generated to satisfy: $\sum_{b \in \{c \in \mathcal{P}_d \:: \: i \in c \}} \lambda_{i, b}=1$.
Exact sampling from distributions of this type are described in \cite{stephenson2003simulating}. 
Note that for both the symmetric and asymmetric copulas, the marginals are distributed according to the standard Fr\'echet distribution. 
We start by comparing the MSE of survival probabilities for $d=2$ where the true Pickands dependence function is given by the symmetric or asymmetric model described above for different degrees of dependence $\alpha$. 
We compute the exact values of the survival probability and consider survival probabilities associated with margins above the 75th percentile. 
As shown in Figures~\ref{fig:sl_survival} and~\ref{fig:asl_survival}, the proposed Pickands-$d$MNN estimator achieves the lowest MSE performance for most degrees of dependence $\alpha$ for the symmetric logistic model and all the degrees of the asymmetric logistic model. 
The proposed method performs worse comparatively in the full dependence case of the symmetric logistic (when all components of the vector are the same) which we suspect is due to difficulties in the optimization procedure of the $d$MNN.
We additionally showcase the ability of the proposed method to model high dimensional extreme value distributions. 
To do this, we train the Pickands-$d$MNN with data for $d=256$ with $\alpha \in \{0.25, 0.50, 0.75, 1.0\}$ and for $d = \{ 256, 512, 728, 1024\}$ with $\alpha = 0.5$. 
Then, we compute the MSE between the Pickands-$d$MNN and the true Pickands function via Monte Carlo with 10,000 uniformly sampled points in $\Delta_{d-1}$. 
The results are illustrated for varying $\alpha$ in Figures~\ref{fig:sl_mse_est_all_a} and~\ref{fig:asl_mse_est_all_a} and for $\alpha = 0.5$ in Figures~\ref{fig:sl_mse_est_all_d} and~\ref{fig:asl_mse_est_all_d}. 
While all hyperparameters were fixed at the beginning and not fine-tuned, we note that performance may improve if additional fine-tuning is performed using a validation set. 
%This remains an area for future work.

% \newlength{\oldintextsep}
% \setlength{\oldintextsep}{\intextsep}
% \setlength\intextsep{0pt}
% \begin{wraptable}{r}{0pt}
\begin{table*}[tbh!]
\newcommand{\timesten}{\text{\tiny $\times10$}}
%\newcommand{\timesten}{ \times10}
\centering
%\footnotesize
\begin{tabular}{@{}lclllll@{}} 
\toprule
 & $d$ & Train/Test & \textsc{Pickands} & \textsc{CFG} & \textsc{BDV} & \textsc{Proposed}  \\ 
\midrule
 Wind & 10 & day/week &
 ${4.48(18.6)}\timesten^{-4}$ & $\textit{4.15(15.1)}\timesten^{-4}$ &   $\bf 4.10(16.3)\timesten^{-4}$ &   $ 4.37(17.5)\timesten^{-4}$ \\
 Ozone & 4 & day/week &
 $3.06(4.66)\timesten^{-2}$ &  $2.99(4.56)\timesten^{-2}$ & $\textit{2.86(4.46)}\timesten^{-2}$ & $\bf 2.73(4.25)\timesten^{-2}$ \\
 Commodities & 10 & week/month &
 $4.34(5.82)\timesten^{-3}$  & $4.33(5.71)\timesten^{-3}$   & $\textit{1.60(1.96)}\timesten^{-3}$  &  $\bf 1.56(2.21)\timesten^{-3}$  \\
 S\&P 500 & 418 & week/month &
 $\textit{3.02(21.2)}\timesten^{-3}$ & $\textit{3.02(21.1)}\timesten^{-3}$ & $6.28(35.2)\timesten^{-3}$ & 
 $\bf 2.41(22.2)\timesten^{-3}$ \\
 Crypto & 100 & week/month &
 ${1.06(2.85)}\timesten^{-2}$ & $\textit{1.05(4.86)}\timesten^{-2}$ & ${1.34(3.44)}\timesten^{-2}$ & $\bf 8.57(26.4)\timesten^{-3}$ \\
 COVID (NC) & 100 & week/week & $4.04(7.21) \timesten^{-2}$ & $4.04(7.19) \timesten^{-2}$ & $\it 3.83(6.51) \timesten^{-2}$ & $\bf 4.37(10.7) \timesten^{-3}$ \\
  COVID (NY) & 58 & week/week & $2.74(10.4) \timesten^{-2}$ & $2.74(10.4) \timesten^{-2}$ & $\it 2.25(7.75) \timesten^{-2}$ & $\bf 4.06(9.50) \timesten^{-3}$ \\
  COVID (CA) & 58 & week/week & $ \it 1.17(3.98) \timesten^{-2}$ & $ 1.19(3.87) \timesten^{-2}$ & $ \it 1.17(3.85) \timesten^{-2}$ & $\bf 1.18(4.83) \timesten^{-3}$ \\

\bottomrule
\end{tabular}
    \caption{MSE of different estimators in estimating maxima over longer time scales. Best and second best performances are marked in \textbf{bold} and \textit{italic} respectively.}
    \label{tab:real_data}
\end{table*}
% \end{wraptable}

\begin{figure}[h!]
    \centering
\begin{subfigure}{.22\textwidth}
  \centering
  \includegraphics[width=\linewidth]{imgs/margins/cali_extremal_NaiveEstimator.pdf}  
  \caption{Pickands 2d Margins}
  \label{fig:pick_marg}
\end{subfigure}
\begin{subfigure}{.22\textwidth}
  \centering
  \includegraphics[width=\linewidth]{imgs/margins/cali_extremal_CFGEstimator.pdf}  
  \caption{CFG 2d Margins}
  \label{fig:cfg_marg}
\end{subfigure}
\begin{subfigure}{.22\textwidth}
  \centering
  \includegraphics[width=\linewidth]{imgs/margins/cali_extremal_BDVEstimatorMM.pdf}  
  \caption{BDV 2d Margins}
  \label{fig:bdv_marg}
\end{subfigure}
\begin{subfigure}{.22\textwidth}
  \centering
  \includegraphics[width=\linewidth]{imgs/margins/cali_extremal_MaxLinear_w512d1.pdf}  
  \caption{$d$MNN 2d Margins}
  \label{fig:net_margins}
\end{subfigure}
\label{fig:marg}
\caption{Qualitative comparison of 10 out of 45 total 2d margins from learned 10d MEV for the California Winds dataset. The $d$MNN is the only method that retains margins that are valid Pickands dependence functions. See additional figures in Appendix~\ref{sec:large_figs} and~\ref{sec:more_experiments}. Figures~\ref{fig:net_marg_ozone}~-~\ref{fig:net_marg_crypto}.}
\end{figure}
\paragraph{Real data.}
We test the proposed estimator with real data on extreme ozone levels $(d=4)$, wind gusts $(d=10)$, commodity prices $(d=10)$, cryptocurrencies to USD conversion rates $(d=100)$, S\&P 500 components with sufficient history $(d=418)$, and county-level COVID-19 case counts for California $(d=58)$, New York $(d=58)$ and North Carolina $(d=100)$.
We provide details for each dataset in Appendix~\ref{sec:exp_details}.
For environmental datasets, we compute the maximum over the different sampling periods, while for the financial data we compute the maximum drawdown.
The maximum drawdown is defined as the difference between the minimum and maximum values over a time period normalized by the maximum value.
For the COVID-19 data, we compute the change in case counts over different time scales. 
All margins were fit with GEVs using the \verb|scipy| implementation, which computes the $a_n, b_n$ normalizing constants.

The main challenge associated with real data is the lack of a ground truth for comparison purposes.
It is extremely difficult to accurately compare different estimators on real data because we can never observe the true distribution of extremes.
Since the purpose of EVT is to extrapolate to the tails from observations not necessarily in the tails, we consider extreme events on different time scales. 
If we fit based on extreme observations on shorter time scales and test on extreme observations on longer time scales, we will obtain an estimate of how well the different methods extrapolate to tail probabilities, since longer time scales will have more extreme events.  
%We therefore resort to computing the accuracy of the model estimate w.r.t. the empirical estimate over a series of thresholds.

We compute the accuracy of the different estimators with respect to the empirical estimate on held out data over longer time scales. 
%To compare the performance of different estimators, we compute the probability of an estimator having the closest estimate to the empirical survival probability. 
Specifically, we choose a series of quantiles where we observe data and compute the difference between the estimated survival probabilities and the empirical estimate calculated from observed data. 
This is quantified as:
$
     \frac{1}{|Q|} \sum_{\mathbf{\gamma} \in Q}  \left[ \frac{1}{B} \sum_{b=1}^B \mathbbm{1}_{ \{ M_{n, b} \geq \mathbf{\gamma} \}} - P_\theta(M_n\geq \mathbf{\gamma}) \right]^2,
    \label{eqn:empirical_accuracy}
$
where $M_{n, b} = \left(M_{n, b}^{(1)}, \ldots, M_{n, b}^{(d)} \right)$ is the $d-$dimensional vector of point-wise maxima (or point-wise maximum drawdown over a period of interest),
$P_\theta$ is the estimated survival probability, and $Q$ is a set of thresholds to consider. 
%This is quantified as: \text{Acc}(\text{est}) =  $\frac{1}{|Q|} \sum_{\gamma \in Q} \mathbb{1}\{ \text{est} = \arg \min_{\text{est}' \in \text{estimators}} (\mathbb{P}_{\text{est}'}(\gamma) - \mathbb{P}_{\text{emp}}(\gamma))^2\}$, where $\mathbb{P}_{\text{est}}(\gamma)$ is the survival probability estimated by an estimator: $\text{est} \in \text{estimators}= \{\text{Pickands, CFG, BDV, Proposed}\}$ for a given threshold $\gamma$ corresponding to a specific quantile. 
%$\mathbb{P}_{\text{emp}}(\gamma) = \frac{1}{B} \sum_{b=1}^B \mathbb{1}\{ M_{n, b} \geq \gamma\}$ with $M_{n, b} = \left(M_{n, b}^{(1)}, \ldots, M_{n, b}^{(d)} \right)$ the $d-$dimensional vector of point-wise maxima (or point-wise maximum drawdown over a period of interest) denotes the empirical survival probability for a given quantile $\gamma$.
\begin{figure}[tbh!]
    \centering
\begin{subfigure}{.22\textwidth}
  \centering
  %\includegraphics[width=\linewidth]{imgs/sl_225_neurips_thick.pdf}   
  \includegraphics[width=\linewidth]{imgs/uai/sl_CFG_sampling_da.pdf}  

  \caption{SL CFG MSE $\Delta \alpha$}
  \label{fig:sl_mse_gen}
\end{subfigure}
\begin{subfigure}{.22\textwidth}
  \centering
  %\includegraphics[width=\linewidth]{imgs/asl_225_neurips_thick.pdf}  
    \includegraphics[width=\linewidth]{imgs/uai/asl_CFG_sampling_da.pdf}  
  \caption{ASL CFG MSE $\Delta \alpha$}
  \label{fig:asl_mse_gen}
\end{subfigure}
\begin{subfigure}{.22\textwidth}
  \centering
  \includegraphics[width=\linewidth]{imgs/uai/sl_CFG_sampling_dd.pdf}  
  \caption{SL CFG MSE $\Delta d$}
  \label{fig:sl_mse_gen_all_d}
\end{subfigure}
\begin{subfigure}{.22\textwidth}
  \centering
  \includegraphics[width=\linewidth]{imgs/uai/asl_CFG_sampling_dd.pdf}  
  \caption{ASL CFG MSE $\Delta d$}
  \label{fig:asl_mse_gen_all_d}
\end{subfigure}
\caption{MSE of CFG estimate for 1000 samples and 1000 simplex points for $d=225$ at various $\alpha \in (0,1)$ (\ref{fig:sl_mse_gen}, \ref{fig:asl_mse_gen}) and for $\alpha=0.5$ at various $d=\{64, 128, 256, 784, 1024\}$ (\ref{fig:sl_mse_gen_all_d}, \ref{fig:asl_mse_gen_all_d}). Data sampled from generative model (blue), $d$MNN (orange), and ground truth (green), where the distributions considered were $A_\text{SL}$ (\ref{fig:sl_mse_gen}) and  $A_\text{ASL}$ (\ref{fig:asl_mse_gen}). Both models were trained with 1000 data points.}
\end{figure}
We choose $Q$ to be all quantiles such that the empirical probability is greater than 0. 
This measures how well the proposed method can extrapolate to greater extremes over longer time scales.
The results are presented in Table~\ref{tab:real_data} and suggest that while most estimators perform similarly, the proposed method most consistently performs the best in terms of the evaluation metric.
%Additionally, none of the baseline estimators consistently perform well on the different data, since the second best estimator changes between datasets.
We would like to emphasize that empirical evaluation on real data is very challenging, and the high variances prevent us from making meaningful statements on the efficacy of any of the methods.
However, from Figures~\ref{fig:net_marg3} and~\ref{fig:net_margins}, we see that our proposed estimator is the only one that satisfies the necessary properties of the Pickands function, which is the main purpose of the proposed method.
Specifically, if we consider the properties of convexity and bounds, the proposed estimator is the only one that retains these. 
The other estimators are not convex and achieve values greater than 1, which leads to incorrect probabilities when considering conditional probabilities. 
Additional figures in Appendix~\ref{sec:large_figs} showcase this property on additional datasets and Appendix~\ref{sec:more_experiments} Figures~\ref{fig:net_wind} to \ref{fig:net_crypto} compares these for different architectures. 
It is critical that these properties are satisfied so that downstream tasks such as conditional probabilities can be computed.
From the state-of-the-art estimators, the properties are not satisfied and thus the applicability of the estimators is severely limited.
%\begin{comment}
\paragraph{Conditional Prediction.}
One important task is computing the conditional survival probability of a random variable. 
Suppose we have a $d$-dimensional EVC and we are interested in computing the probability that the $i^\text{th}$ component exceeds a threshold conditioned on some subset of the other components, $\mathcal{C} \subseteq  \left\{1,\ldots,d\right\} / i$. 
We can compute this through the relation: $$\mathbb{P}(x_i > X_i | \cap_{j\in\mathcal{C}} x_j > X_j) =\frac{\mathbb{P}(x_i > X_i ,\cap_{j\in\mathcal{C}} x_j > X_j) }{ \mathbb{P}(\cap_{j\in\mathcal{C}} x_j > X_j)}.$$
For performance evaluation, we can cast this as a classification problem where we consider features $X_j,\; j\in \mathcal{C}$ with a positive class associated if the combination of $\{X_i, X_j\}, j\in \mathcal{C}$ appears in the held out data. 
Therefore we only have examples of positive classes since all the examples in the held out data are realizations that did occur.
Since we cannot observe the examples that do not occur, we must evaluate how well the method is performing based on the examples that do.
This classification problem with a single class was studied in~\citet{lee2003learning}, where they propose a metric that behaves similarly to the F1 score in binary classification.
This metric is defined as \begin{equation}\frac{r^2}{\mathbb{E}[\mathbbm{1}\{\mathbb{P}(x_i > X_i | \cap_{j\in\mathcal{C}} x_j > X_j) \geq 0.5\}]},
\label{eq:score}
\end{equation} where $r = \frac1N \sum_{k=1}^N \mathbbm{1}\{\mathbb{P}(x_i^{(k)} > X_i^{(k)} | \cap_{j\in\mathcal{C}} x_j^{(k)} > X_j^{(k)}) \geq 0.5\} $ is the proportion of correctly classified examples on the held out data.
The denominator is approximated by taking an empirical average over the space $[0,1]^{|\mathcal{C}|+1}$.
The score has a range of $[0, \infty)$ where larger values indicate better performance.
Table~\ref{tab:score} presents the results of classification on held out data for the COVID datasets where the top 5 most populous counties are used to predict the probability of the $6^\text{th}$ county having the change in case counts at or greater than the observed value.
We consider greater than or equal to due to case counts often being underestimated due to lack of testing.
The results in Table~\ref{tab:score} suggest that the method is an effective tool for computing conditional probabilities necessary for classification tasks.

\begin{table}[ht!]
\footnotesize
    \centering
    \begin{tabular}{@{}llll@{}}
    & NC & NY & CA \tabularnewline
    \toprule
    \textsc{Pickands} & $8.31 \times 10^{-1}$  & $9.68 \times 10^{-1}$  & $8.46 \times 10^{-1}$ \tabularnewline
    \textsc{CFG} & $8.32 \times 10^{-1}$ & $9.69 \times 10^{-1}$  & $8.46 \times 10^{-1}$ \tabularnewline
    \textsc{BDV} & $8.10 \times 10^{-1}$ & $8.04 \times 10^{-1}$  & $7.50 \times 10^{-1}$ \tabularnewline
    \textsc{Proposed} & $\bf 9.79 \times 10^{-1}$ & $\bf 1.08 \times 10^{0}$  & $\bf 1.10 \times 10^{0}$ \tabularnewline
    \end{tabular}
    \caption{Classification score~\eqref{eq:score} on held out COVID-19 data for different states conditioned on 5 counties. Higher is better. }
    \label{tab:score}
\end{table}
%\end{comment}
\begin{comment}
This is quantified as:
$
     \frac{1}{|Q|} \sum_{\mathbf{\gamma} \in Q}  \left[ \frac{1}{B} \sum_{b=1}^B \mathbbm{1}_{ \{ M_{n, b} \geq \mathbf{\gamma} \}} - P_\theta(M_n\geq \mathbf{\gamma}) \right]^2,
    % \label{eqn:empirical_accuracy}
$
where $M_{n, b} = \left(M_{n, b}^{(1)}, \ldots, M_{n, b}^{(d)} \right)$ is the $d-$dimensional vector of point-wise maxima\footnote{In the case of maximum drawdown this is referred to as the point-wise maximum drawdown over a period of interest.} and 
$P_\theta$ is the estimated survival probability, and $Q$ is a set of thresholds to consider. 
In general, we are interested in thresholds corresponding to extremes in the tail; however, due to the lack of data in that region, we estimate accuracy in areas where more data observations are available.
\end{comment}
% \arraystretch{1.5}


%\begin{figure}
%    \centering
%\begin{subfigure}{.48\textwidth}
%  \centering
%  \includegraphics[width=\linewidth]{imgs/mse_sl_256.pdf}  
%  \caption{SL MSE}
%  \label{fig:sl_mse_est_all_d}
%\end{subfigure}
%\begin{subfigure}{.48\textwidth}
%  \centering
%  \includegraphics[width=\linewidth]{imgs/mse_asl_256.pdf}  
%  \caption{ASL MSE}
%  \label{fig:asl_mse_est_all_d}
%\end{subfigure}    
%\caption{Comparison of $||\hat{A}(w) - A(w)||_2^2$ for different estimators $\hat{A}$ for different values of $\alpha$. }
%\end{figure}

\paragraph{Sampling from the copula.}
Finally, to determine the efficacy of sampling from an arbitrary Pickands copula, we consider two synthetic examples using the previously described MEV distributions in Table~\ref{tab:sl}. 
In this experiment, we train the generator $G( \, \cdot \,; \bm \phi)$ in \eqref{eqn:segers_opt} based on 1000 samples from the target distribution. 
We represent $G(\, \cdot \,;\bm \phi)$ as a 2 layer 256 width multi-layer perceptron with $\mathrm{ReLU}$ activation functions and set $\eta = 1$.
Since the Pickands function completely determines the dependency of the random variables, we compare the CFG estimate of the Pickands function from generated samples to the true Pickands function as a measure of sampling quality.
We use the CFG estimator due to its ubiquity in the literature and its highly regarded status as a standard estimator for the Pickands dependence function.
The results for generating 225 dimensional samples with varying dependence $\alpha \in [0, 1]$ are shown in Figures~\ref{fig:sl_mse_gen} and~\ref{fig:asl_mse_gen}.
The figures suggest that the generative model performs comparatively well for both distributions considered, with the worst performance occurring in the nearly independent cases ($\alpha = 1$). 
This is expected, since independence implies a spectral measure with delta functions on the corners of the simplex, which is difficult to learn (see the bottom row of Figure~\ref{fig:illustration} as an example).
The figures additionally suggest that sampling using the learned weights of the $d$MNN has lower variance (since the spectral measure in this is a finite discrete approximation) but does not perform as well in sampling as the generative model.
The error of the CFG estimate for the proposed sampling methods (blue and orange) and the exact sampling (green) follow very similar trends in errors, suggesting that both sampling methods are recovering the true spectral measure.
%The experiments are then repeated where we vary the dimension but fix the dependence parameter $(\alpha= 0.5)$ in Figures~\ref{fig:sl_mse_gen_all_d} and~\ref{fig:asl_mse_gen_all_d}.
%In all experiments, errors of the exact samples and the generated samples are roughly of the same order of magnitude, suggesting effective recovery of the spectral measure.
%The variances are due to the fluctuations that occur when sampling high dimensional spaces. 
\section{Concluding Remarks}\label{sec:conclusion}
We introduced a new neural network architecture for modeling MEV distributions while enforcing all the properties of the distribution.
We additionally show that the architecture can approximate any Pickands function, which allows for precise representations of MEV distributions. 
Finally, we present a generative model for recovering the spectral representation. 
Numerical results are provided to empirically demonstrate the effectiveness of the methods in their respective tasks.
However, there are some limitations of the proposed methods.
%However, it is worth mentioning that these methods can still be further be improved by considering the following limitations. 
% While we empirically show that the proposed methods are effective at their respective tasks, there are additional considerations that should be made for these methods. 

\paragraph{Limitations of Pickands-$d$MNNs and Generative Model.}
The main challenge associated with modeling using $d$MNNs are optimization and architectural choices. 
Choosing appropriate hyperparameters is a difficult and opaque task that requires additional care.
This is a case where non-parametric methods are advantageous, at the cost of being unable to guarantee the necessary properties of the function. 
In general, we suggest using a wide architecture with single depth, since this is the architecture that most of the theory builds upon. 
Additional progress on understanding the training of deep neural networks should improve the representational capabilities of the $d$MNNs, given its theoretical potential to approximate any Pickands functions to arbitrary precision. 
Optimization of the generative model suffers from the same issues.
Furthermore, since the proposed method requires training a neural network for estimation, the non-parametric methods have a significant computational advantage. 
In practice, this is not a major issue since these estimators are generally fit once and the proposed method takes only a few seconds on a GPU to fit.  

\paragraph{Future Work.}
The proposed methods have possible applications in a variety of modeling situations. 
One possibility is to extend the application on estimating conditional probabilities of $d$MNN for other classification tasks, such as out of distribution detection.
Another is in using the spectral measure for finding groups of variables that are extreme simultaneously, such as in \citep{engelke2021sparse}. 
%An important extension of the present work lies in capturing clusters of variables and understanding the joint interaction between them. 
Finally, applications of extremes are important in understanding robustness properties of neural networks \citep{weng2018evaluating}, and the proposed work provides foundation for high dimensional extensions. 
\begin{acknowledgements}
Material in this paper is based upon work supported by the Air Force Office of Scientific Research under award number FA9550-20-1-0397.
AH was supported by NSF Graduate Research Fellowship.
\end{acknowledgements}

% \textbf{Societal Impact.}
% This work does not present any foreseeable ethical or societal consequences.
% However, notions of ``extremeness'' are subjective and thus the proposed methodology should be interpreted within the context of the modeling application.
\bibliography{hasan_123}
%\subfile{hasan_123-supp}
\end{document}
