\documentclass[accepted]{uai2023} % for initial submission
% \documentclass[accepted]{uai2023} % after acceptance, for a revised
                                    % version; also before submission to
                                    % see how the non-anonymous paper
                                    % would look like
%% There is a class option to choose the math font
% \documentclass[mathfont=ptmx]{uai2023} % ptmx math instead of Computer
                                         % Modern (has noticable issues)
% \documentclass[mathfont=newtx]{uai2023} % newtx fonts (improves upon
                                          % ptmx; less tested, no support)
% NOTE: Only keep *one* line above as appropriate, as it will be replaced
%       automatically for papers to be published. Do not make any other
%       change above this note for an accepted version.

%% Choose your variant of English; be consistent
% \usepackage[american]{babel}
\usepackage[british]{babel}

%% Some suggested packages, as needed:
\usepackage{natbib} % has a nice set of citation styles and commands
    \bibliographystyle{plainnat}
    \renewcommand{\bibsection}{\subsubsection*{References}}
\usepackage{mathtools} % amsmath with fixes and additions
% \usepackage{siunitx} % for proper typesetting of numbers and units
\usepackage{booktabs} % commands to create good-looking tables
\usepackage{tikz} % nice language for creating drawings and diagrams

%%%
\usepackage{amsmath}
\usepackage{amsfonts}
\usepackage{multicol}
\usepackage{multirow}
\usepackage{caption}
\usepackage{subcaption}
%%%

%% Provided macros
% \smaller: Because the class footnote size is essentially LaTeX's \small,
%           redefining \footnotesize, we provide the original \footnotesize
%           using this macro.
%           (Use only sparingly, e.g., in drawings, as it is quite small.)

%% Self-defined macros
\newcommand{\swap}[3][-]{#3#1#2} % just an example

\title{Mixture of Normalizing Flows for European Option Pricing}

% The standard author block has changed for UAI 2023 to provide
% more space for long author lists and allow for complex affiliations
%
% All author information is authomatically removed by the class for the
% anonymous submission version of your paper, so you can already add your
% information below.
%
% Add authors
\author[1]{\href{mailto:<yongxin.yang@qmul.ac.uk>?Subject=UAI 2023 MoNF Option Pricing}{Yongxin Yang}{}}
\author[2,3]{\href{mailto:<t.hospedales@ed.ac.uk>?Subject=UAI 2023 MoNF Option Pricing}{Timothy M. Hospedales}{}}
% Add affiliations after the authors
\affil[1]{%
    Queen Mary University of London
}
\affil[2]{%
    University of Edinburgh
}
\affil[3]{%
    Samsung AI Centre, Cambridge
}
  
\begin{document}

\maketitle

\begin{abstract}
We present a mixture of normalizing flows (MoNF) approach to European option pricing with guarantees that its estimations are free from static arbitrage. In contrast to many existing methods that meet economic rationality constraints (e.g., non-arbitrage) by introducing auxiliary losses, our solution meets those constraints exactly by design. To achieve this, we propose to build a model for risk neutral density using normalizing flows, which results in a pricing model, instead of modelling the option pricing function directly. First, we convert the constraints for direct pricing models to the constraints for models backed by risk neutral density estimation, then we design a specific NF architecture that meets these constraints. Furthermore, we find that employing a mixture of such normalizing flows improves the performance significantly, compared to using a deeper single NF. Finally, we present a mechanism to regularise the proposed model, and this regularisation can serve as a bridge between our method and any sample-based mathematical finance method. The evaluations on five option datasets show superiority of our method compared to mathematical finance solutions and some other neural networks based methods. The code is available at \url{https://github.com/qmfin/MoNF}.
\end{abstract}

\section{Introduction}

Option pricing has long been an active research area in both mathematical finance and machine learning community. Option pricing models provide a window to explain financial market mechanics, while efficient pricing models to set bid and ask prices in derivative markets are very valuable for  practitioners. The seminal work Black--Scholes \citep{Black1973Pricing} initiated  modern derivative theory by giving a theoretical estimate of European option price.  Since then many studies have been carried out on finding better option pricing models in terms of fitting real market data more reliably. 

The research paradigm of mathematical finance (MF) usually starts from a set of assumptions, e.g., what kind of stochastic process provides a sensible and accurate model of market movements. Based on these assumptions, they end up with a parametric model that takes as input some market signals (e.g., moneyness, time to maturity, and risk-free rate) then outputs the corresponding option price. The parameters in MF models usually have physical meaning in econometric theory, and thus they are more transparent and explainable, compared to their black-box machine learning counterparts. However, when a MF model fails to fit the market data, i.e., underfitting in machine learning sense, it is not straightforward to introduce more parameters.

On the other side, machine learning approaches treat option pricing as a regression problem. Though the input-output pairs are almost the same as the MF models above, researchers tend to use generic methods, such as kernel machines and neural networks \citep{Malliaris1993Neural}. As a result, ML models can fit the market data nearly perfectly (if not, one can always choose to increase the model capacity). However, ML models usually fail important sanity checks, e.g., they would give nonsense predictions for extreme cases where no training data are available. This makes them less appealing to practitioners since their behaviours can be unpredictable for the rare events.

There are some attempts to integrate the econometric axioms into a machine learning model as inductive bias. More specifically, there are several conditions for an economically rational pricing model, and some studies \citep{Dugas2000Incorporating,Yang2017Gated,Ackerer2020Deep} try to meet these conditions by architecture design, auxiliary losses, and data augmentation. However, those methods can only meet the part of conditions with guarantees, and the rest of them are dealt with penalties that can be violated.

Our insight is that meeting all conditions for a pricing model is intrinsically hard, but it becomes much easier if we choose to work on the risk neural density (the probability density of the underlying asset's price on the maturity date) \citep{Bahra1997Implied}, which comes with the corresponding conditions.% The terminology \emph{risk neutral} roughly implies \emph{no arbitrage} and its rigorous definition can be found in \citep{Jeanblanc2009Mathematical}.

In this work, we propose a normalizing flows based model for risk neutral density, which results in a deterministic pricing function in the end. By specific architecture and activation function design, our method can meet all financial axiom conditions, and yields very accurate estimates. Furthermore, our approach sheds light on the less studied issue of how to bridge mathematical finance and machine learning approaches to option pricing. 

\section{Related Work}

% some work for option pricing, some work for normalizing flows.

\subsection{Option pricing}

Option pricing is an active research area in mathematical finance. The seminal work Black--Scholes \citep{Black1973Pricing} assumed  geometric Brownian motion (GBM) for the underlying asset price and provided a closed-form pricing formula with only one free parameter -- volatility (the standard deviation of logarithmic returns). The GBM assumption is usually invalidated in  real market, and there are two mainstream research directions for a fix. The first direction introduces  randomness to the volatility, rather than treating it as a constant. For example, \citep{Heston1993Closed} allowed arbitrary correlation between volatility and asset returns. It introduced stochastic interest rates as well, and demonstrated the effectiveness for bond options and foreign currency options. \citep{Barndorff1997Normal} took the normal inverse Gaussian law as a building element, exploring the construction of analytically and statistically tractable stochastic processes. \citep{Madan1998Variance} used the Variance Gamma process (a.k.a. the Laplace motion) because it allows for a wider of skewness and kurtosis than the Brownian motion. The second direction is to add a jump process, which represents rare events, in addition to GBM. \citep{Merton1976Option} superimposed a jump component on a diffusion component. The jump component is composed of log-normal jumps driven by a Poisson process. \citep{Peter2002Fine} generalised the Variance Gamma model \citep{Madan1998Variance} to allow for both diffusion process and jump process. \citep{Kou2002Jump} proposed a double exponential jump-diffusion model, incorporate the leptokurtic feature that the return distribution of assets having a higher peak and two heavier tails than normal distribution, and strike a balance between reality and tractability. A comprehensive study over those models can be found in \citep{Jeanblanc2009Mathematical}.

Apart from working on the pricing formula directly, it is possible to build the model through other proxies. Possible indirect modelling approaches are: (i) Implied volatility \citep{Ackerer2020Deep}, which is essentially the inverse function of Black--Scholes formula; One can build the model for implied volatility curve (SVI) or implied volatility surface (SSVI) and then apply Black--Scholes formula for pricing. (ii) Risk neural density \citep{Monteiro2018Recovering}, which models the probability distribution of asset prices in future. The majority of mathematical finance methods model the stochastic process of how asset price evolves, which naturally lead to density functions along time axis. However, they may not have analytic forms for these density functions, therefore the pricing has to be done by Monte Carlo.

\subsection{Neural networks for option pricing}

The use of neural networks for option pricing can be traced back to 1990s \citep{Malliaris1993Neural,Hutchinson1994Nonparametric}. For a comprehensive review, please refer to \citep{Ruf2020Neural}. At the first glance, neural networks appear to be a good fit because they are universal approximators, and it is easy to increase the number of parameters in the case of under-fitting. In contrast, the parameters in white-box mathematical finance models usually have physical meanings, and it is non-trivial to extend those models by introducing more parameters. However, option pricing is not simply mapping the input features (e.g., strike price, time to maturity) to the market prices. Rational option pricing models have to meet a series of conditions to guarantee that they rule out any arbitrage opportunities. To this end, many neural networks models borrow the insight from the mathematical finance models. To name a few, \citep{Garcia2000Pricing} designed a NN-based option pricing model with a similar form of Black--Scholes formula. \citep{Dugas2000Incorporating} chose specific activation functions and weight constraints so that their model has the same properties on first-order and second-order derivatives as the mathematical finance models. \citep{Yang2017Gated} adapted these strategies and also utilised \emph{learning from hint} \citep{Abu1990Learning} for the conditions that are hard to meet with architecture design. \citep{Ackerer2020Deep} extended those techniques to implied volatility surface fitting and demonstrated their effectiveness. These studies showed that employing constraints required for econometric rationality produces better option pricing models compared to vanilla unconstrained feed-forward neural networks.

Overall, neural networks are promising methods for option pricing, and there exist some successful studies in which rationality constraints are exploited. However, they usually use auxiliary loss functions to enforce the constraints, as a result, those constraints are not guaranteed to hold at all times. The reason for using auxiliary losses rather than exploiting architecture design is that the constraints are in the form of taking the limit or the higher-order gradient w.r.t. the input of the neural network. So they are hard to meet as the neural networks go deeper. For example, \citep{Dugas2000Incorporating,Yang2017Gated} have only one hidden layer, and the techniques developed can not be extended beyond that. In our work, we choose to model the risk neutral density (RND), and derive the corresponding constraints for it. We find that it becomes much easier to deal with these constraints in RND space.

\subsection{Normalizing flows}

The key ingredient of the proposed method is called normalizing flows (NF), which is a popular generative model in machine learning \citep{Papamakarios2019Normalizing}. The unique advantage of NF over other generative models is that it allows exact likelihood computation, which leads to a closed-form pricing formula in our case. The basic idea of NF is to stack a number of invertible layers, such as coupling \citep{Dinh2015NICE,Dinh2017Density}, autoregressive \citep{Kingma2016Improved}, masked autoencoder \citep{Papamakarios2017Masked}, and invertible convolution \citep{Kingma2018Glow}, such that the data distribution is transformed to a pre-defined distribution (e.g., Gaussian). Studies in this area usually focus on designing an invertible operator with small computation cost of the determinant (e.g., the operator produces a triangular matrix). Recently, NFs have been extended to the continuous domain \citep{Grathwohl2019FFJORD}. Their applications beyond generating realistic data samples are also exploited, e.g., in  lossless compression \citep{Hoogeboom2019Integer}. An early attempt of using NFs for RND modelling was \citep{Loaiza2017Maximum}, and it followed the principle of maximum entropy from the previous studies \citep{Buchen1996Maximum,Neri2012Maximum}. However, it is effectively not risk-neutral because the constraint over expectation of asset's future price is realised by an auxiliary loss, which is not guaranteed to be zero.% Besides, these studies treat the entropy as objective (i.e., maximising entropy) while fitting market prices as constraints, so they are proven to be sensitive to the noisy data \citep{Buchen1996Maximum,Guo2011Maximum} thus less practical, so we do not build our model upon the principle of maximum entropy.

\section{Methodology}

\subsection{Problem setting}

A European option is a contract that gives the holder the right, but not the obligation, to buy (call option) or to sell (put option) the underlying asset (e.g., stock) at a specified price (strike price) on a certain future date (maturity date). For example, at time $t=0$ (i.e., today), a company's stock price is \$100, and an individual buys a call option with strike price \$110 and maturity date $T=5$ (five days later). After five days, if the company's stock price is \$120, he can exercise the option and get the stock at the strike price \$110. In this case, if he sells the stock immediately, he will get \$10 profit. One the other hand, if the company's stock price is below \$110, he will choose not to exercise the option, and the only loss for him is the price of the option (sometimes called premium). For the put option, he profits if the stock price is lower than the strike price on the maturity date, as he can buy the stock from the market at a lower price and sell it at the strike price.

The problem setting of option pricing is to answer: what should the price for the option at $t=0$ be? In this work, we restrict ourselves to the case of modelling option price curves, i.e., for a given (annualised) time to maturity $\tau=\frac{T-t}{365}$ and the current underlying asset price $S_0$, we seek for the ``best'' mapping from the strike price $K$ to the option price $y$.
%
For a parametric model $f_\theta$, the process of fitting market data of $N$ option prices (sometimes referred to as \emph{calibration}) usually involves the following optimisation,
%
\begin{equation}
\underset{\theta}{\min}~ \frac{1}{N} \sum_{i=1}^{N} \|
f_\theta(K_i; z_i) - y_i\|_2^2
\label{eq:basic}
\end{equation}
%
where $K_i$ refers to the $i$-th option's strike price; $z_i$ refers to other factors such as the asset price $S_0$, time to maturity $\tau$, type of option (put/call), risk-free rate, and dividend yield of the option's underlying asset (if any); $y_i$ refers to the $i$-th option's price. The optimised model can be used to price the option contract with unseen strike prices.

\subsection{The constraints for non-arbitrage}

At the first glance, Eq.~\ref{eq:basic} appears to be a univariate regression problem. However, $f_\theta$ has to meet a series of constraints such that it rules out arbitrage opportunities.

More specifically, we consider the static arbitrage for pricing curves defined in \citep{Carr2003Stochastic}, which include the following constraints for call options

\begingroup
\renewcommand{\arraystretch}{2}
\begin{tabular}{lll}
     $\frac{\partial f_\theta(K)}{\partial K} \le 0  ~\textbf{(C1)}$ & $\frac{\partial^2 f_\theta(K)}{\partial K^2} \ge 0  ~\textbf{(C2)}$ & $f_\theta(\infty) = 0  ~\textbf{(C3)}$ \\
    \multicolumn{3}{c}{$\max(0, e^{-r\tau}(F_\tau-K)) \le
f_\theta(K) \le e^{-r\tau}F_\tau ~\textbf{(C4)}$}
\end{tabular}
\endgroup

In (C4), the forward price $F_\tau$ is defined as $F_\tau = e^{(r-q)\tau} S_0$,
where $r$ is the continuously compounded zero-coupon interest rate and $q$ is the continuously paid dividend yield. Note that, $r$ and $q$ have their own structures which need to be estimated separately in real-world trading, but it is out of the scope of this work. We treat them as known constants for a given time period $\tau$ in this work.
%
For put options, we have similar constraints,

\begingroup
\renewcommand{\arraystretch}{2}
\begin{tabular}{lll}
     $\frac{\partial f_\theta(K)}{\partial K} \ge 0  ~\textbf{(P1)}$ & $\frac{\partial^2 f_\theta(K)}{\partial K^2} \ge 0  ~\textbf{(P2)}$ & $f_\theta(0) = 0  ~\textbf{(P3)}$ \\
    \multicolumn{3}{c}{$\max(0, e^{-r\tau}(K-F_\tau)) \le
f_\theta(K) \le e^{-r\tau}K ~\textbf{(P4)}$}
\end{tabular}
\endgroup

The key insight of our work is that it is hard for a parametric model (e.g., neural networks) to meet all constraints above, but it is much easier if we chose to model the risk neural density instead. 

We propose to model the density of forward log-moneyness of the asset price $S_T$ on the maturity date, i.e., $x =\log(\frac{S_T}{F_\tau})$, because the common range of forward log-moneyness is $[-3, 3]$, which is more numerically stable than the asset price itself. 

More specific, we design the following option pricing model
%
\begin{equation}
f_\theta(K;z) = e^{-r\tau} F_\tau \int_{-\infty}^{+\infty} [\mathbb{I}(z) (e^x-\frac{K}{F_\tau})]_+ q_\theta(x) \mathrm{d}x
\label{eq:rnd_m}
\end{equation}
%
where $[\cdot]_+ := \max(0, \cdot)$ and $\mathbb{I}(z)$ is an indicator function that returns $+1$ for call option and $-1$ for put option. Here we slightly abuse the notation $z$ for the triplet $(\text{Put/Call}, F_\tau, e^{-r\tau})$ while the indicator function only applies to its first element -- option type.

With Eq.~\ref{eq:rnd_m}, all constraints for pricing model are reduced to the following three constraints on the density estimator $q_\theta(x)$. (Proof can be found in the appendix.)
\begingroup
\renewcommand{\arraystretch}{2}
\begin{center}
\begin{tabular}{ll}
     $q_\theta(x) \ge 0  ~\textbf{(R1)}\quad$ & $\quad\int_{-\infty}^{+\infty} q_\theta(x) = 1  ~\textbf{(R2)}$ \\
     \multicolumn{2}{c}{$\int_{-\infty}^{+\infty} e^x q_\theta(x) \mathrm{d}x = 1  ~\textbf{(R3)}$}
\end{tabular}
\end{center}
\endgroup
(R1) and (R2) are constraints for any valid density function, and (R3) sets the expectation of $e^x$ to be constant $1$, which means the expectation of underlying asset price at maturity date, $\mathbb{E}_q[S_T]$, should be the same as the forward price $F_\tau$, and its discounted value (by risk-free rate $r$) at $t=0$ would be $e^{-q\tau}S_0$, which corresponds to the upper bound of call option price in (C4). The density that meets (R3) is referred to as \emph{risk-neutral} density, as it basically states that the expectation of an asset priced at $S_0$ will be $S_0 e^{(r-q)\tau}$ after the period of $\tau$ (or $S_0 e^{r\tau}$ if it does not pay any dividends).

\subsection{A normalizing flows model}

In this work, we build $q_\theta(x)$ by a normalizing flows model, which maps a one-dimension standard Gaussian to the distribution of interest. More specifically, we have a sample $\epsilon\sim\mathcal{N}(0,1)$. After a sequence of invertible transforms, it becomes a sample from the risk neural density, i.e., $\epsilon \xrightarrow{g_\theta} x$ where $g_\theta(\cdot)$ stands for the sequence of transforms.

To find the best $\theta$, we can estimate Eq.~\ref{eq:rnd_m} by Monte Carlo,
%
\begin{equation}
f_\theta(K;z) = e^{-r\tau} F_\tau \frac{1}{M} \sum_{m=1}^M [\mathbb{I}(z) (\exp{(g_\theta(\epsilon_m))}-\frac{K}{F_\tau})]_+
\label{eq:mc}
\end{equation}
%
where $\{\epsilon_1,\epsilon_2,\dots,\epsilon_M\}$ are samples from standard Gaussian. Then we can minimise Eq.~\ref{eq:basic} with Eq.~\ref{eq:mc} plugged-in by any gradient based method, as long as $g_\theta(\cdot)$ is differentiable.
%
\begin{equation}
\nabla_\theta \sum_{i=1}^{N} \|
e^{-r\tau} F_\tau \frac{1}{M} \sum_{m=1}^M [\mathbb{I}(z_i) (\exp{(g_\theta(\epsilon_m))}-\frac{K_i}{F_\tau})]_+ - y_i\|_2^2
\label{eq:obj}
\end{equation}
%
One may question the necessity of normalizing flows, because any generator appears to do the job in Eq.~\ref{eq:obj}. The very unique advantage of normalizing flows is that we can get the exact density of $q(x)$ by
%
\begin{equation}
q_\theta(x) = \mathcal{N}(g_\theta^{-1}(x); 0,1) \left\vert\frac{\mathrm{d} g_\theta^{-1}(x)}{\mathrm{d} x}\right\vert
\label{eq:nf}
\end{equation}
%
Therefore we can run a numerical integration instead of Monte Carlo for Eq.~\ref{eq:rnd_m} in the testing stage, such that we will have a \emph{deterministic} result. This is crucial for real-world deployed applications, though we still prefer to do Monte Carlo during training for better efficiency. Note that, the interval of integration in Eq.~\ref{eq:rnd_m} is $[-\infty,+\infty]$ theoretically, but $[-3,3]$ is sufficient in practice, e.g., $3$ means the underlying asset price increases by $\mathrm{e}^3\approx 20$ times.

\subsubsection{Enforcing the expectation constraint}

It is non-trivial to meet (R3) by the architecture design, but it becomes possible for normalizing flows as they model the density explicitly.

If we have the following expectation calculated by numerical integration $\mu_\theta = \int_{-\infty}^{+\infty} e^x q_\theta(x) \mathrm{d}x$, it is easy to see $
\int_{-\infty}^{+\infty} \frac{e^x}{\mu_\theta} q_\theta(x) \mathrm{d}x \equiv 1$. Therefore the following pricing model meets (R3) up to machine precision.
%
\begin{equation}
f_\theta(K;z) = e^{-r\tau} F_\tau \int_{-\infty}^{+\infty} [\mathbb{I}(z) (\frac{e^x}{\mu_\theta}-\frac{K}{F_\tau})]_+ q_\theta(x) \mathrm{d}x
\label{eq:rnd_rescaled_wo_sg}
\end{equation}
%
However, differentiating through $\mu_\theta$ (the numerical integrator) is unnecessarily expensive, thus we place the stop gradient operator $\operatorname{SG}(\cdot)$ over $\mu_\theta$, i.e.,
%
\begin{equation}
f_\theta(K;z) = e^{-r\tau} F_\tau \int_{-\infty}^{+\infty} [\mathbb{I}(z) (\frac{e^x}{\operatorname{SG}(\mu_\theta)}-\frac{K}{F_\tau})]_+ q_\theta(x) \mathrm{d}x
\label{eq:rnd_rescaled}
\end{equation}
%
To understand the effect of $\operatorname{SG}(\mu_\theta)$, an equivalent realisation is: (i) duplicate $\theta$ into online parameter $\theta_1$ and offline parameter $\theta_2$ and (ii) use the following pricing formula,
%
\begin{equation}
f_{\theta_1,\theta_2}(K;z) = e^{-r\tau} F_\tau \int_{-\infty}^{+\infty} [\mathbb{I}(z) (\frac{e^x}{\mu_{\theta_2}}-\frac{K}{F_\tau})]_+ q_{\theta_1}(x) \mathrm{d}x
\end{equation}
%
For optimisation, we initialise $\theta_1$ and set $\theta_2=\theta_1$ to ensure the expectation constraint (R3) is met, then alternate
%
\begin{description}
    \item[Step (1)] Update $\theta_1$ by running one-step gradient descent for reducing the pricing errors.
    \item[Step (2)] Update $\theta_2$ by setting $\theta_2=\theta_1$.
\end{description}
%
We can see that, Step (1) improves the fitness of pricing model to market data, but it breaks the expectation constraint. Step (2) fixes the expectation constraint issue, but it worsens the fitness. By alternating these two steps until convergence, we should have the final model that fits the market data well and meets the the expectation constraint. When it comes to the \emph{actual} implementation, it is easier to use stop gradient $\operatorname{SG}(\mu_\theta)$ in Eq.~\ref{eq:rnd_rescaled} instead of the explicit alternating optimisation explained above. With Eq.~\ref{eq:rnd_rescaled}, we modify the Monte Carlo estimator in Eq.~\ref{eq:mc} correspondingly.
%
\begin{equation}
f_\theta(K;z) = e^{-r\tau} F_\tau \frac{1}{M} \sum_{m=1}^M [\mathbb{I}(z) (\frac{\exp{(g_\theta(\epsilon_m))}}{\operatorname{SG}(\mu_\theta)}-\frac{K}{F_\tau})]_+
%\label{eq:mc2}
\nonumber
\end{equation}
%
%
\begin{figure}[t]
\centering
\includegraphics[width=0.99\columnwidth]{img/activation_func}
\caption{Plots of the proposed activation function (Eq.~\ref{eq:func}), its inverse, and its first order derivative.}
\label{fig:func}
\end{figure}
%
\subsubsection{Activation function design}

As we only need to transform a one-dimensional distribution to another, a layer in our model is simply,
%
\begin{equation}
h^{(L+1)} = \phi(a^{(L)} \times h^{(L)} + b^{(L)})
\label{eq:layer}
\end{equation}
%
where $\phi(\cdot)$ is an invertible activation function. Apart from being invertible, we introduce two more desired properties: (i) symmetric: we start from a standard Gaussian, which is symmetric, and we do not have bias towards positive or negative sides, i.e., we do not assume the asset price should increase or decrease. (ii) it is close to an identity function for the input with sufficiently large absolute value, because we need to avoid a possible explosion in values, in both $\epsilon \xrightarrow{f_\theta} x$ and $x \xrightarrow{f_\theta^{-1}} \epsilon$, given that multiple transformations (layers) can be stacked.

Frustratingly, the identity function $\phi(x)=x$ meets all the conditions above, but we want a degree of non-linearity, otherwise the model is reduced to Gaussian. To this end, we design the following activation function, based on some translation and rotation of the softplus function,
%
\begin{equation}
\phi(x) =
\begin{cases}
\log(1+\mathrm{e}^x) - \log(2)& \text{if $x \ge 0$}\\
-\log(1+\mathrm{e}^{-x}) + \log(2) & \text{if $x<0$}
\end{cases}   \label{eq:func}    
\end{equation}
%
The plots of $\phi(x)$, its inverse $\phi^{-1}(x)$, and its gradient $\phi'(x)$ can be found in Fig.~\ref{fig:func}.

\subsubsection{Mixture of normalizing flows}

The model that we have developed so far has one drawback: it is hard to scale up  by introducing more trainable parameters. Every layer contains two parameters only, i.e., the shift term $a$ and the bias term $b$. To add more parameters in the case of under-fitting, we have to stack more layers. However, this would cause difficulties for training, e.g. we may have the gradient vanishing-exploding problem.

To add one more axis in the parameter space, we propose to use a mixture of normalizing flows models, rather than a single one, i.e.,
%
\begin{equation}
q(x) = \sum_{i=1}^K \pi_i q_i(x) ~~~\text{s.t.}~~~ \pi_i \ge 0, \sum_i \pi_i = 1
\label{eq:mix}
\end{equation}
%
Behind each $q_i$ is an independent NF model, realised by $z \xrightarrow{f_{\theta_i}} x$.% As a result, if the model has $K$ components, and each has $L$ layers, the total number of trainable parameters will be $2(L-1)K$ (recall that the final layer does not have any trainable parameters).
%
We can also verify that all conditions are still met as long as $\pi$ is a probability simplex. In fact, the mixture of normalizing flows effectively creates $K$ pricing models, and the final price is the re-weighted sum of the outputs of those models, i.e.,
%
\begin{equation}
\hat{c} = \operatorname{SOFTMAX}(\tilde{\pi}) \cdot [\hat{c}_1,\hat{c}_2,\dots,\hat{c}_K]
\label{eq:full_model}
\end{equation}
%
where $\tilde{\pi} \in \mathbb{R}^K$ and each $\hat{c}_i$ is estimated by Monte Carlo (Eq.~\ref{eq:mc}). With Eq.~\ref{eq:full_model}, Eq.~\ref{eq:mc}, and Eq.~\ref{eq:basic}, we can train all parameters $\{\tilde{\pi}, \theta_1, \theta_2,\dots,\theta_K\}$ using gradient based methods.

%
\begin{figure*}[t]
    \centering
    \includegraphics[width=0.45\textwidth]{img/train.pdf}~
    \includegraphics[width=0.45\textwidth]{img/test.pdf}
    \caption{The two ways to split the training and testing set: Interpolation (left) and Extrapolation (right)}
    \label{fig:split}
\end{figure*}
%
%
\begin{table*}[ht]
\centering
\resizebox{0.95\linewidth}{!}{
\begin{tabular}{l|rrrrr|rrrrr}
\toprule
{} & \multicolumn{5}{|c|}{Number of Contracts} & \multicolumn{5}{c}{Average Prices} \\
 &   Alphabet &     Apple & Microsoft &  NASDAQ100 &       S\&P500 &   Alphabet & Apple & Microsoft & NASDAQ100 &  S\&P500 \\
\midrule
Call & 1,939,066 & 809,333 & 63,0234 & 4,292,133 & 10,029,359 &     253.69 & 35.75 &     30.29 &   1131.10 & 441.96 \\
Put & 1,943,777 & 803,095 & 63,4802 & 4,186,962 & 10,492,160 &      96.24 & 23.13 &     16.72 &    458.22 &  95.50 \\
\bottomrule
\end{tabular}%
}
\caption{Some Statistics of the Dataset: Number of Contracts (Left) and Average Prices (Right)}
\label{tab:stats}
\end{table*}
%
\subsubsection{Regularisation}\label{sec:reg}

The mixture of normalizing flows can fit  market data nearly perfectly, but it may result in an overly complicated density function (see zig-zag in Fig.~\ref{fig:reg}). Though a non-smooth density function will not lead to a non-smooth pricing function, because the density function is just the second order derivative of pricing function, we want a mechanism for regularisation to reduce the risk of over-fitting\footnote{Strictly speaking, only non-smooth pricing function raise concerns over over-fitting. Hence the regularisation is mainly for elegance in machine learning rather than generalization.}.
%treatment for OCD of machine learning researchers.}.

Assume that we have a reference risk neutral density $\bar{q}(x)$, which could come from a well-calibrated mathematical finance model, or other model with much fewer parameters. To regularise our model $q_\theta(x)$, we can minimise the divergence of $q_\theta(x)$ and $\bar{q}(x)$, i.e., 
%
\begin{equation}
\underset{\theta}{\min}~\mathcal{D}(q_\theta(x)||\bar{q}(x))
\label{eq:div}
\end{equation}
%
Here we choose $\mathcal{D}(\cdot,\cdot)$ to be the Wasserstein distance (a.k.a. earth mover's distance), because it has a closed-form solution for sample-based approximation under the one-dimensional case. For example, if we draw $M$ samples $\{\bar{x}_1, \bar{x}_2,\dots,\bar{x}_M\}$ from $\bar{q}(x)$, sort them in ascending order, i.e., $\bar{x}_{[1]}<\bar{x}_{[2]}<\cdots<\bar{x}_{[M]}$. Similarly, we get the sorted samples from $q_\theta(x)$, say, $x_{[1]}<x_{[2]}<\cdots<x_{[M]}$, then the estimate of Eq.~\ref{eq:div} is,
%
\begin{equation}
\underset{\theta}{\min}~ \frac{1}{M} \sum_{i=1}^{M} |x_{[i]} - \bar{x}_{[i]}|
\label{eq:emd}
\end{equation}
%
To calculate the gradient for Eq.~\ref{eq:emd}, we need to differentiate a random sample $x$ w.r.t. distribution parameters $\theta$ (in this case $\theta=\{\tilde{\pi}, \theta_1, \theta_2,\dots,\theta_K\}$). Since $q_\theta(x)$ is one-dimensional mixture of distributions, this can be done by the implicit gradient technique developed in \citep{Graves2016Stochastic,Figurnov2018Implicit}.
%
\begin{equation}
\frac{\partial x}{\partial \theta} = - \frac{1}{q_\theta(x)} \frac{\partial }{\partial \theta} \int^x_{-\infty} q_\theta(u) \mathrm{d}u
\label{eq:reparam}
\end{equation}
%
\begin{table*}[t]
\centering
\resizebox{0.95\linewidth}{!}{
\begin{tabular}{lrrrrrrrrrr}
\toprule
 & \multicolumn{2}{c}{Alphabet} & \multicolumn{2}{c}{Apple} & \multicolumn{2}{c}{Microsoft} & \multicolumn{2}{c}{NASDAQ100} & \multicolumn{2}{c}{S\&P500} \\
 &      Test &    Train &   Test &  Train &      Test &  Train &      Test &     Train &       Test &    Train \\
\midrule
Black Scholes  &   0.3436 &   0.3448 & 0.2284 & 0.2240 &    0.2149 & 0.2175 &    0.3156 &    0.3189 &    0.5354 &    0.5358 \\
Heston        &   0.2855 &   0.2872 & 0.1825 & 0.1810 &    0.1185 & 0.1154 &    1.2811 &    1.3029 &    0.8356 &    0.8377 \\
Kou Jump       &   1.0465 &   0.9165 & 0.6261 & 0.6589 &    0.4294 & 0.4246 &    1.3204 &    1.2927 &    1.7156 &    1.7563 \\
Merton Jump    &   0.3384 &   0.3385 & 0.2284 & 0.2240 &    0.2149 & 0.2175 &    0.4049 &    0.4055 &    0.3794 &    0.3797 \\
Variance Gamma &   0.2351 &   0.2365 & 0.1150 & 0.1106 &    1.0360 & 1.3023 &    0.1324 &    0.1314 &    0.1494 &    0.1499 \\
\midrule
MNN    &   0.3427 &   0.3456 & 0.2325 & 0.2586 &    0.2275 & 0.2113 &    0.4221 &    0.4824 &    0.3883 &    0.3825 \\
GNN    &   0.2896 &   0.3021 & 0.1921 & 0.2242 &    0.1300 & 0.1147 &    1.0503 &    1.2572 &    0.7849 &    0.8472 \\
\midrule
MoNF (Linear)         & 0.4791 & 0.5077 & 0.1209 & 0.1325 & 0.0626 & 0.0510 & 0.0559 & 0.0849 & 0.0479 & 0.0671\\
MoNF          & \textbf{0.1715} & \textbf{0.1649} & \textbf{0.0448} & \textbf{0.0391} &   \textbf{0.0391}& \textbf{0.0363} & \textbf{0.0229} & \textbf{0.0220} & \textbf{0.0310} & \textbf{0.0334} \\
\bottomrule
\end{tabular}
}
\resizebox{0.95\linewidth}{!}{
\begin{tabular}{lrrrrrrrrrr}
\toprule
Black Scholes  &    0.4950 &   0.1840 & 0.3676 & 0.0994 &    0.3498 & 0.0889 &    0.4717 &    0.1705 &     0.6185 &   0.4270 \\
Heston        &    0.4228 &   0.1066 & 0.3472 & 0.0661 &    1.8010 & 0.4541 &    1.4452 &    0.3736 &     0.6884 &   1.0936 \\
Kou Jump       &    4.0404 &   0.2047 & 1.1308 & 0.0602 &    1.0607 & 0.0471 &    7.1047 &    0.1038 &     6.1269 &   0.2024 \\
Merton Jump    &    1.0406 &   0.1824 & 0.3676 & 0.0994 &    0.3498 & 0.0889 &    1.4126 &    0.1100 &     0.5640 &   0.2514 \\
Variance Gamma &    0.3924 &   0.0811 & 0.2389 & 0.0469 &    0.1937 & 0.0344 &    0.2279 &    0.0323 &     0.5670 &   0.0904 \\
\midrule
MNN    &    1.0287 &   0.1906 & 0.3987 & 0.1023 &    0.3214 & 0.0934 &     1.3352 &   0.1054 &    0.5568 &   0.2459 \\
GNN    &    0.4204 &   0.1099 & 0.3544 & 0.0633 &    2.1005 & 0.4199 &     1.2218 &   0.3088 &    0.7339 &   0.9670 \\
\midrule
MoNF (Linear)         & 0.5706 & 0.0951 & 0.3490 & 0.0626 & 0.1620 & 0.0854 & 0.1184 & 0.0385 & 0.6351 & 0.1691 \\
    MoNF          & \textbf{0.1706} & \textbf{0.0289} & \textbf{0.1564} & \textbf{0.0221} &    \textbf{0.0982} & \textbf{0.0222} & \textbf{0.1084} & \textbf{0.0185} & \textbf{0.3130} & \textbf{0.0564} \\
\bottomrule
\end{tabular}
}
\caption{Option Pricing Performance (Mean Absolute Error): Interpolation Setting (Top) and Extrapolation Setting (bottom)}\label{tab:mae}
\end{table*}
%
The integral part in Eq.~\ref{eq:reparam} can be estimated by numerical integration, starting from a sufficiently small value, i.e., $-3$, and ending by a value which is sufficiently close to $x$.

The techniques developed here can serve as a bridge between our machine learning model and  mathematical finance models, enabling us to use a well-calibrated (well-trained) mathematical finance model to \emph{regularise} our model, so that it  behaves closely to the MF model. This is potentially useful to improve out-of-distribution predictions of our model.

Besides, many MF models do not have a risk neutral density or a pricing function in an analytic form, as they are essentially sample based. Therefore, we can approximate a MF model and obtain an analytic expression for RND using Eq.~\ref{eq:emd}, which can be used for further analysis or simply a means for producing a deterministic price in a more efficient manner (numerical integration v.s. Monte Carlo).

\section{Experiments}

We evaluate our method as well as several baselines on five option datasets including three stock options: Alphabet, Apple Inc, and Microsoft; two index options: NASDAQ100 and S\&P500. Since there are not real ground truths for option prices, we assume that the frequently traded options reflect the true value of option contracts.

\subsection{Data pre-processing}

The full options dataset is provided by a third party, and we release an anonymised snapshot of data with the coverage over the period from 01-Jan-2017 to 31-Dec-2021\footnote{\url{https://github.com/qmfin/option_data}}.

The option price is the mid-point of best bid and best offer quotes by the end of day. Besides, we apply three filters: (i) we exclude all options with time to maturity shorter than $7$ days or longer than $2$ years (ii) we discard options with implied volatility larger than $1.5$ as they are less frequently traded. (iii) we discard options with best offer smaller than $0.05$ for the same reason. In total we have $35$ million option contracts left after filtering. Some statistics on the number of contracts for each asset and average prices can be found in Table.~\ref{tab:stats}

\subsubsection{Train-test split}

The dataset for each individual experiment is formed by options with the same maturity on every trading day, and we have around $120,000$ datasets/experiments. On average, each dataset has $300$ option contracts/samples. For those samples, we have two strategies to split them into the training and testing sets:
\begin{description}
\item[Interpolation] We sort all options by their strike prices, and choose the even-indexed ones as training and odd-indexed ones as testing.
\item[Extrapolation] We choose $50\%$ all options with the strike prices near the current asset price for the training, and leave the remaining options with strike prices far away from the current asset price for the testing.
\end{description}

The illustration of these two splits can be found in Fig.~\ref{fig:split}. 

\subsection{Implementation}

We implement our method using PyTorch \citep{Paszke2019PyTorch}, and tune the hyper-parameters using the data from the day before the starting date then keep them fix for the whole five years' experiments. For the main  experiments, we use a mixture model with $8$ NFs and each NF has $4$ layers, and the reference distribution is a single $4$-layer NF. We also evaluate the performance for MoNF with identify function instead of the proposed activation function, denoted as MoNF (Linear). Note that, the model is actually reduced to Gaussian Mixture Model (GMM), as the linear transformation of Gaussian is still Gaussian.

For neural network based methods, we choose Modular Neural Networks (MNN) \citep{Gradojevic2009Option} and Gated Neural Networks (GNN) \citep{Yang2017Gated}, which partially meet the constraints. For mathematical finance baselines, we choose Black--Scholes \citep{Black1973Pricing}, Heston \citep{Heston1993Closed}, Variance Gamma \citep{Madan1998Variance}, Kou Jump \citep{Kou2002Jump}, and Merton Jump \citep{Matsuda2004Introduction}, as they represent different diffusion and jump process designs.

All models are trained to minimise the mean absolute error of the real market v.s. predicted prices, and the true difference over prices is reported. This is different from some studies that train and evaluate their methods using the normalised price (e.g., option price divided by asset price), as that may not reflect the performance in trading.

\subsection{Results analysis}

In option pricing, we care both the training and testing error, because training error reflects how well the model explains market behaviour, and the testing error reflects how well the model performs if we ask for a quote of an unseen strike price. In real market data, the strike prices are some common numbers (e.g., $50$, $100$, $150$), and a practical use of option pricing is to estimate the price for an arbitrary strike (e.g., $76.54$). From the summary of results in Table~\ref{tab:mae},  we can easily see the advantage of the proposed method over other baselines. Interestingly, other neural network based methods, such as MNN and GNN, do not outperform the mathematical finance models constantly, which use significantly fewer parameters. We further investigate why this happened. Both MNN and GNN are designed for call option pricing, and the designs are made to meet some (not all) of the constraints so it is not easy to adapt the same design for put options, therefore, we have to convert put options to call options via put-call parity, and convert them back after pricing. However, the re-converted put options have much larger errors, even though their converted call options are well fitted by MNN and GNN. This is reasonable: put-converted call options are usually much cheaper than their original prices, so the errors over put-converted call options will be amplified when converting back to put options.


\subsubsection{Is mixture model necessary?}

One may question that whether we really need a mixture of normalizing flows, or the similar performance can be achieved by using a single but deeper NF model. Here we evaluate the model performance for three pairs of models: (i) an 8-layer NF v.s. a 4 $\times$ 2-layer MoNF  (ii) a 16-layer NF v.s. a 4 $\times$ 4-layer MoNF (iii) a 32-layer NF v.s. a 8 $\times$ 4-layer MoNF. The results can be found in Fig.~\ref{fig:mix}. Increasing the number of parameters helps to reduce the testing error in both cases, but it is more effective to use a mixture of NFs models rather than a single deep NF model. Besides, we have also found that the deep single model is more sensitive to initialisation empirically because of the numerical stability issues when more layers/transformations are added, so the mixture of shallower normalising flows is preferred.

%
\begin{figure}[t]
\centering
\includegraphics[width=0.875\linewidth]{img/mix}
\caption{Single NF models vs.\ Mixture of NFs (Test-MAE)}
\label{fig:mix}
\end{figure}
\begin{figure}[t]
\centering
\includegraphics[width=0.875\linewidth]{img/reg}
\caption{The effect of regularisation on MoNF.}
\label{fig:reg}
\end{figure}
%
\subsubsection{Regularisation}

We investigate the effect of the regularisation term (Sec~\ref{sec:reg}). First, we train a single NF model and use it as a prior distribution. Then, we train our model with/without the regularisation term and plot the risk neutral density in Fig.~\ref{fig:reg}. We can see that the density without regularisation tends to be non-smooth, due to its nature of being a mixture model. In contrast, the density with regularisation is much closer to the single NF model's. We also compare the performance of MoNF without regularisation on S\&P500 index option. As we can see in Table~\ref{tab:mae_wo_reg}, the performance is improved slightly without regularization. We reiterate that the model without regularization does not invalidate any constraints as the density is still valid despite being non-smooth. A more realistic use of regularization is to form a rather simple density function, e.g., a uni-modal distribution, so one can answer the questions like ``what is the option inferred \emph{mode} of future asset price from market data''.
%
\begin{table}[ht]
\centering
\resizebox{0.89\linewidth}{!}{
\begin{tabular}{llrr}
\toprule
  & & Test &    Train \\
\multirow{2}{*}{Interpolation} & MoNF (w. Reg.)  & 0.0310 & 0.0334 \\
& MoNF (w.o. Reg) &  \textbf{0.0301} & \textbf{0.0290}\\
\hline
\multirow{2}{*}{Extrapolation} & MoNF (w. Reg.)   & \textbf{0.3130} & 0.0564\\
& MoNF (w.o. Reg)         & 0.3224 & \textbf{0.0519}
\\\bottomrule
\end{tabular}
}
\caption{Impact of regularization (S\&P 500, MAE).}
\label{tab:mae_wo_reg}
\end{table}
%
\section{Conclusion}

In this work, we presented a European option pricing model based on the normalizing flows. It achieved excellent performance in terms of fitting the market data as well as generalising to unseen strike prices. More crucially, it met all conditions for an economically rational model, so we are assured that it will behave rationally in out-of-distribution scenarios and rare events. Finally, we demonstrate that neural networks for risk neutral density could potentially be a better choice than directly modelling the pricing function. For the future work, we plan to extend the model from pricing curves to pricing surfaces.

\textbf{Disclaimer:} All authors are faculty. Neither graduate students nor small animals were hurt while producing this paper.

\bibliography{finance}

\end{document}
