%%%%%%%% ICML 2022 EXAMPLE LATEX SUBMISSION FILE %%%%%%%%%%%%%%%%%
%\documentclass{uai2022} % for initial submission
\documentclass[accepted]{uai2022} % after acceptance, for a revised
                                    % version; also before submission to
                                    % see how the non-anonymous paper
                                    % would look like
%% There is a class option to choose the math font
% \documentclass[mathfont=ptmx]{uai2022} % ptmx math instead of Computer
                                         % Modern (has noticable issues)
% \documentclass[mathfont=newtx]{uai2022} % newtx fonts (improves upon
                                          % ptmx; less tested, no support)
% NOTE: Only keep *one* line above as appropriate, as it will be replaced
%       automatically for papers to be published. Do not make any other
%       change above this note for an accepted version.

%% Choose your variant of English; be consistent
\usepackage[american]{babel}
% \usepackage[british]{babel}

%% Some suggested packages, as needed:
\usepackage{natbib} % has a nice set of citation styles and commands
\usepackage{mathtools} % amsmath with fixes and additions
% \usepackage{siunitx} % for proper typesetting of numbers and units
\usepackage{booktabs} % commands to create good-looking tables
\usepackage{tikz} % nice language for creating drawings and diagrams


\usepackage{microtype}
\usepackage{graphicx}
\usepackage{subfigure}
\usepackage{booktabs} % for professional tables

% hyperref makes hyperlinks in the resulting PDF.
% If your build breaks (sometimes temporarily if a hyperlink spans a page)
% please comment out the following usepackage line and replace
% \usepackage{icml2022} with \usepackage[nohyperref]{icml2022} above.
%\usepackage{hyperref}
\usepackage[capitalize,noabbrev]{cleveref}
% For theorems and such
\usepackage{amsmath}
\usepackage{amssymb}
\usepackage{mathtools}
\usepackage{amsthm}
%\usepackage{zref-xr}
%\zxrsetup{tozreflabel=false, toltxlabel=true, verbose}
%\zexternaldocument{liu_286-supp}
% if you use cleveref..

\newcommand{\RR}{\mathbb{R}}
\newcommand{\CC}{\mathbb{C}}
\newcommand{\br}{\boldsymbol{r}}
\newcommand{\bx}{\boldsymbol{x}}
\newcommand{\bz}{\boldsymbol{z}}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% THEOREMS
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\theoremstyle{plain}
\newtheorem{theorem}{Theorem}[section]
\newtheorem{proposition}[theorem]{Proposition}
\newtheorem{lemma}[theorem]{Lemma}
\newtheorem{corollary}[theorem]{Corollary}
\theoremstyle{definition}
\newtheorem{definition}[theorem]{Definition}
\newtheorem{assumption}[theorem]{Assumption}
\theoremstyle{remark}
\newtheorem{remark}[theorem]{Remark}
\usepackage{xr}
\externaldocument{liu_286-supp}

% Todonotes is useful during development; simply uncomment the next line
%    and comment out the line below the next line to turn off comments
%\usepackage[disable,textsize=tiny]{todonotes}
\usepackage[textsize=tiny]{todonotes}


\title{PathFlow: A Normalizing Flow Generator that Finds Transition Paths}

% The standard author block has changed for UAI 2022 to provide
% more space for long author lists and allow for complex affiliations
%
% All author information is authomatically removed by the class for the
% anonymous submission version of your paper, so you can already add your
% information below.
%
% Add authors
\author{Tianyi~Liu}
\author{Weihao~Gao}
\author{Zhirui~Wang}
\author{Chong Wang}
% Add affiliations after the authors
\affil{%
   ByteDance Inc.
}

\begin{document}

\maketitle
\begin{abstract}
Sampling from a Boltzmann distribution to calculate important macro statistics is one of the central tasks in the study of large atomic and molecular systems.  Recently, a one-shot configuration sampler, the Boltzmann generator \citep{noe2019boltzmann}, is introduced. Though a Boltzmann generator can directly generate independent metastable states, it lacks the ability to find transition pathways and describe the whole transition process. In this paper, we propose PathFlow that can function as a one-shot generator as well as a transition pathfinder. More specifically, a normalizing flow model is constructed to map the base distribution and linear interpolated path in the latent space to the Boltzmann distribution and a minimum (free) energy path in the configuration space simultaneously. PathFlow can be trained by standard gradient-based optimizers using the proposed gradient estimator with a theoretical guarantee. PathFlow, validated with the extensively studied examples including a synthetic M\"{u}ller potential and Alanine dipeptide, shows a remarkable performance. 
\end{abstract}

\section{Introduction}
In the study of large atomic and molecular systems, the calculation of important macro statistics
such as the total energy of the system or the folding probability of a protein is of fundamental importance \citep{tuckerman2010statistical}. One may turn to Monte Carlo methods that require unbiased sampling of the equilibrium distribution. In many applications, the distribution can be expressed by the Boltzmann distribution: $$p(\boldsymbol{r}) = \frac{1}{Z}\exp(-\mathcal{K}(\br)),$$ where $\br$ is one configuration of the system, $\mathcal{K}(\br)$ represents functions depending on the potential energy of the system e.g. the temperature and other thermodynamic quantities.  The statistics are typically based on a sufficient observation of all important configurations. Whereas, the enumeration of these configurations is usually infeasible. 


%Since was first introduced in the early 1950s, Molecular Dynamics (MD) has become one of the most successful sampling methods. As a MCMC (Markov Chain Monte Carlo) type methods, MD simulates the movement of the particle and under certain conditions the trajectory can converge to the stationary distribution. However, this convergence can be extremely slow  since the existence of high energy barrier lies among  metastable states makes the transition a rare event. To tackle this difficulty, one line of research focuses on speeding up the simulation. Enhanced sampling approaches such as Umbrealla Sampling, Meta Dynamics introduce bias to the potential along certain coordinates (also known as collective variables) to encourage simulation trajectory to climb over high energy barriers and thus enhance the probability of the transition events. These methods has been widely applied in many real world applications such as drug discovery, protein folding. 

Recently, \cite{noe2019boltzmann} introduce a machine learning based Boltzmann distribution sampler, known as the Boltzmann generator. Following the idea of normalizing flows, Boltzmann generators seek an invertible mapping $F_{ZX}(z)$ from a latent space $Z$ to the configuration space $X$ which maps a simple Gaussian distribution to the targeted Boltzmann distribution. Unlike molecular dynamics (MD) sampling methods that require a long time simulation, Boltzmann generators can produce uncorrelated and low energy samples from different metastable states in one-shot.  %that jumps out of the frame work of classical MD. This method tries to find an invertible transformation $F_{ZX}(z)$ from a latent space $Z$ to the configuration space $X$ that maps a simple Gaussian distribution to the targeted Boltzmann distribution. This idea, also known as Normalizing Flow, has been extensively employed by many other machine learning tasks such as vision, speech recognition and etc. After training a normalizing flow model using short MD simulation data and energy data, Boltzmann Generator can produce uncorrelated and low energy samples from different metastable states. 

Though the Boltzmann generator successfully repacks the high probability regions of the configuration space into a concentrated latent space density, its abilities to explore high energy regions and to find the transition pathways are not well justified. {The synthetic experiments in \cite{noe2019boltzmann} report the feasibility of achieving transition pathways with low energy and high probabilities through mapping of the linear interpolated paths in latent space.} However, there are neither theoretical results nor physical constraints to guarantee the physical meaning behind this observation. As an important concept in molecular dynamics, the transition path between metastable states provides an important description of the transition mechanism. For instance, the transition path can be used to evaluate the lowest energy barrier and the transition rate, where the rate is a good metric of materials in applications such as catalyst discovery. Meanwhile, the transition path, as an important guidance, can help to figure out the favorable condition for the transition of chemical reactions. The lack of physical interpretations of direct paths in the latent space limits the application of Boltzmann generators in transition path finding. To the best of our knowledge, however, there is no successful effort yet to improve the path finding ability of Boltzmann generators.

\begin{figure}[!t]
    \centering
    \includegraphics[width=\linewidth]{figure/illustration.pdf}
    \caption{Illustration of PathFlow that maps the base distribution and  a linear interpolated path to the Boltzmann distribution and a transition path simultaneously.}
    \label{fig:illustration}
\end{figure}

In this paper, an extended normalizing flow method, named PathFlow, is proposed to improve the learning of transition paths. Beside retaining the feature of generating independent samples from the Boltzmann distribution,  PathFlow further introduces physical constraints during training to regularize the mapping of linear interpolated paths between two metastable states to the {\it minimum energy path} (MEP) or the {\it minimum free energy path} (MFEP).  A simple illustration of this mapping is provided in Figure \ref{fig:illustration}.

Specifically, a system with two metastable states centered around $A$ and $B$ is considered. An invertible function $F$ is learnt in two modes: 

\emph{Learning on examples} follows the general training of normalizing flows where we collect data of metastable states from MD and then train the model by minimizing the negative log-likelihood loss function $L_{\text{NF}}.$

\emph{Learning on paths} is the main principle behind PathFlow. Following the physical definition of MEP and MFEP, another loss function $L_{\text{path}}$ is designed to measure the ability of $F$ mapping the linear interpolated path in the latent space to a transition path with physical meaning. On-the-fly estimators of physics quantities required in the calculation of $L_{\text{path}}$ as well as its gradient $\nabla L_{\text{path}}$ are provided based on restraint dynamics \citep{maragliano2006string,maragliano2006temperature}. 

Therefore, unlike other path finding methods \citep{jonsson1998nudged, weinan2002string}, PathFlow can be trained by applying gradient-based methods to minimize the total loss:
$$L = w_{\text{NF}} L_{\text{NF}} + w_{\text{path}} L_{\text{path}}.$$
In the experiments based on extensively studied synthetic M\"{u}ller potential and real-world Alanine dipeptide examples, a remarkable performance is achieved by PathFlow. Particularly, our contributions are summarized as below:
\begin{itemize}
\item Introduce physical constraints to normalizing flow which leads to a new machine learning model with knowledge of both high energy and low energy area of a system. This new model can serve as a data generator as well as a transition path finder.
\item Design a loss function $L_{\text{path}}$ to measure the performance of a transition path and provide its estimator based on restraint dynamics. Theoretical bounds of the estimation error are also provided.
\end{itemize}


\section{Related Literature}
%\noindent $\bullet$ 
{\bf Molecular Dynamics.}
The first molecular dynamic simulations can be dated back to mid-20th century \citep{osti_4322875,mccammon1977dynamics}.
Over the past several decades, with the fast development of computational sciences, MD has been successfully applied to physics, chemistry, biology, materials science, and several other fields. One of the greatest challenges of MD is to sample the rare events of state transitions. Enhanced sampling is thus needed to accelerate the dynamics. One line of research focuses on adding bias to the potential along pre-defined collective variables (CVs) to decrease the energy barrier. Such methods include, but are not limited to, the widely used umbrella sampling \citep{torrie1977nonphysical}, adaptive biasing force method \citep{darve2001calculating}, metadynamics \citep{laio2002escaping}, and variational enhanced sampling \citep{valsson2014variational}. However, in many systems, proper CVs are not easily identified. Under such a situation, CV-free methods can be helpful. A number of such methods were proposed, such as parallel tempering \citep{swendsen1986replica}, replica exchange of molecular dynamics \citep{sugita1999replica} and integrated tempering sampling \citep{gao2008integrate}.  

%\noindent $\bullet$ 
{\bf Transition Path Finding.} 
The study of the transition between metastable states is one of the most fundamental problems in chemistry. Existing literature such as transition state theory  \citep{pechukas1981transition}, transition path sampling \citep{dellago2002transition} and transition path theory \citep{vanden2006transition} establishes theoretical foundations to understand the mechanics of the transition. The well-known transition state theory states that the system has to navigate itself to the transition state, which is a saddle point on the potential energy surface. The most probable transition path for the reaction is the MEP. Popular methods for finding MEP include nudged elastic band (NEB) \citep{jonsson1998nudged}, string method \citep{weinan2002string} and its variations \citep{weinan2007simplified,maragliano2007fly,pan2008finding}. \cite{maragliano2006string} extend the definition of MEP to the free energy space and modify the string method to find MFEP. After that, MFEP has been widely explored \citep{branduardi2007b,chen2013efficiently} and applied in different applications \citep{hu2007qm,matsunaga2012minimum}.


%\noindent $\bullet$ 
{\bf Normalizing Flow.} Normalizing flows (NF) are a family of generative models with tractable distributions where both sampling and density evaluation can be efficient and exact. It was popularised by \cite{mohamed2015variational} in the context of variational inference.  Popular architectures include, but are not limited to, the planar flow,  nonlinear independent components estimation (NICE) \citep{dinh2014nice}, real non-volume preserving (RealNVP \citep{45819}), masked autoregressive flow (MAF, \citep{papamakarios2017masked}). Recent development on neural ordinary differential equations \citep{chen2018neural} extends discrete flow models to the continuous flow. Normalizing flows have been widely applied in different machine learning applications such as image generation \citep{ho2019flow++}, noise modelling \citep{abdelhamed2019noise}, video generation \citep{kumar2019videoflow} and etc. Beside Boltzmann generators, normalizing flows also receive great attention in physics \citep{kohler2019equivariant,kanwar2020equivariant,wirnsberger2020targeted,wong2020gravitational,wu2020stochastic}


\section{Model}
Consider a system in the NVT ensemble where the coordinates of $D$ atoms are given by $\boldsymbol{r} = (\boldsymbol{r}_1,\boldsymbol{r}_2,...,\boldsymbol{r}_{3D})\in \RR^{3D}.$ The potential energy of the system is denoted by $V(\br).$ It is known that $\br$ follows a Boltzmann distribution:
$$p(\boldsymbol{r}) = \frac{1}{Z}\exp(-\beta V(\boldsymbol{r})),$$
where $Z =\int_{\RR^{3D}}\exp(-\beta V(\boldsymbol{r})) d\boldsymbol{r}$ is the partition function and $\beta = \frac{1}{\kappa_\beta T}$ is the inverse temperature. Here, $\kappa_\beta$ is the Boltzmann constant and $T$ is the temperature.

Suppose the system has two metastable states $A$ and $B$, which, for instance, may represent the reactant and product states of a reaction. Based on MD simulation methods starting from $A$ and $B$, the data $\{\boldsymbol{r}_A^i\}_{i=1}^n$ and $\{\boldsymbol{r}_B^i\}_{i=1}^n$ can be sampled. However, the transition between these two states can hardly be observed without any enhanced sampling technique, because of the high energy barrier presented in the potential energy landscape. In addition, long simulation trials are always required to achieve statistically independent samples for both metastable states. 

This section describes the PathFlow model, avoiding the aforementioned challenges, to generate independent metastable states samples as well as the transition path. To achieve these two goals, the model will be trained in two modes: {\it learning on examples} and {\it learning on paths}.
\subsection{Learning on Examples}
Given a target distribution $X$ with probability density $p_X$, normalizing flows (NFs) target to find a learnable and invertible function $F_\theta: \RR^d\mapsto\RR^d$, usually represented by a neural network with parameter $\theta$, that transforms a probability density $Z$ to the target $X$, i.e.,$X = F_\theta(Z)$ and $Z= F_\theta^{-1}(X).$  Allowing the change of variable rule, we know that $$p_X(x)=p_Z(F_\theta^{-1}(x))\left|\det(J_{F^{-1}_\theta}(x))\right|,$$ where $J_{F^{-1}_\theta}(x)$ is the Jacobian matrix of $F^{-1}_\theta$ at $x$. Given $n$ realizations of the distribution $X,$ $\{x_i\}_{i=1}^n,$ NFs can be trained by minimizing the negative log-likelihood:
\begin{align*}
    -\sum_{i=1}^n \log p_X(x_i) &= -\sum_{i=1}^n \Big[\log p_Z(F_\theta^{-1}(x_i)) \\&+ \log\left|\det(J_{F^{-1}_\theta}(x_i))\right|\Big].
\end{align*}


The base distribution $Z$ is usually chosen as a uni-modal Gaussian distribution or uniform distribution. However, \cite{cornish2020relaxing} point out that NFs can hardly map a unimodal base distribution to a multimodal distribution such as the Boltzmann distribution considered in this paper. To overcome this issue, we opt to use two separate base distributions $Z_A$ and $Z_B$ for states $A$ and $B,$ respectively. Different from Boltzmann generators using two mappings for two disconnected states, we will transform the two base distributions using the same mapping $F_\theta.$ We expect that:
$$Z_A= F^{-1}_\theta(\boldsymbol{r}_A)~~~\text{and}~~~Z_B= F^{-1}_\theta(\boldsymbol{r}_B).$$

The negative log-likelihood loss to find the best parameter $\theta$ can then be written as:
\begin{align}\label{loss_NF}
&L_{\mathrm{NF}}(\theta; w_A, w_B) \nonumber\\&= w_A L_{\mathrm{NF}}^A(\theta) + w_B L_{\mathrm{NF}}^B (\theta)\nonumber\\
& =  -w_A\sum_{i=1}^n \left[\log p_{Z_A}(F_\theta^{-1}(\boldsymbol{r}_A^i)) + \log\left|\det(J_{F^{-1}_\theta}(\boldsymbol{r}_A^i))\right|\right]\nonumber\\
&~~-w_B\sum_{i=1}^n \left[\log p_{Z_B}(F_\theta^{-1}(\boldsymbol{r}_B^i)) + \log\left|\det(J_{F^{-1}_\theta}(\boldsymbol{r}_B^i))\right|\right],
\end{align}
where $(w_A, w_B)$ are the weights of the two states.

%\noindent $\bullet$  \textbf{Selection of Base Distribution}

\subsection{Learning on Paths} 
To enable PathFlow to find physically meaningful transition pathways, we introduce physical constraints to the model training. Here, we are especially interested in finding the { minimum energy path}  or the {minimum free energy path}.


 \subsubsection{Minimum Energy Path (MEP)}
An MEP is a path that connects two minima of $V(\boldsymbol r)$ via a saddle point and corresponds to the steepest descent path on $V(\boldsymbol r)$ from this saddle point. More specifically, each point on the MEP is a local potential energy minimum on the hyperplane tangent to the path. This implies that the force $-\nabla V$ must be everywhere tangent to the MEP. Denote the MEP by a curve $\boldsymbol r(\alpha),$ where $\alpha \in [0,1]$ is a parametrization of the path. We then have, for $\forall \alpha \in [0,1],$
\begin{align}\label{MEP_parallel}
\nabla V(\boldsymbol r(\alpha)) \text{ is parallel to } \frac{d \boldsymbol r(\alpha)}{d\alpha},
\end{align}
or equivalently,
\begin{align}\label{tangent}
\nabla V(\boldsymbol r(\alpha)) - (\nabla V(\boldsymbol r(\alpha)) \cdot \hat t)\hat t = 0,
\end{align}
where $\hat t$ is the unit tangent vector along the path at  $\boldsymbol r(\alpha).$ Eq. \eqref{tangent}  is not yet a numerically efficient way to measure the performance of a path, due to the high computational cost to calculate the tangent vector. \cite{olender1997yet} instead prove that finding MEP is equivalent to solving the following variation optimization problem:
\begin{align}\label{opt_variational}
P_{\text{MEP}} =\underset{P: A\rightarrow B}{\mathrm{argmin}} \int_P\|\nabla V\|_2|dl|,
\end{align}
for a gradient system
$$\frac{d\boldsymbol{r}}{d\alpha} = -\nabla V(\boldsymbol{r}(\alpha)).$$
Suppose the path is divided into $S$ segments $\{l_{i}\}_{i=1}^S$ with arc lengths $\{d l_{i}\}_{i=1}^S.$  Let $-\nabla V_{i}$ be the force at the starting point of  $i$-th path segment. The discretization of the optimization objective  in Eq. \eqref{opt_variational} provides us an ideal loss function to measure the performance of a candidate path $P$:
\begin{align}\label{loss_path_potential}
L_{\text{MEP}}(P) = \sum_{i=1}^S \|\nabla V_{i}\|_2|dl_{i}|.
\end{align}


\subsubsection{Minimum Free Energy Path (MFEP)}
Finding MEP  has to deal with the difficulty caused by the extremely high dimensionality of the system and also the non-smoothness of the potential energy landscape. This difficulty can be reduced by the introduction of collective variables (CVs) and the mapping of MEP to the CV space (denoted as $\mathcal{X}$). Given $N$ predefined CVs denoted by $\bx (\br) = (x_1(\br),...,x_N(\br))$ , the free energy associated with $\bx(\br)$ is defined as follows:
\begin{align}\label{free_energy}
U(\bz) &= -\beta^{-1}\ln\Big(Z^{-1}\int_{\RR^{3D}}e^{-\beta V(\br)}\nonumber\\
& \times \prod_{i=1}^N\delta(x_i(\br)-z_i)d\br\Big), \forall \bz\in \mathcal{X},
\end{align}
where $\delta$ is the Dirac delta function. On the free energy surface, the path of our interest is the minimum free energy path (MFEP). Letting $\bz(\alpha) = \bx(\br(\alpha)),$  \cite{maragliano2006string} show that MFEP $\bz(\alpha)$ must satisfy
\begin{align}\label{MFEP_parallel}
\frac{d\bz(\alpha)}{d\alpha} \text{ is parallel to } M(\bz(\alpha)) \nabla_{\bz} U(\bz(\alpha)) ,
\end{align}
where 
\begin{align}\label{M_matrix}
M_{ij}(\bz) = Z^{-1} e^{\beta U(\bx)}&\int_{\RR^{3D}} \sum_k \frac{\partial x_i(\br(\alpha))}{\partial r_k}\frac{\partial x_j(\br(\alpha))}{\partial r_k}\nonumber\\ &e^{-\beta V(\br)}\prod_{i=1}^N (z_i-x_i(\br)) d\br.
\end{align}

\cite{maragliano2006string} also prove that MFEP is the most likely path of transitions between $A$ and $B$. Hence, it can greatly help us understand the underlying physical mechanism of the transition.  Similar to Eq. \eqref{loss_path_potential}, the following loss can be utilized to measure the performance of a candidate path on the free energy surface.
\begin{align}\label{loss_path_free}
L_{\text{MFEP}}(P) = \sum_{i=1}^S \|M_i\nabla U_{i}\|_2|dl_{i}|,
\end{align}
where $P$ is a candidate path in $\mathcal{X}$ connecting A and B.


\begin{remark}\label{remark_equivalent}
The minimum energy path can be viewed as a special case of the minimum free energy path. Specifically, if we choose $\bx(\br) =\br $, i.e., an identity mapping, the free energy $U$ is exactly the potential energy $V$ and the transition matrix as defined in Eq. \eqref{M_matrix} is reduced to an identity matrix. Therefore, Eq. \eqref{loss_path_free} is the same as Eq. \eqref{loss_path_potential}.

\end{remark}
\subsection{Total Loss Design}

It is necessary to find an invertible mapping $F_\theta$ that 1) maps the base distribution to the target Boltzmann distribution and 2) maps a base path in the latent space to a transition path in the configuration or CV spaces. Given a base path $P_\text{base}$,  the path after mapping is denoted as $F_\theta(P_\text{base}).$  We can then use Eqs. \eqref{loss_NF}, \eqref{loss_path_potential} and  \eqref{loss_path_free}  to measure how good the parameter $\theta$ is to realize two targets. Denote the path loss as
\begin{equation}\label{loss_path}
   L_{\text{path}} (\theta) = 
\begin{cases}
 L_{\text{MEP}}(F_\theta(P_\text{base})), \text{or}\\
 L_{\text{MFEP}}(F_\theta(P_\text{base})).
\end{cases}
\end{equation}

Combining the path loss Eq. \eqref{loss_path} with NF loss  Eq. \eqref{loss_NF}, we obtain the loss to train PathFlow:
\begin{align}
L(\theta) = w_{\text{NF}} L_{\text{NF}}(\theta) + w_{\text{path}} L_{\text{path}}(\theta),
\end{align} 
where $w_{\text{NF}}$ and $ w_{\text{path}}$ are two hyper parameters to control the weight of two losses. 
To ease the training of our model, the base path $P_\text{base}$ and base distribution $Z_A, Z_B$ should be carefully selected.  First, the end-points $A$ and $B$ of the transition path need to be determined. In some applications like the study of chemical reactions,  the start and end of the transition path is already known.  In other cases, $A$ and $B$ can be set in terms of the simulation data. For example, the end-points can be chosen as 
\begin{equation}
\boldsymbol \mu_A,\boldsymbol \mu_B =\begin{cases}
\frac{1}{n}\sum_i\br_A^i, \frac{1}{n}\sum_i\br_B^i,\text{ if in configuration space;}\\
 \frac{1}{n}\sum_i\bx(\br_A^i),  \frac{1}{n}\sum_i\bx(\br_B^i),  \text{ if in CV space,}
\end{cases}
\end{equation}
 the mean of samples from states A and B, respectively. Since the transition path must have $A$ and $B$ as its end-points, we require the base path starting from $F_\theta^{-1}(A)$ and ending at $F_\theta^{-1}(B).$ A natural choice of the whole base path is the  linear interpolated path between these two points, i.e.,
$$P_\text{base}(\alpha) = (1-\alpha)  F_\theta^{-1}(A) + \alpha  F_\theta^{-1}(B). $$
In the sampling space, the simulation data should center around the points with minimal potential or free energy, which at the same time are the end-points of the transition path.  Therefore, we prefer to set 
\begin{align*}
    &Z_A\sim Gaussian(F_\theta^{-1}(A), \sigma \boldsymbol  I),\\
    &Z_B\sim Gaussian(F_\theta^{-1}(B),\sigma  \boldsymbol  I).
\end{align*}
Here, $\sigma$ can be used to control the concentration of the base distribution. When $\sigma$ is small, most of the linear interpolated path lies outside the concentration area of $Z_A$ and $Z_B.$ Hence, the model can focus on learning the path only. On the other hand,  co-training of the path and the generator may be difficult around A and B, since $F_\theta$ has to minimize two losses at the same time.  However, since we are most interested in the transition process that happens around the high energy barrier, we can  avoid this conflict by reducing the weight of path samples near the end-points. 

\subsection{Gradient-Based Training}
The gradient descent type algorithm is applied to update the model parameter $\theta$ to minimize $L(\theta).$ Notice that
$$\nabla_\theta L(\theta) = w_{\text{NF}} \nabla_\theta  L_{\text{NF}}(\theta) + w_{\text{path}} \nabla_\theta  L_{\text{path}}(\theta).$$
The gradient $\nabla_\theta L_{\text{NF}}(\theta)$ can be calculated by backpropagation that has already been implemented in popular deep learning frameworks such as Tensorflow \citep{45381} and PyTorch \citep{10.5555/3454287.3455008}.  $ \nabla_\theta L_{\text{path}}(\theta) $, however, involves the calculation of the potential mean force, the transition matrix and their gradients, and therefore cannot be calculated automatically. In the next section, we will provide efficient estimators of all the physics quantities appearing in $ \nabla_\theta L_{\text{path}}(\theta) $ using restraint dynamics.

\section{Gradient Estimator by Restraint Dynamics}\label{sec:estimator}
In this section, we provide an estimator of the gradient  $ \nabla_\theta L_{\text{path}}(\theta) $ to facilitate the gradient-based training of our model. Since MEP can be viewed as a special case of MFEP as we mentioned in Remark \ref{remark_equivalent}, we only consider the case where $ L_{\text{path}}= L_{\text{MFEP}}.$  Suppose, under parameter $\theta$, the candidate path is $P(\theta)$ and of arc length $l(P(\theta)).$ We uniformly divide $P(\theta)$ into $S$ segments, each with an arc length $|dl_i| = l(P(\theta))/S$ in Eq. \eqref{loss_path_free}.  We then have:
$$L_{\text{MFEP}}(P) = l(P(\theta))\frac{1}{S} \sum_{i=1}^S \|M_i\nabla U_{i}\|_2,$$
as well as the gradient:
\begin{align}\label{path_grad}
\nabla_\theta L_{\text{path}}(\theta) =&  \nabla_\theta l(P(\theta))  \frac{1}{S}\sum_{i=1}^S \|M_i\nabla U_{i}\|_2\nonumber\\ &~~+ l(P(\theta))\frac{1}{S} \sum_{i=1}^S \nabla_\theta \|M_i\nabla U_{i}\|_2.\end{align}
The calculation of $\|M_i\nabla U_{i}\|_2$ and its gradient $\nabla_\theta \|M_i\nabla U_{i}\|_2$ can be done by regular backpropagation only if the free energy surface (FES)  $U$ or its analytical approximation are known. In practice, however, establishing FES requires a large number of simulations, which can be a task even harder than finding the path. Therefore, it would be more efficient to estimate these values on the fly at the given sample points on $P(\theta)$. We will adopt the approach of restrained dynamics \citep{maragliano2006string,maragliano2006temperature}. For a given point $\bz=(z_1,...,z_M)$ in the CV space, this method adds a harmonic restraint to the potential of the system to represent the effect of the spring forces between the configuration variables and the CVs:
\begin{align}\label{extended_potential}
V_{k}(\br;\bz)= V(\br) + \frac{k}{2}\sum_{i=1}^N (x_i(\br)-z_i)^2,
\end{align}
where $k$ is a parameter to control the restraint. The movement of particles in the CV space under this extended potential can then be characterized by the overdamped Langevin dynamics:
\begin{align}\label{dynamics_extended_potential}
\dot{\br}(t) = - \nabla V_{k}(\br(t), \bz)+ \sqrt{2\kappa_\beta T} \eta_t,
\end{align}
where $\eta(t)$ is a white Gaussian noise with unit variance. 
It can been shown that Eq. \eqref{dynamics_extended_potential} has the following Boltzmann-Gibbs density as its stationary distribution:
$$p_{k}(\br;\bz) = \frac{1}{Z_{k}(\bz)}\exp(-\beta V_{k}(\br;\bz)),$$
where $Z_{k}(\bz) = \int \exp(-\beta V_{k}(\br, \bz))d\br.$  

{\bf Estimation of $\|M \nabla U\|_2$.} 
Define the effective free energy corresponding to $V_{k}(\br;\bz)$ as 
$$U^{(k)}(\bz) = -\beta^{-1} \ln\left(Z^{-1}\int_{\RR^{3D}}\exp(-\beta V_k(\br;\bz))d\br\right).$$
\cite{maragliano2006string} prove that when $k $ is large, 
$$\lim_{k\rightarrow\infty}\nabla U^{(k)}(\bz) = \nabla U(\bz),$$
where $$\nabla_i U^{(k)} = \int_{\RR^{3D}} k (z_i- x_i(\br)) p_k(\br;\bz)d\br,~~ \forall i\leq N.$$
If we further assume the ergodicity of dynamics Eq. \eqref{dynamics_extended_potential},  we can obtain an estimator of the potential mean force:
\begin{align}\label{pmf_estimator}
\nabla_i U^{(T,k)}(\bz) = \frac{k}{T}\int_0^T (z_i - x_i(\br(t)))dt.
\end{align}
 Similar analysis can be done on $M,$ and an estimator of $M_{ij}(\bz)$ can be derived from Eq. \eqref{M_matrix} as follows:
 \begin{align}\label{M_estimator}
M^{(T,k)}_{ij}(\bz) &=  \frac{1}{T}\int_0^T \sum_{k} \frac{\partial x_i(r(t))}{\partial r_k}\frac{\partial x_j(r(t))}{\partial r_k} dt.\end{align}
Combining Eqs. \eqref{pmf_estimator} and \eqref{M_estimator}, we obtain an estimation of $\|M\nabla U\|_2$ as $\|\nabla U^{(T,k)}M^{(T,k)}\|_2$.

{\bf Estimation of $\nabla_{\theta} \|M \nabla U\|_2$.} To estimate $\nabla_{\theta} \|M \nabla U\|_2$ in the second term of Eq. \eqref{path_grad}, one naive approach is to use the finite difference method requiring at least $O(N)$ simulation trails under Eq. \eqref{dynamics_extended_potential}, which may be computationally challenged in practice. Instead, we propose a new estimator that can be obtained simultaneously with  Eqs. \eqref{pmf_estimator} and \eqref{M_estimator}. Specifically, we rewrite the gradient of $\|M\nabla U\|_2$  as follows:
$$\nabla_\theta \|M\nabla U\|_2=\frac{J(  M\nabla U)^\top  M\nabla U}{\|M\nabla U\|_2},$$ where $J(\cdot)$ is the Jacobian matrix of a given function.
Note that the Jacobian matrix can be further decomposed as
\begin{align}\label{jacobian}
    J(M\nabla U)&= \nabla M \nabla U  + M\nabla^2 U,
\end{align}
where $ \nabla M \nabla U = [\nabla_{z_1} M\nabla U ,...,  \nabla_{z_N} M\nabla U].$
Recall that the estimators Eqs. \eqref{pmf_estimator} and \eqref{M_estimator} can all be viewed as a time average estimation of the expectation of a function $f(\br,\bz)$ over distribution $p_k(\br;\bz),$ i.e., $\int_{\RR^{3D}}f(\br,\bz) p_k(\br;\bz)d\br.$ Specifically, for $M_{ij},$ $f(\br,\bz)$ is taken as $\sum_{k} \frac{\partial x_i(r)}{\partial r_k}\frac{\partial x_j(r)}{\partial r_k}  $ and for $\nabla_i U(\bz),$ $f(\br,\bz)$ is taken as $k (z_i- x_i(\br)).$ For this expectation, \cite{maragliano2006string} have proved that
\begin{align*}
 &\lim_{k\rightarrow \infty} \int_{\RR^{3D}}f(\br,\bz)p_k(\br;\bz)d\br  \\=&  Z^{-1}e^{\beta U(\bz)} \int_{\RR^{3D}}f(\br,\bz) e^{-\beta V(\br)} \prod_{i=1}^N \delta(z_i-x_i(\br))d\br.
\end{align*}
Under certain regularity conditions that we can change the order of derivative and limit, as well as the order of derivative and integration, the following equation is established.
\begin{align*}
     &\lim_{k\rightarrow \infty} \int_{\RR^{3D}}\frac{\partial f(\br,\bz)p_k(\br;\bz)}{\partial z_l}d\br \nonumber\\
    =&\lim_{k\rightarrow \infty} \frac{\partial \int_{\RR^{3D}}f(\br,\bz)p_k(\br;\bz)d\br }{\partial z_l} \nonumber\\
    =&  \frac{\partial Z^{-1}e^{\beta U(\bz)} \int_{\RR^{3D}}f(\br,\bz) e^{-\beta V(\br)} \prod_{i=1}^N \delta(z_i-x_i(\br))d\br}{\partial z_l}.
\end{align*}

Estimating $\frac{\partial M}{\partial z_l}$ and $\nabla^2 U$ can both be generalized as how to use the simulation trajectory $\br(t)$ to estimate $\int_{\RR^{3D}}\frac{\partial f(\br,\bz)p_k(\br;\bz)}{\partial z_l}d\br .$ By some manipulation, we have
\begin{align*}
   &\int_{\RR^{3D}}\frac{\partial f(\br,\bz)p_k(\br;\bz)}{\partial z_l}d\br= \int_{\RR^{3D}}\frac{\partial f(\br,\bz)}{\partial z_l}p_k(\br;\bz)d\br \,\\
   &+ \int_{\RR^{3D}}f(\br,\bz)\beta k(x_l(\br)-z_l)p_k(\br;\bz)d\br\\ &- \int_{\RR^{3D}}f(\br,\bz)p_k(\br;\bz)d\br \int_{\RR^{3D}}\beta k(x_l(\br)-z_l)p_k(\br;\bz)d\br.
\end{align*}
All terms are expectations under density $p_k(\br;\bz).$ Therefore, with ergodicity, we can use time average to construct the estimator:
\begin{align}
   &\int_{\RR^{3D}}\frac{\partial f(\br,\bz)p_k(\br;\bz)}{\partial z_l}d\br \approx \frac{1}{T}\int_{t=0}^T\frac{\partial f(\br(t),\bz)}{\partial z_l}dt \nonumber\\
   &+ \frac{1}{T}\int_{t=0}^T f(\br(t))\beta k(x_l(\br(t))-z_l)dt \nonumber\\ 
   &- \frac{1}{T}\int_{t=0}^T f(\br(t),z)dt \frac{1}{T}\int_{t=0}^T\beta k(x_l(\br(t))-z_l)dt \nonumber\\
   &\triangleq \mathcal{F}_l(f(\br,\bz),T,k).
\end{align}
Plugging $\sum_{k} \frac{\partial x_i(\br)}{\partial r_k}\frac{\partial x_j(\br)}{\partial r_k}$ or $k (z_j- x_j(\br))$ into $f(\br, \bz)$, we get the estimators of $\nabla_l M_{i,j}(\bz)$ and $\nabla^2_{i,j} U(\bz).$
\begin{align}\label{second_order_estimator}
   \nabla_l M^{(T,k)}_{ij}(\bz) &= \mathcal{F}_l\left(\sum_{k} \frac{\partial x_i(\br)}{\partial r_k}\frac{\partial x_j(\br)}{\partial r_k},T,k\right),\nonumber\\
   \nabla^2_{i,j} U^{(T,k)}(\bz) & = \mathcal{F}_i\left(k (z_j- x_j(\br)),T,k\right).
\end{align}

Using Eqs. \eqref{pmf_estimator}, \eqref{M_estimator} and \eqref{second_order_estimator}, the approximation of the Jacobian matrix in Eq. \eqref{jacobian} is established. We are ready to use gradient-based algorithm to find $\theta$ that optimizes $L(\theta).$ 

\subsection{Estimation Error}
The following theorem shows the estimation error of estimators Eqs. \eqref{pmf_estimator}, \eqref{M_estimator} and \eqref{second_order_estimator}.
\begin{theorem}\label{thm:rate}
Suppose the dynamics Eq. \eqref{dynamics_extended_potential} is ergodic, for $\forall i,j,l\leq N$ and $\bz$ in $\mathcal{X},$ the estimation errors of $M^{(T,k)}_{ij}(\bz),\nabla_i U^{(T,k)}(\bz),\nabla_l M^{(T,k)}_{ij}(\bz),\nabla^2_{ij} U^{(T,k)}(\bz)$ are as follows. 
\begin{align*}
    &|M^{(T,k)}_{ij}(\bz) - M_{ij}(\bz)| \leq O(\frac{1}{k}) + O(\frac{1}{\sqrt{T}}), \,\\
    &|\nabla_i U^{(T,k)}(\bz) - \nabla_i U(\bz)| \leq O(\frac{1}{k}) + O(\frac{k}{\sqrt{T}}), \,\\
    &|\nabla_l M^{(T,k)}_{ij}(\bz) - \nabla_l M_{ij}(\bz)| \leq O(\frac{1}{k}) + O(\frac{k}{\sqrt{T}}), \,\\
    &|\nabla^2_{ij} U^{(T,k)}(\bz) - \nabla^2_{ij} U(\bz)| \leq O(\frac{1}{k}) + O(\frac{k^2}{\sqrt{T}}).
\end{align*}
\end{theorem}
The proof of Theorem \ref{thm:rate} can be found in Appendix \ref{appendix:thm_rate}. To achieve an error of order $\epsilon,$ $M^{(T,k)}_{ij}(\bz),$ $\nabla_i U^{(T,k)}(\bz)$ and $\nabla_l M^{(T,k)}_{ij}(\bz)$ require at most $T = O(1/\epsilon^4),$ while $\nabla^2_{ij} U^{(T,k)}(\bz)$ requires $T = O(1/\epsilon^6).$ This is consistent with our empirical observation that using $\nabla^2_{ij} U^{(T,k)}(\bz)$ to estimate $\nabla^2 U$ can be statistically unstable which leads to the high variance of  the whole Jacobian matrix estimation. 

To overcome this issue, we propose a method that uses one more simulation trial to avoid estimation of $\nabla^2 U.$ Note that by Eq. \eqref{jacobian}, $\nabla_\theta \|M\nabla U\|_2$ can be decomposed as $$\nabla_\theta \|M\nabla U\|_2=\frac{(\nabla M \nabla U)^\top  M\nabla U}{\|M\nabla U\|_2} + \nabla^2 U \frac{ M ^\top  M\nabla U}{\|M\nabla U\|_2}.$$ The second order term $\nabla^2 U$ appears in the second term in the form of a Hessian-vector product, which can be estimated directly with one additional simulation trial independently of $N$. Specifically, let $v = \frac{ M ^\top  M\nabla U}{\|M\nabla U\|_2}$ and we have:
\begin{align*}
     \nabla^2 U v \approx \frac{\nabla U(\bz+ \delta v) -\nabla U(\bz) }{\delta}.
\end{align*}
Only one extra restraint simulation centered at  $z+ \delta v$ is required to get the estimate. Moreover, to increase stability, the product can also be estimated by central difference.
\begin{align*}
     \nabla^2 U v \approx \frac{\nabla U(\bz+ \delta v) -\nabla U(\bz- \delta v)  }{2\delta},
\end{align*}
By using Hessian-vector product trick, we obtain a new estimation of the second term.
\begin{align*}
 \nabla^2 U \frac{ M ^\top  M\nabla U}{\|M\nabla U\|_2}
 \approx &\frac{\nabla U^{(T,k)}(\bz+ \delta v^{(T,k)}(\bz))}{2\delta}\\&-\frac{\nabla U^{(T,k)}(\bz- \delta v^{(T,k)}(\bz))  }{2\delta},
\end{align*}
where $v^{(T,k)}(\bz) = \frac{ (M^{(T,k)}) ^\top  M^{(T,k)}\nabla U^{(T,k)}(\bz)}{\|M^{(T,k)}\nabla U^{(T,k)}(\bz)\|_2}.$ Empirically, we find that using this trick can greatly stabilize the estimation with an acceptable simulation budget increment. For more detailed error estimation, please refer to Appendix \ref{appendix:hession}.









\begin{figure}[!t]
    \centering
    \includegraphics[width=\linewidth]{figure/muller2.png}
    \caption{Experiment result on M\"{u}ller Potential. PathFlow generates samples filling in two low energy regions. At the same time, the transition path found  passes near the transition state. The energy barrier we found has energy of $-38$ which is very close to the ground-truth value $-40.$  }
    \label{fig:muller}
\end{figure}
\section{Numerical Example: M\"{u}ller Potential }
We first illustrate PathFlow using a two-dimensional M\"{u}ller potential that has
metastable states separated by high energy barriers. The M\"{u}ller potential has an explicit formulation:
\begin{align}\label{muller_potential}
    V(x,y) = \sum_{k=1}^4 A_k e^{B_k},
\end{align}
where we take
\begin{align*}
&B_k = a_k(x-x_k^0)^2+ b_k(x-x_k^0)(y-y_k^0)+c_k(y-y_k^0)^2.
\end{align*}
Values of all parameters can be found in Appendix \ref{appendix:muller}.
The two metastable states of M\"{u}ller potential are located around $A = [-0.56,  1.44]$ and $B = [-0.05,  0.47],$ while the transition state is located around  $C = [-0.82,0.62].$ 
For simplicity, we consider finding the minimum energy path (MEP) starting from state A and ending at state B.  We collect 100 data points using Markov Chain Monte Carlo starting from A and B respectively for learning on examples. Our normalizing flow is a masked autoregressive flow (MAF)  model with 10 autoregressive layers and hidden units of shape $ [256,128, 64]$ with ReLU activation.

Given the explicit formulation of $V(x,y),$ there is no need of estimating the gradient of $L(\theta)$ using the proposed method in Section \ref{sec:estimator}.  All the gradients can be automatically obtained by backpropagation implemented in Tensorflow 2.3. We train the model by Adam optimizer. As shown in figure~\ref{fig:muller}, PathFlow can learn the transition path and the sampler of metastable states at the same time. 1) In terms of path finding, PathFlow finds a transition path that passes the transition state $C.$ The optimal energy barrier has energy around -40. The energy barrier we found is around -38 which is very close to the ground-truth. 2) In terms of sample generator, we can successfully generate data points for metastable states in one-shot. 

\begin{figure}
         \centering
        \includegraphics[width=1\linewidth]{figure/path.png}
        \caption{Experiment result on Alanine dipeptide in vacuum under room temperature 300K. The under-layer density plot is the kernel density estimation of the Boltzmann Distribution generated by Meta Dynamics.  Transition pathways found by PathFlow, string method and NEB overlap in most regions. The energy barrier with the energy of about 8.6 kcal/mol lies on all paths. }
     \label{fig::adp_compare}
\end{figure}

\section{Numerical Example: Alanine dipeptide }
In this section, we provide a practical example to illustrate the performance of our proposed models.

We study the isomerization transition and sampling of Alanine dipeptide modeled by the CHARMM27 force field \citep{brooks2009charmm} at 300 K in vacuum. This transition happens between two metastable states named $C_{7eq}$ and $C_{7ax}.$
%\begin{figure}[!t]
  %  \centering
  %  \includegraphics[width=\linewidth]{figure/adp.png}
  %  \caption{Two metastable states of Alanine dipeptide in vacuum under room temperature 300K.}
  %  \label{fig::adp}
%\end{figure}
We choose two torsion angles $\phi(C,N,C_\alpha,C)$ and $\psi(N,C_\alpha,C,N)$ as our CVs for this system, i.e., $\bz = (\phi, \psi).$ All the MD simulations are performed by the package GROMACS 2021 \citep{lindahl_2021_4457626} linked with Plumed 2.7 \citep{tribello2014plumed}. To generate data in two metastable states, we run  brute-force MD simulations starting from $C_{7eq}$ and $C_{7ax}$ for 100 picoseconds (ps), respectively. The CV values along the MD trajectories are computed and recorded in every 0.2 ps.  We randomly select 100 data points for each state to train the sampler. On each candidate path in the CV space, we sample a point every 0.1 arc length. For each sample on the path, we run three restraint simulations with $k= 500$ kJ/mol/rad for 100 ps. The CV values along the trajectories are computed and recorded in every 0.01 ps to estimate the potential mean force, transformation matrix $M,$ and their derivative. We choose a masked autoregressive flow with 15 autoregressive layers and hidden units of shape $ [512, 256,128, 64]$ with ELU activation as our normalizing flow model.


 \noindent {\bf Path Finding.} To illustrate the path-finding ability of PathFlow, we compare our model with Nudged Elastic Band (NEB) and the string method with swarms of trajectory. All the methods are implemented with 40 images. The detailed setting up of the string method follows that in  \cite{pan2008finding} Section III.1.

Figure \ref{fig::adp_compare} plots the transition pathways found by NEB (average of 30-40 iterations), the string method (average of 60-70 iterations) and PathFlow. We observe that transition paths
found by PathFlow, NEB and the string method overlap in most regions. They all pass the same energy barrier with free energy difference of 8.6 kcal/mol. The three pathways differ around $C_{7ax}$ which may be caused by the conflict between $L_{\text{NF}}$ and $L_{\text{path}}$ during training. However, the free energy profile of our pathway in Figure \ref{fig::free_energy_profile} is almost consistent with that of the string method in \cite{pan2008finding}. %Second, in terms of generator, our newly generated samples locate in two low energy regions. Since the generator is trained using short simulation data which may not contain all metastable configurations, our generator can not sample these unknown low energy regions. This issue can be easily solved by using data from longer simulation trials. In terms of speed of path finding, PathFlow can find the path in 5 iterations requiring 60 ns restraint simulation, which is significantly faster than the string method that converges in about 60 iterations and takes 84 ns simulation (60 ns restraint simulation, 24 ns free simulation).
\begin{figure}
         \centering
        \includegraphics[width=0.9\linewidth]{figure/free_energy_profile.pdf}
        \caption{Free energy profile of the transition pathway found by PathFlow. Free energy in $C_{7eq}$ ($\alpha=0$) is set as $0$. The configuration plots were made by \cite{cuny2017metadynamics}.}
     \label{fig::free_energy_profile}
\end{figure}

\noindent {\bf Configuration Generation.} We also compare PathFlow with the Boltzmann generator on Alanine Dipeptide configurations. The Boltzmann generator is trained using a Gaussian base distribution and simulation samples from both state $C7_{eq}$ and $C7_{ax}.$
We expect that the Boltzmann generator is not effective at sampling separated and disconnected states, and hence we further trained two separate Boltzmann Generators (BG Separate)for these two states, respectively. We tested three models on 100 samples from each state. The test negative log likelihood is listed in Table~\ref{tab1}.
\begin{table}
\begin{center}
\begin{tabular}{|l|l|l|l|}
\hline
            & $C7_{eq}$ & $C7_{ax}$ & Average \\ \hline
Boltzmann   & -0.3889   & 1.689     & 0.6498  \\ \hline
PathFlow    & -1.005    & 0.1581    & -0.4235 \\ \hline
BG Separate & -1.097    & 0.03027   & -0.5333 \\ \hline
\end{tabular}

\caption{Test Negative Log Likelihood of PathFlow, Boltzmann Generator and BG Separate. }
\label{tab1}
\end{center}

\end{table}
%As a comparison, we also implement the string method with swarms of trajectory using a string with 40 images. Other setting-up follows that in  \cite{pan2008finding} Section III.1.

We observe that BG Separate performs well on both states, but the Boltzmann generator achieves the worst test loss among all models. This confirms that the Boltzmann generator is not effective at sampling multi-modal distributions with two metastable states, which is widely known as a major challenge for generative models. However, by introducing two base distributions, our model PathFlow out-performs Boltzmann generators significantly in sampling multi-modal distributions. PathFlow obtains a test loss close to that of BG separate but only uses half the model size.



\section{Conclusion and Perspective}
In summary, PathFlow is a promising tool for generating Boltzmann samples and discovering transition paths to describe the transition mechanisms. Different from existing path finding algorithms (e.g.,NEB \citep{jonsson1998nudged}, string method \citep{weinan2002string}), PathFlow is trained by the standard gradient-based optimizers associating with the efficient gradient estimator developed in section \ref{sec:estimator}. Note that the estimator has the potential to be employed by other machine learning based path finding algorithms. In particular, as an independent research interest, it is empirically found that the gradient-based training leads to a faster path finding speed and fewer simulation trials. In addition, PathFlow can be viewed as one successful application of multitask learning to physics. We expect more multitask learning techniques will demonstrate their power in scientific research. Future research directions also include normalizing flows or other machine learning based methods in the transition tube \citep{vanden2006transition} sampling as well as CV discovery.

%\textcolor{red}{Talk about future work if page limit permits? (1) Transition tube sampling. (2) Machine learning CV.}

\bibliographystyle{plainnat}
\bibliography{example_paper}

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% APPENDIX
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%


% You can have as much text here as you want. The main body must be at most $8$ pages long.
% For the final version, one more page can be added.
% If you want, you can use an appendix like this one, even using the one-column format.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%


\end{document}


% This document was modified from the file originally made available by
% Pat Langley and Andrea Danyluk for ICML-2K. This version was created
% by Iain Murray in 2018, and modified by Alexandre Bouchard in
% 2019 and 2021 and by Csaba Szepesvari, Gang Niu and Sivan Sabato in 2022. 
% Previous contributors include Dan Roy, Lise Getoor and Tobias
% Scheffer, which was slightly modified from the 2010 version by
% Thorsten Joachims & Johannes Fuernkranz, slightly modified from the
% 2009 version by Kiri Wagstaff and Sam Roweis's 2008 version, which is
% slightly modified from Prasad Tadepalli's 2007 version which is a
% lightly changed version of the previous year's version by Andrew
% Moore, which was in turn edited from those of Kristian Kersting and
% Codrina Lauth. Alex Smola contributed to the algorithmic style files.
