\documentclass[accepted]{uai2023}
\usepackage[american]{babel}

\usepackage{natbib}
    \bibliographystyle{abbrvnat}
\usepackage{booktabs}
\usepackage{tikz}

\usepackage{mathtools}
\usepackage{amsmath}
\usepackage{amssymb}
\usepackage{amsthm}
\usepackage{amsfonts}

\usepackage{graphicx}
\usepackage{bm}
\usepackage{comment}
\usepackage{enumitem}

\usepackage{booktabs}
\usepackage{xcolor}
\usepackage{sidecap}
\usepackage{xr}
\externaldocument[supp:]{yi_95-supp}

\makeatletter
\newcommand*{\addFileDependency}[1]{
\typeout{(#1)}
\@addtofilelist{#1}
\IfFileExists{#1}{}{\typeout{No file #1.}}
}\makeatother
\newcommand*{\myexternaldocument}[1]{%
\externaldocument{#1}%
\addFileDependency{#1.tex}%
\addFileDependency{#1.aux}%
}
\myexternaldocument{yi_95-supp}


%\usepackage{subcaption}
\usepackage{mhchem} % Chem equation

\newcommand*\mycommand[1]{\texttt{\emph{#1}}}
\def\tcb{\textcolor{blue}}
\def\tcr{\textcolor{red}}
\def\tcg{\textcolor{green}}
\def\tcm{\textcolor{magenta}}

\def\tcrr{\textcolor{brown}}

\newcommand{\fix}{\marginpar{FIX}}
\newcommand{\new}{\marginpar{NEW}}

\newcommand\soo[1]{\textcolor{orange}{#1}}
\newcommand\joonseok[1]{\textcolor{blue}{#1}}
\newcommand\seunghoon[1]{\textcolor{pink}{#1}}
\newcommand{\sw}[1]{\textcolor{green} {#1}}
\newcommand\hongkee[1]{\textcolor{cyan}{#1}}
\newcommand\jinhwan[1]{\textcolor{purple}{#1}}


\newcommand\rebuttal[1]{\textcolor{blue}{#1}}

\newcommand\checkrequired[1]{\textcolor{red}{#1 : Check required}}

\title{Towards Physically Reliable Molecular Representation Learning}

\author[1]{Seunghoon Yi}
\author[2]{Youngwoo Cho}
\author[1]{Jinhwan Sul}
\author[1]{Seung Woo Ko}
\author[3]{\\Soo Kyung Kim}
\author[2]{Jaegul Choo}
\author[2]{Hongkee Yoon$^*$}
\author[1,4]{\href{mailto:<joonseok@snu.ac.kr>?Subject=Your UAI 2023 paper}{Joonseok Lee\thanks{Corresponding authors}}{}}
% Add affiliations after the authors
\affil[1]{%
    %Graduate School of Data Science\\
    Seoul National University\\
    Seoul, Korea
}
\affil[2]{%
    %Graduate School of AI\\
    Korea Advanced Institute of Science and Technology\\
    Daejeon, Korea
}
\affil[3]{%
    Palo Alto Research Center\\
    Stanford Research Institute\\
    Palo Alto, CA, USA
  }
\affil[4]{%
    Google Research\\
    Mountain View, CA, USA
  }


\begin{document}
\maketitle

\begin{abstract}
Estimating the energetic properties of molecular systems is a critical task in material design.
Machine learning has shown remarkable promise on this task over classical force fields, but a fully data-driven approach suffers from limited labeled data; not just the amount of available data lacks, but the distribution of labeled examples is highly skewed to stable states.
In this work, we propose a molecular representation learning method that extrapolates well beyond the training distribution, powered by physics-driven parameter estimation from classical energy equations and self-supervised learning inspired by masked language modeling.
To ensure the reliability of the proposed model, we introduce a series of novel evaluation schemes in multifaceted ways, beyond the energy or force accuracy that has been dominantly used.
From extensive experiments, we demonstrate that the proposed method is effective in discovering molecular structures, outperforming other baselines. Furthermore, we extrapolate it to the chemical reaction pathways beyond stable states, taking a step towards physically reliable molecular representation learning.
\end{abstract}


%------------------------------------------------
\section{Introduction}

Material simulation is a vast research field that spans understanding material's optimal structure, simulating microscopic dynamics depending on time, temperature, and pressure beyond the experimental resolution, and reducing trial-error loops in designing new materials.
The foundation of this simulation is defining the energy at the atomic level considering interactions between numerous atoms, so-called many-body problem.
Advances in theory and computational capability, \emph{e.g.}, Density Functional Theory (DFT; \citet{kohn_self-consistent_1965,parr1980density}), have led to higher predictability of energy with greater accuracy.
Despite the tremendous advances, however, many-body interactions between atoms have exponential  complexity over the number of atoms, and it has been a grand challenge in computational material simulations to reduce computational cost while improving the prediction accuracy.


Recently, machine learning approaches have drawn attention as an alternative to classical force fields that rely on physical principles and human intuition.
However, pure data-driven approaches often suffer from the limited amount and quality of available data.
Sometimes one may benefit from simulations, which provide data at a larger scale than actual experiments.
It still requires, however, expensive and time-consuming DFT or molecular dynamics (MD) simulations, accompanying by significant human analysis due to our limited knowledge.

Furthermore, for some specific system of interest (\emph{e.g.}, a drug candidate), it is often essential to accurately estimate the molecular dynamics across the reaction pathway, not just the stable states before and after the reaction.
Molecular structure data, however, are vastly available only at their stable states, while it is extremely costly to collect data on their transition states during a chemical reaction.
Therefore, it is vital to have strategies for building a stable model that extrapolates well from stable structures to unstable intermediate ones.
If we can train a physically-reasonable model that performs reasonably even at unstable states from a stable-state-only dataset, we may be able to transfer it to the reaction pathway reconstruction problem, which severely suffers from data scarcity.


Another challenge in ML-based molecular modeling is validation.
It is often challenging to verify if the model truly learns physically reasonable potential energy surface, which is essential for comprehending molecular structural dynamics and constructing chemical reaction pathways.
In previous works, energy estimation accuracy in a stable state has been commonly used, expecting that discovering the actual potential energy surface is needed for the model to precisely estimate its energy.
However, since the test cases are confined to stable states, it is questionable whether the model captures the true geometry of the potential energy surface, or has merely fitted to the energy values.
In other words, the meta-stability of the potential energy surface cannot be verified solely through the stable-state energy estimation.
Therefore, additional metrics and evaluation schemes that compensate the current scheme would benefit the community by providing crosscheck validity of existing and future methods.

In this paper, we tackle the aforementioned challenges in molecular structure modeling as follows:
\begin{itemize}[leftmargin=5mm]
  \item A natural direction to tackle the data scarcity issue in data-driven models is to incorporate as much physical intuitions and knowledge as possible. In this paper, we propose a \emph{physics-empowered hybrid model for molecular representation learning}, which combines the expressive power of a Transformer~\citep{vaswani2017transformer} with classical force-field-style equations.

  \item To build a physically reliable model that generalizes well beyond the steady-state-only training data, we design a \emph{self-supervised learning approach} that the model can learn underlying chemical rules without overly relying on scarcely available labels provided only at stable states. To be specific, we propose an effective \emph{masked atomic modeling} idea, inspired by masked language modeling.
  
  \item We examine the possibility of \emph{transfer learning} from our model trained only on stable structures to \emph{chemical reaction pathways}, which requires energy estimation of molecules at transition states, unseen during the training at all. A general understanding of the physical rules would be essential for this challenging generalization problem.
  
  \item We design a series of \emph{novel evaluation schemes} to measure reliability of the molecular potential energy surface learned by the model.
  To be specific, we propose to recover molecular structure from perturbation, to reassemble molecules from broken bonds, and to predict the entire chemical reaction pathway mentioned above.
  Together with the existing energy estimation accuracy, our evaluation methods verify the models in multifaceted ways, preventing from overfitting to a single objective.
\end{itemize}


%------------------------------------------------
\section{Related Work}
\label{sec:related}

ML potentials can be categorized into three types based on model complexity and history: kernel-based descriptors, fixed atomic descriptors, and learnable descriptors.

\textbf{Kernel-based Methods.}
Kernel-regression-based potentials are mainly applied to a single atom or a few elemental species,
where the kernel method is one of the lightest forms.
Gaussian approximation potential (GAP;~\citet{bartok2010gap}), smooth overlap of atomic potential (SOAP;~\citet{bartok2013soap}), and spectral neighbor analysis potential (SNAP;~\citet{chen2017accurate}) are representative examples.
These models can be trained on a small amount of data, but it is difficult to be extended to chemically complex cases.



\textbf{Fixed descriptors.}
\citet{behler_generalized_2007} uses an atom-centered symmetry function to describe the local environment of each atom and passes each descriptor value to the simple feed-forward network to map the total energy.
They estimate the energy for each descriptor from the distance and angle information between paired atoms within a specific cutoff.
Behler-Parinello neural-net (BPNN; \cite{behler_generalized_2007}) series are the representative practical examples that increase model complexity for high-dimensional Potential Energy Surface (PES) compared to previous kernel-based methods.
BPNN was the first realistic attempt to decompose the total energy as a sum of each individual atom's energy. 
A fundamental limitation of this approach is that fixed descriptors are insufficient to cover complex spatial patterns (\emph{e.g.}, ring structures, bond types, or chemical functional groups), limiting the knowledge transferability between different molecules and atoms.
Also, the original symmetry function does not reflect the chemical environment outside the cutoff at all~\citep{kulichenko_rise_2021}.
Despite these limitations, it achieved accuracy that no previous classical force field reached.
It has been shown to work for systems with many atoms in a dense system with a few species~\citep{behler_constructing_2015,kulichenko_rise_2021}.


\textbf{Deep Learning Models.}
Recently, deep neural networks have been actively applied to construct surrogate potentials.
Most models in this category allow the chemical environmental information can be transferred between atoms over a greater distance than traditional models, providing a higher degree of freedom.
ANI~\citep{smith2017ani} extends BPNN by modifying its angular function.
Message Passing Neural Network (MPNN)~\citep{gilmer2017neural} is specialized in learning from a graph-structured representation by updating hidden node states using messages from adjacent nodes.
MPNN significantly improves accuracy in molecule-related tasks on QM9 dataset~\citep{ruddigkeit2012enumeration,reymond_chemical_2015,ramakrishnan2014quantum}, while the increased model capability nest a risk of overfitting \citep{hawkins_problem_2004,zuo_performance_2020}.
Since then, various graph-based approaches~\citep{schutt_schnet_2018,gasteiger_dimenet_2020,unke_physnet_2019} have been proposed.
Recently, the Transformer~\citep{vaswani2017transformer} is applied to this problem~\citep{cho2021deepdft,tholke2021equivariant}, following its success on natural language processing~\cite{devlin2018bert} and computer vision~\citep{dosovitskiy2021vit,lu2019vilbert,sun2019videobert}.


%------------------------------------------------
\section{The Proposed Method}
\label{sec:method}

\subsection{Problem Definition and Notations}

Given a molecular structure graph $\mathcal{G} = (\mathcal{V}, \mathcal{E})$, where $\mathcal{V}$ is a set of $N$ atoms constructing the molecule and $\mathcal{E}$ is a set of bonds between a pair of atoms with direct interaction,
we aim at a regression problem to estimate the energy $E_\text{mol} \in \mathbb{R}$ of the molecule.
%and the force $\mathbf{F} \in \mathbb{R}^{N \times 3}$ for every atom in the molecule.
The total energy at the molecule level $E_\text{mol}$ is decomposed into the atomic-level energies, denoted by $E_i$ for each atom $i = 1, ..., N$, where $E_\text{mol} = \sum_i E_i$.
Each atom $i$ in the molecule is represented by its atomic number $z_{i} \in \mathbb{R}$, its position $\mathbf{p}_i \in \mathbb{R}^3$ in Cartesian coordinates, and electro-negativity $n_{z_{i}} \in \mathbb{R}$ of the atom type. We denote the pairwise $L_2$ distance matrix $\mathbf{D} \in \mathbb{R}^{N\times N}$ between atoms, computed from $\{\mathbf{p}_i\}$. Here, the element $d_{i,j}$ is the radial distance between two atoms $i$ and $j$. Adjacency matrix that represents bond information of the molecule denoted by $\mathbf{A}\in \{0,1\}^{N\times N}$.


\subsection{Atom Representations}

We represent each atom based on its atom-wise characteristics and relation with neighboring atoms in the molecule.

\textbf{Atom-wise Representation.}
Atom $i$ is represented as an embedding $\mathbf{x}^{\text{(self)}}_i \in \mathbb{R}^{d}$
based on its type $z_i$, concatenated with its electro-negativity $n_{z_i}$:
\begin{equation}
    \mathbf{x}^{\text{(self)}}_i = [E(z_{i}) ;
    n_{z_i}],
    %E_{\text{self}, n}(n_i)].
\end{equation}
where $E$ is an embedding layer.

\textbf{Radial Basis Functions.}
Inspired by the localized orbitals in DFT, we start with a simple Gaussian basis to represent the relationship between two atoms.
For a pair of two atoms $i$ and $j$ in the molecule, we assign $n_b$ basis functions following \citet{unke_physnet_2019}:
\begin{equation}
    \label{eqn:basis_function}
    \psi_{i,j,k}(d_{i,j}) \equiv \varphi(d_{i,j}) \exp \left\{ -\beta_{z_i,k} \left( \exp(-d_{i,j}) - \mu_{z_i,k} \right)^2 \right\}
\end{equation}
where $i = 1, ..., N$ is the center atom index, $j = 1, ..., N$ is a neighboring atom index, $z_i$ is the atomic number of atom $i$, and $k = 1, 2, ..., n_b$ denotes the index of the basis for each center atom type $z_i$. For a predefined distance threshold $\tau$, $\varphi(d) = 1$ if $d < \tau$ and 0 otherwise. With a reasonable $n_b$, we can enhance the expressibility of the model, generating more accurate potentials. $\beta_{z_i,k}$ and $\mu_{z_i,k}$ are the learnable parameters for each atom type $z_i$, which control the center and width of each individual basis.
Finally, a cosine envelope function~\citep{tholke2021equivariant} $\varphi(d_{i,j})$ is applied to guarantee continuity at the cutoff edges, \emph{i.e.}, $\frac{\partial{\psi(d)}}{\partial{d}}|_{d = \tau} = 0$:
{\small 
\begin{equation}
    \label{eqn:envelope}
    \varphi(d_{i,j}) = 
    \begin{cases}
        \frac{1}{2}(\cos(\frac{\pi d_{i,j}}{\tau}) + 1) & \text{if $0 \leq d_{i,j} \leq \tau $,} \\
        0 & \text{otherwise.}
    \end{cases}
\end{equation}}

\textbf{Neighbor Embedding.}
We adopt the idea of neighbor embedding~\citep{tholke2021equivariant}, which represents relative information from nearby atoms under the distance of some threshold $\tau$, denoted by $\mathbf{x}^\text{(neighbor)} \in \mathbb{R}^{d}$:
% distance information of all atoms in the molecule under $r_c$ apart from the center atom.
\begin{equation}
    \mathbf{x}^\text{(neighbor)}_i = \sum_{j=1}^{n_b} \mathbf{U} \left[ \mathbf{x}^\text{(self)}_j \odot \mathbf{V} \boldsymbol{\psi}^0_{i,j} \right],
\end{equation}
where $\boldsymbol{\psi}_{i,j} = [\psi_{i,j,1}, ..., \psi_{i,j,n_b}] \in \mathbb{R}^{n_b}$, $\mathbf{V} \in \mathbb{R}^{d \times n_b}$ is a projection matrix from radial basis functions to the atomic embedding space, and $\odot$ indicates element-wise multiplication. $\mathbf{U} \in \mathbb{R}^{d \times d}$ is another linear projection matrix. As a result, $\mathbf{x}^\text{(neighbor)}_i \in \mathbb{R}^d$, the neighbor embedding of atom $i$, is in the same atomic embedding space.
For each atom $i$, we combine the atomic and neighbor embeddings, then they are projected back to the same dimensionality by $\mathbf{W} \in \mathbb{R}^{d \times 2d}$. That is, {\small $\mathbf{x}_i = \mathbf{W} [ \mathbf{x}^\text{(self)}_i ; \mathbf{x}^\text{(neighbor)}_i]$}.


\subsection{Our Transformer Model}
\label{sec:method:model}

\begin{figure*}
	\centering
	\includegraphics[width=0.99\linewidth]{model}
	\caption{(a) Our model architecture. (b) Detailed Molecular Attention Block. (c) \ce{C_2H_4} example.}
	\label{fig:Main_model}
\end{figure*}

As illustrated in Fig.~\ref{fig:Main_model}, our model is based on a Transformer. Given a molecule as a set of its $N$ atoms, encoded as $\mathbf{x}_i \in \mathbb{R}^{d}$ for $i = 1, ..., N$, our model adds an additional \texttt{[CLS]} token, denoted by $\mathbf{x}_0 \in \mathbb{R}^d$, to explicitly learn to represent the overall molecule embedding. On this input sequence, the model stacks $L$ Molecular Attention Blocks (MAB) to contextualize each atom representation across the molecule (within the cutoff distance $\tau$).
%, which will be detailed subsequently.
The atom embedding after $\ell = 0, ..., L$ stages of the MABs is denoted by {\small $\mathbf{x}_i^{(\ell)}$}.
After $L$ blocks, the final sequence of atomic embeddings {\small \{$\mathbf{x}_i^{(L)}$\}} are produced.

From this, we estimate the overall molecule-level energy from them in two popular ways in Transformers.
First, we predict the atom-level energy $E_i$ for atom $i$ by passing {\small $\mathbf{x}_i^{(L)}$} through an MLP. That is,
{\small $\hat{E}_i = f_\text{atom} (\mathbf{x}_i^{(L)})$}, % \label{eq:atom_f}
where $f_\text{atom}: \mathbb{R}^{d} \rightarrow \mathbb{R}$ is an atom-level energy regressor, and then, summation over all atoms $i = 1, ..., L$ gives the molecule-level energy; that is, $\hat{E}_\text{mol} = \sum_{i=1}^N \hat{E}_i$.
Another approach is directly computing the molecule-level energy from the \texttt{[CLS]} by {\small 
$\hat{E}_\text{mol} = f_\text{mol} (\mathbf{x}_0^{(L)})$}, %\label{eq:mol_f}
where $f_\text{mol}: \mathbb{R}^{d} \rightarrow \mathbb{R}$ is a molecule-level energy regressor.
%For $f_\text{atom}$ and $f_\text{mol}$, we use a single fully-connected layer.
Both approaches are evaluated in Sec.~\ref{sec:exp}.
In Sec.~\ref{sec:method:physics}, we will introduce our main approach for this regression to take advantage of domain knowledge from physics.

\textbf{Details on Molecular Attention Block.}
Each Molecular Attention Block (MAB) at level $\ell$ takes a sequence of atomic embeddings {\small $\{\mathbf{x}_i^{(\ell-1)} : i = 0, ..., N\}$} from the previous level. For each atom {\small $\mathbf{x}_i^{(\ell-1)}$} as query and all atoms including $i$ as the context (keys and values), it performs self-attention as in Fig.~\ref{fig:Main_model}(b).
Following TorchMDNet~\citep{tholke2021equivariant}, we modify from the vanilla Transformer~\citep{vaswani2017transformer} to explicitly reflect the relation arisen from the physical distance between two atoms $i$ and $j$, in addition to the semantic relevance between them modeled by regular Transformers.
Specifically, from the radial basis {\small $\boldsymbol{\psi}^0_{i,j}$}~\citep{orr1996introduction}, we compute $\mathbf{D}^K, \mathbf{D}^V \in \mathbb{R}^{N \times N \times m}$, where $m$ is the embedding dimensionality used for query, key, and value. An element {\small $d_{i,j}^K, d_{i,j}^V \in \mathbb{R}^m$} represents physical tendency to attract each other between atom $i$ and $j$ for key-purpose and value-purpose, respectively. These are mapped from the radial basis function {\small $\boldsymbol{\psi}_{i,j}^0$} by a linear layer, followed by SiLU~\citep{elfwing2018silu} activation. This relation is represented as $\mathbb{R}^{m}$ instead of a scalar to reflect the dimension-wise relationship.

In addition to the changes introduced by \citet{tholke2021equivariant}, we additionally feed the adjacency matrix $\mathbf{A} \in \{0, 1\}^{N \times N}$, followed by a linear layer and SiLU activation. This $\mathbf{A}$-mask is multiplied element-wise with the inferred attention weights, in order to additionally control this semantic relevance based on physical adjacency.
For instance, two atoms that are far away will be likely multiplied with a low value, reducing its relationship even if semantic relevance is estimated high. This part is optional, and we provide an ablation study in Sec.~\ref{sec:exp}.


\subsection{Physics-driven Parametric Energy Prediction}
\label{sec:method:physics}

Instead of directly regressing to the atom or molecule energy as described in Sec.~\ref{sec:method:model}, we propose to design a parametric model that reflects physical insights.
For this formulation, we use a form that simultaneously reflects the repulsive and attractive forces between two atoms $i$, $j$ within the bond energy $E_{i,j}$; namely, Coulomb's law and Lennard-Jones Potential (LJP):
{\small
\begin{equation}
    E_{i,j} = -\beta_1 \frac{\beta_0}{d_{i,j}} + \beta_2 \left[ \left( \frac{\beta_4}{d_{i,j}} \right)^{2\beta_3} - 2\left( \frac{\beta_4}{d_{i,j}} \right)^{\beta_3} \right].
    \label{eq:ljp}
\end{equation}}
$\beta_0$ corresponds to the influence of charges ($q_i q_j$) between two atoms in Coulomb potential.
$\beta_4$ is the equilibrium distance between atom $i$ and $j$, where the repulsive and attractive forces become equivalent, and thus the atom-atom potential energy becomes zero. The energy becomes minimal at this point.
$\beta_1$ and $\beta_2$ are linear coefficients for the Coulomb and LJP parts. It is known that $\beta_3 \approx 6$ under the condition of London dispersion force~\citep{london_zur_1930,cornell_second_1995}, but the repulsive equivalence of $2\beta_3 \approx 12$ is much more an approximate term (square of the attractive term), so we leave $\beta_3$ as an open parameter to be learned from the data.
These five parameters, denoted by $\boldsymbol{\beta} = [\beta_0, \beta_1, \beta_2, \beta_3, \beta_4]$, are estimated by a regressor $f_\text{bond}: \mathbb{R}^{2d} \rightarrow \mathbb{R}^5$; that is, $\hat{\boldsymbol{\beta}} = f_\text{bond} ([\mathbf{x}_i ; \mathbf{x}_j])$.

The overall molecule-level energy is calculated by the sum of all pair-wise bond energies and the atomic self-energies; that is, $\hat{E}_\text{mol} = \hat{E}_\text{bond} + \hat{E}_\text{atom}$, where $\hat{E}_\text{bond}$ and $\hat{E}_\text{atom}$ are defined as
{\small 
\begin{equation}
    \hat{E}_\text{bond} = \sum_{i=1}^N \sum_{j>i}^N \hat{E}_{i,j}, \quad\mathrm{and}\quad  \hat{E}_\text{atom} = \sum_{i=1}^N f_{\text{atom}}\left(\mathbf{x}_i^{(L)}\right).
    \label{eq:bond_f_sum}
\end{equation}}

We summarize what to expect from this physics-based modeling
as follows.
First, we aim to satisfy physical conditions so that the model better extrapolates to unseen cases.
Second, by observing the predicted parameters, we can monitor whether the model actually captures the known physical properties of the molecule.
Lastly, we expect the model to predict the energy directly from $\hat{E}_\text{atom}$ if the given formula is difficult to follow.
In Eq.~(\ref{eq:ljp}), for instance, if the inter-atomic potential does not fit well with LJP, the model assigns $\beta_3 \approx 0$, relying solely on the Coulombic potential.

The model minimizes the MSE loss between the predicted molecule energy $\hat{E}_\text{mol}$ and its ground truth $E_\text{mol}$; that is, $\mathcal{L}_\text{energy} = \|\hat{E}_\text{mol} - E_\text{mol}\|^2$.


\subsection{Masked Atomic Modeling}

Masked Language Modeling (MLM), originally introduced by BERT~\citep{devlin2018bert} for language modeling, has been successfully utilized as a pre-training task for various models~\citep{lu2019vilbert,sun2019videobert,zhang2020hierarchical}.
The main idea is to randomly mask a subset of tokens and let the model recover them from its contexts, \emph{i.e.}, the other textual or visual tokens in the input sequence.
This concept naturally supports self-supervised learning as long as the elements in the sequence are contextually relevant, requiring no human labeling.

In this paper, we propose Masked Atomic Modeling (MAM) in a similar spirit. All chemical materials are composed of multiple atoms, often with more than one type.
When a majority of atoms in a valid molecule is known, a set of possible atoms in the rest is significantly reduced when considering the properties of each atom according to the law of chemistry, \emph{e.g.}, the octet rule, Lewis symbol analysis. 
With MAM, we train our Transformer to discover such chemical restrictions purely by observing a set of valid molecules in the training examples without direct supervision.

Formally, on a sequence $\mathbf{X} \in \mathbb{R}^{N \times d}$ with $N$ atoms,
%and the \texttt{[CLS]} token,
we randomly mask each token by a probability of $\rho$ (we use 0.3, twice as~\citet{devlin2018bert}), replacing the masked tokens to \texttt{[MASK]}.
The model is trained to minimize the log loss over the masked tokens:
\begin{equation}
  \mathcal{L}_\text{mask} = -\log p(\mathbf{X} \otimes \mathbf{m} | \mathbf{X} \otimes (\boldsymbol{1} - \mathbf{m}) ),
  \label{eq:mam_loss}
\end{equation}
where $\mathbf{m} \in \{0, 1\}^N$ is a binary mask vector for atoms, $\boldsymbol{1}$ is a one-valued vector, and $\otimes$ indicates row-wise multiplication. $p$ is estimated by a binary classifier, where we use a two-layer MLP.


\subsection{Combining Physical Constraints}
\label{sec:method:constraints}

\textbf{Zero-Force Regularization.}
When a molecule is in its equilibrium state, the net force on each atom should be at zero. This condition may provide a strong hint for the model to find the valid and optimal molecular structure, but this has not been utilized well in existing studies.
Thus, we additionally regularize to minimize the force, computed by the partial gradients of the predicted energy with respect to the 3-dimensional axis ($x, y, z$). Formally,
{\small \begin{equation}
    \mathcal{L}_\text{force} =
    %\sum_{\mathbf{X} \in \mathcal{D}} 
    \sum_{i=1}^{N} \| \hat{\mathbf{F}}_i \|^2
    = %\sum_{\mathbf{X} \in \mathcal{D}} 
    \sum_{i=1}^{N} \left( \frac{\partial E_i}{\partial x}\right)^2 + \left(\frac{\partial E_i}{\partial y} \right)^2 + \left( \frac{\partial E_i}{\partial z} \right)^2,
    \nonumber
\end{equation}}
where $\hat{\mathbf{F}} \in \mathbb{R}^3$ is the predicted force of atom $i$.

\textbf{Inequality Bound Condition.}
A stable equilibrium structure of a molecule corresponds to the lowest energy under the given composition.
Such an optimal structure can be found by estimating energy from the given structure, differentiating it with respect to the position, and deviating the position based on the force.
Naturally, if there is any local deviation from the optimal structure, the energy is always higher than its ground state.
This sounds obvious physically, but a machine learning model is unaware of this and thus its estimation may be invalid.
Thereby, we apply an additional condition that the energy should be greater than the ground state when locally deviating from the stable structure, to narrow down the solution space.
During training, small Gaussian noise with an amplitude of 0.5 \AA\ is applied to the optimal structure. This is implemented by an additional loss $\mathcal{L}_\text{bound}$ based on the energy inequality condition:
\begin{equation}
\mathcal{L}_\text{bound} = \left\{ 
   \begin{array}{ c l }
    {\hat{E}_\text{mol}} - \hat{E}_\text{mol}^{*}   & \quad \textrm{if } {\hat{E}_\text{mol}^{*}} \leq  \hat{E}_\text{mol}, \\
    0                 & \quad \textrm{otherwise.}
  \end{array}
\right.
\end{equation}


\subsection{Overall Objective}

Combining all together, our model minimizes
\begin{equation}
    \mathcal{L} =
    \mathcal{L}_\text{energy} + \lambda_\text{mask} \mathcal{L}_\text{mask} +
    \lambda_\text{force} \mathcal{L}_\text{force} + \lambda_\text{bound} \mathcal{L}_\text{bound}, 
\end{equation}
where $\lambda_\text{force}$, $\lambda_\text{mask}$, and $\lambda_\text{bound}$ are coefficients controlling relative importance of each loss term.

%------------------------------------------------
\section{Experiments and results}
\label{sec:exp}

We conduct experiments to answer the following questions: \textbf{Q1}. How does our model perform on energy estimation compared to other models? (Sec.~\ref{sec:exp:baseline}) \textbf{Q2}. 
Do our and baseline models truly comprehend the molecular potential energy surface structure?
(Sec.~\ref{sec:exp:structure_opt}--\ref{sec:exp:harder_tasks})
\textbf{Q3}. How much physics-driven constraints affect the prediction? (Sec.~\ref{sec:exp:mam_analysis})


\subsection{Experimental Settings}

\textbf{Datasets.}
We use three public datasets to evaluate the proposed model.
QM9 dataset~\citep{ruddigkeit2012enumeration,ramakrishnan2014quantum} is a collection of optimal structures of 130,000 molecules with up to 9 atoms of $\{$C, H, O, N, F$\}$, selected from GDB-17~\citep{ruddigkeit2012enumeration}.
This dataset contains only the stable structure of molecules.
We use 80\% for training, 5\% for validation, and 15\% for testing.
OC20 dataset~\citep{ocp_dataset} contains stable structures and relaxation trajectories for systems of 15K bulk catalysts and 82 adsorbates.
We evaluate our model on the relaxed energy prediction with a given initial structure (IS2RE).
To evaluate performance on non-equilibrium molecular conformations and reactions, we use Transition1x dataset~\citep{schreiner_transition1x_2022}, which contains reaction paths from 10k organic reactions, with 10M molecular conformations.

\textbf{Baselines.}
We compare our model to several state-of-the-art energy prediction models: SchNet ~\citep{schutt_schnet_2018}, DimeNet~\citep{gasteiger_dimenet_2020}, TorchMDNet(ET)~\citep{tholke2021equivariant}, ForceNet~\citep{hu_forcenet_2021}, and MXMNet~\citep{zhang2020molecular}.

\textbf{Evaluation Metric.}
We report the mean average error (MAE) between the ground truth and predicted energy (MAE$_\mathrm{E}$, in meV/mol) and force (MAE$_\mathrm{F}$, in eV/\AA), following existing studies.

More implementation details are provided in Appendix \ref{supp:sec:exp:impl}.

\begin{table*}
    \centering
    {\small
    \begin{tabular}{l|ccc|cc}
        \toprule
        Dataset (Task) & \multicolumn{3}{c|}{QM9} & \multicolumn{2}{c}{OC20 (IS2RE)}\\
        Model & MAE$_\mathrm{E}$($\downarrow$) & MAE$_\mathrm{F}$ ($\downarrow$) & {$\Delta P$}($\downarrow$) & MAE$_\mathrm{E}$($\downarrow$) & {$\Delta P$}($\downarrow$) \\ \midrule
        SchNet~\citep{schutt_schnet_2018}             & 14.00 & 2.64 & 0.47 & 1.059 & 0.60 \\
        CGCNN~\citep{xie2018crystal}              & --    & --   & --   & 0.988 & 0.58 \\
        MXMNet~\citep{zhang2020molecular}             & \bf{5.90}  & 1.83 & 1.57    & --     & --     \\
        DimeNet~\citep{gasteiger_dimenet_2020}            & 8.02  & 1.79 & 0.58 & 1.012 & 0.55 \\
        ForceNet~\citep{hu_forcenet_2021}           & 18.62 & 0.41 & 0.21   & --     & --     \\
        TorchMDNet (ET)~\citep{tholke2021equivariant}                 & 6.15  & 1.15 & 0.32 & --     & --     \\
        \midrule
        Ours ($\mathcal{L}_\text{energy}$ only)    & 8.35 & 1.28 & 1.23 & -- & -- \\ 
        \midrule
        Ours (full model)  & 15.16\scriptsize{$\pm$0.539} & \textbf{0.0057}\scriptsize{$\pm$0.001} & \textbf{0.0251}\scriptsize{$\pm$0.01} & \textbf{0.887}\scriptsize{$\pm$0.024} & \textbf{0.10}\scriptsize{$\pm$0.01}\\ 
        $p$-value     & -- & 0 & $3.2\times 10^{-7}$ & $2.6\times 10^{-4}$ & $7.0\times 10^{-8}$ \\ 
        \bottomrule
    \end{tabular}}
    \caption{Comparison with baseline models for energy and force accuracy (in MAE) and average distortion $\Delta P$ after structure optimization experiment.
    %{Lower values mean better performance for all metrics.}
    We report MAE$_\mathrm{E}$ in meV/mol,  MAE$_\mathrm{F}$ in eV/\AA, and $\Delta P$ in \AA. All results are averaged over 5 trials with different random seeds, and $p$-values are compared with the second-best method.
    }
    \label{tab:comparison_baselines}
\end{table*}


\subsection{Comparison with Baselines}
\label{sec:exp:baseline}

In this line of research, the MAE in energy estimation has been most widely used.
A primary application for calculating molecular energy is to search for a stable structure and to perform molecular dynamics (MD) simulations of structural changes over temperature and time. All of these works are the foundation for the design and discovery of new materials~\citep{friederich_machine-learned_2021,louie_discovering_2021}.

At a glance to the $\mathrm{MAE}_\mathrm{E}$ column on QM9 dataset in Tab.~\ref{tab:comparison_baselines}, we observe that our proposed model estimates the molecule energy comparably with baselines, slightly lagging behind the current state-of-the-art.
An underlying assumption for relying on the energy estimation accuracy to evaluate molecule representation learning is that the model would need to understand the actual molecular structure in order to precisely estimate its energy.
We raise a question about this assumption: Although energy estimation and structure understanding are positively correlated, the model might overfit to energy estimation if we solely rely on this, optimizing beyond the physical rules permit.
This is because, with data-driven approaches, the model is not fully informed with physical constraints and just optimizes over the objective from limited amount of data.

For this reason, we additionally check the MAE in force prediction.
Ideally, the net force should be 0 for a molecule in a stable state. If a model has learned the correct molecular structure, the estimated net force should be close to 0 as well.
The MAE$_\mathrm{F}$ columns of Tab.~\ref{tab:comparison_baselines} report the force estimation accuracy of each model by differentiating  energy with respect to the position.
On QM9, our full model precisely estimates zero net force (MAE$_\mathrm{F} \approx 0$), indicating that our $\mathcal{L}_\text{force}$ introduced in Sec.~\ref{sec:method:constraints} plays its expected role and the learned force condition generalizes well to the unseen test set.

Interestingly, however, other baselines achieving better energy accuracy, including our model only with $\mathcal{L}_\text{energy}$, catastrophically fail to estimate zero net force.
This contradicts to the common assumption that precise energy estimation relies on general understanding of the molecular structure and underlying physical rules.
This result indicates that overly optimizing only on the single energy criterion leads to break the basic constraints that the models must satisfy for a valid structure, making the achieved energy accuracy meaningless as well.

The rest of Tab.~\ref{tab:comparison_baselines} reports performance on OC20, comparing against a few baselines using scores reported in Open-Catalyst-Project\footnote{\scriptsize{\url{https://github.com/Open-Catalyst-Project}}}. 
Our method is competent on both tasks, outperforming all baselines.
Note that the difference in MAE$_\mathrm{F}$ is not as dramatic as in QM9, since both energy and force information are included in OC20 and utilized by all models.

In conclusion, a model with the lowest energy is the optimal model is correct only if the model is optimized under the perfect conditions satisfying all physical restrictions. That is, it perfectly recovers the true potential energy surface (PES), and the energy is precisely calculated under this PES. As a machine learning approach is not always perfectly restricted to reflect the physical restrictions in reality, it may find a solution outside of the valid range, representing a case that is not possible in reality. For this reason, it is important to measure more metrics in addition to the energy for a more reliable learning and model selection.



\subsection{Qualitative Analysis with Structure Optimization}
\label{sec:exp:structure_opt}

\begin{figure*}[t]
	\centering
	\includegraphics[width=0.95\linewidth]{Fig_opt_2}
	\caption{
        (a) Structural optimization results. The left-most column is the initial stable structure in QM9, followed by recovery results by competing models sequentially.
        For more structural optimization results, see Appendix Fig.~\ref{supp:fig:append}.
        (b-c) Distribution of energy difference ($\Delta E_{\rm g}$) and structural change ($\Delta P$) before and after structural optimization, in log scale.}
\label{fig:sampleopt}
\end{figure*}

In order to see if the models actually capture the optimal structure of molecules, we design an additional structure optimization experiment.
Starting from the stable structures in the dataset, we slightly perturb each atom's position from its original optimum and optimize the structure again, expecting it to converge back to the original optimum. Upon convergence, we measure the average Euclidean distance $\Delta P$ of each atom's distortion from its optimal position in the ground truth.
%structure distortions $\Delta P$, in average Euclidean distance, as each atom moves from its optimal structure.

The $\Delta P$ columns of Tab.~\ref{tab:comparison_baselines} compare the performance of each model on this experiment.
Our physics-driven model attains a higher level of accuracy when compared to other models, thus demonstrating its proficiency in learning the potential energy surface of the target molecule. Moreover, it is capable of reproducing a stable structure rather than over-optimizing solely on energy estimation.

Fig.~\ref{fig:sampleopt}(a) shows optimized structures by baseline models and ours. The left-most column displays the initial stable structures, which the baselines fail to maintain. For instance, in the case of \ce{CH4} (top row), the Hydrogen atoms surrounding the Carbon atom should be arranged symmetrically, but the optimized structures by the baselines lack symmetry. In contrast, our model successfully recovers the optimal structure even in complex scenarios.

Fig.~\ref{fig:sampleopt}(b-c) shows the average difference in energy distribution $\Delta E_g$ and distance deviation $\Delta P$ before and after reoptimization, calculated over 256 molecules (comprising 128 smallest and 128 randomly sampled larger molecules) from QM9. 
Our model achieves a center value of $\Delta E_{\rm g}$ that is two orders of magnitude smaller than other potentials, indicating its superior ability to recover the optimal structure.
Also, in Fig.~\ref{fig:sampleopt}(c), the distance deviation $\Delta P$ is mostly less than 0.1 \AA, and our model's $\Delta P$ values are at least 10 times smaller than other models.
Despite being a challenging task even for molecular dynamics, our model's excellent performance on this stable structure-only dataset like QM9 signifies its capability of capturing fundamental physical principles such as distance symmetry from limited information. Additional examples are presented in Appendix~\ref{supp:sec:exp:examples}.


\subsection{Molecular Assembly and Chemical Reaction Pathway Prediction}
\label{sec:exp:harder_tasks}

\begin{figure}
	\centering
    \includegraphics[width=1\linewidth]{Fig3_Assemble}
	\caption{Molecule assembly results on (a) GDB-35 and (b) GDB-87.
    The original stable structure (GT) is recovered at 500 steps, connecting the broken bond.
    (c-d) Failure results by our model trained \emph{without bound conditions}.}
    \label{fig:sample_assemble}
\end{figure}



We employ our approach for a couple of additional tasks, including the assessment of potential stability in non-equilibrium structures; namely, the molecular assembly and the chemical reaction pathway prediction.
For molecular assembly, the energy profile continuously decreases from the initial structure to the optimal one, whereas chemical reactions require overcoming an activation barrier.

\begin{figure}
	\centering
    \includegraphics[width=0.98\linewidth]{Fig4_Reaction}
	\caption{
        Examples of energy prediction following the reaction pathways on Transition1x.
        The three structures in each panel correspond to the representative structures along each reaction coordinate: the reactant, transition state, and product structure, respectively.
        For more reaction barrier results, see Appendix Fig.~\ref{supp:fig:tr1xmore}.
    }
    \label{fig:figurechemreaction}
\end{figure}


\textbf{Molecular Assembly.}
The molecule assembly task presents an additional challenge beyond the structure optimization presented in Sec.\ref{sec:exp:structure_opt}, where the objective is to recover the stable structure from an (almost) optimal structure. This task involves breaking one or more bonds in the molecule by moving functional groups far away, and recovering the original stable structure from this completely broken one. To accomplish this, we randomly select one or two functional groups in a molecule and disconnect the bonds between them by translating each towards different directions, with a displacement of 0.7 \AA.
We begin with the distorted structure and optimize it using the energy profile of our model to determine if it can regain the original stable structure.
Since the training dataset does not contain non-equilibrium information, it is challenging for the model to accurately discover the energy values along the pathway in which molecules are combined.

As shown in Fig.~\ref{fig:sample_assemble}, only our method succeeds in recovering the original structure, while others show catastrophic failure.
We experiment with our model without the bound conditions on the same task. Fig.~\ref{fig:sample_assemble}(c-d) illustrates that our model also fails in this case.
%Fig.~\ref{fig:sample_assemble}(c-d) illustrates the failure when training model without bound conditions on the same task. 
This highlights the importance of the bound conditions to learn a physically reasonable potential, even with a limited dataset consisting only of optimal structures.

\textbf{Chemical Reaction Pathway Energy Prediction.}
Lastly, we conduct an even more complex task of predicting energies across the complete chemical reaction pathway, encompassing the structures of reactants, transition states, and products.
To accomplish this task, we adopt a transfer learning approach, by initializing the weights from a pre-trained model on QM9 and subsequently fine-tuning on Transition1x.
This is because the two datasets provide different angles of information. QM9 contains 13$\times$ types of molecules than Transition1x, so the model is pretrained on QM9 to learn general molecular structures at an equilibrium state.
The model is then fine-tuned to learn the transition dynamics on Trainsition1x, covering fewer types of molecules than QM9.

Fig.~\ref{fig:figurechemreaction} shows a few examples of energy profiles, following the reaction pathway on the validation set of Transition1x.
Our model accurately predicts not just the energy of the most stable structure (product) but also that of reactant and transition state structures.
A slightly higher error in energy estimation is observed near the transition state, but it is not significant enough to alter the activation barrier height that defines the chemical reaction rates.
From this result, we conclude that our approach is effective to create a more general potential energy surface from limited information.

\subsection{Self-supervised Learning with MAM}\label{sec:exp:mam_analysis}

\begin{figure*}
	\centering
	\includegraphics[width=0.85\linewidth]{Fig_MAM}
	\caption{
		Visualization of MAM.
		(a), (c) The masked atom is moved along the pink arrow ($z$-axis), and (b), (d-e) illustrate the likelihood score along corresponding positions.}
	\label{fig:sampleattention}
\end{figure*}

Fig.~\ref{fig:sampleattention} illustrates the effect of self-supervised learning with MAM, depending on the position of atoms.
For example, Fig.~\ref{fig:sampleattention}~(a) shows the example of \ce{CH_4}, where we perform MAM inference to figure out an appropriate atom type through the vertical direction. Fig.~\ref{fig:sampleattention}~(b) shows the inferred atom type at each position, from atomic number 1 to 14. The atoms that the QM9 covers, H, C, N, O, and F, are marked in the figure.

Fig.~\ref{fig:sampleattention}~(b) shows that around $\pm$2 \AA\ from the center, the Carbon is strongly favored.
On the other hand, Fluorine (F), which is not completely chemically favored, MAM shows a very low affinity.
The Nitrogen and Carbon of \ce{C4NH5} also show a similar trend as shown in Fig.~\ref{fig:sampleattention}~(c-e).
In Fig.~\ref{fig:sampleattention}~(e), Carbon is favored by MAM as expected, and interestingly, Nitrogen is also weakly favored, unlike \ce{CH4}. Presumably, it is due to the shape of the \ce{C4NH5} molecule.
Note that the amplitude of the atom recommendation through MAM is maximized at the most stable energy position.
This reveals that the model self-learns the relationship between surrounding atoms from energy and the positions through MAM.
In molecule generation tasks, MAM would be more efficient than randomly connecting atoms and repeating structural optimization iteratively.


\subsection{Ablation Study}

\begin{table}
\centering
{\scriptsize
\renewcommand{\tabcolsep}{1mm}
\begin{tabular}{c|cccccc|ccc}
\toprule
No. & Base & \texttt{[CLS]} & LJP & Mask & Force & Bound & MAE$_\text{E}\downarrow$ & MAE$_\text{F}\downarrow$    & $\Delta P\downarrow$ \\ 
\midrule
1 & \checkmark    &                                     &                                   &                                    &                                     &                                     & 11.83   & \textcolor{red}{0.77}                    & \textcolor{red}{1.76}                    \\
2 & \checkmark    & \checkmark     &                                   &                                    &                                     &                                     & 9.03    & \textcolor{red}{0.90}                    & \textcolor{red}{1.11}                    \\
\midrule
3 & \checkmark    & \checkmark     & \checkmark   & \checkmark    &                                     &                                     & {9.70}    & \textcolor{red}{1.91}                    & \textcolor{red}{0.814}                   \\
4 & \checkmark    & \checkmark     & \checkmark   &                                    & \checkmark     &                                     & {10.18}   & {0.016} & \textcolor{red}{0.141}                   \\ 
5 & \checkmark    & \checkmark     & \checkmark   &                                    &                                     & \checkmark     & 16.34                                      & {0.007} & {0.038} \\
\midrule
6 & \checkmark    & \checkmark     &                                   & \checkmark    & \checkmark     & \checkmark     & 20.67                                      & {0.004} & {0.022} \\
7 & \checkmark    & \checkmark     & \checkmark   &                                    & \checkmark     & \checkmark     & 17.50                                      & {0.005} & {0.027} \\
8 & \checkmark    & \checkmark     & \checkmark   & \checkmark    &                                     & \checkmark     & 17.34                                      & {0.013} & {0.044} \\
9 & \checkmark    & \checkmark     & \checkmark   & \checkmark    & \checkmark     &                                     & {9.65}    & {0.015} & 0.083 \\
10 & \checkmark    & \checkmark     & \checkmark   & \checkmark    & \checkmark     & \checkmark     & 15.16                                      & {0.005} & {0.025} \\ 
\bottomrule
\end{tabular}}
\caption{Ablation study results, adding or subtracting components in the loss function. \tcr{Red figures} indicate unacceptably inferior results ($\text{MAE}_\text{F}, \Delta P \gg 0.1$).}
\label{tab:ablation}
\end{table}

We conduct an ablation study to see which component contributes to improve which metric.
We start from a `Base' model, which indicates our Transformer model described in Sec.~\ref{sec:method:model} without using any physics-empowered components.
Tab.~\ref{tab:ablation} compares multiple configurations of our model using a subset of components.
Comparing \#1 and \#2, the \texttt{[CLS]} token turns out to be effective, reducing the energy error.
The rest compares by adding each component separately starting from our base + LJP equation model (\#3--5) and by eliminating each component from the full model (\#6--10).
We observe the following:
\begin{itemize}[leftmargin=5mm]
    \item \textbf{Mask} plays its role in improving the energy estimation. Comparing \#7 and \#10, having Mask helps the model to improve MAE$_\text{E}$ without affecting MAE$_\text{F}$ or $\Delta P$. Solely with Mask (\#3), it achieves a nice MAE$_\text{E}$, but its structure is suboptimal implied by inferior MAE$_\text{F}$ and $\Delta P$.
    \item \textbf{Bound} condition is the most important component for understanding the overall structure. Without it (\#9), $\Delta P$ gets significantly worse than the full model (\#10), while MAE$_\text{E}$ gets (probably illegally) better by focusing more on the energy like baseline models. With Bound only (\#5), it achieves reasonable MAE$_\text{F}$ and $\Delta P$, which is not possible only with Mask (\#3) or Force (\#4).
    \item \textbf{Force} affects all metrics slightly at the same time. Without Force (\#8), all metrics get slightly worse compared to the full model (\#10). With the Force only (\#4), however, the $\Delta P$ is suboptimal. We conclude that the Bound condition is also needed to get an acceptable $\Delta P$.
\end{itemize}

Appendix~\ref{supp:sec:exp:ablation_appendix} presents an additional ablation study on model size and MAM masking ratio.
%Also, Sec.~\ref{sec:exp:mam_analysis} provides additional qualitative analysis of physics-driven modeling. % on MAM and 


%------------------------------------------------
\section{Conclusion}

In this study, we present a molecular representation learning approach that harnesses physics-driven parameter estimation from classical energy equations and self-supervised learning via masked atomic modeling. This method addresses the challenges posed by data scarcity and facilitates extrapolation predictions beyond the training distribution.

Furthermore, we introduce a set of innovative evaluation schemes to assess the model's ability to generalize the structure of molecular potential energy surfaces beyond stable-state energies in the training set. Specifically, we evaluate the molecular structure optimization, molecular assembly, and chemical reaction pathway prediction capabilities of the model. Our extensive experiments on multiple benchmark datasets demonstrate that this multifaceted evaluation approach is advantageous, in addition to the widely-used evaluation scheme that relies on energy or force estimation accuracy in stable states, to ensure the reliability of the learned potential energy surface.

To conclude, we take a step towards physically reliable molecular representation learning under limited data availability. Maximally utilizing information in both model design and training would shed light on future research.

%------------------------------------------------

\clearpage

\begin{acknowledgements}
This work was supported by National Research Foundation grants (2021H1D3A2A03038607, 2022R1C1C1010627) and Institute of Information \& communications Technology Planning \& Evaluation (IITP) grants (No. 2022-0-00264, 2021-0-02068, 2019-0-00075), and the Technology Innovation Program grant (20015824) funded by the Korea government (MSIT \& MOTIE).
\end{acknowledgements}

% References
\bibliography{yi_95}

\end{document}
