\section{Introduction}
In the recent years machine learning (ML) algorithms made several breakthroughs in issuing accurate predictions. There is however a growing need to improve trustworthiness of these models. Providing accurate predictions is not enough in high-stake applications like healthcare where an agent (e.g. clinician) needs to interact with the model. In these applications the agent usually needs to understand how a particular prediction is issued. Especially, if the model prediction (say treatment plan) is different from what the clinician has in mind, explaining the model is vital. Complicated ML models like neural networks are essentially black-boxes to humans, and that is why interpretability methods are important and have gained significant attention in recent years~\cite{ribeiro2016lime,lundberg2017shap,guidotti2018lore,arnaldo2014multiple,zhang2018interpreting,alvarez2018towards,arrieta2020explainable,lou2013accurate,doshi2017towards}.

There exist two key approaches to bring interpretability to machine learning models: (1) by designing inherently interpretable models \cite{Rudin_naturemi_2019,Chen_neurips_2019,Melis_neurips_2018}; or (2) by designing post-hoc methods to understand a pre-trained model \cite{Ribeiro_icmlws_2016,Lipton_icmlws_2016}. In this work, we focus on the second approach that includes methods to analyze a trained model locally and globally \cite{Montavon_dsp_2018}. The local interpretability methods focus on instance-wise explanations, which although useful, provide little understanding of a model's global behaviour \cite{ribeiro2016lime,lundberg2017shap}. Hence, researchers have proposed multiple techniques to interpret how a ML model behaves for a group of the instances. Some examples of global analysis methods include permutation feature importance \cite{Molnar_book_2019}, activation-maximization \cite{Erhan_tr_2009}, and learning globally surrogate models \cite{ifthen1995extracting,DT1996extracting}.

Our work relates to the last approach that aims to learn interpretable proxies by approximating the behaviour of black-box ML models for multiple instances. Some efforts in this category include methods to approximate neural networks with if-then rules~\cite{ifthen1995extracting} or decision trees~\cite{DT1996extracting} and the method to approximate matrix factorisation models using Bayesian networks and simple logic rules~\cite{Sanchez_aaai_2015}. Our work is mostly relevant to a different category of approaches for learning interpretable proxies that focuses on approximating black-box functions with symbolic metamodels.
A proper interpretable metamodel can enjoy benefits of different categories of interpretability methods. For example, a metamodel may provide insight into the interactions of different features and how they contribute in producing results. The metamodel can be locally approximated (e.g. using Taylor series) to generate instance-wise explanations. Moreover, it may be used for scientific discovery by revealing underlying laws governing the observed data \cite{schmidt2009distilling,wang2019symbolic,udrescu2020symbolic}.

Symbolic regression (SR) \cite{koza1994genetic}, has been the primary approach for finding approximate metamodels. In SR, there exist some fixed mathematical building blocks (e.g. summation operation), and the Genetic Programming (GP) algorithm searches over possible expressions that can be composed by combining the building blocks.
We will explain SR  \rev{in more details} in Section~\ref{sec:prelims} and compare it with our \rev{proposed} method in Section \ref{sec:related}. The major limitation of SR is that it uses a set of limited predefined building blocks and the search spaces grows when the number of building blocks increase. Two recent papers, which are the most relevant to our work \cite{alaa2019demystifying,crabbe2020learning}, address this issue by suggesting the use of a parametric trainable class of functions instead of fixed building blocks. In particular, they suggest using Meijer G-functions (we briefly introduce this class in Section~\ref{sec:prelims}).
Note that these are univariate functions, in order to use them in multivariate settings, \cite{alaa2019demystifying} considers a heuristic approximation of Kolmogorov superposition theorem (KST) and \cite{crabbe2020learning} considers the projection pursuit method (in Section 4, we show that their method can be also considered as an approximation of KST).
\rev{Both these works start from a general framework, however, they make some restricting assumptions that limit the usability and coverage of their methods.} For example, the simple function $x_1x_2$ (here $x_i$'s are features) cannot be represented with the method given in \cite{crabbe2020learning}. Similarly, the method in \cite{alaa2019demystifying} fails to represent  the product of three features $x_1x_2x_3$. Another limitation of the proposed approaches is that although most of familiar functions are indeed special cases of Meijer G-functions, for almost all parameters, Meijer G-functions do not have familiar closed form representation. Therefore, in practice, in the training of parameters it is very unlikely to obtain a set of parameters that are ``interpretable''.

\textbf{High level idea and contribution:} In this work, we address the above challenges by proposing a new methodology to learn symbolic metamodels. Our approach is a generalization of \cite{alaa2019demystifying} and \cite{crabbe2020learning} as we consider a more general approximation of KST (see section \ref{sec:related}). We represent the KST expression using trees where edges represent simple parameterized functions (e.g., exponential function). We use gradient descent to train parameters of these functions and employ GP to search for the tree that most accurately approximates the black-box function. We demonstrate the efficacy of our proposed method through several experiments. The results suggest that our approach for estimating symbolic metamodels is comparatively more generic, accurate, and efficient than other symbolic metamodeling methods. In this work we are using our proposed method to provide interpretations, however, this method can be considered in general as a new GP method. Our method should be classified as a memetic algorithm where a population based method is paired with a refinement method (in our case gradient descent) \cite{chen2011multi}. To the best of our knowledge, this is the first method that uses gradient descent not only for training numerical constants but also for the training of building blocks, i.e., primitive functions.

\section{Preliminaries}
\label{sec:prelims}
In this section, we present a brief overview of building blocks of our proposed method: genetic programming; and classes of trainable functions.\\
\textbf{Genetic Programming and symbolic regression:}
Genetic programming (GP) is an optimization method inspired by law of natural selection proposed by Koza in 1994 \cite{koza1994genetic}. It starts with a population of random programs for a particular task and then evolves the population in each iteration with operations inspired by natural genetic processes. The idea is that after enough iterations the population evolves and a fit program can be found in later generations. The two typical operations for evolving are crossover and mutation. In crossover, we choose the fittest programs (the fitness criterion is predefined for the task in hand) for reproduction of next generation (parents) and swap random parts of the selected pairs. In mutation operation, a random part of a program is substituted by some other randomly generated part of a program. One instance of using GP is for optimization in Symbolic Regression (SR), where the goal is to find a suitable mathematical expression to describe some observed data. In this setting, each program consists of primitive building blocks such as analytic functions, constants, and mathematical operations. The program is usually represented with a tree, where each node is representing one of the building blocks. We refer to \cite{orzechowski2018we,wang2019symbolic} for more details on SR. GP as a population base optimization method can be paired with other refinement methods. For example, here we are using both GP and GD in our model. This type of methods are called memetic algorithms. In particular, our method should be classified as a \textit{Lamarckian} memetic algorithm, where Lamarckian refers to the method of inheritance in GP search. we refer to \cite{emigdio2014evaluating,chen2011multi} for more details on taxonomy of GP methods.


\textbf{Class of trainable functions:}
\rev{In contrast with SR which uses fixed building blocks, our proposed approach (similar to \cite{alaa2019demystifying} and \cite{crabbe2020learning}) uses a class of trainable parameterized functions as building blocks.}
One such class of functions is called Meijer G-functions and have been used in two recent approaches to learn symbolic metamodels~\cite{meijer1946,meijer1936uber}. A Meijer-G function $G^{m,n}_{p, q}$ is defined as an integral along the path $\mathcal{L}$ in the complex plane.
\begin{align*}
    &G^{m,n}_{p, q}\left(^{a_1,\ldots,a_p}_{b_1,\ldots,b_q}\,\Big|\, x\right) =\\ &\frac{1}{2\pi i} \int_{\mathcal{L}} \frac{\prod^m_{j=1} \Gamma(b_j - s)\prod^n_{j=1} \Gamma(1 - a_j + s)}{\prod^q_{j=m+1} \Gamma(1 - b_j + s)\prod^p_{j=n+1} \Gamma(a_j + s)}\, x^s\, ds,
\end{align*}  
where $0\leq m\leq q$ and $0\leq n\leq p$ are all integers, and $a_i,b_j\in \mathbb{R}$ for $1\leq i\leq p$ and $1\leq j\leq q$. $\mathcal{L}$ is a path which separates poles of $\Gamma(1 - b_j + s)$ from poles of $\Gamma(a_j + s)$. 
By fixing $m,n,p,q$ we have a class of parameterized functions ($a_i$'s and $b_i$'s are parameters), which can be trained using gradient descent. We refer to \cite{beals2013meijer} for a more detailed definition of these functions. Meijer G-functions are rich set of functions that have most of the familiar functions which we think of as interpretable as special cases. For example, \begin{equation} \resizebox{.99\hsize}{!}{$G^{0,1}_{3, 1}(^{2,2,2}_{\, \,\,\,\,1}\,|\,x) = x, \hspace{2mm} G^{1,0}_{0, 1}(\,^{-}_{\,0}\,|\,x) = e^{-x}, \hspace{2mm}  G^{1,2}_{2, 2}(^{1,1}_{1,0}\,|\,x) = \log(1+x).$} \nonumber\end{equation}
However, when trained using gradient descent (GD), the final parameters for Meijer G-functions almost always will not have an interpretable closed form. This limits insight into  the functional form of the black-box model.
Hence, in this work, we propose using classes of simple, interpretable,  parameterized functions that can be efficiently optimized using GD. The class of functions can be chosen by a domain expert for each particular task. We will discuss the selection of primitive functions further in Appendix C. Specifically, here we demonstrate our approach using the following five parameterized functions. In Appendix C, we show that our presented results will not significantly change with using other set of primitive functions.
\begin{align*}
&f_1(a, b, c, d|x)=ax^3+bx^2+cx+d,\,f_2(a,b|x)= ae^{-bx}\\
&f_3(a,b,c|x)=a \sin(bx+c), \, f_4(a,b,c|x)=a\log(bx+c), \\ 
&f_5(a,b,c,d|x)= ax/(bx^2+cx+d).   
\end{align*}



\textbf{Remark.} It is important to revisit that our proposed framework is generic and can accommodate any trainable class of functions, including Meijer G.


\section{Method}
Assume that a black box function $f:\mathcal{X} \to \mathbb{R}$ is trained on a dataset. Our goal is to find an interpretable function $g$ which approximates $f$. To this end, we restrict $g$ to belong to the class of functions $\mathcal{G}$ which are deemed to be interpretable. Therefore, we want to find the solution to the following optimization problem:
\begin{equation} \label{eq:opt} \argmin_{g\in \mathcal{G}} \ell (f,g),\end{equation}
where $\ell$ is our loss function of choice. In this work, we assume $\ell$ to be mean square loss
\begin{equation}\ell (f,g)= \int_{\mathcal{X}} (g(x)-f(x))^2 dx. \label{eq:loss}\end{equation}
In order to approximate multivariate function $f$, we deploy Kolmogorov superposition theorem \cite{kolmogorov1957} which states that any multivariate continuous function (with $d$ variables) has a representation in terms of univariate functions as follows:
\begin{equation} g(\bm{x})=g(x_1,\cdots,x_d)=\sum_{i=1}^{2d+1}g^{out}_i\left(\sum_{j=1}^d g^{in}_{ij} (x_j)\right). \label{eq:Kol} \end{equation}
In our setting, each of $g_{ij}^{in}$ and $g_i^{out}$ can be a function from $\mathcal{G}$.
However, fully implementing this equation (especially, using computationally expensive Meijer G-functions) is impractical even for moderate values of $d$. Therefore, an approximation is proposed in \cite{alaa2019demystifying} by considering a single outer function which is set to be identity and adding multiplication of all pairs of attributes to capture their correlation (we discuss this method in more details in Section \ref{sec:related}). In this work we propose another method for approximating Equation \eqref{eq:Kol}. 

\subsection{Approximating KST}
In our method, we approximate KST using trees with $L<2d+1$ middle nodes, where each of them is connected to only a subset of inputs. We denote the middle nodes with $h_i$, for $1\leq i \leq L$. \rev{Our approximation} can be represented \rev{via a} three layered tree (see Figure \ref{fig:structure}). There is a single root node at the top of the tree which is connected to $L$ middle nodes. Each middle node is connected to a subset of bottom layer nodes. The bottom layer of the tree has $d$ nodes corresponding to $d$ features. For simplicity, when it is not confusing, we call the node corresponds to $i$th feature by $x_i$. 

Note that each edge in the graph represents a univariate function. We denote the function corresponding to the edge between $h_i$ and the root with $g_{h_i}$ \rev{(these are the outer functions)}, and the function corresponding to an edge between $h_i$ and $x_j$ is denoted by $g_{ij}$ \rev{(inner functions)}. The argument of $g_{ij}$ is naturally the feature it is connected to, namely $x_j$, and the argument of $g_{h_i}$ is the summation of all incoming functions to $h_i$. That is, 
$\sum_{j\in \mathcal N(h_i)} g_{ij} (x_j),$
where $\mathcal{N}(h_i)$ denotes the neighbours of node $h_i$ \rev{in the graph}.
Finally, for the root node we sum all the outputs of all $L$ middle layer functions. Therefore, each tree is representing a function from $\mathcal{X}$ to $\mathbb{R}$, which can be expressed as follows:
\begin{equation} g(\bm{x})= \sum_{i=1}^L g_{h_i}\left(\sum_{j\in \mathcal N(h_i)} g_{ij} (x_j) \right).\label{eq:main}\end{equation}
\subsection{Using GP for Training of Metamodels}
Now we want to solve the optimization problem in \eqref{eq:opt}, where $\mathcal{G}$ is the set of all functions that can be represented in form of Equation \eqref{eq:main}, where all $g_{ij}$ and $g_{h_i}$ are drawn from the class of \rev{primitive parameterized functions.} We propose solving this optimization problem by running a version of genetic programming algorithm. The tree representation of Equation \eqref{eq:main}, resembles the trees in symbolic regression that represents each program. Note that, unlike normal GP, here our constructed trees has a fixed structure of three layers, and also edges are representing functions. Hence we need to modify GP accordingly. In this section, we explain the details of our algorithm. 


\begin{figure}[t]
\centering
\begin{tikzpicture}[scale=0.5,shorten >=1pt]
  \tikzstyle{vertex}=[circle,fill=black!25,minimum size=20pt,inner sep=2pt]
  \node[vertex] (G_1) at (0,0) {$x_1$};
  \node[vertex] (G_2) at (2,0)   {$x_2$};
  \node[vertex] (G_3) at (4,0)  {$x_3$};
  \node[vertex] (G_4) at (6,0)   {$x_4$};
  \node[vertex] (G_5) at (8,0)  {$x_5$};
  \node[vertex] (G_6) at (10,0) {$x_6$};
  \node[vertex] (G_7) at (12,0)  {$x_7$};
  \node[vertex] (G_8) at (3,2) {$h_1$};
  \node[vertex] (G_9) at (6,2)  {$h_2$};
  \node[vertex] (G_10) at (9,2) {$h_3$};
  \node[vertex] (G_11) at (6,4)  {$r$};
  \foreach \from/\to in {G_1/G_8,G_1/G_9,G_2/G_8,G_3/G_10,G_4/G_8,G_5/G_10,G_5/G_9,G_6/G_10,G_7/G_9,G_7/G_10,G_8/G_11,G_9/G_11,G_10/G_11}
  \draw (\from) -- (\to);
\end{tikzpicture}
\caption{A sample tree structure, each edge is representing a univariate function} \label{fig:structure}
\vspace{-10pt}
\end{figure}
\subsubsection{Producing random trees}
In the first step, we produce $M$ random trees $T_1, \cdots,T_M$. 
Each tree $T_i$ has $L_i$ middle nodes, where $L_i$ is an integer in $[l_1,l_2]$. $l_1$ and $l_2$ are important hyperparameters, determining number of middle nodes.
For each of $L_i$ middle nodes, a random subset of bottom layers will be chosen to be connected to this node. At first instance, for all $1\leq u\leq L_i$ and $1\leq v \leq d$, we connect $h_u$ and $x_v$ with probability $0<p_0$. Then if there exist an $x_v$ which is not connected to any of the middle nodes. We choose $1\leq u\leq L_i$ uniformly at random and then connect $x_v$ and $h_u$ to ensure every $x_i$ is connected to at least one of the middle nodes. $p_0$ is the parameter that controls sparsity of the produced graphs, which is one of the main factors that determine the complexity of the training procedure. Each edge is representing a function from our class of primitive functions, thus we uniformly at random choose one of the function classes for each edge and also initialize its parameters with samples from normal distribution.



\subsubsection{Training phases}
In the training phase, for each tree, we update the parameters of each edge using gradient descent. We choose a constant $k$ and apply $k$ gradient descent updates on the parameters of functions $g_{h_i}$ and $g_{ij}$. Let $g'_{h_i}(x) = \frac{d g_{h_i} (x)}{d x}$. For $a$ one of the parameters of $g_{ij}$ and $b$ a parameter of $g_{h_i}$, the gradient of $g$ with respect to $a$ and $b$ can be computed as follows \rev{(recall that $g$ is representing the metamodel)}:
\begin{align}
&\frac{\partial g(\bm{x})}{\partial a}=\frac{\partial g_{ij}(x_j)}{\partial a} \cdot g'_{h_i}\left(\sum_{k\in \mathcal{N}(h_i)} g_{ik}(x_k)\right),\\
&\frac{\partial g(\bm{x})}{\partial b}=\frac{\partial g_{h_i}}{\partial b}\left(\sum_{j\in \mathcal{N}(h_i)} g_{ij}(x_j)\right).
\end{align}  
In this work, we choose a fixed learning rate and leave the exploration of using more advanced optimization techniques for future work \rev{(this is compatible with \cite{alaa2019demystifying} and \cite{crabbe2020learning}, and allows us to have a fair comparison with these works)}.

\subsubsection{Evaluation fitness of metamodels} 
For evaluating fitness of the trained \rev{metamodels}, we uniformly at random sample $m$ points from $\mathcal{X}$ and query the output of black-box $f$ and metamodels $g_1, \cdots, g_M$ on these $m$ points and compute the mean square loss for the metamodels to approximate \eqref{eq:loss} (the output of $f$ is considered as the ground truth). If any of the $M$ models has a loss less than a predefined threshold we terminate the algorithm. Otherwise, we choose the $s$ fittest metamodels and discard the rest. These $s$ survived metamodels are the parents that will populate the next generation of trees in the evolution process for the next round of the algorithm. 

\textbf{Regularization:} We can modify the fitness criterion to favor simpler models. For encouraging sparsity of the tree, we can add a term  to the MSE error for penalizing trees that have more edges. Denoting total number of edges with $E$, we use this criterion for evaluation fitness of the trees ($\lambda$ is a hyperparameter):
\begin{equation} \text{Fitness of a given tree} = \text{MSE} + \lambda E.\end{equation}
\subsubsection{Evolution phase}
In the evolution phase, we create the next generation of metamodels using survived trees. Similar to conventional GP algorithm, here we also define two operations to perform on each tree: Crossover and Mutation. For each of the $s$ chosen trees like $T$, we first pass on $T$ to the next generation, then we randomly choose $\frac{M}{s}-1$ times one of the two operations, perform it on $T$, and add the resulting tree to the cohort of the next generation trees. Thus, the total number of trees in the next cohort is also $M$. Here we define the two operations which preserve the three layer structure of the trees:
\begin{itemize}
\item In the crossover operation, \rev{for $T$,} we first randomly choose one of the nodes at the second layer of $T$. Then we uniformly at random choose one of the other $s-1$ trees, and then again uniformly at random choose one of its second layer nodes and replace that node alongside with all edges connected to that node with the chosen node in $T$. Notice that the edge connected to the root node will be also replaced. \rev{Moreover,} note that the new tree will inherit the functions corresponding to replaced edges and their parameters. 
\item
In the mutation operation, one of these two actions will be applied on the tree: 1) changing the function class of an edge, 2) removing an edge between the middle and input layers.
In each round of mutation, we apply $n_m$ times one of these two actions on the tree.
When we change the class of function for an edge, we also randomly reinitialize the parameters of the corresponding function. 
\end{itemize}
The above two operations allow us to explore different configurations of trees and classes. A pseudo code of the algorithm and a flowchart is presented in Appendix A. We call our proposed method symbolic metamodeling using primitive functions (SMPF).



\subsection{Different Types of Interpretation Using SMPF}\label{sec:taylor}
\textbf{Instance-wise feature importance:} Similar to \cite{alaa2019demystifying} and \cite{crabbe2020learning} we can use the learned metamodel for estimating instance-wise feature importance. We can find the Taylor expansion of the metamodel around the data point of interest $\bm{x}_0$ and analyse its coefficients.
\begin{equation} \resizebox{.98\hsize}{!}{$g(\bm{x})=g(\bm{x}_0)+\nabla g(\bm{x}_0).(\bm{x}-\bm{x}_0)+(\bm{x}-\bm{x}_0).H_x(\bm{x}).(\bm{x}-\bm{x}_0)+ \cdots,$}\end{equation}
first order partial derivative with respect to $j$th feature can be computed using chain rule:
\begin{equation} \frac{\partial g(\bm{x})}{\partial x_j}=\sum_{h_i\in \mathcal{N}(x_j)}g'_{h_i}\left(\sum_{j\in \mathcal{N}(h_i)} g_{ij}(x_j)\right)g'_{ij}(x_j). \end{equation}
We will use this method in the instance-wise experiment. Importantly, we can also compute higher order coefficients for analyzing feature interactions.

\textbf{Mathematical expressions:} The final expression of the metamodel can provide insights into the functional form of the black-box function. For example, in the first experiment, we show that the metamodel correctly identifies that the black-box is an exponential function. Moreover, the inspection of mathematical expressions provides information about the interactions between the input features, and
can potentially lead to understanding of previously unknown facts about the underlying mechanisms to domain experts.
An idea for exploring in future work is inspecting the final cohort of graphs. For example, if in the last iteration, the average degree of a node is large across different graphs, this can show the importance of the corresponding feature. Similarly, when a subset of features are connected to a middle node it can show the interaction of those features.
\section{Comparison with Related Works}\label{sec:related}
In the experiments section,
we compare our approach with three symbolic metamodeling methods. This section briefly introduces these approaches , highlighting their strengths and weaknesses. A table comparing our method with a wider range of methods is provided in the supplementary material.

\textbf{Symbolic Metamodeling (SM) \cite{alaa2019demystifying}:}
SM proposes using Meijer G-functions for interpreting black box models. In the derivation of their method, they also start with KST \eqref{eq:Kol}, however, with a different approximation: they consider only one outer function ($g^{out}$) and set that function to be identity (the inner functions are all Meijer G). This does not allow the features to interact, in order to fix this problem, they add multiplication of all pairs $x_ix_j$ to the features. This setting has two main issues, firstly this method cannot capture interaction of more than two features and does not show other forms of interactions apart from multiplication. Secondly, this approach introduces many new features which makes it impractical when $d$ increases. There are ${d \choose 2}+d$ features in total and there is a Meijer G-function corresponding to each of them which makes using SM computationally costly.

\textbf{Symbolic Pursuit (SP) \cite{crabbe2020learning}:}
SP is a subsequent work to SM and is designed to overcome some of its flaws. In particular, SP is designed to use fewer Meijer G-functions. The method is based on the Projection Pursuit algorithm in statistics \cite{friedman1981projection}. In each step of the algorithm, a Meijer G-function will be fitted which minimizes the residual error between the metamodel and the black-box. The final metamodel will be the summation of all these Meijer G-functions. The input of each function is a linear combination of features. Thus, the final function will have the following formulation:
\begin{equation} g(\bm{x})=\sum_{i=1}^L g_i\left(\sum_{j=1}^d c_{ij}x_j\right), \label{eq:SP}\end{equation}
where $g_i$'s are Meijer G-functions. Importantly, the authors use a modified version of \eqref{eq:SP} where the arguments of Meijer G-functions are normalized such that they lie in the open interval of 0 to 1. Moreover, SP involves adding weights to the outer summation to allow mitigating the contributions of previously found functions, if needed.

Note that SP can be considered as one instance of our framework. The equation \eqref{eq:SP} is compatible with KST \eqref{eq:Kol} and can be represented similar to Figure 1. In essence, all inner function (edges between bottom and middle layers) are restricted to be linear, basically, they are coming from class of $f(x)= cx$. There are $L$ middle nodes, and outer functions are drawn from the class of Meijer G-functions. Also, in their setup $p_0=1$ ($p_0$ was the probability of connecting two nodes). 
\rev{A major problem with SP is its capability in representing non-linear correlations between the features.} For example, a simple function like $x_1x_2$ cannot be represented in SP formulation. Therefore, when using SP for explaining this function, in the best case scenario, by inspecting $c_{i1}$ and $c_{i2}$ we can understand that these two features are important but we cannot see how they interact. This can be potentially resolved in our framework by using a more general class of functions as inner functions.
\begin{table*}[t!]
\centering
\resizebox{0.73\textwidth}{!}{%
\begin{tabular}{cccccc} 
\toprule
& &$f(\bm{x}) = e^{-3x_0+x_1}$ & $f(\bm{x}) = \sin (x_0x_1)$ & $f(\bm{x}) = \frac{x_0x_1}{(x_0^2+x_1)}$ &  $f(\bm{x}) = \text{sinc}(x_0^2+x_1)$ \\
\midrule   
SMPF & $\begin{array}{c} \mbox{MSE}\\ R^2\end{array}$ & $\begin{array}{c}  \textbf{0.001}\pm \textbf{0.0002}\\ \textbf{0.996} \pm \textbf{0.002}\end{array}$ & $\begin{array}{c} 0.012\pm 0.002\\0.962 \pm 0.004\end{array} $&$\begin{array}{c} \textbf{0.002}\pm \textbf{0.0004}\\ \textbf{0.895} \pm \textbf{0.013}\end{array}$ & $\begin{array}{c} \textbf{0.004} \pm \textbf{0.0004} \\  \textbf{0.952}\pm \textbf{0.003} \end{array}$ \\\midrule
SM & $\begin{array}{c} \mbox{MSE}\\ R^2\end{array}$  & $\ba0.174\pm 0.031\\0.273\pm 0.019\end{array}$   & $\begin{array}{c} 0.126\pm 0.009\\ -2.039 \pm 0.442\end{array}$  & $\begin{array}{c} 0.108\pm 0.0104\\-5.461\pm 0.746\end{array}$ & $\begin{array}{c} 0.193 \pm 0.006 \\  -0.263\pm 0.094\end{array}$\\\midrule 
SP & $\begin{array}{c} \mbox{MSE}\\ R^2\end{array}$   & $\begin{array}{c} 0.009\pm 0.004\\ 0.958\pm 0.014\end{array}$   & $\begin{array}{c} 0.0008\pm 0.0001\\ 0.978\pm 0.003\end{array}$  & $\begin{array}{c} 0.002\pm 0.0003\\0.878\pm 0.021\end{array} $ & $\begin{array}{c} 0.009 \pm 0.002 \\ 0.937\pm 0.015 \end{array}$ \\\midrule
SP$^p$ & $\begin{array}{c} \mbox{MSE}\\ R^2\end{array} $ & $\begin{array}{c} 0.009\pm 0.001\\ 0.953\pm 0.014\end{array}$   & $\begin{array}{c} 0.024 \pm 0.001 \\ 0.348 \pm 0.082 \\ \end{array}$  & $\begin{array}{c} 0.011\pm 0.001 \\ 0.345 \pm 0.807  \end{array} $ &  $\begin{array}{c} 0.010\pm 0.001 \\ 0.932 \pm 0.013 \end{array} $ \\\midrule
SR & $\begin{array}{c} \mbox{MSE}\\ R^2\end{array} $  & $\ba0.078\pm 0.018\\0.658\pm 0.032\end{array}$   & $\begin{array}{c} \textbf{0.0004}\pm \textbf{0.0002}\\ \textbf{0.988}\pm \textbf{0.003}\end{array}$  & $\begin{array}{c} 0.012\pm 0.002\\ 0.256 \pm 0.144\end{array}$ & $\begin{array}{c} 0.016 \pm  0.003 \\ 0.886\pm 0.034 \end{array}$\\
\bottomrule 
\end{tabular}}
\caption{Approximating two-variable functions using SM, SP, SR and SMPF. \vspace{1mm}}
\label{exp1-table}
\vspace{-4pt}
\end{table*}

\textbf{Symbolic Regression:}
We briefly introduced SR in Section 2. SR searches over mathematical expressions that can be produced by combining a set of predetermined functions.
In each program, the leaf nodes are either features or numerical values, and other nodes are mathematical operations. One main difference between SR and our method (also SM and SP) is that unlike SR our methods are based on a representation derived from KST. Furthermore, we use parametric functions \rev{(and GD)} which cannot be accommodated in SR setting \rev{(note that GD has been suggested in SR but only for training of leafs, e.g. see \cite{topchy2001faster,kommenda2018local}).}
Importantly, SR has an advantage over SP and SM that the final result expression is guaranteed to be explainable, as it will be a combination of functions that we chose to include as the building blocks. However, when Meijer G-functions are used (in SM and SP), the resulting metamodel may not have a simple and explainable representation. This issue is resolved in our framework. There are several extensions on the original SR method, including methods that leverage deep learning techniques for searching the search space. These methods can be considered for future work to improve the GP in our method as well \cite{arnaldo2014multiple,rad2018gp,wang2019symbolic,orzechowski2018we,chen2015,udrescu2020ai,petersen2019deep,mundhenk2021symbolic}.
\section{Experiments}
\label{sec:exps}
We evaluate and compare our proposed method using three experiments. In the first experiment, we use our method to approximate four functions with simple expressions (similar to first experiment of \cite{alaa2019demystifying}). In the second experiment, we use our method for estimating instance-wise feature importance for three synthetic datasets (similar to \cite{alaa2019demystifying} and \cite{chen2018learning}). Finally, in the third experiment, we consider black-boxes trained on real data and approximate it using  the metamodel (similar to \cite{crabbe2020learning}). 
Some additional results and the hyperparameters are reported in Appendix E.
\begin{figure*}[t!]
    \centering
    \includegraphics[width=0.94\textwidth]{exp2_fig.pdf}
\caption{Box-plot of feature importance for three datasets. The red lines show the median ranks under each algorithm. Lower median ranks imply better performance. DL refers to DeepLIFT.}
\label{fig:exp2}    
\end{figure*}
\subsection{Metamodels for Fixed Functions}
In this experiment,
\rev{we find metamodels for four synthetic functions with two variables.} We compare the performance of our method (SMPF) with symbolic metamodeling (SM), symbolic pursuit (SP), polynomial approximation of SP (SP$^p$), and symbolic regression \cite{orzechowski2018we} (similar to \cite{alaa2019demystifying} we use gplearn library \cite{stephens2015gplearn} for implementation of SR). We compare methods in terms of mean squared error (MSE) and $R^2$ score.
Generally, our algorithm achieves a better accuracy as compared to other methods (we have the best score for three of the functions). The results are reported in Table 1.
Furthermore, SMPF was able to correctly identify the functional form. For the first experiment, the final expression of the metamodel is as follows (we rounded up coefficients here):
\begin{align*}
g(\bm{x}) = 0.854\exp\Big(&-2.438\sin(1.371x_0 - 0.0318) +\\ &\frac{0.684x_1}{0.016x_1^2+0.204x_1+0.426}\Big).
\end{align*}
This shows an important advantage of our method in comparison to other methods.
For example, the expression found by SP algorithm has the following form ($P1$ here is a linear combination of the two inputs):
\begin{equation*} g(x) = 0.98\, G^{2,1}_{2, 3}\left(^{\,\,\,\, 0.24,-0.06}_{0.16,-0.47, 0.43}\,|\,1.0 [ReLU(P1)] \right).\end{equation*}
Note that it was not possible to find a closed form expression for this function. 
Also, for the second function, $\sin$ is correctly chosen as the outer function in SMPF. See Appendix E, where we provide results for synthetic functions with more variables.

\begin{table*}[t!]
\centering
{\scriptsize
\begin{tabular}{cccccc} 
\toprule
& Method &\multicolumn{2}{c}{MLP}& \multicolumn{2}{c}{SVM} \\ 
& & MSE & $R^2$ & MSE & $R^2$ \\ \midrule
Black Box && $  0.689 \pm 0.224  $&$  0.703\pm 0.019 $ &$ 0.448 \pm 0.241  $ & $ 0.781\pm 0.061  $\\\midrule
Method v.s. Black Box &$\begin{array}{c} \mbox{SMPF} \\ \mbox{SP} \end{array}$ & $\begin{array}{c} 0.007 \pm 0.003 \\ 0.008 \pm 0.011 \end{array}$ &$\begin{array}{c} 0.993\pm 0.003 \\ 0.978 \pm 0.016 \end{array}$& $\begin{array}{c} 0.029 \pm 0.013 \\ 0.014 \pm 0.015 \end{array}$ &$\begin{array}{c} 0.967\pm 0.120 \\ 0.974 \pm 0.078 \end{array}$ \\\midrule
Method &$\begin{array}{c} \mbox{SMPF} \\ \mbox{SP} \end{array}$ &$\begin{array}{c} 0.674 \pm 0.211 \\ 0.682 \pm 0.225 \end{array}$&$\begin{array}{c} 0.709\pm0.015 \\ 0.697 \pm 0.027 \end{array}$ & $\begin{array}{c} 0.344 \pm 0.163 \\ 0.471 \pm 0.253 \end{array}$&$\begin{array}{c} 0.829\pm0.037 \\ 0.780 \pm 0.048 \end{array}$ \\
\bottomrule 
\end{tabular}
\caption{Interpreting black-boxes trained on real data using SMPF compared with SP\vspace{1mm}}
\label{exp3-table}
\vspace{-4pt}
\end{table*}
\subsection{Instance-wise Feature Selection}
In this experiment, we evaluate the performance of our method for estimating the feature importance by repeating the second experiment of \cite{alaa2019demystifying}. Three synthetic datasets are used: XOR,  Nonlinear additive features, and Feature switching. All three datasets have 10 features, in XOR, only the first two features contribute in producing the output. In Nonlinear additive features and switch datasets, the first four features and first five features are important, respectively.\footnote{See Appendix B of \cite{alaa2019demystifying} for more details.} First, we train a 2-layer neural network $f(\bm{x})$ with 200 hidden neurons for estimating the label of each data point. Then, we run our algorithm to find function $g(\bm{x})$ to estimate function $f(\bm{x})$. We consider the coefficient of each feature in the Taylor series of $g(\bm{x})$ as a metric for its importance. The larger the coefficient, the more important it will be. We rank the features based on their importance. We consider 1000 data points, repeat the process for each data point and find the median feature importance ranking. The median value of relevant features determines the accuracy of the algorithm; the smaller median rank implies a better accuracy. Figure \ref{fig:exp2} compares our algorithm with Symbolic Metamodeling (SM) \cite{alaa2019demystifying}, Symbolic Pursuit (SP) \cite{crabbe2020learning}, Symbolic Regression \cite{orzechowski2018we}, DeepLIFT \cite{shrikumar2017learning}, SHAP \cite{lundberg2017shap}, LIME \cite{ribeiro2016lime}, and L2X \cite{chen2018learning}. SMPF performs competitively comparing with other algorithms. For XOR dataset we have the best median rank, and we are among the best for nonlinear additive dataset. On Switch dataset, SMPF performs similar to other global methods, i.e., SM, SP, and SR which are our direct competitors. SHAP is the only algorithm that has a better performance on this dataset. 

\subsection{Black-box Approximation}
In this experiment, we evaluate performance of our model on interpreting a black-box trained on real data, replicating the second experiment of \cite{crabbe2020learning}. A Multilayer Perceptron (MLP), and Support Vector Machine (SVM) are trained as two black boxes using UCI dataset Yacht \cite{Dua:2019} (additional results are reported in Appendix E). In order to have the same setting as SP, we train the MLP and SVM models using the scikit-learn library \cite{buitinck2013api} with the default parameters.
We randomly use 80\% of the data points for the training of the black box model as well as SMPF model, and the remaining 20\% is used to evaluate the performance of the model. This procedure is repeated five times to report the averages and standard deviations. We report the mean squared error (MSE) and $R^2$ score of the MLP and SVM against the true labels, MSE and $R^2$ of the metamodel against the black-box models, and the MSE and $R^2$ of the metamodel against the true labels (see Table \ref{exp3-table}). We observe that both SP and SMPF have very good performance in approximating the black-box. Interestingly, SMPF outperforms the black-box on the test set for both models which may indicate that the black-box overfits the dataset, but SMPF does not, as it uses simple functions.

\section{Discussion}
\textbf{Complexity:} In terms of run-time, for the last experiment, the training of SP for the MLP black-box takes 215 minutes, while the training of our algorithm takes 45 minutes (both performed on a personal computer). The reason that SP is more computationally expensive is that SP has to evaluate Meijer G-functions in each iteration of their optimization process. Evaluating a Meijer G-function is very expensive and takes about 1 to 4 seconds depending on the hyperparameters (i.e., $m,n,p,q$). This observation implies that SMPF has lower computational complexity which allows us to handle more variables and also enables the possibility of using more complex trees, as we suggest later in the future work. However, this should be highlighted that our method (similar to other symbolic methods) is not appropriate for high dimensional data like images.

\textbf{Limitations:} Even though we showed the performance of our model through extensive numerical experiments, our method lacks theoretical guarantees (theoretical analysis is particularly challenging because of the use of GP). Another limitation (also inherited from GP) is that there are several hyperparameters in our model to specify structure of the tree. As discussed, symbolic metamodels cannot handle high dimension inputs. Finally, the richness of functions we can create is limited, this can be compensated using more complex classes of functions or more complex tree structures.

\textbf{Direct training vs  using black-box:}
A natural question is why not directly use the training data to train the metamodel (without using the black-box)? There are two reasons for why we have considered the black-box for training. One is from the user point of view, we may have been given a task of interpreting a black-box, i.e., the user’s question may be why this particular method is working, and not necessarily looking for another interpretable method. Secondly, and more importantly, we may not have access to the dataset for various reasons including privacy concerns. In this method we only need querying the black-box method and we can use random inputs (as many of them as we want). Directly using the dataset in all symbolic metamodeling methods (e.g. SR, SM, and SP) is also possible and can be relevant in many scenarios, e.g., discovering the underlying governing rules of a dataset \cite{udrescu2020ai,sahoo2018learning,makke2021symbolic}.

\textbf{Conclusion and future work:} 
\rev
We proposed a new generic framework for symbolic metamodeling based on the Kolmogorov superposition theorem. We  suggested using simple parameterized functions to get a closed-form and interpretable expression for the metamodel. The use of simple functions may seem restrictive when compared with SM and SP which use Meijer G-functions (a richer class of functions). However, this is compensated in our framework with a better approximation of KST. We used genetic programming to search over different possible trees and also possible classes of functions. 
There are several directions for the expansion of this work: 1) we can consider a more complex tree structure. For example, we can have trees with four layers instead of three, which allows us to construct more complex expressions (see Appendix D). 2) Other primitive functions can be used in our setup, e.g., Meijer G-functions. 
3) The optimization in the training phase can be improved. The problem is non-convex, and gradient descent may not be able to find the global optimal point. This issue can be addressed by imposing convex relaxation or using more sophisticated non-convex optimization methods.}

\textbf{Acknowledgment: } This work is partially supported by the NSF under grants IIS-2301599 and ECCS-2301601.












\section{Ethics and Reproducibility Statements}

