\documentclass[twoside]{article}

\usepackage{aistats2025}
% If your paper is accepted, change the options for the package
% aistats2025 as follows:
%
%\usepackage[accepted]{aistats2025}
\usepackage{xiaohanmacros}
\usepackage{algorithm}
\usepackage{algpseudocode}
\newtheorem{definition}{Definition}
\newtheorem{theorem}{Theorem}
\newtheorem{assumption}{Assumption}
\newtheorem{lemma}{Lemma}
%
% This option will print headings for the title of your paper and
% headings for the authors names, plus a copyright note at the end of
% the first column of the first page.

% If you set papersize explicitly, activate the following three lines:
%\special{papersize = 8.5in, 11in}
%\setlength{\pdfpageheight}{11in}
%\setlength{\pdfpagewidth}{8.5in}

% If you use natbib package, activate the following three lines:
\usepackage[round]{natbib}
\renewcommand{\bibname}{References}
\renewcommand{\bibsection}{\subsubsection*{\bibname}}

% If you use BibTeX in apalike style, activate the following line:
%\bibliographystyle{apalike}

\begin{document}
\bibliographystyle{plainnat}

% If your paper is accepted and the title of your paper is very long,
% the style will print as headings an error message. Use the following
% command to supply a shorter title of your paper so that it can be
% used as headings.
%
%\runningtitle{I use this title instead because the last one was very long}

% If your paper is accepted and the number of authors is large, the
% style will print as headings an error message. Use the following
% command to supply a shorter version of the authors names so that
% they can be used as headings (for example, use only the surnames)
%
%\runningauthor{Surname 1, Surname 2, Surname 3, ...., Surname n}

\twocolumn[

\aistatstitle{Instructions for Paper Submissions to AISTATS 2025}

\aistatsauthor{Xiaohan Wang \And Yunzhe Zhou \And  Giles Hooker }

\aistatsaddress{Cornell University \And  University of California, Berkeley \And University of Pennsylvania} ]

\begin{abstract}
Variable importance is one of the most widely used measures for interpreting machine learning, which shares prolonged popularity within both statistics and machine learning community. Recently, increasing attention has been directed toward uncertainty quantification in these metrics. Traditional approaches largely rely on one-step procedures, which, while asymptotically efficient, can present higher sensitivity and instability in finite sample settings. To address these limitations, we propose a novel method inspired by the {\it targeted learning} (TL) framework, designed to enhance robustness in inference for variable importance metrics. Our approach is particularly suited for permutation and relearning-based variable importance techniques. We show that it (i) retains the asymptotic efficiency of traditional methods, (ii) maintains comparable computational complexity, and (iii) delivers improved accuracy, especially in finite sample contexts. We further support these findings with numerical experiments that illustrate the practical advantages of our method and validate the theoretical results.
\end{abstract}

\section{Introduction}
Machine Learning (ML) models offer high-quality predictions for complex data structures and have become indispensable across various fields, including civil engineering \citep{lu2023using}, sociology \citep{molina2019machine}, and archaeology \citep{bickler2021machine}, due to their versatility and predictive power. However, due to their complexity, ML models present an absence of interpretability for their internal mechanism\citep{hooker2017machine, hooker2021unrestricted, freiesleben2024scientific}. To address this issue, many researchers proposed interpretable machine learning tools (IML) to provide post hoc interpretability of ML models. 

Among these tools, variable importance, which measures the contribution of individual covariates to the response variable, is a widely adopted measure in IML \citep{molnar2020interpretable}. Traditionally, it has been applied to assess the performance of fixed models, such as random forests \citep{breiman2001random} and linear models \citep{gromping2007estimators}. Additionally, efforts have been made to create model-specific uncertainty quantification methods, as seen in \citet{gan2022model}. Building on these advancements, there is a growing interest in exploring model-agnostic variable importance using nonparametric techniques \citep{van2006statistical, lei2018distribution, williamson2021nonparametric, donnelly2023rashomon,verdinelli2024decorrelated}.

Despite substantial efforts devoted to developing new methodologies, less attention has been given to fully understanding these tools. Specifically, there is a notable gap in the literature concerning the uncertainty quantification in variable importance metrics, where very few work has been done \citep{williamson2021nonparametric, williamson2023general,wolock2023nonparametric,freiesleben2024scientific}. Yet, existing methods have been focused on utilizing the one-step debias procedure, which presents a higher numerical instability under a finite sample size.

In this paper, we introduce a novel method to quantify the uncertainty of variable importance measures. Inspired by the targeted learning framework of \citet{van2006statistical}, our method provides a robust algorithm for conducting inference on variable importance. Our approach is efficient within the class of \textit{regular}\footnote{This will be formally introduced in section \ref{sec: theory}} estimators, computationally efficient, and universally outperforms one-step estimators. For illustration, we focus on conditional permutation and relearn-based metrics, as these methods avoid potential issues with extrapolation \citep{hooker2021unrestricted}.


% Our contribution can be considered as two-fold.  

% Firstly, our work contributes to the literature on uncertainty quantification. 


% From the targeted learning literature perspective, we extend the scope of targeted learning to variable importance research. TL is first introduced by \cite{van2006statistical}, and since then, a number of variants have been proposed to address specific causal problems, such as cross-validation \cite{van2011cross}, one step tMLE \cite{van2016one}, and \cite{wei2023efficient}. However, most of the existing work have been focused on the casual inference, instead of a way to consider general nonparametric estimation. 



\clearpage

This paper is organized as follows: In section, we'll formally state problem setup and introduce some key concepts related to variable importance and TL. In section , we formally present our methodology and provide theoretical guarantees to our methodology. Lastly, we demonstrate the effectiveness of our methodology through concrete empirical examples of variable importance such as permute and relearn methods \citep{mentch2016quantifying} and conditional permutation importance \citep{strobl2008conditional}.
\section{Variable Importance }
\subsection{Problem Setup}
Suppose that we observe \(n\) independent and identically distributed (i.i.d.) observations \( \{(Y_i, X_i, Z_i)\}_{i=1}^{n} \) drawn from the unknown joint distribution \( P^* \in \mathcal{M}\), where \(\mathcal{M}\) is the class of nonparametric distributions. That is, 
\[
(Y_i, X_i, Z_i) \overset{\text{i.i.d.}}{\sim} P_{Y,X,Z}, \quad i = 1, \dots, n.
\]

We aim to investigate the relationship between the response \( Y \in \R\) and the covariate of interest \( X  \in \mathcal{X}\) through some predefined variable importance, make sure it is ``efficient'', and then conduct inference. Also, we note that our model is evaluated by a certain loss function \(L(\cdot)\), where we may obtain an estimator \(f: \mathcal{X}\to \mathcal{R}\). For simplicity, we'll focus on the case where \(\mathcal{X}\subseteq \R, \mathcal{Z} \subseteq \R^{d-1}\), yet we note that our method can be generalized to the cases where \(X_i\) is a vector. 
\subsection{Variable Importance metrics}\label{sec: vi}
Variable importance and its variants aim to provide a score for a feature by measuring how much difference may present with/without the information from the feature and has been widely used since the introduction by \citet{breiman2001random}. 

Variable importance is obtained by considering the out-of-bag (OOB) loss of a certain feature. First, we permute the feature(s) that we are interested in quantifying the importance. That is, we randomly permute the index of the column of \(X\), denoted by \(X^\pi\), and then put it together with the remaining features, which results in the final data \((Y, X^\pi, Z)\).
\begin{align*}
  VI^{\pi}_X  = \sum_{i=1}^N L(Y_i, f(X_i^\pi, Z_i) - L(Y_i, f(X_i^\pi, Z_i))).
\end{align*}

Building upon this, various extensions have been explored, for example, \citet{lei2018distribution, williamson2023general} proposed to estimate the variable importance by simply dropping it, called leave-one-covariate-out (LOCO) method and \cite{strobl2008conditional} considered the problem by permuting the feature within each leave, among many others.

In this paper, we focus on the uncertainty quantification of permutation and relearn-based methods such as permutation and relearn importance, defined below.
\subsubsection{Permutation and relearn Importance}\label{subsec: pap}
This approach was first proposed by \citet{mentch2016quantifying}.  Different from the classical result, they relearned the model on permuted data. That is, they propose to train a new model \(f^\pi\) from \((Y, X^\pi, Z)\). The Permutation and relearn variable importance is thus defined as:
\begin{align*}
    VI^{\pi L}_X  = \sum_{i=1}^N L(Y_i, f^\pi(X_i^\pi, Z_i) - L(Y_i, f(X_i^\pi, Z_i))).
\end{align*}

\subsubsection{Conditional Permutation Importance}
This metric is obtained by conditional permutation copy of \(X\) such that: \(X_{i}^C \sim X_i^C|Z_i\). With a similar notation defined above, we may thus have the plug-in estimator, defined as:
\begin{align*}
    VI_X^C  = \sum_{i=1}^N L(Y_i, f(X_i^C, Z_i) - L(Y_i, f(X_i^\pi, Z_i))).
\end{align*}

This approach is first proposed by \cite{strobl2008conditional} for the random forest, where they obtain the conditional permuted version by conducting the permutation within each leave. A similar idea is also present at \citet{fisher2019all}.

\section{Methodology}

\subsection{Existing methodology}
Existing literature on quantifying the uncertainty of variable importance is scarce. There are mainly two trends in the uncertainty quantification of IML in general.

The first trend is the de-biasing approach utilizing the efficient influence function, with which we can do bias correction and construct confidence intervals for the efficient ones \citep{williamson2023general,wolock2023nonparametric}. Yet another approach would be to refit the model to construct a variance estimator directly \citep{molnar2023relating}.

In works such as \citet{williamson2021nonparametric} and \citet{williamson2023general}, the efficient influence function is leveraged to construct confidence intervals, since the plug-in estimator is shown to be efficient under mild assumptions. While \citet{wolock2023nonparametric} applies this concept to debias the results, and then construct the confidence interval -- an approach also seen in \citet{ning2017general,10.1111/ectj.12097}. 

Additionally, \citet{molnar2023relating} and \citet{freiesleben2024scientific} considers a bootstrap-like method. The uncertainty quantification is done by refitting models under different portions of data. Then, using the models and metrics calculated from the collection of models built, variance can thus be estimated. We note that such methodology shares a similar idea as bootstrap variance estimation, such as \citet{diciccio1996bootstrap}. 

However, we note from the non-asymptotic perspective, the instability of empirical distributions can hinder the effectiveness of both methods, as highlighted by \cite{booth1998monte} and \citet{van2011targeted}. In order to address the numerical instability problem, we proposed a novel method for estimating the variable importance by iteratively updating the distribution, which we'll describe in detail in the next section.
\subsection{Proposed Methodology}
In this section, we propose a novel method to provide robust statistical inference for variable importance. The steps described in the following paragraphs can be summarized in algorithm \ref{alg: estimation}. 
%\textcolor{red}{should we make it K-fold or two fold shall be enough?}
\begin{algorithm}\label{alg: estimation}
  \caption{Calculation on \(I_1\)}
  \label{alg:example}
  \begin{algorithmic}[1]
      \Require $\{Y_{i}, X_i, Z_{i}\}$ for $i = 1, \dots, n$ and \(I_1, I_2\) such that \(I_1 \cup I_2 = \left\{ 1, \dots, n \right\}\) and \(I_1 \cap I_2 = \emptyset\).
      \State Calculate the plug-in estimate for the variable importance on \(I_1\), denote as \(\hat{\Psi}_1(\hat{P})\).
      \For {each iteration $t$}
          \State Calculate the plug-in estimate for efficient influence function \(\psi\) on \(I_2\), denote as \(\hat{\psi}_1\).
          \State  Find $\epsilon^*$ to maximize the likelihood $c(\epsilon)\hat{P}(X_i)e^{\epsilon \psi_1(X_i;\hat{P})}$
          \State Update $\hat{P} = c(\epsilon^*)\hat{P}(x)e^{\epsilon^* \psi(x;\hat{P})}$
      \EndFor
      \State Repeat the above iteration until convergence
      \State \textbf{Return:} \(\Psi_1(\hat{P})\) and variance \(\sqrt{\frac{1}{n}\sum_{i\in I_2}\psi_1(\hat{P})}\)
  \end{algorithmic}
  \end{algorithm}

Our strategy can be considered as two steps: we start off with constructing the most efficient regular estimator. By classical results from Le Cam's convolution theorem (e.g. Theorem 5.1.1 of \cite{bickel1993efficient}), it preserves the asymptotic linearity property. Based on such observation, we can then construct the confidence interval with known asymptotic behavior, which will be formalized in the next session.

As for the construction of the efficient estimator, we note that we adopted a different approach compared to common de-bias methods, since we iteratively update the distribution, instead of considering the one-step estimators. Based on the plug-in estimators, we improve upon them based on the efficient influence function.

To begin with, we start by calculating the plug-in estimator of variable importance metrics and the corresponding efficient influence function. Then, we'll iteratively update the estimator by perturbing the distribution in the path defined by empirical efficient influence function, until convergence\footnote{Intuitively, we can consider this as an optimization problem on the distribution to the optimal distribution. This will be formalized in section \ref{sec: theory}.}. And then, we may obtain an efficient estimator \(\hat\Psi_n^*\). By the asymptotic property of the RAL estimators, we may then consider the confidence interval of the estimate, and then conduct hypothesis testing. 

We note that, in order to obtain the theoretical results with weaker conditions, we adopted the sample splitting method for the update, as implemented in \citet{van2011cross}, \citet{10.1111/ectj.12097} and \cite{newey2018cross}. That is, the plug-in estimate is estimated with the first set of data \(I_1\), yet the iterative update is conducted using another set of data \(I_2\). Lastly, we note that we can easily generalize our method into $K-$folds, which would result in the same asymptotical result, yet choosing \(K = 4\) or 5 would lead to a more stable result numerically.

\subsection{Illustration}
In this section, we'll provide an implementation of the methodology, where we implement our methodology to the permute-and-relearn and conditional variable importance. To begin with, we can consider the efficient influence function of the permute-and-relearn metrics. We'll start with the permute-and-relearn importance.

Noticing that the estimand of permute-and-relearn metric is the same as permutation importance, which is defined as:
\begin{align*}
  \Psi^{\pi L} = \E\left[ L(y, \hat{y}(x, z)) -  L(y, \hat{y}(x', z))\right],
\end{align*}
where \(\hat{y}(x, z) = \E\left[ y| X,Z \right]\)

\begin{lemma}
  Let \[\Psi^{\pi L} = \E\left[ L(y, \hat{y}(x, z)) -  L(y, \hat{y}^\pi(x', z))\right].\]The efficient influence function is:
  \begin{align*}
    \psi^{\pi L} 
    & = (\tilde{y} - \hat{y}(\tilde{x},\tilde{z}))\int L'(y,\hat{y}(\tilde{x},\tilde{z})) \frac{P(\tilde{x})P(y,\tilde{z})}{P(\tilde{x},\tilde{z})} dy \\ 
    &+  \int L(\tilde{y},\hat{y}(x',\tilde{z}))P(x')dx'\\
    &+ \int L(y,\hat{y}(\tilde{x},z))P(y,z)dydz - 2 \Psi^{\pi}(P),
  \end{align*}
  where \(x'\) denotes the permuted version of \(x\) and \(\tilde{x}\) denotes the empirical distribution.
\end{lemma}






\section{Theoretical results}\label{sec: theory}
To formally introduce the theoretical results, we'll briefly introduce a few concepts that would be helpful in developing our method. In section \ref{sec: vi}, we introduce the notions of variable importance to be used, and in section \ref{sec: raltl}, we discuss the background and methodology of targeted learning, which includes a hint of regular asymptotic linear estimation related methods. 

\subsection{Efficiency theory and TL}\label{sec: raltl}
By considering the variable importance as {\it general parameter} \(\Psi: P \to \R, P \in \mathcal{M}\), our aim is to find a ``good'' estimator of the true value \( \Psi(P^*)\), and then construct the corresponding confidence interval. We denote the empirical distribution as \(P_n\).

Our goal is to obtain a ``good'' estimator of it from two perspectives:
\begin{enumerate}
    \item Consistency: we would like to construct an estimator that would perform better asymptotically, which can be guaranteed by {\it asymptotically linearity}.
    \item Robust: we would like to construct an estimator that is robust to the small drift in the data distribution, which will be guaranteed with {\it regularity}.
    \item Efficient: we hope to have an estimator that would be the best possible with the given information, which will be ensured by the TL methodology, based on the two other requirements.
\end{enumerate}

\subsubsection{Regular Asymptotically Linear (RAL) Estimators}
To begin with, we at least hope that our estimate with guaranteed consistency. One such class of distributions is the {\it asymptotically linear} estimators, where classical {\it asymptotically linear} estimators for parametric models include maximum likelihood estimation (MLE) and Generalized method of moments (GMM) under mild conditions. 

Formally, it is defined as:
\begin{definition}\label{def1: asymptotically linear}
    An estimator sequence \(\{\hat{\Psi}_n(P_n)\}\) is said to be asymptotically linear with influence function \(\psi\) at distribution \(P^*\) if 
    \begin{align*}
        \sqrt{n}(\hat\Psi_n(P_n) - \Psi(P^*)) -  \frac{1}{\sqrt{n}}\sum_{i = 1}^n \psi(X_i) = o_P(1)
    \end{align*},
    where \(E_P[\psi(X_i)] = 0\).
\end{definition}

We note that regularity ensures that the distribution remains robust for a reasonably small perturbation in the distribution, which will be formalized after we introduce the tangent space. 

In short, we would like to have a regular asymptotically linear (RAL) estimator.

\subsubsection{Tangent Space}
The {\it tangent space} characterizes the collection of possible functions to construct a path, defined by score function \(h = \frac{d}{d\varepsilon}\log dP_0\big |_{\varepsilon = 0}\) and their linear combinations at distribution \(P_0 \in \mathcal{M}\) \citep{bickel1993efficient,van2000asymptotic}. We can consider the path to be the way that how a distribution may move to another. Formally, {\it tangent space} is defined as:
\begin{definition}
  Let \(\{V_1, \dots, V_n\}\) denote the collection of score functions of \(P_0\), then the tangent space \(\dot{\mathcal{P}}_{P_0}\) of \(P_0\) is defined as the linear span of \({V_1, \dots, V_n}\).
\end{definition}

For the class of nonparametric distributions \(\mathcal{M}\), the tangent space is \(L_2(P_0)\) \citep{bickel1993efficient}. 

With the tangent space defined, we can say that a sequence of estimators \(\hat\Psi_n\) at \(P_0\) is {\it regular} if:
\begin{align*}
  \sqrt{n}\left(\hat\Psi_n - \Psi\left(P_{1/\sqrt{n},g}\right)\right) \overset{{P_{1/\sqrt{n},g}}}{\rightsquigarrow} L, \quad \text{every } g \in \dot{\mathcal{P}},
\end{align*}
where \(P_{1/\sqrt{n},g} = (1 + \frac{1}{\sqrt{n}}g) P_0\).
\subsubsection{Influence Function}
As showed in definition \ref{def1: asymptotically linear}, the influence function characterizes the asymptotic performance of asymptotically linear estimators. Moreover, Le Cam's convolution theorem guarantees that the most efficient estimator of regular estimators is asymptotically normal (Theorem 5.1.1 of \citet{bickel1993efficient}).

Moreover, we 

\subsubsection{Bias correction and Z-estimation}
In this section, we'll propose a novel method to construct the 

\subsubsection{Targeted Learning}
\clearpage




\section{Experiment}

\section{Conclusion}



\section{FIRST LEVEL HEADINGS}

First level headings are all caps, flush left, bold, and in point size
12. Use one line space before the first level heading and one-half line space
after the first level heading.

\subsection{Second Level Heading}

Second level headings are initial caps, flush left, bold, and in point
size 10. Use one line space before the second level heading and one-half line
space after the second level heading.

\subsubsection{Third Level Heading}

Third level headings are flush left, initial caps, bold, and in point
size 10. Use one line space before the third level heading and one-half line
space after the third level heading.

\paragraph{Fourth Level Heading}

Fourth level headings must be flush left, initial caps, bold, and
Roman type.  Use one line space before the fourth level heading, and
place the section text immediately after the heading with no line
break, but an 11 point horizontal space.

%%%
\subsection{Citations, Figure, References}


\subsubsection{Citations in Text}

Citations within the text should include the author's last name and
year, e.g., (Cheesman, 1985). 
%Apart from including the author's last name and year, citations can follow any style, as long as the style is consistent throughout the paper.  
Be sure that the sentence reads
correctly if the citation is deleted: e.g., instead of ``As described
by (Cheesman, 1985), we first frobulate the widgets,'' write ``As
described by Cheesman (1985), we first frobulate the widgets.''


The references listed at the end of the paper can follow any style as long as it is used consistently.

%Be sure to avoid
%accidentally disclosing author identities through citations.

\subsubsection{Footnotes}

Indicate footnotes with a number\footnote{Sample of the first
  footnote.} in the text. Use 8 point type for footnotes. Place the
footnotes at the bottom of the column in which their markers appear,
continuing to the next column if required. Precede the footnote
section of a column with a 0.5 point horizontal rule 1~inch (6~picas)
long.\footnote{Sample of the second footnote.}

\subsubsection{Figures}

All artwork must be centered, neat, clean, and legible.  All lines
should be very dark for purposes of reproduction, and art work should
not be hand-drawn.  Figures may appear at the top of a column, at the
top of a page spanning multiple columns, inline within a column, or
with text wrapped around them, but the figure number and caption
always appear immediately below the figure.  Leave 2 line spaces
between the figure and the caption. The figure caption is initial caps
and each figure should be numbered consecutively.

Make sure that the figure caption does not get separated from the
figure. Leave extra white space at the bottom of the page rather than
splitting the figure and figure caption.
\begin{figure}[h]
\vspace{.3in}
\centerline{\fbox{This figure intentionally left non-blank}}
\vspace{.3in}
\caption{Sample Figure Caption}
\end{figure}

\subsubsection{Tables}

All tables must be centered, neat, clean, and legible. Do not use hand-drawn tables.
Table number and title always appear above the table.
See Table~\ref{sample-table}.

Use one line space before the table title, one line space after the table title,
and one line space after the table. The table title must be
initial caps and each table numbered consecutively.

\begin{table}[h]
\caption{Sample Table Title} \label{sample-table}
\begin{center}
\begin{tabular}{ll}
\textbf{PART}  &\textbf{DESCRIPTION} \\
\hline \\
Dendrite         &Input terminal \\
Axon             &Output terminal \\
Soma             &Cell body (contains cell nucleus) \\
\end{tabular}
\end{center}
\end{table}

\section{SUPPLEMENTARY MATERIAL}

If you need to include additional appendices during submission, you can include them in the supplementary material file.
You can submit a single file of additional supplementary material which may be either a pdf file (such as proof details) or a zip file for other formats/more files (such as code or videos). 
Note that reviewers are under no obligation to examine your supplementary material. 
If you have only one supplementary pdf file, please upload it as is; otherwise gather everything to the single zip file.

You must use \texttt{aistats2025.sty} as a style file for your supplementary pdf file and follow the same formatting instructions as in the main paper. 
The only difference is that it must be in a \emph{single-column} format.
You can use \texttt{supplement.tex} in our starter pack as a starting point.
Alternatively, you may append the supplementary content to the main paper and split the final PDF into two separate files.

\section{SUBMISSION INSTRUCTIONS}

To submit your paper to AISTATS 2025, please follow these instructions.

\begin{enumerate}
    \item Download \texttt{aistats2025.sty}, \texttt{fancyhdr.sty}, and \texttt{sample\_paper.tex} provided in our starter pack. 
    Please, do not modify the style files as this might result in a formatting violation.
    
    \item Use \texttt{sample\_paper.tex} as a starting point.
    \item Begin your document with
    \begin{flushleft}
    \texttt{\textbackslash documentclass[twoside]\{article\}}\\
    \texttt{\textbackslash usepackage\{aistats2025\}}
    \end{flushleft}
    The \texttt{twoside} option for the class article allows the
    package \texttt{fancyhdr.sty} to include headings for even and odd
    numbered pages.
    \item When you are ready to submit the manuscript, compile the latex file to obtain the pdf file.
    \item Check that the content of your submission, \emph{excluding} references and reproducibility checklist, is limited to \textbf{8 pages}. The number of pages containing references and reproducibility checklist only is not limited.
    \item Upload the PDF file along with other supplementary material files to the CMT website.
\end{enumerate}

\subsection{Camera-ready Papers}

%For the camera-ready paper, if you are using \LaTeX, please make sure
%that you follow these instructions.  
% (If you are not using \LaTeX,
%please make sure to achieve the same effect using your chosen
%typesetting package.)

If your papers are accepted, you will need to submit the camera-ready version. Please make sure that you follow these instructions:
\begin{enumerate}
    %\item Download \texttt{fancyhdr.sty} -- the
    %\texttt{aistats2022.sty} file will make use of it.
    \item Change the beginning of your document to
    \begin{flushleft}
    \texttt{\textbackslash documentclass[twoside]\{article\}}\\
    \texttt{\textbackslash usepackage[accepted]\{aistats2025\}}
    \end{flushleft}
    The option \texttt{accepted} for the package
    \texttt{aistats2025.sty} will write a copyright notice at the end of
    the first column of the first page. This option will also print
    headings for the paper.  For the \emph{even} pages, the title of
    the paper will be used as heading and for \emph{odd} pages the
    author names will be used as heading.  If the title of the paper
    is too long or the number of authors is too large, the style will
    print a warning message as heading. If this happens additional
    commands can be used to place as headings shorter versions of the
    title and the author names. This is explained in the next point.
    \item  If you get warning messages as described above, then
    immediately after $\texttt{\textbackslash
    begin\{document\}}$, write
    \begin{flushleft}
    \texttt{\textbackslash runningtitle\{Provide here an alternative
    shorter version of the title of your paper\}}\\
    \texttt{\textbackslash runningauthor\{Provide here the surnames of
    the authors of your paper, all separated by commas\}}
    \end{flushleft}
    Note that the text that appears as argument in \texttt{\textbackslash
      runningtitle} will be printed as a heading in the \emph{even}
    pages. The text that appears as argument in \texttt{\textbackslash
      runningauthor} will be printed as a heading in the \emph{odd}
    pages.  If even the author surnames do not fit, it is acceptable
    to give a subset of author names followed by ``et al.''

    %\item Use the file sample\_paper.tex as an example.

    \item The camera-ready versions of the accepted papers are \textbf{9
      pages}, plus any additional pages needed for references and reproducibility checklist.

    \item If you need to include additional appendices,
      you can include them in the supplementary
      material file.

    \item Please, do not change the layout given by the above
      instructions and by the style file.

\end{enumerate}

\subsubsection*{Acknowledgements}
All acknowledgments go at the end of the paper, including thanks to reviewers who gave useful comments, to colleagues who contributed to the ideas, and to funding agencies and corporate sponsors that provided financial support. 
To preserve the anonymity, please include acknowledgments \emph{only} in the camera-ready papers.

\clearpage


\bibliography{vitmle}

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\clearpage
\section*{Checklist}


% %%% BEGIN INSTRUCTIONS %%%
The checklist follows the references. For each question, choose your answer from the three possible options: Yes, No, Not Applicable.  You are encouraged to include a justification to your answer, either by referencing the appropriate section of your paper or providing a brief inline description (1-2 sentences). 
Please do not modify the questions.  Note that the Checklist section does not count towards the page limit. Not including the checklist in the first submission won't result in desk rejection, although in such case we will ask you to upload it during the author response period and include it in camera ready (if accepted).

\textbf{In your paper, please delete this instructions block and only keep the Checklist section heading above along with the questions/answers below.}
% %%% END INSTRUCTIONS %%%


 \begin{enumerate}


 \item For all models and algorithms presented, check if you include:
 \begin{enumerate}
   \item A clear description of the mathematical setting, assumptions, algorithm, and/or model. [Yes/No/Not Applicable]
   Yes
   \item An analysis of the properties and complexity (time, space, sample size) of any algorithm. [Yes/No/Not Applicable]
   Not Applicable
   \item (Optional) Anonymized source code, with specification of all dependencies, including external libraries. [Yes/No/Not Applicable]
   Yes
 \end{enumerate}

 \item For any theoretical claim, check if you include:
 \begin{enumerate}
   \item Statements of the full set of assumptions of all theoretical results. [Yes/No/Not Applicable]
   Yes
   \item Complete proofs of all theoretical results. [Yes/No/Not Applicable]
   Yes
   \item Clear explanations of any assumptions. [Yes/No/Not Applicable]  
   Yes
 \end{enumerate}


 \item For all figures and tables that present empirical results, check if you include:
 \begin{enumerate}
   \item The code, data, and instructions needed to reproduce the main experimental results (either in the supplemental material or as a URL). [Yes/No/Not Applicable] 
   Yes
   \item All the training details (e.g., data splits, hyperparameters, how they were chosen). [Yes/No/Not Applicable]Yes
   
    \item A clear definition of the specific measure or statistics and error bars (e.g., with respect to the random seed after running experiments multiple times). [Yes/No/Not Applicable]
         Yes
    \item A description of the computing infrastructure used. (e.g., type of GPUs, internal cluster, or cloud provider). [Yes/No/Not Applicable]
         Yes
 \end{enumerate}

 \item If you are using existing assets (e.g., code, data, models) or curating/releasing new assets, check if you include:
 \begin{enumerate}
   \item Citations of the creator If your work uses existing assets. [Yes/No/Not Applicable]
   Not Applicable
   \item The license information of the assets, if applicable. [Yes/No/Not Applicable]
   Not Applicable
   \item New assets either in the supplemental material or as a URL, if applicable. [Yes/No/Not Applicable]
   Not Applicable
   \item Information about consent from data providers/curators. [Yes/No/Not Applicable]
   Not Applicable
   \item Discussion of sensible content if applicable, e.g., personally identifiable information or offensive content. [Yes/No/Not Applicable]
   Not Applicable
 \end{enumerate}

 \item If you used crowdsourcing or conducted research with human subjects, check if you included:
 \begin{enumerate}
   \item The full text of instructions given to participants and screenshots. [Yes/No/Not Applicable]
   Not Applicable
   \item Descriptions of potential participant risks, with links to Institutional Review Board (IRB) approvals if applicable. [Yes/No/Not Applicable]
   Not Applicable
   \item The estimated hourly wage paid to participants and the total amount spent on participant compensation. [Yes/No/Not Applicable]
   Not Applicable
 \end{enumerate}

 \end{enumerate}


\end{document}
