% \documentclass{uai2023} % for initial submission
\documentclass[accepted]{uai2023} % after acceptance, for a revised version; also before submission to see how the non-anonymous paper would look like
%% There is a class option to choose the math font
% \documentclass[mathfont=ptmx]{uai2023} % ptmx math instead of Computer
% Modern (has noticable issues)
% \documentclass[mathfont=newtx]{uai2023} % newtx fonts (improves upon
% ptmx; less tested, no support)
% NOTE: Only keep *one* line above as appropriate, as it will be replaced
%       automatically for papers to be published. Do not make any other
%       change above this note for an accepted version.

%% Choose your variant of English; be consistent
\usepackage[american]{babel}
% \usepackage[british]{babel}

%% Some suggested packages, as needed:
\usepackage{natbib} % has a nice set of citation styles and commands
    \bibliographystyle{plainnat}
    \renewcommand{\bibsection}{\subsubsection*{References}}
\usepackage{balance}

\usepackage{mathtools} % amsmath with fixes and additions
% For theorems and such
\usepackage{amsmath}
\usepackage{amssymb}
\usepackage{mathtools}
\usepackage{amsthm}

\theoremstyle{plain}
\newtheorem{theorem}{Theorem}[section]
\newtheorem{proposition}[theorem]{Proposition}
\newtheorem{lemma}[theorem]{Lemma}
\newtheorem{corollary}[theorem]{Corollary}
\theoremstyle{definition}
\newtheorem{definition}[theorem]{Definition}
\newtheorem{assumption}[theorem]{Assumption}
\theoremstyle{remark}
\newtheorem{remark}[theorem]{Remark}
\newtheorem{example}{Example}

\usepackage[ruled,vlined]{algorithm2e}
% \usepackage{siunitx} % for proper typesetting of numbers and units
\usepackage{booktabs} % commands to create good-looking tables
\usepackage{tikz} % nice language for creating drawings and diagrams

%% Provided macros
% \smaller: Because the class footnote size is essentially LaTeX's \small,
%           redefining \footnotesize, we provide the original \footnotesize
%           using this macro.
%           (Use only sparingly, e.g., in drawings, as it is quite small.)

\usepackage{multirow}

%% Self-defined macros
\DeclareMathOperator*{\argmax}{arg\,max}
\DeclareMathOperator*{\argmin}{arg\,min}

\def\de{\overset{\Delta}{=}}
\newcommand{\Jie}[1]{{\color{green}#1}}

\SetKwInput{KwInput}{Input}
\SetKwInput{KwOutput}{Output}
\SetKwComment{Comment}{/*}{ */}
\let\oldnl\nl% Store \nl in \oldnl
\newcommand{\nonl}{\renewcommand{\nl}{\let\nl\oldnl}}

\title{Robust Quickest Change Detection for Unnormalized Models}

% The standard author block has changed for UAI 2023 to provide
% more space for long author lists and allow for complex affiliations
%
% All author information is authomatically removed by the class for the
% anonymous submission version of your paper, so you can already add your
% information below.
%
% Add authors
% \author[1]{\href{mailto:<jj@example.edu>?Subject=Your UAI 2023 paper}{Jane~J.~von~O'L\'opez}{}}
\author[1]{Suya Wu}
\author[1]{Enmao Diao}
\author[2]{Taposh Banerjee}
\author[3]{Jie Ding}
\author[1]{Vahid Tarokh}
% Add affiliations after the authors
\affil[1]{%
    Department of Electrical and Computer Engineering\\
    Duke University\\
    Durham, NC 27708, USA
}
\affil[2]{%
    Department of Industrial Engineering\\
    University of Pittsburgh\\
    Pittsburgh, PA 15213, USA
}
\affil[3]{%
    School of Statistics\\
    University of Minnesota Twin Cities\\
    Minneapolis, 
MN 55455, USA
  }
  
\begin{document}
\maketitle
\begin{abstract}
      Detecting an abrupt and persistent change in the underlying distribution of online data streams is an important problem in many applications. This paper proposes a new robust score-based algorithm called RSCUSUM, which can be applied to unnormalized models and addresses the issue of unknown post-change distributions. RSCUSUM replaces the Kullback-Leibler divergence with the Fisher divergence between pre- and post-change distributions for computational efficiency in unnormalized statistical models and introduces a notion of the ``least favorable'' distribution for robust change detection. The algorithm and its theoretical analysis are demonstrated through simulation studies. 
      %and out-of-distribution detection on image datasets.
\end{abstract}

\section{Introduction}\label{sec: intro}
In the problem of quickest change detection, the objective is to detect an abrupt change in the 
statistical properties of an observed stochastic process. This change in the distribution has to be detected with the minimum possible delay, subject to a constraint on the rate of false alarms. This problem has applications in sensor networks, cyber-physical systems, biology, and neuroscience; see \cite{veeravalli2014quickest,  basseville1993detection, poor2008quickest, tartakovsky2014sequential}. %We review the relevant results in Section~\ref{sec:QCDreview} and Section~\ref{sec:background}. 

When the pre- and post-change distribution of the data is known, a typical optimal algorithm in the literature is a stopping rule. A sequence of statistics is calculated using the likelihood ratio of the observations, and a change is declared when the sequence of statistics crosses a pre-designed threshold. The threshold is chosen to meet a constraint on false alarms; see \cite{shiryaev1963optimum,lorden1971procedures,pollak1985optimal,moustakides1986optimal,lai1998information,tartakovsky2005general}. The three most important algorithms in the literature are the Shiryaev algorithm (\cite{shiryaev1963optimum, tartakovsky2005general}), the cumulative sum (CUSUM) algorithm (\cite{page1955test, lorden1971procedures, moustakides1986optimal, lai1998information}), and the Shiryaev-Roberts algorithm (\cite{roberts1966comparison, pollak1985optimal}). 

The main challenge in implementing a change detection algorithm in practice is that the pre- and post-change distributions are not precisely known. This challenge is amplified when the data is high-dimensional. Specifically, in several machine learning applications, the data models may not lend themselves to explicit distributions. For example, energy-based models~(\cite{LeCun2006ATO}) capture dependencies between observed and latent variables based on their associated energy (an unnormalized probability), and score-based deep generative models~\cite{song2020score} generate high-quality images by learning the score function (the gradient of the log density function). These models can be computationally cumbersome to normalize themselves as probabilistic density functions. Thus, optimal algorithms from the change detection literature, which are likelihood ratio-based tests, are computationally expensive to implement. 

This issue is partially addressed in \cite{wuetal-aistat-2023} where the authors have proposed the SCUSUM algorithm, a Hyv\"arinen score-based (\cite{hyvarinen2005estimation}) modification of the CUSUM algorithm for quickest change detection. 
It is shown in \cite{wuetal-aistat-2023} that the SCUSUM algorithm is consistent and the authors also provide expressions for the average detection delay and the mean time to a false alarm. The 
Hyv\"arinen score is invariant to scale and hence can be applied to unnormalized models. This makes the SCUSUM algorithm highly efficient as compared to the classical CUSUM algorithm for high-dimensional models.

The main drawback of the SCUSUM algorithm is that its effectiveness is contingent on knowing the precise post-change unnormalized model, i.e., knowing the post-change model within a normalizing constant. In practice, due to a limited amount of training data, the post-change model can only be learned within an uncertainty class. To detect the change effectively, an algorithm must be robust against these modeling uncertainties. The SCUSUM algorithm is not robust in this sense. Specifically, if not carefully designed, the SCUSUM algorithm can fail to detect several (in fact, infinitely many) post-change scenarios. 

In this paper, we propose a robust score-based variant of the CUSUM algorithm for the quickest change detection. We refer to our algorithm as the RSCUSUM algorithm. Under the assumption that the post-change uncertainty class is convex and compact, we show that the RSCUSUM algorithm is robust, i.e., can consistently detect changes for every possible post-change model. This consistency is achieved by designing the RSCUSUM algorithm using the \textit{least favorable} distribution from the post-change class. 

The problem of optimal robust quickest change detection is studied in \cite{unnikrishnan2011minimax}. In a minimax setting, the optimal algorithm is the CUSUM algorithm designed using the least favorable distribution. 
The robust CUSUM test in \cite{unnikrishnan2011minimax} may suffer from two drawbacks: 1) It is a likelihood ratio-based test and hence may not be amenable to implementation in high-dimensional models. 2) The notion of least favorable distribution is defined using \textit{stochastic boundedness}, which may be difficult to verify for high-dimensional data. 

In contrast with the work in \cite{unnikrishnan2011minimax}, we define the notion of least favorable distribution using Fisher divergence and provide a method to effectively identify the least favorable distribution for the post-change model. 

% For example, energy-based models~\cite{LeCun2006ATO} capture dependencies between observed and latent variables based on their associated energy (an unnormalized probability), and score-based deep generative models~\cite{song2020score} generate high-quality images by learning the score function (the gradient of the log density function). These models can be computationally cumbersome to normalize themselves as probabilistic density functions, and therefore likelihood-based change detection algorithms are computationally expensive in implementation.

%Two common features of all the optimal algorithms discussed above are: 1) the pre- and post-change models are assumed to be known, and 2) the precise knowledge of the pre- and post-change densities are used to calculate the likelihood ratios of the observations. These likelihood ratios are then used to calculate the optimal change detection statistic. 

% In general, the post-change model may not be known precisely. Also, in some machine learning applications, calculating the likelihood ratios can be computationally challenging (see Section~\ref{sec:QCDunknown} and Section~\ref{sec:LLRissues} for detailed discussions on both issues below). In this paper, we propose a Hyv\"arinen score-based method \cite{hyvarinen2005estimation} for quickest change detection that is robust against post-change uncertainties \cite{unnikrishnan2011minimax}. It is also computationally efficient as compared to a likelihood ratio-based optimal robust test. Our proposed method can be applied for change-detection in high-dimensional models, e.g., energy-based models, that are only known or learned within a normalizing constant. For robustness, we identify a distribution that is \textit{least favorable} in the post-change uncertainty class and use it to design our test. 
% \cite{wuetal-aistat-2023}. 

% Detecting an abrupt and persistent change in the underlying statistical characteristics of online data streams is an important problem commonly encountered in many applications. For example, this problem has applications in sensor networks, cyber-physical systems, biology, and neuroscience \cite{veeravalli2014quickest}.  
% In the statistical field of sequential analysis, such a problem is formulated as a problem of quickest change detection. In this problem, observations are modeled as a realization of a stochastic process. The problem is posed as the problem of detecting a change in the distribution of a sequence of random variables. 
% A change point is defined as a time when such a change in distribution occurs.
% A quickest change detection algorithm aims to detect the change point as quickly as possible, with the minimum possible delay subject to a constraint on the rate of false alarms~\cite{veeravalli2014quickest}. 
% A typical quickest change detection algorithm is a single-threshold test where a sequence of statistics is computed over time, and an alarm is raised the first time the sequence is above a pre-designed threshold. The threshold is used to control the rate of false alarms. 

%\subsection{Quickest Change Detection}
% \label{sec:QCDreview}
% In the quickest change detection literature, the most well-studied setting is the independent and identically distributed (i.i.d.) setting. In this setting, it is assumed that the random variables are i.i.d. with a particular probability density function (written in short as density when there is no ambiguity) before the change, and are i.i.d. with another density after the change. In the i.i.d. setting, the main optimality results are obtained in \cite{shiryaev1963optimum,lorden1971procedures,pollak1985optimal,moustakides1986optimal}. In \cite{shiryaev1963optimum}, it is shown that if the change point is modeled as a geometrically distributed random variable, then the optimal algorithm is to stop the first time the \textit{a posterior} probability that the change has already occurred is above a fixed threshold. This algorithm is also called the Shiryaev algorithm and is shown to minimize the average detection delay subject to a constraint on the probability of a false alarm. A minimax problem formulation is introduced in \cite{lorden1971procedures}, where it is shown that the Cumulative Sum (CUSUM) algorithm proposed in \cite{page1955test} is asymptotically optimal as the mean running time to a false alarm goes to infinity. In \cite{pollak1985optimal}, another variant of a minimax problem formulation is considered and it is shown that the Shiryaev-Roberts algorithm, proposed in \cite{roberts1966comparison}, is asymptotically optimal, as the mean running time to a false alarm goes to infinity. In \cite{moustakides1986optimal}, it is shown that the CUSUM algorithm is exactly optimal for the formulation in \cite{lorden1971procedures}. In \cite{lai1998information}, it is shown that the CUSUM algorithm is also asymptotically optimal with respect to the minimax variant studied in \cite{pollak1985optimal}. The classical i.i.d. setting has been extended to non-i.i.d. settings, e.g., in \cite{lai1998information} and \cite{tartakovsky2005general}. For a more detailed discussion of the state-of-the-art theoretical results in this classical setting, we refer the readers to~\cite{veeravalli2014quickest, polunchenko2012state} and the references therein. 

% Two common features of all the optimal algorithms discussed above are: 1) the pre- and post-change models are assumed to be known, and 2) the precise knowledge of the pre- and post-change densities are used to calculate the likelihood ratios of the observations. These likelihood ratios are then used to calculate the optimal change detection statistic 
% \cite{shiryaev1963optimum,lorden1971procedures,pollak1985optimal,moustakides1986optimal,lai1998information,tartakovsky2005general}. 
% In general, the post-change model may not be known precisely. Also, in some machine learning applications, calculating the likelihood ratios can be computationally challenging (see a detailed discussion on both issues below). In this paper, we propose a score-based method for quickest change detection that is robust against post-change uncertainties. It is also computationally efficient as compared with a likelihood ratio-based optimal test. 

% \subsection{Quickest Change Detection with Unknown Post-change Distribution}
% \label{sec:QCDunknown}
% The assumption that the post-change model is known can be relaxed in the quickest change detection problem without sacrificing optimality. In the literature, three popular approaches to an unknown post-change model are the generalized likelihood ratio (GLR)-based tests, mixture-based tests, and robust tests \cite{lorden1971procedures}, \cite{lai1998information}, and \cite{unnikrishnan2011minimax}. In a GLR test, the unknown post-change parameter is replaced by its maximum likelihood estimate \cite{lorden1971procedures}, \cite{lai1998information}. In a mixture test, the likelihood ratio is integrated over an assumed prior over the unknown post-change parameter \cite{lai1998information}. It has been shown that under some conditions, both types of tests are asymptotically optimal, uniformly over the unknown parameter set. However, both GLR-based and mixture-based tests are computationally expensive, with the GLR-based test needing to solve an optimization problem at each time step, and the mixture-based test needing to perform a numerical integration at each step. In \cite{unnikrishnan2011minimax}, it is assumed that the post-change distribution belongs to a known family of distributions. It is further assumed that the post-change family has a member that is \textit{least favorable} or closest to the pre-change model in a well-defined sense. This least favorable distribution is then used to replace the unknown post-change model. Thus, the test designed is computationally as efficient as the likelihood ratio-based optimal test with known distribution. Also, the test is shown to be minimax robust optimal. Thus, in the setting of unknown post-change distributions, robust tests enjoy both optimality and better computational efficiency (as compared to GLR- or mixture-based tests). Motivated by these observations, in this paper we will focus on robust tests. 

% \subsection{Issues with Likelihood Ratio-Based Algorithms}
% \label{sec:LLRissues}
% In many machine learning applications, the data models may be high-dimensional and, in some cases, may not lend themselves to explicit distributions. For example, energy-based models~\cite{LeCun2006ATO} capture dependencies between observed and latent variables based on their associated energy (an unnormalized probability), and score-based deep generative models~\cite{song2020score} generate high-quality images by learning the score function (the gradient of the log density function). These models can be computationally cumbersome to normalize themselves as probabilistic density functions, and therefore likelihood-based change detection algorithms are computationally expensive in implementation. In Subsection~\ref{subsec:issues_llr_cusum}, we show this difficulty with two examples. When the full knowledge of pre- and post-change distributions is not available, the data-generating distributions must be modeled using the available data. In such scenarios, likelihood-based detection algorithms do not perform as well as expected. For instance, by numerical results, \citet{chen2015graph} showed issues with the performance of generalized likelihood ratio-based algorithms when the dimension of data increases. For image datasets, \cite{nalisnick2018deep} demonstrated the likelihood learned from flow-based deep generative models cannot distinguish distribution drifts from one dataset to another.

% % The Hyv\"arinen score is proposed by~\citet{hyvarinen2005estimation} for establishing an empirical estimation procedure for unnormalized models. This estimation procedure is also known as score matching. 
% Recently, \citet{wu2022score} proposed a score-based test statistic as a surrogate of the log-likelihood ratio statistic for unnormalized models based upon the Hyv\"arinen score~\cite{hyvarinen2005estimation}, which was originally developed for parameter estimation in unnormalized models. Their experimental results demonstrate significant performance gains and a reduction in computational complexity in testing unnormalized distribution drifts. Inspired by that, this paper will develop the sequential and robust version of the method developed in \cite{wu2022score} for the purpose of quickest change detection. 


% These algorithms typically rely on various pre-change and post-change statistics, e.g., cumulative means. A false alarm occurs when a change has not happened but is declared by a detection algorithm. However, reducing false alarms tends to prolong the time lag from the change event (if it occurs) to the time that a change is declared (often referred to as the detection delay). The quality of a change detection algorithm is often evaluated by the trade-off between its detection delay and false alarms, and its performance typically depends on the change point. In this light, for a given algorithm, we are interested in the minimax objectives that evaluate the trade-off between the worst conditional detection delay and the average run length to a false alarm~\cite{pollak1985optimal}.

% Classical developments in change detection assumed that pre- and post-change distributions are explicitly known. 


% In this case, if the observations before (respectively after) the change point are drawn \textit{i.i.d.} according to pre-change (respectively post-change) distribution, \citet{moustakides1986optimal} proved that the log-likelihood ratio (LLR) based CUSUM (described later in Subsection~\ref{subsec: SCUSUM_algorithm}) provides the optimal trade-off between worst-case detection delay and false alarm in the sense of Lorden's metric~\cite{lorden1971procedures}. Relaxing the independence assumption, \citet{lai1998information} developed a window-limited generalized CUSUM and proved its {asymptotic} optimality in the sense of Pollak's metric~\cite{pollak1985optimal}. Another state-of-the-art likelihood-based approach is the Shiryaev–Roberts (SR) procedure and its extensions~\cite{shiryaev1963optimum, roberts1966comparison}. These have been studied in both Bayesian and non-Bayesian settings~\cite{pollak1985optimal, moustakides2011numerical, tartakovsky2012third, polunchenko2010optimality}. For a more detailed discussion of the state-of-the-art theoretical results in this classical setting, we refer the reader to~\cite{veeravalli2014quickest, polunchenko2012state} and the references therein. 

% Unfortunately, the full knowledge of pre- and post-change distributions is not available in many modern machine learning applications, particularly when the data-generating distributions must be modeled using the available data. These models may be high-dimensional and, in some cases, may not lend themselves to explicit distributions. For example, energy-based models~\cite{LeCun2006ATO} capture dependencies between observed and latent variables based on their associated energy (an unnormalized probability), and score-based deep generative models~\cite{song2020score} generate high-quality images by learning the score function (the gradient of the log density function). These models can be computationally cumbersome to normalize themselves as probabilistic density functions, and therefore likelihood-based change detection algorithms are computationally expensive in implementation. In Subsection~\ref{subsec:issues_llr_cusum}, we show this difficulty with two examples. On the other hand, despite the fact that the likelihood can be modeled directly, the likelihood-based detection algorithms do not perform as well as expected. For instance, by numerical results, \citet{chen2015graph} showed that the performance of GLR-based algorithms is not good when the dimension of data increases. For image datasets, Nalisnick et al. in \cite{nalisnick2018deep} demonstrated the likelihood learned from flow-based deep generative models cannot distinguish distribution drifts from one dataset to another.
% In some cases, such as energy-based models~\cite{LeCun2006ATO}, graphical models~\cite{graphic_models}, and score-based deep generative models~\cite{song2020score}, the models can approximate a wide class of data-generating distributions. They may be explicit to a normalizing factor but computationally cumbersome to normalize. \textcolor{blue}{non-normalized distributions.}
% , graphical models~\cite{graphic_models}, and score-based deep generative models~\cite{song2020score}
%The likelihood of these models can be approximated by Monte Carlo-based methods (e.g.,~\cite{Hinton2002} and the references therein). However, change detection performance may suffer from underlying approximation errors. 

\subsection{Our Contributions}

% In this paper, we consider the problem of quickest change detection in high-dimensional models where the post-change distribution is not precisely known. As discussed above, a robust approach is preferable since it leads to computationally efficient procedures. Due to the difficulty associated with likelihood ratio-based algorithms discussed above, the robust algorithm from \cite{unnikrishnan2011minimax} cannot be employed in many machine learning applications. The main contribution of our work is three-fold.
We now summarize our contributions in this paper. 

$\bullet$ We propose a new robust score-based quickest change detection algorithm that can be applied to unnormalized models, namely, statistical models whose density involves an unknown normalizing constant. Specifically, we use the Hyv\"arinen score (\cite{hyvarinen2005estimation}) to propose a robust score-based variant of the SCUSUM algorithm from \cite{wuetal-aistat-2023}, which we refer to as RSCUSUM. In this variant and its subsequent theory, the role of Kullback-Leibler divergence in classical change detection is replaced with the Fisher divergence between the pre-and post-change distributions. Please see Section~\ref{sec:RSCUSUM_algorithm}.

% If the post-change law is precisely known, then the log-likelihood ratio statistic in the CUSUM algorithm is replaced by a scaled difference of the Hyv\"arinen scores.
$\bullet$ Our developed RSCUSUM algorithm can address unknown post-change models. Specifically, assuming that the post-change law belongs to a known family of distributions that is convex and compact, we identify a least favorable distribution that is closest in terms of Fisher divergence from the pre-change family. We then show that the RSCUSUM algorithm can consistently detect each post-change distribution from the family, and is robust in this sense. Please see Section~\ref{sec:theoritical_analysis}.

$\bullet$ 
We provide an effective method to identify the least favorable post-change distribution in a post-change family. This is in contrast to the setup in \cite{unnikrishnan2011minimax} where a stochastic boundedness characterization makes it harder to identify the least favorable distribution. Please see Section~\ref{sec:least_favorable_distribution}.

$\bullet$ 
From a theoretical perspective, unlike the CUSUM algorithm that leverages the fact that the likelihood ratios form a martingale under the pre-change model~\cite{lai1998information,woodroofe1982nonlinear}, the RSCUSUM algorithm is a score-based algorithm where cumulative scores do not enjoy a standard martingale characterization. Our analysis of the delay and false alarm analysis for RSCUSUM is based on new analysis techniques. Pleas see Section~\ref{sec:theoritical_analysis}.
% The delay and false alarm analysis of the CUSUM algorithm are performed using martingale and renewal theoretical methods.

$\bullet$ We demonstrate the effectiveness of the RSCUSUM algorithm through simulation studies on Gaussian and Gauss-Bernoulli Restricted Boltzmann Machine (RBM) models. Please see Section~\ref{sec:results}. 
%and out-of-distribution (OOD) detection~\cite{ren2019likelihood,liu2020energy,fort2021exploring} on image datasets. OOD detection is the research area that focuses on identifying inputs that are not typically sampled from the distribution of training data, which has found many applications in anomaly detection and safety-critical systems.



% \textcolor{blue}{Robust quickest change detection literature.}
% In addressing this, various authors have considered the robust quickest change detection problem when the unknown distribution belongs to an explicit convex and closed set of distributions \cite{}\textcolor{blue}{Taposh-Cite}. 
% \textcolor{blue}{In particular, ...}
% In this setting, a robust change detection algorithm considers the element of $\mathcal{S}_1$ closest in KL-distance to $ \mathcal{P}_\infty$ and treats this as the postulated post-change distribution in CUSUM and other aforementioned algorithms.

% \textcolor{blue}{OOD literature.}
% Detecting out-of-distribution (OOD) changes by the underlying statistical characteristics of online data streams is an important problem commonly encountered in many machine applications.
% \textcolor{blue}{1) Difficulties for OOD detection: the unknown post-change distribution/online data stream} Unfortunately, in typical machine learning OOD scenarios, the post-change distribution is not known and only pre-distribution data is available \textcolor{blue}{Add Citation}.
% \textcolor{blue}{2) Composite hypothesis testing -inspired methods for OOD.}
% \textcolor{blue}{3) Change detection -inspired methods for OOD.}

% One common feature of all the aforementioned robust change detection algorithms in the quickest change detection literature is that the knowledge of distributions in $\mathcal{G}_\infty$ and $\mathcal{G}_1$, as these are used to calculate the likelihood ratios of the observations and in turn the change detection statistics.  
% In some machine learning applications, the data models may be high-dimensional and, in some cases, may not lend themselves to explicit distributions. 
% \textcolor{blue}{Drawbacks or bad examples of likelihood-based methods for OOD detection.} 

% This motivates our work in this paper, where we extend robust change detection algorithms to unnormalized models for unknown post-change distribution scenarios. In addition, we investigate the applications of out-of-distribution detection problems. 
% \textcolor{blue}{The contribution of our paper is summarized below.}
% \textcolor{blue}{Then, the outline of the paper.}
% d

\section{Problem Formulation}\label{sec:background}

%\subsection{Problem Formulation} 
\label{subsec:problem_formulation}
\noindent Let $\{X_n\}_{n\geq 1}$ denote a sequence of independent random variables defined on the probability space $(\Omega, \mathcal{F}, P_\nu)$. Let $\mathcal{F}_n$ be the $\sigma-$algebra generated by random variables $X_1,\; X_2, \;\dots,\; X_n$, and let $\mathcal{F}=\sigma(\cup_{n\geq 1}\mathcal{F}_n)$ be the $\sigma-$algebra generated by the union of sub-$\sigma$-algebras. 
Under $P_\nu$, $X_1, \; X_2, \;\dots,\; X_{\nu-1}$ are {i.i.d.} according to a density $p_\infty$ and $X_{\nu}, \; X_{\nu+1},\; \dots$ are {i.i.d.} according to a density $p_1$. We think of $\nu$ as the change point, $p_\infty$ as the pre-change density, and $p_1$ as the post-change density. We use $\mathbb{E}_{\nu}$ and $\text{Var}_{\nu}$ to denote the expectation and the variance associated with the
measure $P_\nu$, respectively. Thus, $\nu$ is seen as an unknown constant and we have an entire family $\{P_\nu\}_{1 \leq \nu \leq \infty}$ of change-point models, one for each possible change point. We use $P_\infty$ to denote the measure under which there is no change, with $\mathbb{E}_\infty$ denoting the corresponding expectation. 


A change detection algorithm is a stopping time $T$ with respect to the data stream $\{X_n\}_{n\geq 1}$:
$$
\{T \leq n\} \in \mathcal{F}_n.
$$
If $T\geq \nu$, we have made a \textit{delayed detection}; otherwise, a \textit{false alarm} has happened. 
%Intuitively, there is a trade-off between detection delay and false alarms. 
Our goal is to find a stopping time $T$ to optimize the trade-off between well-defined metrics on delay and false alarm. 
We consider two minimax problem formulations to find the best stopping rule. 

% Let $\mathbb{E}_{\nu}$ and $\text{Var}_{\nu}$ respectively denote the expectation and the variance associated with the PDF $p_{\nu}(\cdot)$. 
To measure the detection performance of a stopping rule, we use the following minimax metric (\cite{lorden1971procedures}), the worst-case averaged detection delay (WADD):
\begin{equation*}
\mathcal{L}_{\texttt{WADD}}(T)\de \sup_{\nu\geq 1}\text{ess}\sup \mathbb{E}_{\nu}[(T-\nu+1)^{+}|\mathcal{F}_{\nu}],
\end{equation*}
where $(y)^{+}\de\max(y, 0)$ for any $y\in \mathbb{R}$. Here $\text{ess} \sup$ is the essential supremum, i.e., the supremum outside a set of measure zero. We also consider the version of minimax metric introduced in \citet{pollak1985optimal}, the worst conditional averaged detection delay (CADD):
\begin{equation*}
    \mathcal{L}_{\texttt{CADD}}(T)\de \sup_{\nu\geq 1}\mathbb{E}_{\nu}[T-\nu|T\geq \nu].
\end{equation*}
For false alarms, we consider the \textit{average running length} (ARL), which is defined as the mean time to false alarm:
$$
\text{ARL}\de \mathbb{E}_\infty[T].
$$

We now formulate a robust quickest change detection problem; see \cite{unnikrishnan2011minimax}. We assume that pre- and post-change distributions are not precisely known. However, each is known within an uncertainty class: 
\begin{equation*}
    \begin{split}
        P_\infty &\in \mathcal{G}_\infty \\
        P_1 &\in \mathcal{G}_1. 
    \end{split}
\end{equation*}
For simplicity, in this paper, we will assume that the pre-change class is a singleton:
$$
\mathcal{G}_\infty = \{P_\infty\}.
$$
Our proposed method can also be extended to the
case of composite $\mathcal{G}_\infty$. The objective is to find a stopping rule to solve the following problem:
\begin{equation}
    \label{eq:lorden}
    \min_T \;\sup_{P_1 \in \mathcal{G}_1}\mathcal{L}_{\texttt{WADD}}(T)\;
    \quad \text{subject to}\;\quad \mathbb{E}_{\infty}[T]\geq \gamma,
\end{equation}
 where $\gamma$ is a constraint on the ARL. The delay $\mathcal{L}_{\texttt{WADD}}$ in the above problem is a function of the true post-change law $P_1$ and should be designated as
 $
 \mathcal{L}_{\texttt{WADD}}^{P_1}.
 $ 
 We will, however, suppress this notation and simply refer to $\mathcal{L}_{\texttt{WADD}}^{P_1}$ by $\mathcal{L}_{\texttt{WADD}}$. 
 Thus, the goal in this problem is to find a stopping time $T$ to minimize the worst-case detection delay, subject to a constraint $\gamma$ on $\mathbb{E}_{\infty}[T]$. 
 
 % $\mathcal{L}_{\texttt{WADD}}(T)$ is called Lorden's metric, which evaluates the performance of a stopping rule $T$ in terms of detection delay, while $\mathbb{E}_{\infty}[T]$, also known as ARL, evaluates the stopping rule $T$ in terms of false alarms. Under the \textit{i.i.d} assumptions for pre-change (respectively post-change) observations, \citet{lorden1971procedures} showed that the asymptotic optimality of LLR-based CUSUM to Problem~\ref{eq:lorden} as $\gamma \to \infty$. Later, \citet{moustakides1986optimal} proved that the LLR-based CUSUM is an optimal solution under Lorden's metric for any $\gamma >0$. 
We are also interested in the version with the minimax metric introduced in \citet{pollak1985optimal}: 
\begin{equation}
    \label{eq:pollak}
    \min_T \;\sup_{P_1 \in \mathcal{G}_1} \mathcal{L}_{\texttt{CADD}}(T)\;
    \quad \text{subject to}\;\quad \mathbb{E}_{\infty}[T]\geq \gamma. 
\end{equation}
If the post-change family is also singleton, $\mathcal{G}_1 = \{P_1\}$, then the above formulations are the classical minimax formulations from the quickest change detection literature; see \cite{veeravalli2014quickest, tartakovsky2014sequential, poor2008quickest}. The optimal algorithm (exactly optimal for \eqref{eq:lorden} and asymptotically optimal for \eqref{eq:pollak}) is the CUSUM algorithm given by
\begin{equation*} 
    T_{\texttt{CUSUM}}=\inf\{n\geq 1:\Lambda(n)\geq \tau\},
\end{equation*}
where $\Lambda(n)$ is defined using the recursion
\begin{align}
    &\Lambda(0)=0, \nonumber \\
    &\Lambda(n) \de \biggr(\Lambda(n-1)+\log \frac{p_1(X_n)}{p_{\infty}(X_n)}\biggr)^{+}, \forall n \geq 1, \label{eq:cusum_score}
 \end{align}
which leads to a computationally convenient stopping scheme. We recall that here $p_1$ is the post-change density and $p_\infty$ is the pre-change density. 


In \cite{lorden1971procedures} and \cite{lai1998information}, the asymptotic performance of the CUSUM algorithm is also characterized. Specifically, it is shown as $\gamma \rightarrow \infty$,
\begin{align*}
    \mathcal{L_{\texttt{WADD}}}(T_{\texttt{CUSUM}}) \sim \mathcal{L_{\texttt{CADD}}}(T_{\texttt{CUSUM}})\sim \frac{\log \gamma}{\mathbb{D}_{\texttt{KL}}(P_{1}\|P_{\infty})}.
\end{align*}
Here $\mathbb{D}_{\texttt{KL}}(P_{1}\|P_{\infty})$ is the Kullback-Leibler divergence between the post-change distribution and pre-change distribution:
$$
\mathbb{D}_{\texttt{KL}}(P_{1}\|P_{\infty}) = \int p_1(x) \log \frac{p_1(x)}{p_\infty(x)} dx, 
$$
and the notation $g(c)\sim h(c)$ as $c\to c_0$ indicates that $\frac{g(c)}{h(c)} \to 1$ as $c\to c_0$ for any two functions $c\mapsto g(c)$ and $c\mapsto h(c)$.

Since the CUSUM algorithm uses likelihood ratio to compute its statistic, it is not amenable to implementation for high-dimensional models (see \cite{wuetal-aistat-2023}), where often the densities $p_1$ or $p_\infty$ are only known within a normalizing constant. 


% In addition, \citet{pollak1985optimal} provides the Shiryaev-Roberts-Pollak change detection algorithm as an asymptotically optimal solution to Problem~(\ref{eq:pollak}) when $\gamma \to \infty$. However, for any fixed $\gamma>0$, the optimal solution remains unsolved. It is worth noting that $\mathcal{L}_{\texttt{WADD}}(T)\geq \mathcal{L}_{\texttt{CADD}}(T)$ for any stopping rule $T$~\cite{banerjee2018quickest}. Because of this, we will quantify the detection delay of change detection algorithms under Pollak's metric.

% \subsection{Likelihood Ratio-based Robust CUSUM Algorithm}
% \label{subsec:llr-cusum}
% % In this section, we review the LLR-based CUSUM algorithm and present our Score-based CUSUM (SCUSUM) algorithm. Following the scheme of CUSUM, the proposed method can be used in a recursive way, which is not too demanding in computational and memory requirements for online implementation. 
% \noindent Let $p_{\infty}$ and $p_{1}$ be the density functions of pre- and post-change distributions. If the post-change law is known, then given the data stream $\{X_n\}_{n\geq 1}$, the stopping rule of the likelihood ratio-based CUSUM algorithm is defined by
% \begin{equation} 
% \label{eq:cusumrule}
%     T_{\texttt{CUSUM}}=\inf\{n\geq 1:\Lambda(n)\geq \tau\},
% \end{equation}
% where $\Lambda(n)$ is defined using the recursion
% \begin{align}
%     &\Lambda(0)=0, \nonumber \\
%     &\Lambda(n) \de \biggr(\Lambda(n-1)+\log \frac{p_1(X_n)}{p_{\infty}(X_n)}\biggr)^{+}, \forall n \geq 1, \label{eq:cusum_score}
%  \end{align}
% which leads to a computationally efficient stopping scheme (if the densities $p_1$ and $p_{\infty}$ are precisely known). 
% In \cite{moustakides1986optimal}, it is shown that the CUSUM algorithm is exactly optimal, for every fixed constraint $\gamma$, for Lorden's problem 
% \eqref{eq:lorden}. As pointed in \cite{lai1998information}, the algorithm is also asymptotically optimal for Pollak's problem \eqref{eq:pollak}. In \cite{lorden1971procedures} and \cite{lai1998information}, the asymptotic performance of the CUSUM algorithm is also characterized. Specifically, it is shown that as $\gamma \rightarrow \infty$.
% \begin{align}
% \label{eq:optimality_cusum}
%     \mathcal{L_{\texttt{WADD}}}(T_{\texttt{CUSUM}}) \sim \mathcal{L_{\texttt{CADD}}}(T_{\texttt{CUSUM}})\sim \frac{\log \gamma}{\mathbb{D}_{\texttt{KL}}(P_{1}\|P_{\infty})}.
% \end{align}
% Here $\mathbb{D}_{\texttt{KL}}(p_{1}\|p_{\infty})$ is the Kullback-Leibler divergence between the post-change density $p_1$) and pre-change distribution $p_{\infty}$:
% $$
% \mathbb{D}_{\texttt{KL}}(P_{1}\|P_{\infty}) = \int_x p_1(x) \log \frac{p_1(x)}{p_\infty(x)} dx, 
% $$
% and the notation $g(c)\sim h(c)$ as $c\to c_0$ indicates that $\frac{g(c)}{h(c)} \to 1$ as $c\to c_0$ for any two functions $c\mapsto g(c)$ and $c\mapsto h(c)$.

% The CUSUM algorithm can successfully detect a change in law from $p_1$ to $p_\infty$ because 
% \begin{equation}
% \label{eq:driftCUSUM}
%     \begin{split}
%         \int_x &\log \frac{p_1(x}{p_\infty(x)} p_1(x) dx = \mathbb{D}_{\texttt{KL}}(P_{1}\|P_{\infty}) > 0 \\
%          \int_x &\log \frac{p_1(x}{p_\infty(x)} p_\infty(x) dx = -\mathbb{D}_{\texttt{KL}}(P_\infty\|P_1) < 0.
%     \end{split}
% \end{equation}
% Thus, the mean of the increment of $\Lambda(n)$
% in \eqref{eq:cusum_score} before the change is negative, and after the change is positive. 

% If the post-change density $p_1$ is not known and assumed to belong to a family $\mathcal{G}_1$, then the test is designed using the least favorable distribution. Specifically, in \cite{unnikrishnan2011minimax}, it is assumed that there is a density $q_1 \in \mathcal{G}_1$ such that for every $p_1 \in \mathcal{G}_1$, 
% \begin{equation}
% \label{eq:leastfavunni}
%     \begin{split}
%         \log \frac{q_1(X)}{p_\infty(X)} \bigg|_{X \sim q_1} \; \; \prec \quad \; \; \log \frac{q_1(X)}{p_\infty(X)}\bigg|_{X \sim p_1} . 
%     \end{split}
% \end{equation}
% Here the notation $\prec$ is used to denote stochastic dominance: if $W$ and $Y$ are two random variables, then $W \prec Y$ if
% $$
% P(Y \geq t) \geq P(W \geq t), \quad \text{for all } t \in (-\infty, \infty). 
% $$
% If such a density $q_1$ exists in the post-change family, then the robust CUSUM is defined as the CUSUM test with $q_1$ used as the post-change density. Such a test is exactly optimal for the problem in \eqref{eq:lorden} under additional assumptions on the smoothness of densities, and asymptotically optimal for the problem in \eqref{eq:pollak}. We refer the reader to \cite{unnikrishnan2011minimax} for a more precise optimality statement. 

% We note that in the literature on quickest change detection, the issue of the unknown post-change model has also been addressed by using a generalized likelihood ratio (GLR) test or a mixture-based test. While these tests have strong optimality properties, they are computationally even more expensive than the robust test described above; see \cite{lorden1971procedures, lai1998information, tartakovsky2014sequential}. 

% As discussed in the introduction, the robust CUSUM algorithm discussed above has two major drawbacks: 1) Due to the complicated characterization of the least favorable distribution $q_1$ \eqref{eq:leastfavunni}, it is hard to identify in high-dimensional models. 2) The robust CUSUM is a likelihood ratio-based test and is thus computationally expensive to implement for high-dimensional models. 

% In Section~\ref{sec:RSCUSUM_algorithm}, we propose the RSCUSUM algorithm to mitigate these issues. 
% \begin{enumerate}
%     \item The RSCUSUM algorithm is based on Hyv\"arinen score (\cite{hyvarinen2005estimation}) and is invariant to normalizing constants. This makes it computationally efficient for high-dimensional models which are often only learnable within a normalizing constant. 
%     \item We define the notion of least favorable distribution differently in our paper. For us, the least favorable distribution has the least Fisher divergence with respect to the pre-change model. We also provide an efficient computational method to identify the least favorable distribution. 
% \end{enumerate}

% \noindent Let $\{X_n\}_{n\geq 1}$ denote a sequence of independent random observations that take values in a set $\mathcal{X}$. We use $\mathcal{P}(\mathcal{X})$ to denote a set of probability distributions on $\mathcal{X}$. We further use $\mathcal{F}_n$ to denote the $\sigma$-algebra generated by $(X_1, X_2, \cdots, X_n)$. At some unknown time $\nu\geq 1$, the data-generating distribution of the observations switches abruptly from one distribution to another, namely, the observations $X_1, \; X_2, \;\dots,\; X_{\nu-1}$ are \textit{i.i.d.} under a distribution $P_{\infty}$, and $X_{\nu}, \; X_{\nu+1},\; \dots$ are \textit{i.i.d.} under another distribution $P_1$. We shall intuitively think of $P_{\infty}$ and $P_1$ as distributions of normal and abnormal observations, respectively. We write $\nu=\infty$ when no change ever happens, namely, the entire series $\{X_n\}_{n\geq 1}$ follow the measure $P_{\infty}$. If $\nu=1$, then the entire set of observed values $\{X_n\}_{n\geq 1}$ are abnormal, adhering to the measure $P_{1}$. We use $P_{\nu}$ and $\mathbb{E}_{\nu}$ respectively to denote the probability measure and the expectation when the change happens at time $\nu$.
% % For $j\ge i \ge 1$, let $\mathbf{X}_{[i, j]}\de(X_i, \dots, X_j)$ denote a window of $j-i+1$ successive observations between time instances $i$ and $j$. For every fixed $\nu$, the change-of-regime in the data stream $\{X_n\}_{n\geq 1}$ results in a new probability measure $P_{\nu}^{(n)}$. Given the probability density functions (PDFs) $p_{\infty}$ and $p_1$ of distributions $P_{\infty}$ and $P_{1}$, respectively, the PDF $p_v^{(n)}$ of $P_{\nu}^{(n)}$ is given by:
% % \begin{equation*}
% %    p_{\nu}^{(n)}(\mathbf{X}_{[1, n]}) = \prod_{i=1}^{\nu-1}p_{\infty}(X_i)\prod_{j=\nu}^{n}p_{1}(X_j),\; \forall n\geq v\geq 1.
% % \end{equation*}
% % Since the above construction is true for any $n$, for notational simplicity, we will drop the superscript $n$ and write $p_{\nu}$ when there is no ambiguity. Let $\mathbb{E}_{\nu}$ and $\text{Var}_{\nu}$ respectively denote the expectation and the variance associated with the PDF $p_{\nu}(\cdot)$.

% In the robust quickest change detection problem, we assume that $P_1$ and $P_{\infty}$ are not known exactly, but are known to belong to some uncertainty classes of distributions, namely, $P_1\in \mathcal{P}_1$, $P_{\infty}\in \mathcal{P}_{\infty}$, and $\mathcal{P}_1, \mathcal{P}_{\infty}\subset\mathcal{P}(\mathcal{X})$. 
% We assume that the change point $\nu$ is unknown but is deterministic and $\mathcal{P}_1 \cap \mathcal{P}_\infty = \emptyset$. 
% The quickest change detection procedure is characterized by a stopping time $T$ with respect to the data stream $\{X_n\}_{n\geq 1}$:
% % A out-of-distribution change detection algorithm is a stopping time $T$ with respect to the data stream $\{X_n\}_{n\geq 1}$:
% $$
% \{T \leq n\} \in \mathcal{F}_n,
% $$
% which is also called the stopping rule through out this manuscript. 
% If $T\geq \nu$, we have made a \textit{delayed detection}; otherwise, a \textit{false alarm} has happened. 
% Intuitively, there is a trade-off between detection delay and false alarms. 
% We consider two minimax problem formulations to find the best stopping rule. 

% \textcolor{blue}{To modify Problem (2) and Problem (4) to the robust version, namely minimizing over the two uncertainty classes $\mathcal{P}_1$ and  $\mathcal{P}_{\infty}$.}
% In \cite{lorden1971procedures}, the following minimax metric, the worst-case averaged detection delay (WADD), is defined:
% \begin{equation}
% \label{eq: wadd}
% \mathcal{L}_{\texttt{WADD}}(T)\de \sup_{\nu\geq 1}\text{ess}\sup \mathbb{E}_{\nu}[(T-\nu+1)^{+}|\mathcal{F}_{\nu}],
% \end{equation}
% where $(y)^{+}\de\max(y, 0)$ for any $y\in \mathbb{R}$. This leads to the minimax optimization problem 
% \begin{equation}
%     \label{eq:lorden}
%     \min_T \;\mathcal{L}_{\texttt{WADD}}(T)\;
%     \text{subject to}\;\mathbb{E}_{\infty}[T]\geq \gamma.
% \end{equation}

% We are also interested in the version of minimax metric introduced in \cite{pollak1985optimal}, the worst conditional averaged detection delay (CADD):
% \begin{equation}
% \label{eq:cadd}
%     \mathcal{L}_{\texttt{CADD}}(T)\de \sup_{\nu\geq 1}\mathbb{E}_{\nu}[T-\nu|T\geq \nu].
% \end{equation}
% The optimization problem becomes
% \begin{equation}
%     \label{eq:pollak}
%     \min_T \;\mathcal{L}_{\texttt{CADD}}(T)\;
%     \text{subject to}\;\mathbb{E}_{\infty}[T]\geq \gamma. 
% \end{equation}


% \subsection{Classical Robust Quickest Change Detection}
% \label{subsec:llr-cusum}
% We next review the classical robust CUSUM algorithm.
% Given the data stream $\{X_n\}_{n\geq 1}$, the stopping rule of the likelihood ratio-based CUSUM algorithm is defined by
% \begin{equation*}
% \label{eq:cusumrule}
%     T_{\texttt{CUSUM}} \de \inf \biggl\{n\geq 1: \max_{1\leq k\leq n}\sum_{i=k}^n\log \frac{p_{1}(X_i)}{p_{\infty}(X_i)}\geq \tau \biggr\},
% \end{equation*}
% where the infimum of the empty set is defined to be $+\infty$, and $\tau>0$ is referred to as the stopping threshold. The value of this threshold is clearly related to the trade-off between detection delay and false alarms. It is known~\cite{lai1998information} that $T_{\texttt{CUSUM}}$ can be written as
% \begin{equation*} 
%     T_{\texttt{CUSUM}}=\inf\{n\geq 1:\Lambda(n)\geq \tau\},
% \end{equation*}
% where $\Lambda(n)$ is defined using the recursion
% \begin{align}
%     &\Lambda(0)=0, \nonumber \\
%     &\Lambda(n) \de \biggr(\Lambda(n-1)+\log \frac{p_1(X_n)}{p_{\infty}(X_n)}\biggr)^{+}, \forall n \geq 1, \label{eq:cusum_score}
%  \end{align}
% which leads to a computationally convenient stopping scheme. 

% In \cite{moustakides1986optimal}, it is shown that the CUSUM algorithm is exactly optimal, for every fixed constraint $\gamma$, for Lorden's problem 
% \eqref{eq: wadd}. As pointed in \cite{lai1998information}, the algorithm is also asymptotically optimal for Pollak's problem \eqref{eq:cadd}. In \cite{lorden1971procedures} and \cite{lai1998information}, the asymptotic performance of the CUSUM algorithm is also characterized. Specifically, it is shown that
% \begin{align}
% \label{eq:optimality_cusum}
%     \mathcal{L_{\texttt{WADD}}}(T_{\texttt{CUSUM}}) \sim \mathcal{L_{\texttt{CADD}}}(T_{\texttt{CUSUM}})\sim \frac{\log \gamma}{\mathbb{D}_{\texttt{KL}}(P_{1}\|P_{\infty})},
% \end{align}
% as $\gamma \rightarrow \infty$.
% Here $\mathbb{D}_{\texttt{KL}}(P_{1}\|P_{\infty})$ is the Kullback-Leibler divergence between the post-change distribution $P_{1}$ (associated with the density $p_1$) and pre-change distribution $P_{\infty}$ (associated with density $p_\infty$):
% $$
% \mathbb{D}_{\texttt{KL}}(P_{1}\|P_{\infty}) = \int_x p_1(x) \log \frac{p_1(x)}{p_\infty(x)} dx, 
% $$
% and the notation $g(c)\sim h(c)$ as $c\to c_0$ indicates that $\frac{g(c)}{h(c)} \to 1$ as $c\to c_0$ for any two functions $c\mapsto g(c)$ and $c\mapsto h(c)$.

% \textcolor{blue}{To introduce the robust CUSUM\cite{unnikrishnan2011minimax}. The next until Section 3 will be heavily edited.}
% Next we consider the case where the pre-change $\mathbb{P}_{\infty}$ is known and the post-change distributions $\mathbb{P}_{1}$ is known to belong to the parametric family $\mathcal{G}_1 = \{\mathbb{P}_{1}^{\theta} \mid\theta \in \Theta_1\}$. The family $\mathcal{G}_1$ is assumed to be convex and closed (in weak topology). 
% As noted that our framework readily extends to the case of non-parametric families but for simplicity, we present Robust CUSUM only in the parametric case. 

% Let $$Q_{1}  = \arg \min_{ W \in \mathcal{G}_1} \mathbb{D}_{kl}(W\| P_{\infty}),$$ 
% where the existence of $Q_{1}$ is guaranteed by closeness $\mathcal{G}_1$ and continuity of the KL-divergence as a function of its arguments.
% %Since $\mathcal{G}_{\infy} \cap \mathcal{G}_1 = \emptyset$, it follows that $\mathbb{D}_{kl}(Q_{\infty}, \mathbb{Q}_{1}) > 0$.

% The Robust CUSUM algorithm replaces the density $p_1(x)$ of post-change distributions $P_{1}$ with the density $q_1(x)$ of $Q_{1}$ in Equation (\ref{eq:cusum_score}). The performance of this algorithm is analyzed in \cite{unnikrishnan2011minimax}.

% We note that the assumption of full knowledge of $P_{\infty}$ in the above can be relaxed. Assuming that significant pre-change data is available, the data and a model class $\mathcal{G}_\infty$ can be used to model pre-change distribution $P_{\infty}$ by its \textit{shadow} $Q_{\infty} \in \mathbb{G}_{\infty}$ given by
% $$Q_{\infty}  = \arg \min_{ W \in \mathcal{G}_{\infty}} \mathbb{D}_{kl}(P_{\infty} \| W).$$
% This shadow $Q_{\infty}$ can be computed by maximizing the maximum likelihood of the observed pre-change data over the model class $\mathcal{G}_\infty$. The Robust CUSUM algorithm then replaces the densities $p_{\infty}(x)$ and $p_1(x)$ of pre- and post-change distributions $\mathbb{P}_{\infty}$ and  $\mathbb{P}_{1}$ with respectively densities $q_{\infty}(x)$ and $q_1(x)$ of $\mathbb{Q}_{\infty}$ and  $\mathbb{Q}_{1}$ in Equation (\ref{eq:cusum_score}).

% \noindent In this section, we propose our robust score-based CUSUM (SCUSUM) algorithm for unnormalized models. To this end, we first review Fisher divergence and Hyv\"arinen Score.

% Let $X$ be a random variable with values in $\mathcal{X}\subseteq \mathbb{R}^d$, and let $\mathcal{P}$ be a family of distributions over $\mathcal{X}$. 
% Let $P \in \mathcal{P} $ and $Q \in \mathcal{P}$ with $p$ and $q$ respectively denote their corresponding densities.  The Fisher divergence from $P$ to $Q$ is defined as
% \begin{align*}
%     \mathbb{D}_{\texttt{F}} (P \| Q) \de \mathbb{E}_{X\sim P} \left[\left \| \nabla_{\mathbf{x}} \log p(X)- \nabla_{\mathbf{x}} \log q(X)\right \|_2^2 \right],
% \end{align*}
% where $\|\cdot\|_2$ denotes the Euclidean norm. 

% The Fisher divergence is particularly suitable for working with unnormalized distributions because $\nabla_{\mathbf{x}} \log p(X)$ and $\nabla_{\mathbf{x}} \log q(X)$ remain invariant if $p$and $q$ scaled by positive constants.

% \subsection{Issues with the Likelihood Ratio-based CUSUM Algorithm}
% \label{subsec:issues_llr_cusum}
% % \textcolor{blue}{Add specific examples (EXP and RBM) to explain why not CUSUM. Further, clearly refer to the Table in the Numerical Results. (For the numerical results part, we can repeat the formulas for computation.)}
% \noindent We consider the pre- and post-change densities, $p_1$ and $p_{\infty}$, respectively. We assume that the densities are potentially known only up to a normalizing constant, i.e., we have unnormalized models. In other words, instead of $p_1(x)$ and $p_{\infty}(x)$, we are given $\tilde{p}_1(x)$ and $\tilde{p}_{\infty}(x)$ with
% \begin{equation*}
%     p_{i}(x) = \frac{\tilde{p}_{i}(x)}{\int_{x\in \mathcal{X}} \tilde{p}_i(x)dx}, \; i=1,\infty.
% \end{equation*}
% As discussed in the introduction, such models are occasionally encountered in several machine learning applications. 
% In many cases, the computation of the denominator (also known as the \textit{normalizing constant} or the \textit{partition function}) can be intractable when the integral is not analytic in a closed form. For low-dimensional cases, numerical integration can be used to approximate the function. However, the number of points required for approximating the integral may grow exponentially as a function of the dimension of data space. This approximation is computationally expensive for high-dimensional data. Hence, implementing the likelihood ratio-based CUSUM algorithm is computationally cumbersome for unnormalized models. Next, we provide two examples to show this issue.
% \begin{example}[Exponential Family] We consider a subfamily of the Exponential family belonging to pairwise interaction graphical models~\citep{yu2016statistical}. Let $X\in \mathbb{R}^d$ be the random variable, and let $p_{\tau}$ represent the density, which is formulated as
% \begin{align}
%     p_{\tau}(X) =\frac{1}{Z_{\tau}} \exp\left\{-\tau\left(\sum_{i=1}^dx_i^4+\sum_{1\leq i\leq d, i\leq j\leq d}x_i^2x_j^2\right)\right\},\nonumber
% \end{align}
% where $\tau\in \mathcal{T}\subset\mathbb{R}^{+}$ is the model parameter and $Z_{\tau}$ is the normalizing constant of $p_{\tau}(X)$. Here,\begin{equation*}
% \begin{split}
%     Z_{\tau} &= \int_{x_1}\cdots \int_{x_d}
%       \exp\left\{-\tau\left(\sum_{i=1}^dx_i^4 \right. \right. \\
%       &\quad \quad \quad \quad \quad  \left. \left. +\sum_{1\leq i\leq d, i\leq j\leq d}x_i^2x_j^2\right)\right\} dx_1\cdots d x_d.
%     \end{split}
% \end{equation*}
% As shown above, this integral cannot be computed in a closed form, and therefore the density $p_{\tau}$ cannot be computed in a closed form. Besides, the numerical approximation is time-consuming when $d$ is large. Particularly, in Section~\ref{sec: results}, we show that the likelihood ratio-based CUSUM cannot be implemented in a reasonable computational time when $d=4$.
% \end{example}
% \begin{example}[Gauss-Bernoulli Restricted Boltzmann Machine]
% Restricted Boltzmann Machine (RBM)~\citep{LeCun2006ATO} is a generative graphical model defined on a bi-partite graph of hidden and visible variables. In particular, we consider the Gauss-Bernoulli RBM (RBM), which has binary-valued hidden variables $H=(h_1, \ldots, h_{d_h})^{T}\in \{0,1\}^{d_h}$, real-valued visible variables $X=(x_1, \ldots, x_{d_x})^{T}\in R^{d_x}$, and the joint density 
% \begin{equation*}
% \begin{split}
%     p(X, H) &= \frac{1}{Z}\text{exp} \left\{
%     -\left(\frac{1}{2}\sum_{i=1}^{d_x}\sum_{j=1}^{d_h}\frac{x_i}{\sigma_i}W_{ij}h_j\right. \right.\\
%     &\quad \quad \quad \quad \left. \left.+\sum_{i=1}^{d_x}b_ix_i+\sum_{j=1}^{d_h}c_jh_j-\frac{1}{2}\sum_{i=1}^{d_{x}} \frac{x_{i}^{2}}{\sigma_{i}^{2}}\right)
%     \right\}, 
%     \end{split}
% \end{equation*}
% where model parameters $\theta = (\mathbf{W}, \mathbf{b}, \mathbf{c})$ and $Z$ is the normalizing constant of $p(X, H)$. We set $\sigma_i=1$ for all $i=1,\dots, d_x$. 

% Let $p_{\theta}$ represent the density of the visible variable $X$, which can be written as $$
% p_{\theta}(X)= \sum_{h\in \{0,1\}^{d_h}}p_{\theta}(X, H) = \frac{1}{Z_{\theta}}\exp\{-F_{\theta}(X)\},
% $$
% where $Z_{\theta}$ is the normalizing constant of $p_{\theta}(X)$, and $F_{\theta}(X)$ is the free energy given by 
% \begin{equation*}
%     F_{\theta}(X) = \frac{1}{2}\sum_{i=1}^{d_x} (x_{i}-b_i)^{2}\nonumber
%     -\sum_{j=1}^{d_h} \operatorname{Softplus}\left(\sum_{i=1}^{d_x} W_{i j}x_{i}+b_{j}\right).
% \end{equation*}
% The $\operatorname{Softplus}$ function is defined as $\operatorname{Softplus}(y) \de \log(1+\exp(y))$ with a default scale parameter $\beta=1$. The same computational difficulty occurs on $p_{\theta}(X)$, and therefore the likelihood of the RBM data may not be computed exactly in practice. 
% \end{example}

\section{Robust Quickest Change Detection for Unnormalized Models} \label{sec:RSCUSUM_algorithm}
\noindent In this section, we propose a robust score-based CUSUM (RSCUSUM) algorithm. We first review the SCUCUM algorithm proposed by \cite{wuetal-aistat-2023} to address the issues with likelihood ratio-based CUSUM for unnormalized models. The SCUSUM is defined based on Hyv\"arinen Score (\cite{hyvarinen2005estimation}), which circumvents the computation issue of the normalization constant. Similar to the schemes of SCUSUM, we use the Hyv\"arinen score and propose a robust variant that releases the knowledge of the true post-change distribution, where we assume the true post-change distribution is unknown but its uncertainty class is known.

Recall from Section~\ref{subsec:problem_formulation} that under the measure $P_\infty$, there is no change, and the density for each random variable is $p_\infty$. In the rest of the paper, we refer to the probability measure of $X_1$ under $P_\infty$, also by $P_\infty$. Similarly, we refer to the law of $X_1$ under $P_1$ also by $P_1$. The differences will always be clear from the context. 

We provide the definition of the Hyv\"arinen Score below.
\begin{definition}[Hyv\"arinen Score] The Hyv\"arinen score of any measure $P$ (with density $p$) is a mapping $(X, P)\mapsto \mathcal{S}_{\texttt{H}}(X, P)$ given by 
    \begin{equation*}
        % \label{eq:hyv_score}
        \mathcal{S}_{\texttt{H}}(X, P) \de \frac{1}{2} \left \| \nabla_{X} \log p(X) \right \|_2^2 + \Delta_{X} \log p(X)
    \end{equation*}
 whenever it can be well defined. Here, $\|\cdot\|_2$ denotes the Euclidean norm, $\nabla_{X}$ and $\Delta_{X} = \sum_{i=1}^d \frac{\partial^2}{\partial x_i^2}$ respectively denote the gradient and the Laplacian operators acting on $X = (x_1, \cdots, x_d)^{\top}$.
\end{definition}

By using the Hyv\"arinen Score in our algorithm, the role of Kullback-Leibler divergence in the theoretical analysis of the algorithm is replaced by the Fisher divergence. 

\begin{definition}[Fisher Divergence] The Fisher divergence between two probability measures $P$ to $Q$ (with densities $p$ and $q$) is defined by
\begin{align*}
    \mathbb{D}_{\texttt{F}} (P \| Q) \de \mathbb{E}_{X\sim P} \left[\left \| \nabla_{{X}} \log p(X)- \nabla_{{X}} \log q(X)\right \|_2^2 \right],
\end{align*}
whenever the integral is well defined. %It is worth noting that in general, one does not require the density functions to define a Fisher divergence for two distributions. We assume the existence of densities only for notational convenience.
\end{definition}


Clearly, $\nabla_{{X}} \log p(X)$, $\nabla_{{X}} \log q(X)$, and $\Delta_{X} \log q(X)$ remain invariant if $p$ and $q$ are scaled by any positive constant with respect to $X$. Hence, the Fisher divergence and the Hyv\"arinen Score remain \textit{scale-variant} concerning an arbitrary constant scaling of density functions.

% Under some mild regularity conditions on $p$ and $q$, \citet{hyvarinen2005estimation} showed that
% \begin{align*}
%     \mathbb{D}_{\texttt{F}} (P \| Q) =\mathbb{E}_{X\sim P} \left[\frac{1}{2}\left \| \nabla_{\mathbf{x}} \log p(X) \right \|_2^2 + \mathcal{S}_{\texttt{H}}(X, Q)\right],
% \end{align*}
% where $\mathcal{S}_{\texttt{H}}(X, Q)$ a \textit{scale-invariant} proper scoring function, referred to as the Hyv\"arinen score in the framework of proper scoring rules~\cite{parry2012proper} (see a precise definition below). 
%\subsection{Least Favorable Distributions}\label{subsec:lfds}
The SCUSUM~\citep{wuetal-aistat-2023} assumes that the true pre- and post-chagne distributions $P_{\infty}$ and $P_{1}$ are known. It defines the detection score by
\begin{equation}
    \tilde{z}_{\lambda}(X)\de \lambda\bigr(\mathcal{S}_{\texttt{H}}(X, P_{\infty})-\mathcal{S}_{\texttt{H}}(X, P_{1})\bigr).
\end{equation}
However, it is impractical, in particular for online data streams, to know the true post-change distribution. We assume that pre-change data is available. This data and a model class $\mathcal{G}_\infty$ are used to model/learn the pre-change distribution $P_{\infty}$. The post-change distribution $P_{1}$ is assumed to be modeled by an unknown element of a parametric family $\mathcal{G}_1 = \{G_{\theta}:\,\theta \in \Theta_1\}$. We note that our framework readily extends to the case of non-parametric families but for simplicity, we present our results only in the parametric case.

We define the notion of least favorable distribution. This approach to defining the least favorable distribution for the quickest change detection is novel. 
\begin{definition}[Least Favorable Distribution (LFD)]
\label{defination: LFD}
    Assume that the family $\mathcal{G}_1 = \{G_{\theta} :\,\theta \in \Theta_1\}$ is convex and compact. We define
    \begin{equation}
\label{eq:Q1LFD}
   Q_{1}  = \arg \min_{ G_{\theta} \in \mathcal{G}_1} \mathbb{D}_{F}( G_{\theta} \| P_{\infty}).
\end{equation}
\end{definition}
% \begin{equation}
% \label{eq:Q1LFD}
%    Q_{1}  = \arg \min_{ W \in \mathcal{G}_1} \mathbb{D}_{F}( W \| P_{\infty}), 
% \end{equation}

The existence of $Q_{1}$ is guaranteed by the compactness of $\mathcal{G}_1$ and the continuity of the Fisher divergence as a function of its arguments. Thus, $Q_1$ is the closest element of $\mathcal{G}_1$ to $P_\infty$ in the Fisher-divergence sense. 



% The motivation for this approach comes from the fact that for some Gaussian examples, the stochastic boundedness conditions of \cite{unnikrishnan2011minimax} is satisfied by the distribution that is closest in the KL-divergence sense to the pre-change law. The analysis of our algorithm will reveal that for our score-based method, the role of KL-divergence is played by Fisher divergence. 

% We assume that pre-change data is available. This data and a model class $\mathcal{G}_\infty$ is used to model/select a pre-change distribution $P_{\infty}$ (with density 4p_\infty$) by an element $Q_{\infty} \in \mathcal{G}_\infty$. The post-change distributions $P_{1}$ are assumed to be modeled by an unknown element of a parametric family $\mathcal{G}_1 = \{Q^{\theta} \mid\theta \in \Theta_1\}$. We note that our framework readily extends to the case of non-parametric families but for simplicity, we present our results only in the parametric case.


%\textcolor{blue}{We need a compact and convex set of distributions for the uncertainty class.}

%\subsection{Robust Score-based CUSUM Algorithm} \label{subsec:RSCUSUM}
% Recall that the detection score of LLR-based Robust CUSUM can be rewritten as 
% $$
% \log \frac{q_1(X_n)}{p_{\infty}(X_n)} = \mathcal{S}_{\texttt{L}}(X_n, P_{\infty})-\mathcal{S}_{\texttt{L}}(X_n, Q_1).$$
% Our key idea is to replace the role of log-score and KL divergence in the Robust Score-Based CUSUM (RSCUSUM) algorithm with respectively those of the Hyv\"arinen score and the Fisher Divergence. 

% To this end, we first consider the general case where the pre-change $\mathbb{P}_{\infty}$ is known (referred to as the \textit{well-specified} case) and the post-change distributions $\mathbb{P}_{1}$ is known to belong to the parametric family $\mathcal{G}_1 = \{Q^{\theta} \mid\theta \in \Theta_1\}$. The family $\mathcal{G}_1$ is assumed to be convex and closed (in weak-topology). 
% As noted that our frame-work readily extend to the case of non-parametric families but for simplicity, we present Robust Score-based CUSUM only in the parametric case. 

% Let $$Q_{1}  = \arg \min_{ W \in \mathcal{G}_1} \mathbb{D}_{F}( W \| P_{\infty}),$$ 
% where the existence of $Q_{1}$ can be guaranteed by closeness $\mathcal{G}_1$ and continuity of the Fisher-divergence as a function of its arguments. Specifically, let $X$ represent a generic random variable defined on the probability space $(\Omega, \mathcal{F}, P)$. 
Given the pre-change law $P_\infty$ (with density $p_\infty$), we now use $Q_1$ and its density $q_1$ to design the RSCUSUM algorithm. 
We define the instantaneous RSCUSUM score function $X\mapsto z_{\lambda}(X)$ by 
\begin{equation}
\label{eq:scusum_instantaneous}
    z_{\lambda}(X) \de \lambda\bigr(\mathcal{S}_{\texttt{H}}(X, P_{\infty})-\mathcal{S}_{\texttt{H}}(X, Q_{1})\bigr),
\end{equation}
where $\lambda>0$ is a pre-selected multiplier, $\mathcal{S}_{\texttt{H}}(X, P_{\infty})$ and $\mathcal{S}_{\texttt{H}}(X, Q_1)$ are respectively the Hyv\"arinen score functions of $P_\infty$ and $Q_1$. If the post-change model is precisely known, then the $Q_1$ in the above equation will be replaced by the known post-change law and RSCUSUM is identical to SCUSUM~\citep{wuetal-aistat-2023}. In Section~\ref{sec:theoritical_analysis}, we will provide more discussion on the role of $\lambda$ in the RSCUSUM algorithm. 

Our proposed stopping rule is given by 
% \begin{equation}
% \label{eq:SCUSUM_rule}
%     T_{\texttt{RSCUSUM}} \de \inf \biggl\{n\geq 1: \max_{1\leq k\leq n}\sum_{i=k}^nz_{\lambda}(X_i)\geq \tau \biggr\},
% \end{equation}
% where $\tau>0$ is a stopping threshold, which is pre-selected to control false alarms. Similar to the stopping scheme of CUSUM, the stopping rule of RSCUSUM can be written as
\begin{equation} 
\label{eq:SCUSUM_rule}
    T_{\texttt{RSCUSUM}}=\inf\{n\geq 1:Z(n)\geq \tau\},
\end{equation}
where $\tau>0$ is a stopping threshold that is pre-selected to control false alarms, and $Z(n)$ can be computed recursively:
\begin{align*}
    &Z(0)=0, \\
    &Z(n) \de (Z(n-1)+z_{\lambda}(X_n))^{+},\;\forall n\geq 1.
\end{align*}
The statistic $Z(n)$ is referred to as the detection score of RSCUSUM at time $n$. The RSCUSUM algorithm is summarized in Algorithm~\ref{algm:rscusum}.

%The hyper-parameter $\lambda$ is pre-determined to satisfy Condition~(\ref{eq:conditon}). 

\begin{algorithm}
\DontPrintSemicolon
\caption{RSCUSUM Detection Algorithm}
\label{algm:rscusum}
\KwInput{Hyvarinen score functions $\mathcal{S}_{\texttt{H}}(\cdot, P_{\infty})$ and $\mathcal{S}_{\texttt{H}}(\cdot, Q_{1})$ of pre-change distribution and least favorable distribution in $\mathcal{G}_1$, respectively.} 
% \Comment*[r]{$(p, X)\mapsto \mathcal{S}_{\texttt{H}}(X, P)$ is defined in Equation~(\ref{eq:hyv_score})}
\KwData{$m$ previous observations $\mathbf{X}_{[-m+1,0]}$ and the online data stream $\{X_n\}_{n\geq 1}$}
\SetKwProg{Fn}{Initialization}{:}{}
  \Fn{}{
       Current time $k=0$, $\lambda>0$, $\tau>0$, and $Z(0)=0$}
\While{$Z(k)<\tau$}{
$k = k+1$\\
Update $z_{\lambda}(X_k) = \lambda(\mathcal{S}_{\texttt{H}}(X_{k}, P_{\infty})-\mathcal{S}_{\texttt{H}}(X_{k}, Q_{1}))$\\
Update $Z(k) = \max(Z(k-1)+z_{\lambda}(X_k), 0)$\;
}
Record the current time $k$ as the stopping time $T_{\texttt{RSCUSUM}}$\;
\KwOutput{$T_{\texttt{RSCUSUM}}$}
\end{algorithm}

\section{Delay and False Alarm Analysis of the RSCUSUM Algorithm} \label{sec:theoritical_analysis}
\noindent In this section, we provide delay and false alarm analysis of the RSCUSUM algorithm. %using the same notations and under the same problem setting defined in Section~\ref{sec: background} and Section~\ref{sec: SCUSUM}. 
We introduce two assumptions: 1) $P_{\infty}\notin\mathcal{G}_1$, and 2) the same mild regularity conditions introduced in \cite{hyvarinen2005estimation} so that the Hyv\"arinen score is well-defined.

%We first provide an overview of the results in this section. In Lemma~\ref{lemma: drifts}, we show that, just as in the RSCUSUM algorithm, the drift of the SCUSUM algorithm is negative before the change and positive after the change. The role played by the KL-divergence in the CUSUM algorithm is replaced by the Fisher divergence in the SCUSUM algorithm. 
% Our core results are presented in Theorems~\ref{thm:arl} and~\ref{thm:cond_edd}. In Theorem~\ref{thm:arl}, we provide a lower bound of the average run length when no change has occurred. %As discussed in the introduction, the challenge here is that the bound cannot be derived using classical martingale techniques, e.g. those employed in \cite{lai1998information}. 
% This is challenging because the RSCUSUM algorithm is based on scores and not log-likelihood ratios. The latter have martingale properties that are employed by classical proofs. Our novel proof technique is developed after Lemma~\ref{lemma: drifts}, in Lemma~\ref{lemma: lambda}, and in the proof of Theorem~\ref{thm:arl}. 
% In Theorem~\ref{thm:cond_edd}, we demonstrate an upper bound of the expected detection delay when a change point occurs at $v=1$. This, in turn, provides an upper bound on the $\mathcal{L}_{\texttt{WADD}}$ (see \eqref{eq: wadd}) of the RSCUSUM algorithm. 
 %In Proposition~\ref{prop: gaussian}, we consider a special case of multivariate Normal pre- and post-change distributions and discuss the asymptotic optimality of our algorithm in this particular case. Specifically, we show that in this case the KL-divergence and the Fisher divergence coincide and the SCUSUM algorithm has the same optimality properties as the CUSUM algorithm.
 
We first prove an important lemma for our problem. If the Fisher divergence is seen as a measure of distance between two probability measures, then the following lemma provides a reverse triangle inequality for this distance, under the mild assumption that the order of integrals and derivatives can be interchanged.
\begin{lemma}%[Technical Lemma]
\label{lemma:tech}
Let $P_{\infty}$ be the pre-change distribution, $Q_1\in \mathcal{G}_1$ be the least-favorable distribution (as defined in Equation~\ref{eq:Q1LFD}), and $Q_2 \in \mathcal{G}_1$ be any other post-change distribution. Then 
\begin{equation*}
\mathbb{D}_{\texttt{F}}\left(Q_1 \| P_{\infty}\right)\leq \mathbb{D}_{\texttt{F}}\left(Q_2 \| P_{\infty}\right) - \mathbb{D}_{\texttt{F}}\left(Q_2 \| Q_1\right).
\end{equation*}
\end{lemma}
\begin{proof}
Consider a convex set of densities \begin{align*}
\bigl\{y\mapsto q_{\xi}(x): q_{\xi}(x)=\xi q_1(x)+(1-\xi) q_2(x), \xi \in [0,1]\bigr\},
\end{align*}
where $q_1$ and $q_2$ are densities of $Q_1$ and $Q_2$, respectively. 
Let $Q_{\xi}$ denote the distribution characterized by density $q_{\xi}$. 
We note that $Q_{\xi} \in \mathcal{G}_1$ due to the convexity assumption on $\mathcal{G}_1$. We use $\mathcal{L}(\xi)$ to denote the Fisher divergence $\mathbb{D}_{\texttt{F}} \left(Q_{\xi}\| P_{\infty}\right)$, and
\begin{align*}
    \mathcal{L}(\xi)&=\int \big\|\nabla \log q_{\xi}-\nabla\log p_{\infty}\big\|^2 q_{\xi} dx\\
    &=\int \big\| \nabla \log \bigl(\xi q_1+(1-\xi)q_2\bigr)-\nabla \log p_{\infty} \big\|^2\\
    &\qquad \qquad \qquad \qquad \qquad \bigl(\xi q_1+(1-\xi)q_2\bigr) dx.
\end{align*}
Clearly $\mathcal{L}(\xi)$ is minimized at $\xi=1$, and $\frac{\partial \mathcal{L}(\xi)}{\partial\xi}\mid_{\xi=1^-}\le 0$. 
Let $\mathcal{L}^{\prime}(\xi)=\frac{\partial \mathcal{L}(\xi)}{\partial\xi}$, we have 
\begin{align*}
&\mathcal{L}^{\prime}(\xi)=\int\bigl(q_1-q_2\bigr)\big\|\nabla \log q_{\xi}-\nabla \log p_{\infty}\big\|^2 d x\\
& \quad+\int 2q_{\xi} \nabla\left( \frac{q_1-q_2}{q_{\xi}} \right)^T \bigl(\nabla\log q_{\xi} -\nabla \log p_{\infty}\bigr)dx.
        %transport of the gradient - a vector
\end{align*}
This implies 
\begin{align}
\label{eq:diff}
&\mathcal{L}^{\prime}(1^{-})= \int\bigl(q_1-q_2\bigr)\big\|\nabla \log q_1- \nabla \log p_{\infty}\big\|^2 dx\notag\\
    & \quad +\int 2q_1 \nabla\left(\frac{q_1-q_2}{q_1}\right)^T\bigl(\nabla \log q_1-\nabla \log p_{\infty}\bigr) dx\notag\\
&= \mathbb{D}_{\texttt{F}}\left(Q_1 \| P_{\infty}\right)-\int\underbrace{ q_2\big\|\nabla\log q_1-\nabla \log p_{\infty}\big\|^2}_{\text{term 1}}\notag\\
    &\qquad \quad+\underbrace{2 q_1 \nabla\left(\frac{q_1-q_2}{q_1}\right)^T \bigl(\nabla \log q_1-\nabla\log p_{\infty}\bigr)}_{\text{term 2}}dx.
\end{align}
For term 1, we have 
\begin{align}
\label{eq:term1}
&q_2\big\| \nabla \log q_1-\nabla\log p_{\infty}\big\|^2 \nonumber\\
&=q_2\big\|\nabla\log q_1-\nabla\log q_2\big \|^2+q_2\big\|\nabla\log q_2-\nabla\log p_{\infty}\big\|^2\nonumber\\
% &=\left\|\nabla\log q_1-\nabla\log q_2\right\|^2 +\left\|\nabla\log q_2-\nabla\log p_{\infty}\right\|^2 \nonumber\\
&\quad+\underbrace{2q_2\bigl(\nabla \log q_1-\nabla\log q_2\bigr)^T \bigl(\nabla \log q_2-\nabla\log p_{\infty}\bigr)}_{\text{term 1(a)}}.
\end{align}
We note that,
\begin{align}
    &\int_x q_2
\big\|\nabla\log q_1-\nabla\log q_2\big\|^2 dx = \mathbb{D}_{\textit{F}}(Q_2\|Q_1), \label{eq: fisher1}\\
& \int_x q_2
\big\|\nabla\log q_2-\nabla\log p_{\infty}\big\|^2 dx = \mathbb{D}_{\textit{F}}(Q_2\|P_{\infty}). \label{eq: fisher2}
\end{align}
For term 2, we note that 
\begin{align*}
\nabla\left( \frac{q_1-q_2}{q_1}\right)
% &=-\nabla\left(\frac{q_2}{q_1}\right)\notag\\
% &=-\frac{\left(\nabla q_2 ) q_1-\left(\nabla q_1\right) q_2\right.}{q_1^2}\notag\\
% &=-\frac{\nabla q_2}{q_1}+\frac{(\nabla q_1) q_2}{q_1^2}\notag\\
=\frac{q_2}{q_1}\bigl(\nabla\log q_1-\nabla\log q_2\bigr).
\end{align*}
Therefore, 
\begin{align}
&2 q_1 \nabla\left(\frac{q_1-q_2}{q_1}\right)^T \bigl(\nabla \log q_1-\nabla\log p_{\infty}\bigr)\nonumber\\
&=2q_2\bigl(\nabla \log q_1-\nabla\log q_2\bigr)^T \bigl(\nabla \log q_1-\nabla\log p_{\infty}\bigr).
\label{eq:term2}
\end{align}
Combining the last term in Equation (\ref{eq:term1}) with Equation (\ref{eq:term2}),
\begin{align}
\label{eq: comb_term12}
&-\text{term 1(a)} + \text{term 2} \notag\\
&=2q_2\bigl(\nabla \log q_1-\nabla\log q_2\bigr)^T \notag \\
    & \qquad \bigl(\nabla \log q_1-\nabla\log p_{\infty} -\nabla\log q_2 +\nabla\log p_{\infty} \bigr) \notag\\
&=2q_2\|\nabla \log q_1 - \nabla \log q_2\|^2.
\end{align}
Plugging Equations (\ref{eq: fisher1}), (\ref{eq: fisher2}), and (\ref{eq: comb_term12}) into Equation~(\ref{eq:diff}), 
\begin{align*}
&\mathcal{L}^{\prime}(1^{-})=\mathbb{D}_{\texttt{F}}\left(Q_1 \| P_{\infty}\right)+\mathbb{D}_{\texttt{F}}\left(Q_2 \| Q_1\right)-\mathbb{D}_{\texttt{F}}\left(Q_2 \| P_{\infty}\right).
\end{align*}
The results follows since $\frac{\partial \mathcal{L}(\xi)}{\partial \xi}\mid_{\xi=1^-}\le 0$.
\end{proof}

We now use Lemma~\ref{lemma:tech} to prove a result on the consistency of our proposed RSCUSUM algorithm. 


\begin{lemma}[Positive and Negative Drifts]
\label{lemma: drifts}
Consider the instantaneous RSCUSUM score function $X\mapsto z_{\lambda}(X)$ as defined in Equation~(\ref{eq:scusum_instantaneous}). Recall that $P_1 \in \mathcal{G}_1$ is the true (but unknown) post distribution. Then,
\begin{align*}
\label{eq:expst}
    &\mathbb{E}_{\infty}\left[z_{\lambda}(X)\right] = -\lambda\mathbb{D}_{\texttt{F}}(P_{\infty} \| Q_1)<0,\; \text{and}\\
    &\mathbb{E}_{1}\left[z_{\lambda}(X)\right] 
 \ge \lambda\mathbb{D}_{\texttt{F}}(Q_1 \| P_{\infty})>0.
\end{align*}
\end{lemma}
\begin{proof}
Under some mild regularity conditions, \cite{hyvarinen2005estimation} proved that
\begin{align*}
    \mathbb{D}_{\texttt{F}} (P \| Q) =\mathbb{E}_{X\sim P} \left[\frac{1}{2}\left \| \nabla_{X} \log p(X) \right \|_2^2 + \mathcal{S}_{\texttt{H}}( X, Q)\right].
\end{align*}
We use $C_P$ to denote the term $\mathbb{E}_{X\sim P} \left[\frac{1}{2}\left \| \nabla_{X} \log p(X) \right \|_2^2\right]$. Then 
\begin{equation*}
\begin{split}
     \mathbb{E}_{\infty}&[\mathcal{S}_{\texttt{H}}(X, P_{\infty})-\mathcal{S}_{\texttt{H}}(X, Q_1)]\\
     &=\mathbb{D}_{\texttt{F}} (P_{\infty} \| P_{\infty})-C_{P_{\infty}}-\mathbb{D}_{\texttt{F}} (P_{\infty} \| Q_1)+C_{P_{\infty}}\\
     &=-\mathbb{D}_{\texttt{F}} (P_{\infty} \| Q_1),
     \end{split}
\end{equation*}
and 
\begin{equation*}
\begin{split}
     \mathbb{E}_{1}&[\mathcal{S}_{\texttt{H}}(X, P_{\infty})-\mathcal{S}_{\texttt{H}}(X, Q_1)]\\
     &=\mathbb{D}_{\texttt{F}} (P_1 \| P_{\infty})-C_{P_{1}}-\mathbb{D}_{\texttt{F}} (P_1 \| Q_1)+C_{P_{1}} \\
     &\ge \mathbb{D}_{\texttt{F}} (Q_1 \| P_{\infty}),
     \end{split}
\end{equation*}
where we applied Lemma \ref{lemma:tech}.

Since $\lambda>0$, the results follow.
\end{proof}

Lemma~\ref{lemma: drifts} shows that, prior to the change, the expected mean of instantaneous RSCUSUM score  $z_{\lambda}(X)$ is negative. Consequently, the accumulated score has a negative drift at each time $n$ prior to the change. Thus, the RSCUSUM detection score $Z(n)$ is pushed toward zero before the change point. This intuitively makes a false alarm unlikely. In contrast, after the change, the instantaneous score has a positive mean, and the accumulated score has a positive drift. Thus, the RSCUSUM detection score will increase toward infinity and leads to a change detection event.

Next, we discuss the values of the multiplier $\lambda$ in the theoretical analysis. Obviously, with a fixed stopping threshold, a larger value of $\lambda$ results in a smaller detection delay because the increment of the SCUSUM detection score is large, and the threshold can be easily reached. However, a larger value of $\lambda$ also causes SCUSUM to stop prematurely when no change occurs, leading to a larger false alarm probability. Hence, the value of $\lambda$ cannot be arbitrarily large (except in the degenerate case where $P_{\infty}(S_{\texttt{H}}(X, Q_1)-S_{\texttt{H}}(X, P_{\infty})\le 0)=1$). It needs to satisfy the following key condition:
\begin{equation}
\label{eq: condition}    
\mathbb{E}_{\infty}[\exp(z_{\lambda}(X))]\leq 1.
\end{equation}
We will present a technical lemma that guarantees the existence of such a $\lambda$ to satisfy inequality~(\ref{eq: condition}). %and it can even make the equality of~(\ref{eq: condition}) hold.

\begin{lemma}[Existence of appropriate $\lambda$]
    \label{lemma: lambda}
There exists $\lambda>0$ such that Inequality~(\ref{eq: condition}) holds. Moreover, either 1) there exists $\lambda^{\star} \in (0,\infty)$ such that the equality of~(\ref{eq: condition}) holds, or 2) for all $\lambda>0$, the inequality of~(\ref{eq: condition}) is strict. As noted in \cite{wuetal-aistat-2023}, the second case is of no practical interest.
\end{lemma}
\begin{proof}
    We give proof in the supplementary material.
\end{proof}
%However, in general, it is not easy to obtain a closed-form solution of $\lambda$ that satisfies condition~(\ref{eq:conditon}). We determine the value of hyperparameter $\lambda$ empirically by the history of events, which is discussed in Subsection~\ref{subsec: SCUSUM_algorithm}.
From now on, we consider a fix $\lambda > 0$ that satisfies Inequality~(\ref{eq: condition}) to present our core results. In practice, it is possible to use $m$ past samples $\mathbf{X}_{[-m+1,0]}$ to determine the value of $\lambda$. In particular, $\lambda$ can be chosen as the positive root of the function $\lambda \to \tilde{h}(\lambda)$ given by 
\begin{align*}
% \label{eq:empirical_conditon}
\tilde{h}(\lambda)\de\frac{1}{m}\sum_{i=1}^m[\exp(z_{\lambda}(X_{i-m}))]-1.
\end{align*}
By Lemma~\ref{lemma: lambda} and its related technical discussions, the above equation has a root greater than zero with a high probability if $m$ is sufficiently large. In the case that $\lambda$ is not chosen properly, the algorithm remains implementable but optimal performance of detection delay is not guaranteed. We discuss this situation further in the supplementary material. 

\begin{theorem}
\label{thm:arl}
Consider the stopping rule $T_{\texttt{RSCUSUM}}$ defined in Equation~(\ref{eq:SCUSUM_rule}). Then, for any $\tau>0$,
    \begin{equation*}
        \mathbb{E}_{\infty}[T_{\texttt{RSCUSUM}}]\geq  e^{\tau}.
    \end{equation*}
    To satisfy the constraint of $\mathbb{E}_{\infty}[T_{\texttt{RSCUSUM}}] \geq \gamma$, it is enough to set the threshold $\tau=\log \gamma$. 
\end{theorem}
\begin{proof}
    We give proof in the supplementary material.
\end{proof}
%The quantity $\mathbb{E}_{\infty}[T_{\texttt{RSCUSUM}}]$ is also referred to as the \textit{Average Run Length} (ARL)~\cite{page1955test}. 
Theorem~\ref{thm:arl} implies that the ARL increases at least exponentially as the stopping threshold increases. %To satisfy the constraint

The following theorem gives the asymptotic performance of the RSCUSUM algorithm in terms of the detection delay under the control of the ARL.

\begin{theorem}
\label{thm:cond_edd}
   Subject to $\mathbb{E}_{\infty}[T_{\texttt{RSCUSUM}}]\geq \gamma>0$, the stopping rule $T_{\texttt{RSCUSUM}}$ satisfies
\begin{align*}
    \mathcal{L}_{\texttt{WADD}}&(T_{\texttt{RSCUSUM}}) \sim \mathcal{L}_{\texttt{CADD}}(T_{\texttt{RSCUSUM}}) \sim \mathbb{E}_1[T_{\texttt{RSCUSUM}}]\\
    &\sim \frac{\log \gamma}{\lambda (\mathbb{D}_{\texttt{F}}(P_1\|P_{\infty})-\mathbb{D}_{\texttt{F}}(P_1\|Q_{1}))}\\
    &\lesssim \frac{\log \gamma}{\lambda \mathbb{D}_{\texttt{F}}(Q_1 \| P_\infty)}, \quad \text{as $\gamma \to \infty$.  }
\end{align*} 
\end{theorem}
\begin{proof}
    We give proof in the supplementary material. 
\end{proof}

In the above theorem, we have used the notation $g(c)\lesssim h(c)$ as $c\to c_0$ to indicate that $\lim \sup \frac{g(c)}{h(c)} \leq 1$ as $c\to c_0$ for any two functions $c\mapsto g(c)$ and $c\mapsto h(c)$.


%The value $\mathbb{E}_{1}[T_{\texttt{RSCUSUM}}]$ is also referred to as the \textit{Expected Detection Delay} (EDD) in the literature. 
Theorems~\ref{thm:arl} and~\ref{thm:cond_edd} imply that the \textit{expected detection delay} (EDD) increases linearly as the stop threshold $\tau$ increases subject to a constraint on ARL.
% \begin{remark}
% \label{remark: lambda}
% It is worth noting that although results of our core results hold for a pre-selected $\lambda$ that satisfied the Inequality~(\ref{eq: condition}), the effect of choosing any other $\lambda^{\prime}$ amounts to the scaling of all the increments of RSCUSUM by a constant factor of $\lambda^{\prime}/ \lambda$. This means that all of these results still hold adjusted for this scale factor. For instance, the result of Theorem \ref{thm:arl} can be modified to be written as 
% $$
% \mathbb{E}_{\infty}[T_{\texttt{RSCUSUM}}]\geq \exp \left\{\frac{\lambda  \tau}{\max(\lambda, \lambda^{\prime})}\right\},
% $$ 
% for any $\lambda^{\prime} > 0$. It is easy to see that this scaling will change the statement of Theorem \ref{thm:cond_edd} accordingly to 
% $$
% \mathbb{E}_{1}[T_{\texttt{RSCUSUM}}]\sim \frac{\max(\lambda, \lambda^{\prime})}{\lambda }\frac{\log \gamma}{\lambda^{\prime}(\mathbb{D}_{\texttt{F}}(P_1\|P_{\infty})-\mathbb{D}_{\texttt{F}}(P_1\|Q_{1}))},
% $$ 
% as $\gamma \to \infty$. In order to have the strongest results in Theorems~\ref{thm:arl} and \ref{thm:cond_edd}, we must choose $\lambda$ as close to $\lambda^*$ as possible.
% \end{remark}

% In the robust Likelihood-based CUSUM Algorithm, we assume that
% $\min_{  } \mathbb{D}_{\texttt{KL}}(Q_1\|P_{\infty}) $

% \begin{definition}[The gradient-log mixture] A set of distributions $\mathcal{P}$ is a \textit{gradient-log mixture} of density functions if for any $0\leq \lambda\leq 1$, if $\forall Q_1, p_2\in \mathcal{P}$, then there exists a density $p_{\star}\in \mathcal{P}$, such that \begin{align}
%     \nabla_x \log p_{\star} = \lambda \nabla_x \log Q_1+(1-\lambda) \nabla_x \log p_{2}.
%     \label{eq:gradient-log-mix}
% \end{align}
% \end{definition}

\section{Identification of the least favorable distribution}
\label{sec:least_favorable_distribution}
Consider a general parametric distribution family $\mathcal{P}$ defined on $\mathcal{X}$. We use $\mathcal{P}_m$ to denote a set of a finite number of distributions belonging to $\mathcal{P}$, namely \begin{align*}
    \mathcal{P}_m = \{P_i,\; i=1,\dots, m:\; P_i\in \mathcal{P}\},\; m\in \mathbb{N}^{+}.
\end{align*}
We use $p_i$ to denote the density of each distribution $P_i, \;i=1, \dots, m$. Then, we define a convex set of densities 
    \begin{multline}
    \label{eq:convex_post_family}
        \mathcal{A}_m \de \biggl\{ x \mapsto \sum_{i=1}^m \alpha_i p_i(x):  \sum_{i=1}^m \alpha_i=1, \alpha_i \geq 0\biggr\}. 
    \end{multline}
We further define a set of functions
    \begin{multline}
    \label{eq:convex_gradient_density}
        \mathcal{B}_m \de \biggl\{ x \mapsto \sum_{i=1}^m \beta_i(x) \nabla_x \log  p_i(x): \\
         \sum_{i=1}^m \beta_i(x)=1,\; \beta_i(x) \geq 0,\; p_i \in \mathcal{P}_m \biggr\}. 
    \end{multline}
Consider the pre-change distribution $P_{\infty}$ (with density $p_{\infty}$) such that $P_{\infty} \in \mathcal{P}$ and $P_{\infty}\notin \mathcal{A}_m$. We use $\mathbb{E}_{\infty}$ to denote its corresponding expectation with $p_{\infty}$. Next, we provide a result to identify the LFD in $\mathcal{A}_m$ in terms of the Fisher-divergence (as defined in Definition~\ref{defination: LFD}). 
    
\begin{theorem} \label{thm_general_LFD}
    Assume that there exists an element $P_0 \in \mathcal{A}_m$  (with density $p_0$) such that
    \begin{multline}
      \mathbb{E}_{p_0}\biggl\{ \|\nabla_x \log p_0(X) -\nabla_x \log p_{\infty}(X) \|_2^2 \biggr\} \\
        = \min_{p \in \mathcal{A}_m, \phi \in \mathcal{B}_m} \mathbb{E}_{p} \biggl\{\|\phi (X) -\nabla_x \log p_{\infty}(X) \|_2^2 \biggr\}. 
        \label{eq10}
    \end{multline}
    Then, we have
    \begin{multline*}
        \mathbb{E}_{p_0}\biggl\{ \|\nabla_x \log p_0(X) -\nabla_x \log p_{\infty}(X) \|_2^2 \biggr\} \\
        = \min_{p \in \mathcal{A}_m} \mathbb{E}_{p} \biggl\{\|\nabla_x \log p(X) -\nabla_x \log p_{\infty}(X) \|_2^2 \biggr\}.
   \end{multline*}
\end{theorem}

\begin{proof}
    For any $p \in \mathcal{A}_m$, there exist $w_i$ such that $p = \sum_{i=1}^m w_i p_i$, where $w_i \geq 0$ and $\sum_{i=1}^m w_i = 1$. Direct calculations give
    \begin{align*}
        &\mathbb{E}_{p} \biggl\{ \|\nabla_x \log p(X) -\nabla_x \log p_{\infty}(X) \|_2^2 \biggr\}
        \\
        &= \mathbb{E}_{p}\biggl\{ \biggl\| \frac{\nabla_x p(X)}{ p(X)} -\nabla_x \log p_{\infty}(X)\biggr\|_2^2 \biggr\} 
         \\
        &= \mathbb{E}_{p}\biggl\{ \biggl\| \frac{\sum_{i=1}^m w_i \nabla_x p_i(X)}{ \sum_{i=1}^m w_i p_i(X)} -\nabla_x \log p_{\infty}(X) \biggr\|_2^2 \biggr\}  
        \\
        % &= \mathbb{E}_{\infty}\biggl\{ \biggl\| \frac{\sum_{i=1}^m w_i p_i(X) \nabla_x \log p_i(X)}{\sum_{i=1}^m w_i  p_i(X)} -\nabla_x \log p_{\infty}(X) \biggr\|_2^2 \biggr\} 
        % \\
        &= \mathbb{E}_{p}\biggl\{ \biggl\| \sum_{i=1}^m u_i(X) \nabla_x \log p_i(X) - \nabla_x \log p_{\infty}(X) \biggr\|_2^2 \biggr\},
        % \label{eq1}
    \end{align*}
    where $u_i(X) = \frac{w_ip_i(X)}{\sum_{i=1}^m w_ip_i(X)}$ for all $i=1,\ldots,m$, and $\sum_{i=1}^m u_i(X)=1$. Clearly $\nabla_x \log u_i(x) - \nabla_x \log u_j(x)  = \nabla_x \log  p_i(x) - \nabla_x \log  p_j(x)$ for all $1 \le i,j \le m$.
    
    Using Condition~(\ref{eq10}), the quantity above is minimized at $p = p_0$, which concludes the proof. 
\end{proof}
Theorem~\ref{thm_general_LFD} provides an efficient way to identify the LFD in a convex set with only knowledge of the gradient of the log density functions. 
% We note that the above theorem remains true if we modify the definition of the set $\mathcal{B}_m$ as follows: 
% \begin{multline}
% \mathcal{B}_m \de \biggl\{ x \mapsto \sum_{i=1}^m \beta_i(x) \nabla_x \log  p_i(x):\\
% \sum_{i=1}^m \beta_i(x)=1,\; \beta_i(x) \geq 0,\; p_i \in \mathcal{P}_m \\
% \nabla_x \log \beta_i(x) - \nabla_x \log \beta_j(x)  = \nabla_x \log  p_i(x) - \nabla_x \log  p_j(x) \\
% \text{for all} \, 1 \le i,j \le m \biggr\}.
% \label{eq:convex_gradient_density_1} 
% \end{multline}

%It enables the infeasibility of RSCUSUM for unnormalized statistical models. Moreover, when the pre- and post-change distributions belong to some parametric family, we can identify the LFD in the parameter space. 
Next, we provide a method to find the LFD in a class of Gaussian mixture models. 
\begin{theorem}
\label{theorem: lfd_example}
    Let $G_{\theta}$ denote the $d$-dimensional Gaussian distribution centered at $\theta \in \mathbb{R}^d$ with a constant covariance matrix $V \in \mathbb{R}^{d \times d}$. Let the set  $\Theta_1 \subseteq \mathbb{R}^d$ be compact and convex.
    Consider the pre-change distribution $G_{\theta_*}$ and post-change distribution class $\mathcal{G}_1$ defined as all Gaussian mixture models given by the convex hull of
    $\{ G_{\theta}: \theta \in \Theta_1 \}$.  
    For any vector $v \in \mathbb{R}^d$, let $\|v\|_V = (v^T V^{-2} v)^{1/2}$.
    Assume that $\theta_* \not\in \Theta_1$, and $\theta_0 \in \Theta_1$ is the closest to $\theta_*$ under the $\|\cdot\|_V$ norm, namely $\|\theta_0-\theta_*\|_V = \min_{\theta \in \Theta_1 }\|\theta-\theta_*\|_V $. Then, $G_{\theta_0}$ is the closest to $G_{\theta_*}$ among $\mathcal{G}_1$ under the Fisher divergence.
\end{theorem}

\begin{proof}
Let $g_{\theta_0}$ and $g_{\theta_*}$ denote the densities of $G_{\theta_0}$ and $G_{\theta_*}$, respectively.  Clearly,
   \begin{multline*}
   \min_{g_{\theta} \in \mathcal{G}_1 }\,  \mathbb{E}_{g_{\theta}} \biggl\{\| \nabla_x \log g_{\theta}(X) -\nabla_x \log g_{\theta_*}(X) \|_2^2 \biggr\} \\
\le 
\mathbb{E}_{g_{\theta_0}} \biggl\{\| \nabla_x \log g_{\theta_0}(X) -\nabla_x \log g_{\theta_*}(X) \|_2^2 \biggr\}
\end{multline*} 
We will prove the equality by proving the reverse inequality. To this end, consider an arbitrary element of $\mathcal{G}_1$. By definition of convex hull, this element can be written as $G_1 = \sum_{i=1}^m w_i G_{\theta_i}(X)$ for some $m \ge 1$,  $w_i \ge 0, i=1, \cdots, m$ with $\sum_{i=1}^{m} w_i = 1$ and $\theta_i \in \Theta_1$ for $i=1, \cdots, m$. As proved in the above theorem 
    \begin{align*}
        &\mathbb{E}_{g_1} \biggl\{ \|\nabla_x \log g_1(X) -\nabla_x \log g_{\theta_*}(X) \|_2^2 \biggr\}
        \\
        &= \mathbb{E}_{g_1}\biggl\{ \biggl\| \sum_{i=1}^m \beta_i(X) \nabla_x \log g_{\theta_i}(X) - \nabla_x \log g_{\theta_*}(X) \biggr\|_2^2 \biggr\},
    \end{align*}
    where $\beta_i(X) = \frac{w_ig_{\theta_i}(X)}{\sum_{i=1}^m w_i g_{\theta_i}(X)}$ for all $i=1,\ldots,m$.

Thus, we have
    \begin{align*}
& \mathbb{E}_{g_{1}} \biggl\{\| \nabla_x \log g_{1}(X) -\nabla_x \log g_{\theta_*}(X) \|_2^2 \biggr\} \\    
        &=  \mathbb{E}_{g_{1}}\biggl\| \sum_{i=1}^m \beta_i(X) (X - \theta_i) - (X - \theta_*) \biggr\|_V^2 \\
        &= \mathbb{E}_{g_{1}}\biggl\| \sum_{i=1}^m \beta_i(X) (\theta_* - \theta_i) \biggr\|_V^2.
        %&= \mathbb{E}_{g_{1}}\biggl\| \theta_* - \sum_{i=1}^m \beta_i(X) \theta_i \biggr\|_V^2.
        %\geq  \mathbb{E}_{g_{\theta}}\| \theta_* - \theta_0 \|_V^2 \\
        % &= 
        % \mathbb{E}_{g_{\theta_0}} \biggl\{\| \nabla_x \log g_{\theta_0}(X) -\nabla_x \log g_{\theta_*}(X) \|_2^2 \biggr\}.
    \end{align*}
    Using the assumption that $\|\theta_0-\theta_*\|_V = \min_{\theta \in \Theta_1 }\|\theta-\theta_*\|_V $, we have 
        \begin{align*}
%& \mathbb{E}_{g_{1}} \biggl\{\| \nabla_x \log g_{1}(X) -\nabla_x \log g_{\theta_*}(X) \|_2^2 \biggr\} \\    
 %       &=  \mathbb{E}_{g_{1}}\biggl\| \sum_{i=1}^m \beta_i(X) (X - \theta_i) - (X - \theta_*) \biggr\|_V^2 \\
        &= \mathbb{E}_{g_{1}}\biggl\| \sum_{i=1}^m \beta_i(X) (\theta_* - \theta_i) \biggr\|_V^2 \\
        &= \mathbb{E}_{g_{1}}\biggl\| \theta_* - \sum_{i=1}^m \beta_i(X) \theta_i \biggr\|_V^2  \geq  \mathbb{E}_{g_{\theta}}\| \theta_* - \theta_0 \|_V^2 \\
        &= 
        \mathbb{E}_{g_{\theta_0}} \biggl\{\| \nabla_x \log g_{\theta_0}(X) -\nabla_x \log g_{\theta_*}(X) \|_2^2 \biggr\}.
    \end{align*}
This concludes the proof.
\end{proof}

For a general parametric family of potential post-change distributions, it may be difficult to identify the LFD. In Section~\ref{subsec: example_lfd}, we propose a method to find the LFD in parameter space. 

\section{Numerical Results} \label{sec:results}
In this section, we present numerical results for both synthetic and real data demonstrating the robustness of RSCUSUM. 
Specifically, we identify the LFD in $\mathcal{G}_1$ defined as convex hull of given distributions $P_i(x), \, i=1,2, \cdots, m$. %where $m < \infty$. 
To this end,  we minimize the Fisher divergence over the set $\mathcal{B}_m$ defined in Equation (\ref{eq:convex_gradient_density}) and
invoke Theorem \ref{thm_general_LFD}.
%when possible.
In general, we can then estimate the $\nabla_x \log p_0(x)$ for LFD by $\sum_{i=1}^{m} \beta_i(x) \nabla_x \log p_i(x)$.  

%To this end, we first provide a heuristic to find the LFD in the simulations below.

\subsection{Example of the Least Favorable Distribution}
\label{subsec: example_lfd}
We consider the parametric family $\mathcal{P}$ as the multivariate Normal distribution (MVN), a subfamily~\citep{yu2016statistical} of the exponential family (EXP), and the Gauss-Bernoulli Restricted Boltzmann Machine (RBM)~\citep{LeCun2006ATO}.
For example, in the case of MVN, 
\begin{align*}
    &\mathcal{G}_{\infty} = \{\mathcal{N}(\mathbf{\mu}_*, V_*)\},\nonumber\\
    &\mathcal{G}_1= \left\{\sum_{i=1}^m \alpha_i\mathcal{N}(\boldsymbol{\mu}_i, V_i):\; \sum_{i=1}^m \alpha_i = 1, \;\forall \;\alpha_i\geq 0\right\}.
\end{align*}
Here the pre-change distribution $P_{\infty}=\mathcal{N}(\mathbf{\mu}_*, V_*)$ and the uncertainty class $\mathcal{G}_1$ is constructed from a finite basis $\mathcal{P}_m=\{\mathcal{N}(\boldsymbol{\mu}_i, V_i), \;i=1,\ldots, m\}$ (see Equation~(\ref{eq:convex_post_family})). Each basis element $P_i$ is parameterized by the corresponding vector $\boldsymbol{\theta}_i=(\boldsymbol{\mu}_i, V_i)$. Without loss of generality, we assume $\boldsymbol{\theta}_1$ to be the closest to $\boldsymbol{\theta}_*=(\boldsymbol{\mu}_{\star}, V_{\star})$ in $L_2$ (Euclidean) norm. 

By Theorem~\ref{thm_general_LFD}, it is sufficient to find $P_0$ such that Condition~\eqref{eq10} holds.  Any $\phi(x)\in\mathcal{B}_m$ is characterized by coefficients $\beta_j(\cdot), \;j=1,\ldots,m$ (see Equation~\eqref{eq:convex_gradient_density}).

We use a neural network $\operatorname{Softmax}_j\circ f_{\textit{NN}}(x)$ to estimate $\beta_j(\cdot)$, specifically, 
\begin{align*}
    \beta_j(x) = \operatorname{Softmax}_j\circ f_{\textit{NN}}(x),
\end{align*}
where $f_{\textit{NN}}$ is given by the feature extractor part of a multi-layer neural network corresponding to hidden layer sizes $[128-64-m]$, with $\operatorname{Softmax}$ the last layer
all $\operatorname{ReLU}$ activation functions in hidden layers.  Note that $\operatorname{Softmax}_j$ denotes the $j$-th element of the Softmax function.
The use of Softmax function ensures $\sum_{i=1}^m\beta_i(x)=1$ and $\beta_i(x) \ge 0$ for all $1 \le i \le m$.

To identify $P_0$, we learn $f_{\textit{NN}}$ by minimizing the following loss function over the training sample $X_1, \cdots, X_N\sim P$:
\begin{equation*} 
    \mathcal{L} = \frac{1}{N}\sum_{i=1}^N \biggl\|\sum_{j=1}^m\beta_j(X_i)\nabla\log p_{j}(X_i)-\nabla\log p_{\infty}(X_i)\biggr\|_2^2,
\end{equation*}
where $P$ is updated at each epoch based on the learned coefficients $\beta_i(x)$ by
$$\nabla_{x}\log p(x) = \sum_{i=1}^m \beta_{i}(x)\nabla_{x}\log p_{i}(x).$$
To generate samples from the unnormalized density function $\nabla_{x} \log p(x)$, standard Markov Chain Monte Carlo (MCMC)  techniques (such as MALA) are employed. Furthermore, the neural network is trained using the Adam optimizer.

In Table~\ref{tab: lfd_coeffs}, we report the average value $\frac{1}{M}\sum_{i=1}^M\beta_j(Y_i)$ over the test sample $Y_1, \cdots Y_M\sim P$ respectively in cases where the basis elements of $\mathcal{P}_m$ are MVN$_m$ (with mean shifts), MVN$_c$ (with covariance shifts), EXP, and RBMs. Details of $P_{\infty}$ and basis elements of $\mathcal{P}_m$ are given in the Supplementary Material. In all cases the average value of $\beta_1(y)$ (respectively $\beta_j(y), \,  j=2,3,4$) is extremely close to $1$ (respectively to $0$). This gives strong evidence that the LFD is achieved by one of the basis $\mathcal{P}_m$, and Theorem \ref{thm_general_LFD} can be invoked to give the LFD.

\begin{table}[htbp]
    \centering
    \begin{tabular}{c|cccc}
    \toprule
         j & $1$&$2$&$3$&$4$\\
         \hline
        MVN$_m$ &1.00e+00& 4.90e-09& 2.43e-11& 6.29e-12\\
        MVN$_c$ &9.99e-01& 7.47e-06& 3.23e-08& 3.55e-08\\
        EXP & 9.99e-01& 2.84e-05& 1.37e-09& 1.01e-09\\
        RBM & 1.00e+00& 3.18e-33& 0.00e+00& 0.00e+00\\
        \bottomrule
    \end{tabular}
    \caption{Empirical average values of $\beta_j(x)$ over $10000$ test sample for MVN, EXP, and RBM models.}
    \label{tab: lfd_coeffs}
\end{table}

\subsection{Synthetic Data} \label{subsec: synthetic_data}
As in Subsection~\ref{subsec: example_lfd}, we simulate synthetic data streams from MVNs and RBMs to evaluate the performance of RSCUSUM. The LFD in the uncertainty class is identified as in Subsection~\ref{subsec: example_lfd}. We also report the performance of the SCUSUM (which is not robust) \cite{wuetal-aistat-2023}  for arbitrary \textit{wrong} distributions in the uncertain class.

We consider a change detection scenario where the pre- and post-distributions are modeled by MVN (respectively RBM) models with $m=4$. Both $P_\infty$ and the elements of the uncertainty class are created according to detailed descriptions in the supplementary material. We use Gibbs sampling method with $1000$ iterations for RBMs. 
In each trial, we treat one of $P_i\in \mathcal{P}_m, \, i=1,2,3,4$ as the \textit{true} post-change distribution. For each trial, we perform the experiment for $1000$ runs. 
%Further details of the distributions can be found in the supplementary materials. 
 
In all experiments, we set the change point as $\nu=50$, and we set the total length of each data stream as $10000$ to assure the generated data stream is long enough for detection.
We evaluate the detection delay for ARL values ranging from $100$ to $3000$. 

In Figure~\ref{fig:score}(a) and (b), we respectively report the detection scores versus time in cases for MVN$_m$ and RBM experiments. The results demonstrate that the average increment of detection scores is positive for RSUCUM, while negative for the non-robust SCUSUM. This means that a non-robust CUSUM fails to detect this post-change scenario but the RSCUSUM algorithms detects it. 
\begin{figure}[htbp]
\centering
 \includegraphics[width= 0.9\linewidth]{results/score_mvn-rbm.pdf}
   \caption{Detection score versus Time. }
 \label{fig:score}
\end{figure}


In Figure~\ref{fig:mvn-mean}(a) and (b), we respectively demonstrate the empirical EDD against log-scaled ARL for both MVN$_m$ and RBM experiments. The results demonstrate that  RSCUSUM is robust and performs competitively in terms of detection delay.  In particular, we observe that the EDD of RSCUSUM (subplot in left rows) increases at a linear rate for all cases, while some EDD of non-robust SCUSUM (subplot in right rows) may increase at an exponential rate (compare the y-axis labels for the plots).
%\subsection{Extended Network Intrusion Data} 
%\label{subsec: real-data}
\begin{figure}[tbph]
\centering
 \includegraphics[width=0.9 \linewidth]{results/arl_mvn-rbm.pdf}
   \caption{EDD versus log-scaled ARL. }
 \label{fig:mvn-mean}
\end{figure}

%\clearpage

\section{Conclusions} \label{sec: conclusion}
In this work, we proposed the RSCUSUM algorithm, a robust score-based algorithm for quickest change detection when the post-change distribution is not precisely known.  
We defined the least favorable distribution in the sense of Fisher divergence. Using asymptotic analysis, we also analyzed the delay and false alarms of RSCUSUM in the sense of Lorden's and Pollak's metrics. 
We provided both theoretical and algorithmic methods for computing the least favorable distribution for unnormalized models. Numerical simulations were provided to demonstrate the performance of our robust algorithm.


\begin{acknowledgements} % will be removed in pdf for initial submission,
% (without ‘accepted’ option in \documentclass)
% so you can already fill it to test with the
% ‘accepted’ class option
Suya Wu and Vahid Tarokh were supported in part by Air Force Research Lab Award under grant number FA-8750-20-2-0504. Jie Ding was supported in part by the Office of Naval Research under grant number N00014-21-1-2590. Taposh Banerjee was supported in part by the U.S. Army Research Lab under grant W911NF2120295.
\end{acknowledgements}

% References
%\clearpage
\balance
\bibliography{wu_571}

\clearpage

% \include{uai2023-supplement}
\end{document}
