\documentclass{uai2023}

% \usepackage{uai2023}

\usepackage[utf8]{inputenc} % allow utf-8 input
\usepackage[T1]{fontenc}    % use 8-bit T1 fonts
\usepackage{hyperref}       % hyperlinks
\usepackage{url}            % simple URL typesetting
\usepackage{booktabs}       % professional-quality tables
\usepackage{amsfonts}       % blackboard math symbols
\usepackage{mathtools} % amsmath with fixes and additions
% For theorems and such
\usepackage{amsmath}
\usepackage{amssymb}
\usepackage{mathtools}
\usepackage{amsthm}
\usepackage{nicefrac}       % compact symbols for 1/2, etc.
\usepackage{microtype}      % microtypography
\usepackage{xcolor}         % define colors in text
\usepackage{xspace}         % fix spacing around commands
\newtheorem{comments}{Comment}[section]
\newtheorem{response}{Response}[section]

\def\de{\overset{\Delta}{=}}

\begin{document}
% You have until \textbf{Monday, December 5, 2022 (11:59PM Anywhere on Earth)} to (optionally) respond to the reviews. You must submit a single response that addresses all reviews (not one response per review). The author response is limited to a \textbf{single page} in PDF format, including all figures, tables, and references, and has to use the AISTATS ``author response'' style that accompanies this \texttt{tex}-file. You may not alter this style file; in particular, you may not change the paper size, font, font size, or margins. Moreover, author responses must not contain external links, and must be \textbf{anonymized}.

% Please focus your response on either answering specific questions raised in the reviews or correcting any misunderstanding or factual errors in the reviews.

% You can change your response as often as you like until the above deadline. Please note that \textbf{this deadline is strict} and we encourage you to submit your response early so as to avoid technical issues. Please be aware that the deadline is \textbf{11:59PM Anywhere on Earth}.

% To include a figure in your response, the following LaTeX code is a possible solution:

% \begin{verbatim}
% \begin{minipage}[b]{0.3\linewidth}
% \includegraphics[width=\linewidth]{path_to_figure}
% \captionof{figure}{figure_caption}
% \end{minipage}
% \end{verbatim}
\onecolumn

\section{Reviewer 1}
\begin{comments}
Score matching has been used in machine learning before the cited deep generative models (Song et a. 2020), I suggest the authors include more references on score matching, as a large part of the audience is from the machine learning background.
\end{comments}
\paragraph{Response:} Thanks for your excellent comment. We have already cited some earlier work but we will cite more widely. %For instance, \cite{ding2019gradient} interprets Hyv\"arinen score from an information-theoretic perspective, and existing studies introduced the Hyv\"arinen score to Bayesian model comparison with improper or vague priors~\cite{Shao2019}. 
%We refer more citations to \cite{ding2019gradient} and references therein.

%#####################################################################################################
\vspace{1cm}
\begin{comments}
Section 2 introduces and highlights the meaning of "robustness" which is an important difference between the proposed technique with the baseline.
\end{comments}
\paragraph{Response:} We will revise the manuscript in accordance with this comment. To interpret the meaning of ``robustness" in our problem, we recall the definition of Problem (1) (see Equations (1) in Section 2). The objective is to find a stopping rule that solves the following problem:
\begin{equation}
    \label{eq:lorden}
    \min_T \;\sup_{P_1 \in \mathcal{G}_1}\mathcal{L}_{\texttt{WADD}}(T)\;
    \quad \text{subject to}\;\quad \mathbb{E}_{\infty}[T]\geq \gamma,
\end{equation}
 which is the robust version of the quickest change detection with the Lorden criterion~\cite{lorden1970excess}. Thus, an algorithm is \textit{robust optimal} if it solves the above problem. As shown in \cite{unnikrishnan2011minimax}, the robust optimal algorithm is the robust CUSUM (RCUSUM) algorithm with the performance given by
 \begin{equation}
\label{eq:detection_delay_rcusum}
    \mathcal{L}_{\texttt{WADD}}(T_{\texttt{RCUSUM}})\sim \frac{\log \gamma}{\mathbb{D}_{\texttt{KL}}(P_1\|Q_{\infty})-\mathbb{D}_{\texttt{KL}}(P_1\|Q_1)}, \quad \text{ as } \gamma \to \infty. 
\end{equation}
As seen in the above equation, the detection delay and the log of mean time to false alarm have a linear relationship. The performance of the RSCUSUM algorithm is given by
\begin{equation}
\label{eq:detection_delay_rscusum}
     \mathcal{L}_{\texttt{WADD}}(T_{\texttt{RSCUSUM}}) \sim \frac{\log \gamma}{\lambda (\mathbb{D}_{\texttt{F}}(P_1\|P_{\infty})-\mathbb{D}_{\texttt{F}}(P_1\|Q_{1}))}, \quad \text{ as } \gamma \to \infty. 
\end{equation}
Thus, even if the algorithm does not achieve the performance of the robust CUSUM algorithm (due to lack of knowledge of the precise distributions), it maintains the linear relationship between the delay and the log of mean time to false alarm (see Theorem 4.5). 
 % In the optimization problem formulated above, the optimal stopping rule minimizes the worst-case detection delay among all possible \textit{unknown} distributions, while satisfying the false-alarm constraint. Therefore, we expect a stopping rule such that there is a linear relationship between its expected detection delay and the average running length of a false alarm. 
 %Even with the ``uncertainty" of the post-change distribution, our proposed algorithm, RSCUSUM, can achieve this good performance, meaning that its detection delay increases at a linear rate of a log-scaled average running length (see Theorem 4.5). 
 This is the ``robustness" we claimed in this work.  We also demonstrated this ``robustness" in our numerical results, as shown in Figure 2, Section 6. Specifically, we see that, for \textit{any} case of true post-change distribution, the expected detection delay (EDD) of RSCUSUM (subplot in left rows) increases at a linear rate of log-scaled average running length; while for \textit{some} case, the EDD of non-robust SCUSUM (subplot in right rows) increases at an exponential rate of log-scaled average running length. In the revised version of the paper, we will include this discussion to clarify the concept of robustness. 

%#####################################################################################################
\vspace{1cm}
\begin{comments}
"By using the Hyvärinen Score in our algorithm, the role of Kullback-Leibler divergence in the theoretical analysis of the algorithm is replaced by the Fisher divergence." This is not clear, please introduce more steps to relate definitions 3.1 and 3.2.
\end{comments}
\paragraph{Response:} Thanks again for your comment. We will revise the manuscript in accordance with this comment. This sentence attempted to establish a comparison between robust SCUCUM and robust CUSUM proposed by~\cite{unnikrishnan2011minimax}. Specifically, Unnikrishnanan \cite{unnikrishnan2011minimax} proposed a robust CUSUM algorithm by the stopping rule:
\begin{equation}
T_{\texttt{RCUSUM}} = \inf \biggl\{ n\geq 1: \max_{1\leq k\leq n} \sum_{i=k}^n \log \frac{q_{1}(X)}{q_{\infty}(X)}\geq \tau\biggr\},
\end{equation}
where $\tau$ is some pre-selected threshold and $q_{\infty}$ and $q_1$ respectively denote density functions of $Q_{\infty}$ and $Q_1$, which is a pair of least favorable distributions (LFDs) introduced by \cite{unnikrishnan2011minimax}. Its asymptotic behavior on the worst-case averaged detection delay (WADD) has been analyzed: 
\begin{equation}
\label{eq:detection_delay_rcusum}
    \mathcal{L}_{\texttt{WADD}}(T_{\texttt{RCUSUM}})\sim \frac{\log\gamma}{\mathbb{D}_{\texttt{KL}}(P_1\|Q_{\infty})-\mathbb{D}_{\texttt{KL}}(P_1\|Q_1)}, \quad \text{ as } \gamma \to \infty, 
\end{equation}
where $Q_{\infty}$ can be replaced by $P_{\infty}$ if the uncertainty class for pre-change distributions is a singleton. For our robust RSCUSUM algorithm, we prove that (in Theorem 4.5)
\begin{equation}
\label{eq:detection_delay_rscusum}
     \mathcal{L}_{\texttt{WADD}}(T_{\texttt{RSCUSUM}}) \sim \frac{\log \gamma}{\lambda (\mathbb{D}_{\texttt{F}}(P_1\|P_{\infty})-\mathbb{D}_{\texttt{F}}(P_1\|Q_{1}))}, \quad \text{ as } \gamma \to \infty, 
\end{equation}
where we define the notion of least favorable distribution differently using the Fisher divergence. 
Comparing Equations~(\ref{eq:detection_delay_rcusum}) and (\ref{eq:detection_delay_rscusum}), we note that the asymptotic analysis of worst averaged detection delay for RCUSUM involves the Kullback-Leibler (KL) divergence among the least favorable distributions and the true pre- and post-change distributions. As for our results, the result involves the Fisher divergence among the least favorable distributions and the true pre- and post-change distributions. In our statement, we were referring to this comparison. We will revise the paper to clarify this point. 
%It can be said that ``the role of Kullback-Leibler divergence in the theoretical analysis of the is replaced by the Fisher divergence".

%#####################################################################################################
\vspace{1cm}
\begin{comments}
Section 3, what is the motivation to use Hyvärinen Score? This should be mentioned before "Our algorithm is based on Hyvärinen Score ".    
\end{comments}
\paragraph{Response:} In Section 2, we discussed the potential difficulty of calculating the normalization constant for unnormalized models. Recall the definition of the Hyvarinen score (see Definition 3.1): 
\begin{equation}
        % \label{eq:hyv_score}
        \mathcal{S}_{\texttt{H}}(X, P) \de \frac{1}{2} \left \| \nabla_{X} \log p(X) \right \|_2^2 + \Delta_{X} \log p(X),
\end{equation}
It can be seen that the above circumvents the computation of the normalization constant. This motivates using the Hyv\"arinen score in Section 3.

%#####################################################################################################
\vspace{1cm}
\begin{comments}
    RSCUSUM Detection Algorithm: briefly give the complexity. Also, explain in more detail how the computational requirements are reduced, as in the beginning, you mentioned "which is not too demanding in computational and memory requirements for online implementation."
\end{comments}
\paragraph{Response:} To come up with a 
computational complexity analysis, we need to have a bound on the complexity of computing $\nabla_{X} \log p(X)$ and $\Delta_{X} \log p(X)$ which depends on the format of $p(X)$.  Thus a general complexity bound is hard to obtain. However, in most practical cases, the complexity is reasonable to manage. Recall that RSCUSUM declares the occurrence of a change if the detection score $Z(n)$ is greater or equal to a pre-selected threshold $\tau$ (see Algorithm 1). Here, the detection score $Z(n)$ can be computed recursively:
\begin{align}
    &Z(0)=0, \\
    &Z(n) \de (Z(n-1)+z_{\lambda}(X_n))^{+},\;\forall n\geq 1.
\end{align}
Therefore, for online detection, RSCUSUM only needs to compute the instantaneous $z_{\lambda}(n)$, and saves $Z(n)$ for the next time detection in memory. The instantaneous score $z_{\lambda}(X)$ is defined by
\begin{equation}
   z_{\lambda}(X_k) = \lambda(\mathcal{S}_{\texttt{H}}(X_{k}, P_{\infty})-\mathcal{S}_{\texttt{H}}(X_{k}, Q_{1})).
\end{equation}
Assuming that $\nabla \log p(x)$ and $\Delta \log p(x)$ can be calculated with complexity and memory requirements polynomial in dimension (which is the case in most cases), then each recursive step is polynomial complexity. The required memory for computation also remains polynomial due to its recursive nature. We will discuss this in the revised manuscript.

\section{Reviewer 2}
\begin{comments}
    It is assumed in the paper that the post-change uncertainty class is both convex and compact, although this may not always be the case, potentially restricting the proposed algorithm's practicality.
\end{comments}
\paragraph{Response:} The esteemed reviewer is absolutely correct (that in general, the convex and compact assumption of the uncertain class may not always hold). 

First, we humbly point out that we need to make some assumptions to prove the results. Our assumption is certainly a lot milder than the stochastic dominance assumption commonly made for analysis of RCUSUM \cite{unnikrishnan2011minimax}.

 Additionally, we have already considered a remedy to address your question to some extent. Specifically, we considered the uncertainty class as a set of convex combinations of finitely many potential candidates for the true post-change distribution. Even when the knowledge of these candidates is limited to the unnormalized terms, we demonstrated (Theorem 5.1) that it is possible to identify the least favorable distribution in the uncertainty class by another set of the gradient of log density functions. This addresses  your question to some extent and further softens our assumption.

%#####################################################################################################
\vspace{1cm}
\begin{comments}
    Empirical evidence to demonstrate the RSCUSUM algorithm's efficacy on real-world datasets is not presented in the paper.
\end{comments}
\paragraph{Response:} We are working on applications of our method to real-world datasets and plan to include them in the final manuscript. 

\section{Reviewer 3}
\begin{comments}
    I have a small question about why the proposed method is called "robust" CUSUM. The considered setting is not a data contamination setting. I am wondering whether author can make connections to the setting in the following paper. "Offline change detection under contamination", https://proceedings.mlr.press/v180/bhatt22a/bhatt22a.pdf
\end{comments}
\paragraph{Response:} We proposed the robust SCUSUM algorithm, named RSCUSUM, for the quickest change detection problem when the post-change distribution is unknown. The problem formulation is different from a data contamination setting but is conceptually related. Specifically, we have followed the robust quickest change detection setting analogous to that of \cite{unnikrishnan2011minimax} but with unnormalized models. The objective in this setting is to solve the optimization problem:
\begin{equation}
    \label{eq:lorden}
    \min_T \;\sup_{P_1 \in \mathcal{G}_1}\mathcal{L}_{\texttt{WADD}}(T)\;
    \quad \text{subject to}\;\quad \mathbb{E}_{\infty}[T]\geq \gamma,
\end{equation}
where $\mathcal{G}_1$ is some uncertainty class. We assume the uncertainty class is convex and compact, and then identify the least favorable distribution from the uncertainty class to analyze the worst-case detection delay of RSCUSUM. The uncertainty class we considered is similar but different to the $\epsilon-$contamination model given by \cite{huber1965robust} which is related the data contamination setting. That said, we think that robust detection for contaminated unnormalized models is a very interesting problem.

%#####################################################################################################
\vspace{1cm}
\begin{comments}
I am curious about whether exists any other distance rather than Fisher divergence to achieve similar performance? Could author provide more intuitions why they consider score function in (5). Can other type of score function have similar theoretical properties.
\end{comments}
\paragraph{Response:} We are not sure if we could use other score functions to achieve the same performance. A well-known example is the CUSUM algorithm, in which the detection score can be rewritten as the difference between the log scores (negative log-likelihood) of pre- and post-change distributions. The motivation for using Hyv\"arinen score is to address the computational issue when applying likelihood-based change detection algorithms to unnormalized models. Recall the definition of the Hyvarinen score (see Definition 3.1): 
\begin{equation}
        % \label{eq:hyv_score}
        \mathcal{S}_{\texttt{H}}(X, P) \de \frac{1}{2} \left \| \nabla_{X} \log p(X) \right \|_2^2 + \Delta_{X} \log p(X).
\end{equation}
This circumvents the need for computing normalization constants (by taking gradients of the log density function with respect to $X$).

To develop the theoretical properties, we start with a key property (see Lemma 4.2) showing that under pre-change law, the mean increment of the Hyvarinen score-based detection is negative; while under post-change law, the mean increment is positive.

%#####################################################################################################
\vspace{1cm}
\begin{comments}
    The paper is well written. The only suggestion is author can improve numerical results a bit to include more settings.
\end{comments}
\paragraph{Response:}

\section{Reviewer 4}
\begin{comments}
    The presentation of this work did not detail reveal the difference between the previous SCUSUM and RSCUSUM but begin with the CUSUM method, which brings some confusion in Section 3 to know which part mainly contributed to this work.
\end{comments}
\paragraph{Response:} We will revise the manuscript in accordance with this comment. Specifically, we will add a discussion of the SCUSUM algorithm at the beginning of Section 3 and clarify the difference between the SCUSUM and RSCUSUM algorithms. If the post-change model is precisely known, the least favorable distribution ($Q_1$) in Algorithm 1 will be replaced by the true post-change law ($P_1$), and then RSCUSUM is identical to SCUSUM.
%#####################################################################################################
\vspace{1cm}
\begin{comments}
    Again, in Section 4, the difference of the theoretical results between SCUSUM and RSCUSUM is still unclear.
\end{comments}
\paragraph{Response:} The major difference between SCUSUM and RSCUSUM is that we release the knowledge of post-change distribution for the problem of quickest change detection. While SCUSUM assumes that the partial density function of the true post-change distribution is known, which is impractical for typical online data streams, RSCUSUM only assumes the knowledge of the uncertainty class of the true post-change distribution. RSCUSUM is identical to SCUSUM when the post-change distribution is precisely known. The false alarm analysis is similar but the detection delay analysis is different because its performance depends on the knowledge of post-change distribution. However, we emphasize that we have non-trivial technical results to handle the uncertainty of post-change distribution, in particular, see Lemma 4.1. 

%#####################################################################################################
\vspace{1cm}
\begin{comments}
    Is it too strict to assume that all random variables in sequence $X_1,...X_{\nu-1}$ and $X_{\nu},...,X_{n}$ are i.i.d.? This seems not a typical online data stream that might contain time dependency.
\end{comments}
\paragraph{Response:} From a practical perspective, yes, the i.i.d. assumption not always holds. We think it is possible to extend the current results to the non-i.i.d. case. But we believe this to be technically non-trivial, and that more research is needed to develop results that can handle non-i.i.d. data.

\bibliographystyle{apalike}
\bibliography{UAI2023/uai2023-ref}
\end{document}
