\documentclass[accepted]{uai2023} % for initial submission
% \documentclass[accepted]{uai2023} % after acceptance, for a revised
                                    % version; also before submission to
                                    % see how the non-anonymous paper
                                    % would look like
%% There is a class option to choose the math font
% \documentclass[mathfont=ptmx]{uai2023} % ptmx math instead of Computer
                                         % Modern (has noticable issues)
% \documentclass[mathfont=newtx]{uai2023} % newtx fonts (improves upon
                                          % ptmx; less tested, no support)
% NOTE: Only keep *one* line above as appropriate, as it will be replaced
%       automatically for papers to be published. Do not make any other
%       change above this note for an accepted version.

%% Choose your variant of English; be consistent
\usepackage[american]{babel}
\usepackage{xr}

% \usepackage[british]{babel}

%% Some suggested packages, as needed:
\usepackage{natbib} % has a nice set of citation styles and commands
    \bibliographystyle{plainnat}

\setlength{\parindent}{0cm}
\usepackage{multicol}
\usepackage[utf8]{inputenc} % allow utf-8 input
\usepackage[T1]{fontenc}    % use 8-bit T1 fonts
\usepackage{hyperref}       % hyperlinks
\usepackage{url}            % simple URL typesetting
\usepackage{booktabs}       % professional-quality tables
\usepackage{amsfonts}       % blackboard math symbols
\usepackage{nicefrac}       % compact symbols for 1/2, etc.
\usepackage{microtype}      % microtypography
\usepackage{xcolor}   
\usepackage{amsmath}
\usepackage{amssymb}
\usepackage{mathtools}
\usepackage{amsthm}
\usepackage{algorithm}
\usepackage{algorithmic}
\usepackage{natbib}
\usepackage{balance}
\newcommand{\twopartdef}[4]
{
	\left\{
		\begin{array}{ll}
			#1 & \mbox{if } #2 \\
			#3 & \mbox{if } #4
		\end{array}
	\right.
}
\usepackage{subcaption}
\usepackage{multirow}
\usepackage{graphicx}
\usepackage{microtype}


\newcommand\blfootnote[1]{%
  \begingroup
  \renewcommand\thefootnote{}\footnote{#1}%
  \addtocounter{footnote}{-1}%
  \endgroup
}

\theoremstyle{plain}
\newtheorem{theorem}{Theorem}[section]
\newtheorem{proposition}[theorem]{Proposition}
\newtheorem{lemma}[theorem]{Lemma}
\newtheorem{corollary}[theorem]{Corollary}
\theoremstyle{definition}
\newtheorem{definition}[theorem]{Definition}
\newtheorem{assumption}[theorem]{Assumption}
\theoremstyle{remark}
\newtheorem{remark}[theorem]{Remark}
\newtheorem{notation}[theorem]{Notation}
% colors
\allowdisplaybreaks

\title{Online Heavy-tailed Change-point detection}
\date{}

\author[1]{{Abishek Sankararaman\footnote{here}}}
{\author[1]{{Balakrishnan (Murali) Narayanaswamy}}}
\affil[1]{%
AWS AI Labs
}


\makeatletter
\newcommand*{\addFileDependency}[1]{% argument=file name and extension
\typeout{(#1)}% latexmk will find this if $recorder=0
% however, in that case, it will ignore #1 if it is a .aux or 
% .pdf file etc and it exists! If it doesn't exist, it will appear 
% in the list of dependents regardless)
%
% Write the following if you want it to appear in \listfiles 
% --- although not really necessary and latexmk doesn't use this
%
\@addtofilelist{#1}
%
% latexmk will find this message if #1 doesn't exist (yet)
\IfFileExists{#1}{}{\typeout{No file #1.}}
}\makeatother

\newcommand*{\myexternaldocument}[1]{%
\externaldocument{#1}%
\addFileDependency{#1.tex}%
\addFileDependency{#1.aux}%
}
\myexternaldocument{sankararaman_560-supp}

\begin{document}

\maketitle

\begin{abstract}
       We study algorithms for online  change-point detection (OCPD), where samples that are potentially heavy-tailed, are presented one at a time and a change in the underlying mean must be detected as early as possible. We present an algorithm based on clipped Stochastic Gradient Descent (SGD), that works even if we only assume that the second moment of the data generating process is bounded. We derive guarantees on worst-case, finite-sample false-positive rate (FPR) over the family of all distributions with bounded second moment. Thus, our method is the first OCPD algorithm that guarantees finite-sample FPR, even if the data is high dimensional and  the underlying distributions are heavy-tailed. The technical contribution of our paper is to show that clipped-SGD can estimate the mean of a random vector and simultaneously provide confidence bounds at all confidence values. We combine this robust estimate with a union bound argument and construct a sequential change-point algorithm with finite-sample FPR guarantees. We show empirically that our algorithm works well in a variety of situations, whether the underlying data are heavy-tailed, light-tailed, high dimensional or discrete. No other algorithm achieves bounded FPR theoretically or empirically, over all settings we study simultaneously. 
\end{abstract}
\blfootnote{* Correspondence to Abishek Sankararaman : abisanka@amazon.com}
{\color{black}
\section{Introduction}

Online change-point detection (OCPD) is a fundamental problem in statistics where instantiations of a random variable are presented one after another and we want to detect if some parameter or statistic corresponding to the underlying data generating distribution has changed. This problem has been widely studied in machine learning, mathematical statistics  and information theory over the past century. In part, this is due to the wide-ranging applications of OCPD to computational biology \citep{muggeo2011efficient}, online advertising \citep{zhang2017online}, cyber-security \citep{osanaiye2016change, kurt2018bayesian, polunchenko2012nearly}, cloud-computing \citep{maghakian2019online}, finance \citep{lavielle2007adaptive}, medical diagnostics \citep{yang2006adaptive, gao2018automatic} and robotics \citep{konidaris2010constructing}. We refer interested readers to the recent surveys of \citep{aminikhanghahi2017survey} and \citep{xie2021sequential} for details of applications of OCPD. These surveys build upon the classical texts in change-point detection obtained over the last decade \citep{basseville1993detection,tartakovsky1991sequential,krichevsky1981performance}.

Classical results for OCPD have focused on algorithms that assume known distributions for either one or both of the pre- and post-change data \citep{wald1992sequential,page1954continuous,shiryaev2007optimal,lorden1971procedures,pollak1985optimal,ritov1990decision,moustakides1986optimal,tartakovsky1991sequential}. In recent years, algorithms have been developed for cases when the pre- and post- change distributions are unknown, but belong to a parametric class such as the exponential family \citep{lai2010sequential,fryzlewicz2014wild,frick2014multiscale,cho2016change}. Non-parametric algorithms have been developed in \citep{padilla2021optimal,madrid2021optimal} and the references therein, but they only give asymptotic guarantees. The algorithms of \citep{adams2007bayesian,lai2010sequential,maillard2019sequential,alami2020restarted} have finite-sample guarantees, but either rely on parametric assumptions such as an exponential family, or on tail assumptions such as sub-gaussian distribution families. The works of \citep{bhatt2022offline} and \citep{li2021adversarially}  build upon the work in \citep{niu2012screening}, and give algorithms for multiple change-points with possibly heavy-tailed data in the \emph{offline} case with all data available up-front.%This paper is the first to give an online algorithm with finite sample guarantees for change-point detection without assuming any parametric family or strong tail conditions such as sub-gaussian distributions. 

In many modern applications such as cloud-computing and monitoring, data is known to often be heavy-tailed \citep{nair2022fundamentals,loiseau2010investigating,nizam2016attack} and too complex to model with any simple parametric family \citep{barnett2016change,hallac2015network,dartmann2019big}. Given the velocity, variety and volume of modern data streams, performance of change-point detection is measured through false-positive rates in order to combat alert fatigue \citep{ruff2021unifying}, and algorithms must work for streams that have multiple change points. Motivated by these requirements, we seek an OCPD algorithm that simultaneously meet the following desiderata : it  {\em(i)} detects multiple change-points, {\em(ii)} makes no parametric assumptions on the distribution of data, {\em(iii)} works with potentially heavy-tailed data, {\em (iv)} works for high-dimensional data streams, and {\em(v)}  guarantees finite sample FPR.  
%None of the existing algorithms for OCPD simultaneously achieve all desiderata.



}

{\color{black}


\subsection{Main Contributions}

Our paper is the first to give an online algorithm satisfying all the $5$ desiderata listed above. Specifically, our algorithm gives finite sample guarantees for FPR and detection-delay without assuming that data comes from a specific parametric family or assuming strong tail conditions - such as that the data have sub-gaussian distributions. No previous algorithm for OCPD simultaneously achieves all desiderata. 
%Thus, our algorithm is the first one to provably achieve bounded FPR across a variety of data streams whether they are  We also corroborate our theoretical results empirically.
%For a data stream consisting of multiple change-points with samples drawn from a potentially heavy tailed distribution without any parametric assumptions on the distribution class, our algorithm is the first online algorithm to provably satisfy finite-sample false positive detection guarantees and detection delay bounds. 
%Our algorithm is based on an instantiation of the {\ttfamily clipped-SGD} \citep{tsai2022heavy} as an online robust mean estimator. Specifically, 
Our main technical contribution is to  provide a {\ttfamily clipped-SGD} algorithm with finite sample confidence bounds for heavy-tailed mean estimation, \emph{that hold for all confidence values simultaneously}, a result  of independent interest. We use these bounds to build a OCPD algorithm with finite sample FPR.

We further show good empirical performance across a variety of data streams with heavy-tailed, light-tailed, high dimensional or discrete distributions.
However while our algorithm is designed to work across different distributions, we observe theoretically and empirically that when data has additional structure such as being one-dimensional with sub-gaussian tails or is binary, then specialized OCPD algorithms for those cases yield better results than our method. Closing these gaps is an ongoing direction of research.

% similar to the Generalized-Likelihood-Ratio (GLR) of \citep{maillard2019sequential}. 
%We further corroborate empirically, that our algorithm is the only one to achieve bounded FPR across a variety of data streams whether they are heavy-tailed, light-tailed, high dimensional or discrete.

% Our paper is the first algorithm to give non-asymptotic, finite sample guarantee on false positive probability and detection delay, that holds for any data stream consisting of samples drawn from potentially heavy tailed, online multiple change point detection. 


% We do so by constructing a novel variant of the clipped-SGD algorithm of \citep{tsai2022heavy} that can simultaneously estimate the mean of a sequence of heavy-tailed random variables for all confidence levels $\delta \in (0,1)$. This improves on the algorithm of \citep{tsai2022heavy} which needs the confidence level $\delta \in (0,1)$ to be given as input. The price we pay for simultaneous confidence band however is that our confidence intervals have an additional logarithmic factor compared to that of \citep{tsai2022heavy}. 



% \begin{itemize}
%     \item We can show that False positives are bounded above by $1-\delta$, uniformly over time. 
%     \item If the pre-change distribution is sufficiently long, then with probability $1-\delta$, the detection delay is bounded.
%     \item Lower bound needed for the minimal pre-change distribution needed. 
% \end{itemize}
}

{\color{red}

%\section{Related Work}

% Online change-point detection is widely studied, both in Theoretical Statistics and Information Theory literature in the past century with early seminal works of \citep{wald1992sequential}, \citep{page1954continuous}, \citep{lorden1971procedures}, \citep{krichevsky1981performance}. In recent times, this problem has gained renewed attention due to theoretical and methodological advances - \citep{fryzlewicz2014wild}, \citep{wang2018high}, \citep{frick2014multiscale}, \citep{cho2016change}. Most of these results rely on parametric assumptions on the distributions and provide asymptotic guarantees. Recent progress in non-parametric estimators have led to algorithms without parametric assumptions \citep{padilla2021optimal}, \citep{madrid2021optimal}, but only give asymptotic guarantees. The algorithms of \citep{adams2007bayesian}, \citep{lai2010sequential}, \citep{maillard2019sequential}, \citep{alami2020restarted} provide algorithms with finite-sample guarantees, but either rely on parametric assumptions such as an exponential family, or on tail assumptions such as sub-gaussian distribution families. The works of \citep{bhatt2022offline} of \citep{li2021adversarially} build upon the work in \citep{niu2012screening} and give algorithms for multiple change-points with possibly heavy-tailed data in the \emph{offline} case with all data available up-front. This paper is the first to give an online algorithm with finite sample guarantees for change-point detection without assuming any parametric family or strong tail conditions such as sub-gaussian distributions. 

% In recent times, there have been significant advances in robust mean estimation \citep{diakonikolas2020outlier}, \citep{lugosi2021robust}, \citep{depersin2022robust}, \citep{cherapanamjeri2020algorithms}, \citep{diakonikolas2022streaming} that are known to provide near optimal error bounds. However, unlike our method, none of these algorithms can directly give confidence bounds for all confidence values simultaneously.

% The simplicity of the SGD approach is that it can be incrementally updated and thus has a computational complexity of $O(t)$ at each time $t$. We need a variant because the algorithm of \citep{tsai2022heavy} needs as input both the time-horizon and a confidence parameter $\delta$. In contrast, our mean-estimator variant is any-time and does not require any confidence parameter as input. These allow us to use this estimator and design appropriate confidence bands to guarantee bounded FPR.



% However, all these algorithms are designed for the batch setting. Thus, even if we conservatively estimated the offline mean-estimation algorithm to take a run-time linear in the number of samples, at time $t$ a total of $O(t^2)$ computation is required to decide if there is a change point at time $t$. The quadratic complexity arises because these robust mean estimation algorithms cannot be incrementally updated with additional samples. 
}
\section{Problem Setup}
\label{sec:problem_formulation}

%We consider a OCPD setup where $d$ dimensional random vectors are sequentially observed, whose mean may change at one or more time steps. 
At each time $t$, a random vector $X_t \in \mathbb{R}^d$ is revealed to an OCPD algorithm. $X_t$ has a probability measure and expectation denoted by $\mathbb{P}_t$ and $\mathbb{E}_t$ respectively, and mean $\mathbb{E}_{t}[X_t] \in \mathbb{R}^d$. Subsequently, using all the samples observed so far - $X_1, \cdots, X_t$ - the algorithm outputs a binary decision denoting whether a change in mean has occurred  since time $t=1$ or the last time a change was output by the algorithm, whichever is larger. The goal of the OCPD algorithm is identify the change points as quickly as possible after they occur, with bounded false-positive rate (FPR). 
%Formally, at each time $t$,  a vector $X_t \sim \mathbb{P}_{t}$, with mean $\mathbb{E}_{t}[X] := \theta^*_t$ is given to the algorithm. 
The observed datum $(X_t)_{t \geq 1}$ are independent, although not identically distributed with piece-wise constant mean. 

\begin{definition}[Piece-wise constant mean process]
Let $T$ be the time horizon (stream-length) and let $Q_T < T$ be the total number of change-points. A set of strictly increasing time-points $1 < \tau_1 < \tau_2 \cdots < \tau_{Q_T +1} := T+1$ are called change-points, if for all $c \in \{1,\cdots, Q_T\}$ 
\begin{itemize}
\item $\forall t \in [1,T]$, $X_t \sim \mathbb{P}_t$ independently.
    \item $\forall t \in [\tau_c, \tau_{c+1})$, the mean $\mathbb{E}_t[X_t] := \theta_c$ of the observation is constant and does not depend on $t$.
    \item $\forall c \in [1,Q_T]$, $\theta_{c} \neq \theta_{c+1}$.
\end{itemize}
\end{definition}
Thus, a piece-wise constant mean process is identified by the quadruple $\mathfrak{M} := (T, Q_T, (\tau_c)_{c=1}^{Q_T}, (\mathbb{P}_t)_{t=1}^T)$. Throughout, we use probability and expectation operators $\mathbb{P}$ and $\mathbb{E}$, to denote the joint product probability distribution $(\mathbb{P}_t)_{t=1}^T$. %In our setup, the time-horizon $T$ location and values of change-points are deterministic and not random


\subsection{Assumptions}
Let $\mathcal{P}$ be a family of probability measures  on $\mathbb{R}^d$ such that the probability distributions $\mathbb{P}_t$, for all $t$, are from this family, i.e., $\mathbb{P}_t \in \mathcal{P}$, $\forall t \in [1,T]$. Throughout this paper, we make the following non-parametric assumptions on the family $\mathcal{P}$. 

\begin{assumption}
There exists a convex compact set $\Theta \subset \mathbb{R}^d$ known to the algorithm, such that for all $ \mathbb{P} \in \mathcal{P}$, $\mathbb{E}_{X \in \mathbb{P}}[X] \in \Theta$. In words, the mean of all the distributions in the family belong to a known bounded set $\Theta$ such that $\max_{\theta_1, \theta_2 \in \Theta}\| \theta_1 - \theta_2\| := G$.
\label{assumption:bounded}
\end{assumption}


\begin{assumption}
     There exists $\sigma > 0$ {known} to the algorithm, such that for all $\mathbb{P} \in \mathcal{P}$ and $\theta \in \Theta$, $\mathbb{E}_{X \sim \mathbb{P}}[ \| X - \mathbb{E}_{X \sim \mathbb{P}}[X] \|_2^2] \leq \sigma^2$. In words, the second moment is uniformly bounded for all distributions in $\mathcal{P}$.
     \label{assumption_stochastic}
\end{assumption}

These assumptions are very general and encompass a wide range of families such as any bounded distribution, the set of sub-Gaussian distributions and heavy-tailed distributions that do not have finite higher moments. We seek algorithms that work without knowing the length of the data stream, the number of change-points and that do not make any assumptions on the underlying distributions generating the samples, beyond Assumptions  \ref{assumption:bounded} and \ref{assumption_stochastic}.


\subsection{Performance Measures}

Any OCPD algorithm is measured by two performance metrics -- {\em (i)} False-positive rate and {\em(ii)} Detection delay. We set notation to define these measures.

\begin{notation}
For every $1 \leq r \leq s < T$, we denote by $X_{r:s} := (X_r, X_{r+1}, \cdots, X_s)$ to be the set of observed vectors from time $r$ to time $s$, with both end-points $r$ and $s$ inclusive. 
\end{notation}


\begin{definition}[OCPD algorithm]
A sequence of measurable functions $\mathcal{A} := (\mathcal{A}_t)_{t\geq 1}$ is called an OCPD algorithm if for every time $t \geq 1$, $\mathcal{A}_t \in \{0,1\}$ and is measurable with respect to the sigma algebra generated by $X_{1:t}$. The interpretation is that if $\mathcal{A}_t = 1$ for some $t$, then the algorithm has detected a change at time $t$ and if $\mathcal{A}_t = 0$, no change is detected at time $t$. 
\end{definition}

\begin{notation}
For an OCPD algorithm $\mathcal{A}$ and for all $t \in [T]$, denote by $R^{(\mathcal{A})}(t) \in \mathbb{N}$ to be the random variable denoting the number of detections made till time $t$, i.e., $R^{(\mathcal{A})}(t) = \sum_{s=1}^t \mathcal{A}_s$.
\end{notation}

\begin{notation}
For an OCPD algorithm $\mathcal{A}$, and every $r \in \mathbb{N}$ and, denote by $t_{r}^{(\mathcal{A})}$ as the stopping time
\begin{align*}
    t_{r}^{(\mathcal{A})} := \min( \inf \{t \in [0,T] \text{ s.t. } R^{(\mathcal{A})}(t) \geq r \}, T+1),
\end{align*}
where the $\inf$ of an empty set is defined to be $\infty$. In words, $t_{r}^{(\mathcal{A})}$ is the stopping time when the OCPD algorithm detects a change for the $r$th time, or $T+1$, whichever is larger. 
\end{notation}



\begin{definition}[False Positive Detection]
The $r$th detection of an OCPD algorithm $\mathcal{A}$ is said to be a False Positive, if there exists no change-point between the $r-1$th and the $r$th detection. Formally, denote by the indicator (random) variable $\chi_r^{(\mathcal{A})} = \mathbf{1}(\not\exists c \in [1,Q_T] \text { s.t. } \tau_c \in (t_{r-1}^{(\mathcal{A})},t_{r}^{(\mathcal{A})}])$ to denote if the $r$th detection of $\mathcal{A}$ is a false-positive. Note that by definition, on the event that $R^{(\mathcal{A})}(T) < r$, $\chi_r^{(\mathcal{A})} = 0$. 
\end{definition}


\begin{definition}[False Positive Rate (FPR)]
An OCPD algorithm $\mathcal{A}$ is said to have false-positive rate bounded by $\delta \in (0,1)$ if 
\begin{align}
   \sup_{\mathfrak{M}} \mathbb{E}\left[ \frac{\sum_{r=1}^{T} \chi_r^{(\mathcal{A})}}{R^{(\mathcal{A})}(T)} \mathbf{1}(R^{(\mathcal{A})}(T)>0)\right] \leq \delta.
   \label{eqn:fpr_definition}
\end{align}

In words, an OCPD algorithm $\mathcal{A}$ has bounded false positive rate, if for every piece-wise constant mean process $\mathfrak{M}$, the expected fraction of false-positives made by the algorithm $\mathcal{A}$ is bounded by $\delta$. In Equation (\ref{eqn:fpr_definition}), we take the sum till $T$ because that is the maximum number of possible change points detected. If an algorithm only detects $s < T$ change points, then by definition $\chi_r^{(\mathcal{A})} = 0$ for all $r > s$.
\end{definition}

\begin{definition}[Worst-case Detection Delay]
For $n \in \mathbb{N}$ and $\Delta > 0$, let $X_1, X_2, \cdots, X_n, X_{n+1}, \cdots$ be an infinite stream with the following distribution. For every $t < n$, $X_t \stackrel{\text{ind}}{\sim} \mathbb{P}_t$ with $\mathbb{E}_{X \sim \mathbb{P}_t}[X] = \theta_1 \in \Theta$ and for every $t \geq n$, $X_t \stackrel{\text{ind}}{\sim} \mathbb{P}_t$ with $\mathbb{E}_{X \sim \mathbb{P}_t}[X] = \theta_2 \in \Theta$ with $\| \theta_1 - \theta_2 \| = \Delta$. Let $\mathfrak{M}^{(n,\Delta)}$ denote all such infinite piece-wise constant mean process. An algorithm $\mathcal{A}$ is said to have worst-case detection delay $\mathcal{D}(\Delta, n , \delta')$, if 
\begin{multline}
    \sup_{\mathfrak{M}^{(n,\Delta)}}\mathbb{P}\bigg[ \inf \{ t > n \text{: } \mathcal{A}_t = 1\} - n \geq  \mathcal{D}(\Delta, n, \delta')\bigg]  \leq \delta'
    \label{eqn:detection_delay_defn}
\end{multline}
 holds for all $n \in \mathbb{N}$, $\Delta > 0$ and $\delta' \in (0,1)$.
\end{definition}

In words, the detection delay function $\mathcal{D}(\Delta, n ,\delta')$ is such that  for every admissible process $\mathfrak{M}^{(n,\Delta)}$ that has a single change-point at time $n$ with jump magnitude $\Delta$, algorithm  $\mathcal{A}$ detects the change-point before time $n+ \mathcal{D}(\Delta, n ,\delta' )$, with probability at-least $1-\delta'$. Note that the delay metric is measured on data streams with exactly one change-point. Defining detection delay for streams with multiple change-points is ambiguous as there could be missed detections, with only a subset of the change-points being detected \citep{alami2020restarted}, \citep{maillard2019sequential}. The main question this paper studies is 


\begin{center}
\textit{For each $\delta \in (0,1)$, does there exists an OCPD algorithm with FPR bounded by $\delta$ and having small worst-case detection-delay that only makes Assumptions \ref{assumption:bounded} and \ref{assumption_stochastic}  ? }.
\end{center}

Observe that it is trivial to achieve a FPR of $0$ for example the constant function where $\mathcal{A}(\cdot) = 0$, i.e., an algorithm that never detects change-point at all. However, this algorithm has a worst-case detection-delay of $\infty$, i.e., $\mathcal{D}(\Delta, n, \delta') = +\infty$ for all $\Delta > 0$, $n \in \mathbb{N}$ and $\delta' \in (0,1)$. Thus, the challenge is to design an algorithm that satisfies the FPR constraint of $\delta$ while having small, finite worst-case detection delay, without making parametric assumptions on the underlying data generating distributions.




\section{Online Robust Mean Estimation}
\label{sec:mean_estimation}

The central workhorse of our change-point detection algorithm is heavy-tailed online mean estimation. Suppose $X_1, X_2, \cdots$ are a sequence of independent random vectors, with the means $\mathbb{E}_t[X_t] = \theta^* \in \Theta$ being a constant independent of time $t$. Let $(\widehat{\theta}_t)_{t \geq 1}$ be a sequence of random variables such that $\widehat{\theta}_t$ is an estimate of $\theta$ based on the samples $X_1, \cdots, X_{t}$ defined through clipped-SGD algorithm described as follows. For a given non-negative sequence $(\eta_t)_{t \geq 1}$ and $\lambda > 0$, the estimate $\widehat{\theta}_0 \in \Theta$ is arbitrary,  $\widehat{\theta}_t$ for each $t \geq 1$, is given by
\begin{align}
    \widehat{\theta}_t \coloneqq \prod_{\Theta}(\widehat{\theta}_{t-1} - \eta_t \text{clip}(X_t - \widehat{\theta}_{t-1}, \lambda)),
    \label{eqn:sgd_update}
\end{align}
where, $\prod_{\Theta}$ is the projection operator onto the convex compact set $\Theta$ and for every $x \in \mathbb{R}^d$ and $\lambda > 0$,
\begin{align}
    \text{clip}(x,\lambda) = x \min \left( 1, \frac{\lambda}{\|x\|} \right).
\end{align}

Our main result on the convergence of the estimator $\widehat{\theta}_t$ to the true $\theta^*$ with increasing number of samples $t$ is the following. 

\begin{theorem}
For all times $t \geq 1$, when clipped SGD in Equation (\ref{eqn:sgd_update}) is run with $\lambda = 2G$ and $\eta_t = \frac{2}{m(t+\gamma)}$ for $\gamma = \max \left(120 \lambda \sigma(\sigma+1),\frac{320\sigma^2}{m}+1 \right)$, then for every $t \geq 1$ and every $\delta \in (0,1)$, 
 \begin{align*}
   \mathbb{P} \left[ \| \widehat{\theta}_t - \theta^* \|_2^2 \geq {\mathcal{B}(t, \delta)} \right] \leq \frac{\delta}{t(t+1)},
 \end{align*}
where 
\begin{multline}
    \mathcal{B}(t, \delta) := C_t\bigg[ \frac{\gamma^2 G^2}{(t+1)^2} +\left(\frac{16  \sigma^2}{\lambda} +  4  \sigma^2 \right) \frac{1}{2m^2(t+1)} \\+ \frac{96 \lambda^2 \ln \left( \frac{2t^2(t+1)}{\delta}\right)\sigma(\sigma+1) }{m (t+\gamma)\sqrt{t+1}}  \bigg],
    \label{eqn:defn_B}
\end{multline}
and $C_t  = \max(\frac{1024\sigma^4}{G^2m^2\lambda^2}, \frac{8 \lambda \sqrt{\ln \left( \frac{2t^2(t+1)}{\delta} \right)}}{\gamma^2 G} )$.
% \begin{align*}
%      .
% \end{align*}
\label{thm:main_mean_est}
\end{theorem}
{\color{black}
\begin{corollary}
There exists an universal constant $A > 0$ such that for all $t \geq 1$, when clipped SGD in Equation (\ref{eqn:sgd_update}) is run with parameters in Theorem \ref{thm:main_mean_est}
\begin{align*}
    \mathbb{P} \left[ \| \widehat{\theta}_t - \theta^* \| \geq A \max \left(\frac{\sigma^3}{\sqrt{t}} , \frac{\sigma \sqrt{\ln \left( \frac{t^3}{\delta}\right)}}{\sqrt{t}}\right) \right] \leq \frac{\delta}{t(t+1)},
\end{align*}
holds for every  $\delta \in (0,1)$.
\end{corollary}
}

Proof is in Appendix in Section \ref{sec:mean_estimation_proofs} and uses tools from \citep{bubeck2015convex}, \citep{gorbunov2020stochastic, tsai2022heavy} and \citep{victor1999general}. 

\begin{remark}
Compared to \citep{tsai2022heavy}, we do not need the failure probability $\delta$ in the input and we can give simultaneous confidence intervals for all failure probabilities $\delta$. In contrast, the algorithm of \citep{tsai2022heavy} requires $\delta \in (0,1)$ as an input and only guarantees that the estimate mean is close to the true mean, upto error probability of $\delta$. However, the bound in Theorem \ref{thm:main_mean_est} is off by logarithmic factors compared to \citep{tsai2022heavy}. Concretely, $C_t = O(1)$ for the algorithm of \citep{tsai2022heavy}, while it is $O( \log(t/\delta))$ for us. This is the price to have confidence intervals hold for all failure probabilities simultaneously as opposed to just having one single failure probability.
\end{remark}

\begin{remark}
Compared to the setting of \citep{tsai2022heavy}, our setting is \emph{weaker} as we assume that the domain $\Theta$ is compact with finite diameter $G$. This is what enables us to use an appropriately tuned learning rate and clipping parameter to make the algorithm any-time and obtain confidence intervals at all failure probabilities simultaneously. It is an open question whether the assumption that $\Theta$ is compact can be relaxed and if we can still make guarantees confidence interval holding for all failure probabilities $\delta$ for all $t$ for heavy tailed distributions.
\end{remark}

\begin{remark}
The constants in Theorem \ref{thm:main_mean_est} are not optimal. In Section \ref{sec:experiments}, we suggest an alternative set of constants that work well empirically across variety of settings.
\label{remark:sub_opt_constants}
\end{remark}

\begin{remark}
There have been significant recent advances in robust mean estimation \citep{diakonikolas2020outlier,lugosi2021robust,depersin2022robust,cherapanamjeri2020algorithms,diakonikolas2022streaming}, that are known to provide near optimal error bounds. However, unlike our method, none of these algorithms can give confidence bounds for all confidence values simultaneously.
\end{remark}

\begin{remark}
Theorem $4.3$ in \citep{devroye2016sub} proves that it is impossible to get a finite sample confidence bound to hold for all $\delta \in (0,1)$. Our result does not contradict this since the restriction on the allowable $\delta$ is \emph{implicit} in Theorem \ref{thm:main_mean_est}. Equation (\ref{eqn:defn_B}) gives that, for every $t \in \mathbb{N}$, as $\delta \searrow 0$, $\mathcal{B}(t, \delta) \nearrow \infty$. However, from Assumption \ref{assumption:bounded},  if $\mathcal{B}(t,\delta) \geq G$,  then the statement of Theorem \ref{thm:main_mean_est} is vacuous. Thus, Theorem \ref{thm:main_mean_est} gives a non-vacuous bounds only for $\delta \in (\delta_{min}^{(t)},1)$ where $\delta_{min}^{(t)} := \inf_{\delta > 0}\{ \mathcal{B}(t,\delta) < G \}$.
\end{remark}

{\color{black}
\subsection{Uniform over time bound}
{\color{black}
As a corollary of Theorem \ref{thm:main_mean_est}, we get the following bound that holds uniformly over all time. 
\begin{corollary}
There exists an universal constant $A > 0$ such that, when clipped SGD in Equation (\ref{eqn:sgd_update}) is run with parameters in Theorem \ref{thm:main_mean_est},
\begin{align*}
\mathbb{P}\left[ \exists t \in \mathbb{N} :  \| \widehat{\theta}_t - \theta^* \| \geq A \max \left(\frac{\sigma^3}{\sqrt{t}} , \frac{\sigma \sqrt{\ln \left( \frac{t^3}{\delta}\right)}}{\sqrt{t}}\right) \right] \leq \delta,
\end{align*}
holds for every  $\delta \in (0,1)$.
\label{cor:dim_free}
\end{corollary}
}

The proof follows by taking an union bound over all $t \geq 1$, i.e., summing over $t \geq 1$ on both the LHS and RHS of Corollary \ref{cor:dim_free} and noticing that $\sum_{t \geq 1}\frac{1}{t(t+1)}=1$. The bounds in Theorem \ref{thm:main_mean_est} and Corollary \ref{cor:dim_free} are \emph{dimension free}, i.e., the term $d$ does not appear in the bounds. The moment bound $\sigma$ plays the role of dimension. In particular, suppose that all distributions in the family $\mathcal{P}$ have covariance matrices bounded in the positive semi-definite sense by $\Sigma \in \mathbb{R}^{d \times d}$. In this case, by definition, $\sigma^2 \leq \text{Trace}(\Sigma)$ and  plays the role of dimension.

In the special case when the samples $(X_t)_{t\geq 1}$ are i.i.d. with sub-gaussian distributions with mean $\theta^*$ and covariance matrix $\Sigma$, \citep{NIPS2011_e1d5be1c,maillard2019sequential, chowdhury2022bregman} show that for all $\delta \in (0,1)$,
\begin{multline}
    \mathbb{P}\bigg[ \exists t \in \mathbb{N} : \bigg\|\frac{1}{t}\sum_{s=1}^tX_s - \theta^* \bigg\| \geq \\ \sqrt{2 \lambda_{max}(\Sigma) \left( 1+ \frac{1}{t}\right) \ln \left( \frac{(t+1)^{{\color{black}d}}}{\delta} \right)} \bigg] \leq \delta,
    \label{eqn:maillard_them}
\end{multline}
holds, where $\lambda_{max}(\Sigma)$ is the highest eigen-value of the co-variance matrix $\Sigma$. Thus, for the special case of sub-gaussian distributions, Equation (\ref{eqn:maillard_them}) has a better dependence on time $t$ compared to our Corollary \ref{cor:dim_free}. The improved dependence on time arises as Equation (\ref{eqn:maillard_them}) is based on the construction of a self normalized martingale and using the martingale stopping theorem to obtain uniform over time bounds while Corollary \ref{cor:dim_free} is based on a simple union bound.

However, the bound in Equation (\ref{eqn:maillard_them}) is un-satisfactory for high dimensional problems since it is not dimension free and depends on the scale of the problem through the term $d \lambda_{max}(\Sigma)$ which by definition is larger than $\text{Trace}(\Sigma)$. In many high dimensional settings,  $d \lambda_{max}(\Sigma)$ is much {larger} than $\text{Trace}(\Sigma)$ and thus algorithms and bounds depending explicitly on $d$ is un-desirable \citep{wainwright2019high, lugosi2019mean}. It is an open problem whether a bound can be established that is both dimension free and has the improved time-dependence of the kind in Equation (\ref{eqn:maillard_them}), even for sub-gaussian random vectors. For heavy-tailed distributions, \citep{wang2022catoni, wang2023huber} establish  confidence bounds with sharp dependence on time by extending the martingale recipe followed by \citep{maillard2019sequential}. However, it is unclear if trivially extending their approach to high dimensions can yield dimension free bounds. 

}

\section{Change-Point Detection Algorithm}

{\color{black}Our algorithm is described in Algorithm \ref{algo:learn_model} and is based on the following idea. A change point is detected in the time-interval $[r,t]$ if there exists $r < s < t$ such that confidence interval around the estimated mean of the observations $X_{r:s}$  is separated from the confidence interval around the estimated mean of the observations $X_{s+1:t}$. Further, in order to accommodate multiple change-points, the algorithm \emph{restarts} after every change detection, similar to \citep{alami2020restarted}. 
%Formally, in the proof throughout and in Algorithm \ref{algo:learn_model}, for every $t \geq 1$, we denote by the indicator variable $\text{\textbf{Restart}}_t = \mathbf{1}(\mathcal{A}(X_{1:t} - \mathcal{A}(X_{1:t-1}) > 0)$, i.e., to denote whether a change-point was detected at time $t$.
%In order to guarantee a false-positive rate requires a careful selection of the confidence bounds around the mean estimates obtained from $X_{r:s}$ and $X_{s+1:t}$. 
%The central design is to construct a good estimate of the mean and its associated confidence bound for a given set of samples $X_{r:s}$ and $X_{s+1:t}$. 
It is known that standard empirical mean is a poor estimator when the underlying distributions can potentially be heavy-tailed, as its confidence interval under only assumptions in \ref{assumption_stochastic} is wide \citep{lugosi2019mean}. 
To attain better confidence intervals, we use  the {\ttfamily clipped-SGD} Algorithm \ref{algo:learn_model} that gives a confidence interval for the estimated mean for every failure probability $\delta\in (0,1)$ simultaneously. Having multiple confidence intervals is crucial as we show that adaptively testing different intervals of times at different carefully chosen confidence intervals (Line $8$ of Algorithm \ref{algo:learn_model}) leads to the bounded FPR guarantee. 




}


% classical ideas in online change-point analysis \citep{page1954continuous}, \citep{veeravalli021survey}, \citep{xie2021sequential}, of computing a test statistic for any interval of time $[s,t]$. In the classical CUSUM algorithm \citep{page1954continuous}, the statistic for an interval is the sum of log-liklihood ratios of all the observed samples in that interval. The recent works of \citep{maillard2019sequential} and \citep{alami2020restarted} use a version of the generalized liklihood statistic. As our setting is non-parametric, we cannot compute liklihood ratios. Instead our test-statistic estimates the mean in an interval and deems a change-point in the mean between two intervals if the estimated mean is sufficiently different. 
% \\

% Formally 

% A change-point is detected in the interval $[r,t]$, if there exists a point $r < s < t$, such that the estimated mean of the observations in the interval $[r,s]$ is different from the estimated mean in the interval $[s+1,t]$. However, since the observations can potentially be heavy-tailed, blindly using the empirical mean is in-sufficient as a single outlier can skew the mean. Instead the mean of a sub-interval needs to be estimated in a robust fashion. In principle one could use several robust estimation algorithms such as \citep{diakonikolas2019sever} or \citep{diakonikolas2020outlier}. However, these algorithms although provide optimal statistical guarantees are offline and requires memory and computation that scales quadraticaly in the time horizon $t$. }

	
	\begin{algorithm*}[htb]
%\DontPrintSemicolon
		\caption{{ Online {\ttfamily Clipped-SGD} Change Point Detection}}
		\label{algo:learn_model}

		\begin{algorithmic}[1]
				\STATE \textbf{Input}: {  $(\eta_t)_{t \geq 1}$,  $\lambda > 0$, $\theta_0 \in \Theta$, $\delta \in (0,1)$ the FPR guarantee } 
				\STATE $r \gets 1$
				\STATE $\widehat{\theta}_{t,t-1} \gets \theta_0$, for all $t \geq 1$.
				\STATE Set {\ttfamily Num-change-points }$ \gets 0$
			    \FOR {each time $t = 1, 2, \cdots , $}
			   \STATE Receive sample $X_t$ \\
			   \STATE   $\widehat{\theta}_{s,t} \gets \prod_{\theta}(\widehat{\theta}_{s,t-1} - \eta_{t-s}\text{clip}(X_{t} - \widehat{\theta}_{s,t-1}, \lambda))$, for every $r \leq s \leq t$.
			    \IF {$\exists s \in (r,t)$ such that $\|\widehat{\theta}_{r:s} - \widehat{\theta}_{s+1:t}\|_2^2 > \mathcal{B}\left(s-r,\frac{\delta}{2(t-r)(t-r+1)}\right) + \mathcal{B}\left(t-s-1, \frac{\delta}{2(t-r)(t-r+1)}\right)$ \COMMENT {$B(\cdot, \cdot)$ is defined in Equation (\ref{eqn:defn_B}}}
			    \STATE Set $\mathcal{A}_t$ $\gets 1$ \COMMENT {Change point detected}
			    \STATE $r \gets t+1$
			    \STATE Set {\ttfamily Num-change-points }$ \gets ${\ttfamily Num-change-points } $+1$ \COMMENT {Increment number of change-points detected}
			    %\ENDIF
			    \ELSE 
			    \STATE Set $\mathcal{A}_t$ $\gets 0$
			    \ENDIF
			 
			    \ENDFOR 
			 %   \begin{align*}
			 %       \widehat{\theta} \in \arg\min_{\theta \in \Theta} \min_{\substack{S \subset \mathcal{B}, \\ \text{s.t.} |S| > \frac{3}{4}|\mathcal{B}|}} \frac{1}{|S|}\sum_{Y \in S}\mathcal{L}(Y,\theta).
			 %   \end{align*}
	

\end{algorithmic}
	\end{algorithm*}

% \begin{align*}
%   \mathcal{T}(r,s,t) = \mathcal{O} \left( \frac{1}{s-r} + \frac{1}{t-s-1} + \sqrt{\frac{1}{s-r}} + \sqrt{\frac{1}{t-s-1}} \right).
% \end{align*}


%\section{Theoretical Guarantees}

\subsection{Connections to GLR}

Restating our algorithm, a change point is detected in a time-interval $[t_0,t]$ if 
\begin{align*}
    \exists s \in (t_0,t) \text{ s.t. } \| \widehat{\theta}_{t_0:s} - \widehat{\theta}_{s+1:t} \|^2 \geq \mathcal{C}(t_0, s, t, \delta),
\end{align*}
where the function $\mathcal{C}(\cdot)$ is given in Line $8$ of Algorithm \ref{algo:learn_model}. In the above re-statement, the estimates $\widehat{\theta}_{t_0:s}$ and $\widehat{\theta}_{s+1:t}$ are robust estimates of the mean based on the set of observations $\{X_{t0}, \cdots,X_s\}$ and $\{X_{s+1}, \cdots, X_t\}$ respectively. The {\ttfamily Improved-GLR} of \citep{maillard2019sequential} uses a detector that is structurally similar to the above equation except that they {\em (i)} use the empirical mean as they are dealing with sub-gaussian random variables, and {\em (ii)} use a function $\mathcal{C}(\cdot)$ derived from the Laplace method that gives confidence bounds with better dependence on time, but is not dimension free. In contrast, we use the robust mean estimator given by clipped-SGD and the function $\mathcal{C}(\cdot)$ is derived from the confidence guarantees that only require the existence of the second moment and make no other tail assumptions and yields dimension free bounds. The cost however is that the confidence bound derived from clipped SGD has a weaker dependence on time compared to that obtained by the Laplace's method \citep{maillard2019sequential}.
\subsection{False-positive guarantee}
\label{sec:fpr_guarantee}

We will prove the following result on Algorithm \ref{algo:learn_model}. For a given process $\mathfrak{M}$, and every $r \in \mathbb{N}$, denote by the deterministic time $\tau_c^{(r)} := \inf\{ \tau_c : \tau_c > r \}$ be the first change-point after time $r$. 

\begin{theorem}[False Positives]
%Let $ n \in \mathbb{N}$, and  $\theta \in \Theta$ be arbitrary. If $X_{t}, \stackrel{\text{ind}}{\sim} \mathbb{P}_t$, where for all $t < n$, $\mathbb{E}_{t}[X] = \theta$, then 
When Algorithm \ref{algo:learn_model} is run with parameters  $\lambda = 2G$, $\eta_t = \frac{2}{m(t+\gamma)}$ for $\gamma = \max \left(120 \lambda \sigma(\sigma+1),\frac{320\sigma^2}{m}+1 \right)$ and $\delta \in (0,1)$, 
\begin{align*}
    \sup_{\mathfrak{M}, r}\mathbb{P}[ \exists t \in [r, \tau_c^{(r)}), \text{ s.t. } \mathcal{A}_{t}  = 1 \vert \mathcal{A}_r=1] \leq \delta,
\end{align*}
holds almost-surely.
\label{thm:fpr_main}
\end{theorem}

Proof is in Appendix in Section \ref{sec:proof_thm_fpr_main}. This result states that with probability at-most $\delta$, a true change-point \emph{does not} lie between any two consecutive detections made by the algorithm. This theorem implies the following lemma. 

\begin{lemma}
Under the conditions of Theorem \ref{thm:fpr_main}, the FPR condition in Equation (\ref{eqn:fpr_definition}) holds. 
\label{lem:fpr_connection}
\end{lemma}
The proof is in the Appendix in Section \ref{sec:proof_of_fpr_connection}. We emphasize that the guarantee in \ref{lem:fpr_connection} is a \emph{worst-case guarantee}. In other words, no matter the underlying distribution, as long as Assumptions \ref{assumption:bounded} and \ref{assumption_stochastic} are met, Algorithm \ref{algo:learn_model} will not have more than a $\delta$ fraction of false-positives. 


\subsection{Worst-case Detection Delay Guarantee}

\begin{lemma}
If Algorithm \ref{algo:learn_model} is run with the parameters from Theorem \ref{thm:fpr_main}, then for every $n \in \mathbb{N}$, $\Delta > 0$ and $\delta' \in (0,1)$
\begin{multline}
   \mathcal{D}(n, \Delta, \delta') \geq \inf \bigg\{ d \in \mathbb{N} : \Delta^2 \geq \mathcal{B}\left( n-1, \frac{\delta'}{2} \right) + \\ \mathcal{B}\left( d, \frac{\delta'}{2} \right)  + \mathcal{B}\left( n-1, \frac{\delta}{2(n+d+1)(n+d)} \right) + \\ \mathcal{B}\left( d,  \frac{\delta}{2(n+d+1)(n+d)} \right) \bigg\},
   \label{eqn:delay_formula}
\end{multline}
where $\mathcal{D}(\cdot)$ and $\mathcal{B}(\cdot)$ are in Eqns (\ref{eqn:detection_delay_defn}) and (\ref{eqn:defn_B}) respectively. 
\label{lem:detection_delay}
\end{lemma}


Proof is in the Appendix in Section \ref{sec:proof_delay}. Lemma \ref{lem:detection_delay} is an \emph{upper bound on the worst case delay}. In other words, for any pre- and post-change distribution with norm of the means differing by $\Delta$, Algorithm \ref{algo:learn_model} will detect this change within delay of $\mathcal{D}(n,\Delta,\delta')$ with probability at-least $1-\delta'$. 
%As in Remark \ref{remark:sub_opt_constants}, the constants we used to create Figure \ref{fig:heatmap} are given in Section \ref{sec:experiments}. In Figure \ref{fig:empirical_heat_map}, we plot the empirically observed detection delay for a sequence of $32$ dimensional Pareto distributed random vectors with shape parameter $2.01$. As can be seen in Figure \ref{fig:heatmap_joint}, the observed detection delay is much smaller than that indicated by Lemma $4.3$, which is a worst case over all distributions. 


% \begin{figure}
%     \centering
%     \includegraphics[width=0.85\linewidth]{plots/heatmap.pdf}
%     \caption{Heat-map $\mathcal{D}(n, \Delta, \delta')$ from Lemma \ref{lem:detection_delay} for fixed $\delta'=0.1$. The grey cells represent infinity.}
%     \label{fig:heatmap}
% \end{figure}

\begin{figure}
\centering
% \begin{subfigure}{0.32\linewidth}
% \includegraphics[width=0.99\linewidth]{figs/mean_estimation_pareto.pdf}
% %\caption{}
% \end{subfigure}
\begin{subfigure}{0.49\linewidth}
\centering
\includegraphics[width=0.99\linewidth]{plots/actual_heat_1.png}
\caption{}
\label{fig:heatmap}
\end{subfigure}
\begin{subfigure}{0.49\linewidth}
\centering
\includegraphics[width=0.93\linewidth]{plots/empirical_heat_1.png}
\caption{}
\label{fig:empirical_heat_map}
\end{subfigure}
\caption{Figure $(a)$ plots the heat-map of $\mathcal{D}(n, \Delta, \delta')$ from Lemma \ref{lem:detection_delay} for fixed $\delta'=0.1$. The white cells represent infinity. Figure $(b)$ plots the $90$th quantile ($\delta'=0.1$) of the observed delay for Pareto distribution $d=32$ over $30$ runs. As can be seen, the observed detection delay in $(b)$ is much smaller than the worst case delay in $(a)$.}
\label{fig:heatmap_joint}
\end{figure}

For many specific choices of pre- and post-change distribution families however, we expect the observed detection delay to be much smaller than predicted by Lemma \ref{lem:detection_delay}.  This bound is conservative as it is worst-case over all distributions. In Figure \ref{fig:heatmap} we plot the bound in Lemma \ref{lem:detection_delay} for a fixed $\delta'=\delta=0.1$ as $n$ and $\Delta$ varies. We use the constants given in Section \ref{sec:synthetic_simulations} to plot Figure \ref{fig:heatmap}. In Figure \ref{fig:empirical_heat_map}, we plot the empirically observed detection delay for a sequence of $32$ dimensional Pareto distributed random vectors with shape parameter $2.01$. As can be seen in Figure \ref{fig:heatmap_joint}, the observed detection delay is much smaller than that indicated by Lemma $4.3$, which is a worst case over all distributions. 


%For instance, Figure (\ref{fig:heatmap}) gives that $\mathcal{D}(400, 1,0.1) = +\infty$. However, we see in Simulations in \ref{sec:experiments} that with $400$ pre-change samples, a gap of $\Delta = 1$ is detectable with finite detection delay for both high-dimensional Pareto and standard normal distributions. 


\begin{remark}
%Lemma \ref{lem:detection_delay} gives a guarantee on the \emph{worst-case} detection delay, where the worst case is over all possible pre and post-change distributions that have a difference in mean magnitude of $\Delta$. 
In the special case when the observations are Bernoulli random variables, the {\ttfamily R-BOCPD} algorithm of \citep{alami2020restarted} gives a smaller detection delay compared to ours -- our detection delay bound in \ref{lem:detection_delay} has additional poly-logarithmic factors of $\log(n/\delta)$ and sub-optimal constants compared to {\ttfamily R-BOCPD}. However, our bound holds for \emph{any} family of distributions, including high-dimensional and heavy tailed ones, while {\ttfamily R-BOCPD} can only be applied for Bernoulli distributions.
\end{remark}



\begin{corollary}[Un-detectable Change]
If $\Delta \leq \mathcal{O} \left( \frac{\log \left( \frac{n}{\delta} \right)}{\sqrt{n}} \right)$, then $   \mathcal{D}(n, \Delta, \delta') = \infty$ for all $\delta' \in (0,1)$, i.e., the change is un-detectable by Algorithm \ref{algo:learn_model}.
\label{cor:undetectable}
\end{corollary}

\begin{remark}
%If $\Delta \leq \mathcal{O} \left( \frac{\log \left( \frac{n}{\delta} \right)}{\sqrt{n}} \right)$, then Lemma \ref{lem:detection_delay} gives that $   \mathcal{D}(n, \Delta, \delta) = \infty$, i.e., this indicates that the change is un-detectable by Algorithm \ref{algo:learn_model}.
The undetecable region consists of the grey/white areas of Figure \ref{fig:heatmap}. However, since Lemma \ref{lem:detection_delay} is only an upper-bound, the fact that $   \mathcal{D}(n, \Delta, \delta') = \infty$ \emph{does not imply} that our algorithm cannot detect the change (cf. Figure \ref{fig:empirical_heat_map}).
\end{remark}

\begin{remark}
%Corollary \ref{cor:undetectable} is a statement about the performance of our Algorithm \ref{algo:learn_model} and not a global lower bound. 
In the case of sub-gaussian, exponential families, \citep{maillard2019sequential} give a lower bound for changes that not detectable by \emph{any} algorithm. When Algorithm \ref{algo:learn_model} is applied to sub-gaussian random variables from an exponential family, the detection-delay bound in Lemma \ref{lem:detection_delay} is sub-optimal by poly-logarithmic factors in $\log(n/\delta)$ compared to the lower bound. However, Algorithm \ref{algo:learn_model} and the delay bound in Lemma \ref{lem:detection_delay} holds for any class of distributions subject to Assumptions \ref{assumption_stochastic} and \ref{assumption:bounded}, while the bounds in \citep{maillard2019sequential} only applies to sub-gaussian observations from a known exponential family. 
\end{remark}

% \begin{figure}
%     \centering
%     \includegraphics[width=0.4\linewidth]{plots/detection_delay.pdf}
%     \caption{Plot of $ \mathcal{D}(n, \Delta, \delta')$ for fixed $\Delta=10, \delta=0.1$.}
%     \label{fig:detection_delay}
% \end{figure}



% \begin{figure}
%     \centering
%     \includegraphics[scale=0.2]{plots/delta_min.pdf}
%     \caption{Plot showing the variation of minimum detectable change with number of pre-change samples.}
%     \label{fig:delta_min}
% \end{figure}


% \begin{figure}
%     \centering
%     \includegraphics[width=0.6\linewidth]{plots/detection_delay.pdf}
%     \caption{Plot of $ \mathcal{D}(n, \Delta, \delta')$ for fixed $\Delta=10, \delta=0.1$.}
%     \label{fig:detection_delay}
% \end{figure}


\subsection{Change-point localization}

In practice, it is also crucial to identify the location where the change point occurred. In this section we describe how to modify Algorithm \ref{algo:learn_model} to also output the estimate of the location of change in addition to just detecting the existence of a change. Recall that for every $r \in \mathbb{N}$, $\tau_r^{(\mathcal{A})} \in \mathbb{N} \cup \{ \infty\}$ is the stopping time denoting the $r$th time, Algorithm $\mathcal{A}$ detects a change point. We modify Algorithm \ref{algo:learn_model} by additionally outputting for every $r \in \mathbb{N}$, a time interval $[s_{r;1}^{(\mathcal{A})}, s_{2;r}^{(\mathcal{A})}] \subseteq [\tau_{r-1}^{(\mathcal{A})},\tau_{r}^{(\mathcal{A})}]$ such that this is an interval that contains a change-point $\tau_c$.

 In order do so, we need an additional definition. For every $r < s < t$ and $\delta \in (0,1)$, denote by $\mathfrak{B}(r,s,t,\delta) \in \{0,1\}$ as the indicator variable that 
\begin{equation}
    \mathfrak{B}(r,s,t,\delta)=  \mathbf{1} \bigg(\|\widehat{\theta}_{r:s} - \widehat{\theta}_{s+1:t}\|_2^2 > \mathfrak{B_1} + \mathfrak{B_2}\bigg), \label{eqn:defn_mathfrak_B}
\end{equation}   
where $\mathfrak{B1}= \mathcal{B}\left(s-r,\frac{\delta}{2(t-r)(t-r+1)}\right)$ and $\mathfrak{B2}= \mathcal{B}\left(t-s-1, \frac{\delta}{2(t-r)(t-r+1)}\right) $.
% \begin{equation}
%     \mathfrak{B1}= \mathcal{B}\left(s-r,\frac{\delta}{2(t-r)(t-r+1)}\right) \nonumber
%     \end{equation}
%     \begin{equation}
%     \mathfrak{B2}= \mathcal{B}\left(t-s-1, \frac{\delta}{2(t-r)(t-r+1)}\right) \nonumber
%     \end{equation}
The estimates of the location of change in a time-interval $[r,t]$ is all those time instants $s \in [r,t]$ such that $\mathfrak{B}(r,s,t,\delta)=1$. Line $12$ in Algorithm \ref{algo:learn_model_local} in Section \ref{sec:localization} in the Appendix, precisely defines the estimator. The empirical performance of this method is shown in Figure \ref{fig:localization}. We observe that this produces an accurate and sharp estimate of the change-point location in simulations.% A theoretical characterization of localization performance is an interesting future work.




% In the supplementary materials in Section \ref{sec:localization}, we give Algorithm \ref{algo:learn_model_local} that is a variation of Algorithm \ref{algo:learn_model}, that at each time $t$, if \textbf{Restart}$_t = 1$, also outputs a time-interval as an estimate for the location of the true-change point. Recall that Algorithm \ref{algo:learn_model} detects a change-point in the interval $[r,t]$, whenever the condition in Line $8$ of \ref{algo:learn_model} evaluates to True. The modification in Algorithm \ref{algo:learn_model_local} outputs all the time points $s \subseteq [r,t]$ where the condition in Line $8$ of Algorithm \ref{algo:learn_model} evaluates as True. {\color{black} We see through experiments in Section \ref{sec:localization} in the Appendix, }

% to output the instant of change is to say whenever a change-point is detected, i.e., whenever the {\ttfamily IF} statement in Line $7$ evaluates to a {\ttfamily True}, then also output the time-window $[s_{min}, s_{max}]$ of the minimum and maximum $s$ that satisfies the condition in Line $7$. Concretely, for every $r < s < t$, denote by the boolean condition $\mathfrak{B}(r,s,t,\delta)$ 
% \begin{multline*}
%     \mathfrak{B}(r,s,t,\delta) = \mathbf{1} \bigg(\|\widehat{\theta}_{r:s} - \widehat{\theta}_{s+1:t}\|_2^2 > \\ \mathcal{B}\left(s-r,\frac{\delta}{2(t-r)(t-r+1)}\right) + \\ \mathcal{B}\left(t-s-1, \frac{\delta}{2(t-r)(t-r+1)}\right) \bigg).
% \end{multline*}
% If at time $t$, a change point is detected in the interval $[r,t]$, then output the estimate of the change-point location as $[\inf \{s : \mathfrak{B}(r,s,t,\delta)=1\}, \sup\{s, \mathfrak{B}(r,s,t,\delta)=1\}]$.
\section{Experiments}
\label{sec:experiments}

\begin{figure*}[ht!]
\centering
\begin{subfigure}{0.22\textwidth}
\includegraphics[width=0.99\linewidth]{plots/refined_plots/pareto_half_d1.pdf}
    \caption{Pareto $\Delta=0.5$}
    \label{fig:ff1}
\end{subfigure}
\begin{subfigure}{0.22\textwidth}
\includegraphics[width=0.99\linewidth]{plots/refined_plots/pareto_one_d1.pdf}
    \caption{Pareto $\Delta=1$}
    \label{fig:ff2}
\end{subfigure}
\begin{subfigure}{0.22\textwidth}
\includegraphics[width=0.99\linewidth]{plots/refined_plots/pareto_half_d32.pdf}
    \caption{Pareto $d=32,\Delta=0.5$}
    \label{fig:ff3}
\end{subfigure}
\begin{subfigure}{0.22\textwidth}
\includegraphics[width=0.99\linewidth]{plots/refined_plots/pareto_one_d32.pdf}
    \caption{Pareto $d=32, \Delta=1$}
    \label{fig:ff4}
\end{subfigure}
\begin{subfigure}{0.22\textwidth}
\includegraphics[width=0.99\linewidth]{plots/refined_plots/normal_half_d1.pdf}
    \caption{Normal $\Delta=0.5$}
    \label{fig:ff5}
\end{subfigure}
\begin{subfigure}{0.22\textwidth}
\includegraphics[width=0.99\linewidth]{plots/refined_plots/normal_one_d1.pdf}
    \caption{Normal $ \Delta=1$}
    \label{fig:ff6}
\end{subfigure}
\begin{subfigure}{0.22\textwidth}
\includegraphics[width=0.99\linewidth]{plots/refined_plots/normal_half_d32.pdf}
    \caption{Normal $d=32$, $\Delta=0.5$}
    \label{fig:ff7}
\end{subfigure}
\begin{subfigure}{0.22\textwidth}
\includegraphics[width=0.99\linewidth]{plots/refined_plots/normal_one_d32.pdf}
    \caption{Normal $d=32, \Delta=1$}
    \label{fig:ff8}
\end{subfigure}
\begin{subfigure}{0.22\textwidth}
\includegraphics[width=0.99\linewidth]{plots/refined_plots/pareto_one_d1_compare.pdf}
    \caption{Pareto $ \Delta=1$}
    \label{fig:ff9}
\end{subfigure}
\begin{subfigure}{0.22\textwidth}
\includegraphics[width=0.99\linewidth]{plots/refined_plots/pareto_one_d32_compare.pdf}
    \caption{Pareto $d=32, \Delta=1$}
    \label{fig:ff10}
\end{subfigure}
\begin{subfigure}{0.22\textwidth}
\includegraphics[width=0.99\linewidth]{plots/refined_plots/Bernouli_half.pdf}
    \caption{Bernoulli with $\Delta=0.5$}
    \label{fig:ff11}
\end{subfigure}
\begin{subfigure}{0.22\textwidth}
\includegraphics[width=0.99\linewidth]{plots/refined_plots/bernoulli_07_compare.pdf}
    \caption{Bernoulli with $\Delta=0.7$}
    \label{fig:ff12}
\end{subfigure}
\caption{Empirical performance of Algorithm \ref{algo:learn_model}  in a variety of scenarios. Exact details of each plot in Section \ref{sec:synthetic_simulations}.}
\label{fig:all_figs} 
\end{figure*}


\begin{figure*}
\centering
\begin{subfigure}{0.22\textwidth}
\includegraphics[width=0.99\linewidth]{plots/refined_plots/pareto_one_d1_localization.pdf}
\label{fig:local1}
\caption{Pareto $\Delta=1$}
\end{subfigure}
\begin{subfigure}{0.22\textwidth}
\includegraphics[width=0.99\linewidth]{plots/refined_plots/pareto_one_d32_local.pdf}
\label{fig:local2}
\caption{Pareto $d=32, \Delta=1$}
\end{subfigure}
\begin{subfigure}{0.22\textwidth}
\includegraphics[width=0.99\linewidth]{plots/refined_plots/normal_one_d1_local.pdf}
\label{fig:local3}
\caption{Normal $\Delta=1$}
\end{subfigure}
\begin{subfigure}{0.22\textwidth}
\includegraphics[width=0.99\linewidth]{plots/refined_plots/bernoulli_07_local.pdf}
\label{fig:local4}
\caption{Bernoulli $\Delta=0.7$}
\end{subfigure}
\caption{Plots showing that by Algorithm \ref{algo:learn_model_local} can detect and localize change-points across a variety of settings.}
\label{fig:localization}
\end{figure*}

In this section we give numerical evidence to show that Algorithm \ref{algo:learn_model} can be applied across variety of settings. Line $8$ of Algorithm \ref{algo:learn_model} relies on confidence bounds for high-dimensional estimation where the global constants are not optimized. This is an artifact of the proof analysis in robust estimation \citep{lugosi2019mean, vershynin2018high}. Thus, we modify the absolute constants used in Theorem \ref{thm:fpr_main} as follows. We use $\gamma = \max \left({\color{red}4} \lambda \sigma(\sigma+1),\frac{{\color{red}8}\sigma^2}{m}+1 \right)$ with the color red highlighting the changes from the definition in Theorem \ref{thm:fpr_main}. The constant $C_t$ is modified as follows $     C_t  = \max(\frac{ {\color{red}0.5}\sigma^4}{G^2m^2\lambda^2}, \frac{ {\color{red}1} \lambda \sqrt{\ln \left( \frac{2t^2(t+1)}{\delta} \right)}}{\gamma^2 G} ).
$
% \begin{align*}
% \end{align*}
In addition, we use the following definition of $\mathcal{B}(\cdot, \cdot)$ 
\begin{multline}
    \mathcal{B}(t, \delta) := {\color{black}C_t}\bigg[ \frac{{\color{black}\gamma^2} G^2}{t+1} +\left(\frac{{\color{red}2}  \sigma^2}{\lambda} +  {\color{red}1}  \sigma^2 \right) \frac{1}{2m^2(t+1)} \\+ \frac{{\color{red}2} \lambda^2 \ln \left( \frac{2t^2(t+1)}{\delta}\right)\sigma(\sigma+1) }{m (t+\gamma)\sqrt{t+1}}  \bigg],
    \label{eqn:defn_B_simulation}
\end{multline}
where $C_t$ and $\gamma$ are the modified values stated above. Further, in all simulations we assume $\Theta = \mathbb{R}^d$ to be the whole plane. 



\subsection{Synthetic Simulations}
\label{sec:synthetic_simulations}

Here, demonstrate that Algorithm \ref{algo:learn_model} with choice of hyper-parameters in Equation (\ref{eqn:defn_B_simulation}) is practical and can be applied across a variety of data generating distributions -- either heavy-tailed, or high-dimensional or both and still obtains bounded false-positive rates and a much lower detection delay compared to what the conservative bound in Lemma \ref{lem:detection_delay} would indicate. 

\subsubsection{Setup}
\label{subsec:simulation_setup}
In Figure \ref{fig:all_figs}, we construct synthetic situations and introduce change-points with each change lasting $400$ time-units. In all experiments, we choose the family of distributions $\mathfrak{M}$ such that $\sigma=1$, $G = 1$2. At each time $t$, a sample is drawn from the appropriate distribution that we detail below and presented to the change-point algorithm. The true-change points and the median detection times along with the 95 percentile upper and lower confidence bands are show in Figure \ref{fig:all_figs}. These are estimated by averaging $30$ independent runs for each setting in Figure \ref{fig:all_figs}.

\textbf{Heavy-tailed distribution:} In Figures \ref{fig:ff1}, \ref{fig:ff2} and \ref{fig:ff9}, the sample at every time-point is drawn from a Pareto distribution with shape-parameter $2.01$. This implies that the third central moment of the distribution is infinity. The mean of the samples in the time-durations $t \in [0,400) \cup [800, 1200)$ is $0$ in all figures and the mean at times $t \in [400, 800) \cup [1200,1600)$ is $\Delta = 0.5, 1, 1$  respectively in Figures \ref{fig:ff1}, \ref{fig:ff2} and \ref{fig:ff9}. In Figure \ref{fig:ff3}, \ref{fig:ff4} and \ref{fig:ff10}, we consider the observation at time $t$ to be $32$ dimensional isotropic random vector with norm having Pareto distributions with shape parameter $2.01$. The mean vector at times $[0,400) \cup [800,1200)$ is $0 \in \mathbb{R}^{32}$ and at times $t \in [400,800) \cup [1200,1600)$ is $\frac{\Delta}{\sqrt{32}}[{1, \cdots, 1}] \in \mathbb{R}^{32}$, where $\Delta = 0.5, 1, 1$  respectively in Figures \ref{fig:ff3}, \ref{fig:ff4} and \ref{fig:ff10} respectively.


\textbf{Gaussian distribution:} In Figures \ref{fig:ff5} and \ref{fig:ff6}  the sample at every time-point is drawn from a unit variance Gaussian distribution. The mean of the samples in the time-durations $t \in [0,400) \cup [800, 1200)$ in all three figures is $0$ and the mean at times $t \in [400, 800) \cup [1200,1600)$ in the two figures \ref{fig:ff5} and \ref{fig:ff6} are $\Delta = 0.5$ and $\Delta=1$ respectively. In Figures \ref{fig:ff7} and \ref{fig:ff8}  we consider the observation at time $t$ to be $32$ dimensional isotropic gaussian random vector with co-variance on each axis being $1/\sqrt{32}$. The mean vector at times $[0,400) \cup [800,1200)$ is $0 \in \mathbb{R}^{32}$ and at times $t \in [400,1600) \cup [1200,1600)$ is  $\frac{\Delta}{\sqrt{32}}[{1, \cdots, 1}] \in \mathbb{R}^{32}$. 


\textbf{Bernoulli distribution:} In Figures \ref{fig:ff11} and \ref{fig:ff12}, the data was $\{0,1\}$ valued Bernoulli random variable with means at times $[0,400) \cup [800,1200)$ was $0.7$ and $0.85$ respectively in the two figures, and the means at times $[400,800) \cup [1200,1600)$ are $0.3, 0.15$ respectively in the two figures.

% we consider the sample at every time-point to be drawn from a $\{0,1\}$ valued Bernoulli distribution. In Figure \ref{fig:ff11}, the mean in during time intervals $[0,400), [400,800), [1600, 2400)$ and $[1200,1600)$ are respectively $0.9,0.1,0.8,0.2$. In Figure \ref{fig:ff12}, the mean in the time interval $[0,400) \cup [800,1200)$ is $0.7$ and the mean in interval $[400,800) \cup [1200,1600)$ is $0.2$.

\subsubsection{Baselines}

We consider two OCPD algorithms, the {\ttfamily Improved-GLR} of \citep{maillard2019sequential} and {\ttfamily R-BOCPD} of \citep{alami2020restarted}. We chose these two as baselines since they have been empirically demonstrated to be state-of-art, and are the only other algorithms to possess finite sample, non-asymptotic FPR guarantees. The {\ttfamily Improved-GLR} can be applied to any distribution, while  its theoretical guarantees only hold for sub-gaussian distributions. The {\ttfamily R-BOCPD} algorithm is only applicable to detecting change-points in binary data and thus we only apply it to Bernoulli distributed data.


\subsubsection{Results}

 \textbf{We observe in Figure \ref{fig:all_figs} that our algorithm is the only one to attain bounded FPR across  heavy-tailed, Gaussian, high dimensional and Bernoulli distribution.}
 
 We can observe from Figures \ref{fig:ff8} and \ref{fig:ff10} that under Pareto distribution, the {\ttfamily Improved-GLR} algorithm  has a large number of False Positive detections. Intuitively this occurs because the {\ttfamily Improved-GLR} algorithm assumes sub-gaussian tails and thus large deviations that are typical for the heavy-tailed Pareto distributions are mistaken for a change. (See also Figure \ref{fig:sample_path}). In contrast, from Figures \ref{fig:ff1}, \ref{fig:ff2}, \ref{fig:ff3}, \ref{fig:ff4} and \ref{fig:ff10}, we see that our algorithm consistently attains bounded false-positive rates and finite detection delay guarantees across choices of $\Delta$ and dimension $d$. 

On gaussian distributed data, both our algorithm \ref{algo:learn_model} and the {\ttfamily Improved-GLR}  obtains similar performance in-terms of false-positive rates. However, the detection delay of GLR is much smaller compared to ours, as seen by the fact that the median detection time of our algorithm is larger than the 95th percentile detection time of {\ttfamily Improved-GLR}. In Bernoulli distributed data, all methods attain similar False-positive guarantees -- however, the specialized algorithm of {\ttfamily R-BOCPD} is superior in terms of detection delay compared to ours and the {\ttfamily Improved-GLR}.  


%In addition, we also empirically show that the delay-distribution in Figure \ref{fig:detection_delay} is conservative and in practice one obtains an improved detection delay.



% We show through experiments that across a variety of spectrum -- either heavy-tailed, or high-dimensional or both, Algorithm \ref{algo:learn_model} with parameters defined in Equation (\ref{eqn:defn_B_simulation}) produces good results. 

% 


% \textbf{One-dimensional Pareto Distributions}

% \textbf{One-dimensional normal distributions} Here we compare with GLR Algorithm of \citep{maillard2019sequential}.

% \textbf{Multi-dimensional normal distributions} Here we compare with GLR Algorithm of \citep{maillard2019sequential}.


% \textbf{One-dimensional alternating Pareto and Normal}

% \textbf{Bernoulli Distributions} Here we compare with GLR algorithm of \citep{maillard2019sequential} and the R-BOCPD algorithm of \citep{alami2020restarted}. 


{\color{black}
\begin{table*}[]
\small
    \centering
    \begin{tabular}{c|c|c||c|c|c}
         \textbf{Distribution} & $\mathbf{d}$ & $\mathbf{\Delta}$ & \textbf{Algorithm }\ref{algo:learn_model} & \textbf{Improved GLR} \citep{maillard2019sequential}  & \textbf{R-BOCPD} \citep{alami2020restarted}  \\ \hline 
         \multirow{4}{*}{Normal} & $1$ &$1$  & $274 \pm 38$ & {$\mathbf{64\pm45}$} & \multirow{4}{*}{N/A} \\
         & $32$ &$1$ & $\mathbf{300 \pm 6}$ & $2400 \pm 0$ &  \\
         & $1$ & $0.5$ & $694 \pm 191$ & $\mathbf{356 \pm 150}$  &\\ 
         & $32$ & $0.5$  & $\mathbf{1427 \pm 14}$ & $2400 \pm 1$ & \\ \hline
         \multirow{4}{*}{Pareto} &$1$&$1$& $\mathbf{296 \pm 35}$ & $19913\pm 8143$ &\multirow{4}{*}{N/A} \\
         & $32$ & $1$  & $\mathbf{302 \pm 7}$ & $1616 \pm 921$ & \\
         & $1$ & $0.5$ & $\mathbf{868 \pm 365}$ & $1891 \pm 663$ & \\
         & $32$ & $0.5$ & $\mathbf{1431 \pm 14}$ & $1667 \pm 653$ & \\ \hline 
         \multirow{2}{*}{Bernoulli} & - & $0.7$ & $515 \pm 49$ & $181 \pm 23$ & $\mathbf{23 \pm 479}$ \\
         & - & $0.5$ & $1509 \pm 53$ & $1466 \pm 762$ & $\mathbf{63 \pm 380}$ \\ \hline
         
    \end{tabular}
    \caption{Quantitative summary of Figure \ref{fig:all_figs} by comparing regret, where lower is better. Our method achieves lower regret across variety of settings of distribution, dimension and change magnitude.}
    \label{tab:fig_table}
\end{table*}


In Table \ref{tab:fig_table}, we summarize Figure \ref{fig:all_figs} by measuring  \emph{regret}. For any OCPD algorithm $\mathcal{A}$, we can define a function $R^{(\mathcal{A})} : [T] \to \mathbb{N}$ where $R^{(\mathcal{A})}(t) = \sum_{s \leq t} \mathcal{A}_s$ is the total number of change-points detected upto time $t$. Similarly, for any $t \in [T]$, the ground-truth function $R^*(t) = \max \{ c : \tau_c \leq t \}$ is the number of true changes till time $t$. The regret of  algorithm $\mathcal{A}$ is defined as $\sum_{t=1}^T | R^{\mathcal{A}}(t) - R^*(t)|$. This measure is non-negative and is $0$ if and only if the output of the algorithm matches the ground truth. In Table \ref{tab:fig_table}, we give the median value of regret along with $95$\% confidence interval.
We observe in Table \ref{tab:fig_table} that our method achieves lower regret across a variety of situations - whether the data is heavy-tailed, light tailed high dimensional or discrete. 

% \begin{figure}
%     \centering
%     \includegraphics[width=0.75\linewidth]{plots/refined_plots/empirical_heat_map.pdf}
%     \caption{Plot of observed delay $ \mathcal{D}(n, \Delta, \delta')$ for Pareto distribution $d=32$. As can be seen, the observed delay is much smaller than the worst case delay shown in Figure \ref{fig:heatmap}.}
%     \label{fig:empirical_heat_map}
% \end{figure}

}


\subsubsection{Change-point localization}



In Figure \ref{fig:localization}, we demonstrate sharpness of change-point localization (detailed in Algorithm \ref{algo:learn_model_local}). The setting in Figure \ref{fig:localization} is identical to that of Figure \ref{fig:all_figs} with the boundary of the shaded region representing the $5$th quantile for the starting point and the $95$th quantile for the ending point of the change location interval output in Line $12$ of Algorithm \ref{algo:learn_model_local}. The localization region is biased towards the right, which is expected since our algorithm is designed to minimize false positives even in the worst-case. Mathematically proving that the localization is sharp is an interesting future work. 


\begin{figure}
    \centering
    \includegraphics[width=0.85\linewidth]{plots/refined_plots/well_log.pdf}
    \caption{Performance of change-point detection of Algorithm \ref{algo:learn_model} and the {\ttfamily Improved-GLR} on real data.}
    \label{fig:real_data}
\end{figure}


\subsection{Real-Data}

In Figure \ref{fig:real_data} we show the performance of Algorithm \ref{algo:learn_model} and the {\ttfamily Improved-GLR} on the well-log dataset \citep{oruanaidh1996numerical}. This dataset consists of $4050$ measurements in the range $[6 \times 10^4, 10^5]$ of nuclear-magnetic-response taken during drilling of a well. The data
are used to interpret the geophysical structure of the
rock surrounding the well. The variations in mean
reflect the stratification of the earth’s crust. % \citep{adams2007bayesian, oruanaidh1996numerical, fearnhead2003line}. 
We process the data by dividing it by $10^{4.5}$ and run Algorithm \ref{algo:learn_model} with $G=10$, $\sigma=1$ and {\ttfamily Improved GLR} with $\sigma=1$. The detected change-points are shown in Figure \ref{fig:real_data}. Figure \ref{fig:real_data} shows that the performance of Algorithm \ref{algo:learn_model} is at-least as good as that of {\ttfamily Improved-GLR} in terms of false-positive detections. 



% \clearpage



% \begin{figure}
%     \centering
%     \begin{subfigure}{0.92\linewidth}
%     \includegraphics[width=0.99\linewidth]{plots/pareto_gaussian.pdf}
%     \caption{Alternating Pareto with shape $2.01$ and Unit variance gaussian}
%     \end{subfigure}
%         \begin{subfigure}{0.92\linewidth}
%     \includegraphics[width=0.99\linewidth]{plots/pareto_0.1_1.pdf}
%     \caption{Pareto distributions with shape parameter $2.01$}
%     \end{subfigure}
%     \caption{Scalar observations with $\Delta = 0.1$ and $\delta=0.05$}
%     \label{fig:motivation}

% \end{figure}



% \begin{figure}
%     \centering
%     \begin{subfigure}{0.92\linewidth}
%     \includegraphics[width=0.99\linewidth]{plots/comparision_0.1.pdf}
%     \caption{$\Delta=0.1, \delta=0.05$}
%     \end{subfigure}
%         \begin{subfigure}{0.92\linewidth}
%     \includegraphics[width=0.99\linewidth]{plots/comparision_3_0.5.pdf}
%     \caption{$\Delta=0.5, \delta=0.05$}
%     \end{subfigure}
%             \begin{subfigure}{0.92\linewidth}
%     \includegraphics[width=0.99\linewidth]{plots/comparision_2.pdf}
%     \caption{$\Delta=1, \delta=0.05$}
%     \end{subfigure}
%     \caption{One-dimensional Gaussian distribution with unit variance.}
%     \label{fig:glr_comparision}

% \end{figure}

    
%     \begin{figure}
%     \centering
%     \includegraphics[scale=0.3]{plots/heavy_tail_compare.pdf}
%     \caption{Pareto distribution with shape $2.01$ and $\Delta=0.5, \delta=0.05$.}
%     \label{fig:heavy_tail_compare}
% \end{figure}


%     \begin{figure}
%     \centering
%     \includegraphics[scale=0.3]{plots/bernoulli_comparision.pdf}
%     \caption{Bernoulli distributions with means $0.9, 0.1, 0.8,0.2$ respectively each for $800$ time-steps.}
%     \label{fig:bernoulli_compare}
% \end{figure}



%     \begin{figure}
%     \centering
%     \includegraphics[scale=0.3]{plots/high_dim_gaussian.pdf}
%     \caption{Identity covariance matrix Gaussians on $d=32$ with $\Delta = 2, \delta=0.05$.}
%     \label{fig:high_dim_gaussian}
% \end{figure}

% \clearpage

% \begin{figure}
%     \centering
%     \includegraphics[scale=0.3]{plots/multiple_normal_1.pdf}
%     \caption{One Dimensional change-point with $\Delta = 0.1$ and $\delta=0.05$ and light-tailed normal random variable with unit variance.}
%     \label{fig:multiple_motivation}
% \end{figure}

% \begin{figure}
%     \centering
%     \includegraphics[scale=0.3]{plots/multiple_pareto_1.pdf}
%     \caption{One Dimensional change-point with $\Delta = 0.1$ and $\delta=0.05$ and Pareto distribution with shape $2.1$.}
%     \label{fig:multiple_motivation}
% \end{figure}

% \begin{figure}
%     \centering
%     \includegraphics[scale=0.3]{plots/multiple_pareto_2.pdf}
%     \caption{One Dimensional change-point with $\Delta = 0.1$ and $\delta=0.05$ and Pareto distribution with shape $2.01$.}
%     \label{fig:multiple_motivation}
% \end{figure}


% \begin{figure}
%     \centering
%     \includegraphics[scale=0.3]{plots/multiple_normal_pareto_1.pdf}
%     \caption{One Dimensional change-point with $\Delta = 0.1$ and $\delta=0.05$ with alternating Pareto distribution with shape $2.01$ and standard normal light-tailed distribution.}
%     \label{fig:multiple_motivation}
% \end{figure}



% \begin{figure}
%     \centering
%     \includegraphics[scale=0.2]{plots/pareto_normal_2.png}
%     \caption{One Dimensional change-point with $\Delta = 0.1$ and $\delta=0.05$ with alternating Pareto distribution with shape $2.01$ and standard normal light-tailed distribution. The last change-point is not detected in this sample-path.}
%     \label{fig:multiple_motivation}
% \end{figure}


% \begin{figure}
%     \centering
%     \includegraphics[scale=0.3]{plots/normal_pareto_3.pdf}
%     \caption{One Dimensional change-point with $\Delta = 0.1$ and $\delta=0.05$ with alternating Pareto distribution with shape $2.01$ and standard normal light-tailed distribution.}
%     \label{fig:multiple_motivation}
% \end{figure}



% \begin{figure}
%     \centering
%     \includegraphics[scale=0.3]{plots/high_dim_pareto_1.pdf}
%     \caption{$d=15$ dimensional change-point with $\Delta = 5$ and $\delta=0.05$ with Pareto distribution with shape $2.01$ in all directions. The Y-axis plots the norm of the observed vector and this thus non-zero.}
%     \label{fig:multiple_motivation}
% \end{figure}


% \begin{figure}
%     \centering
%     \includegraphics[scale=0.3]{plots/high_dim_pareto_2.pdf}
%     \caption{$d=15$ dimensional change-point with $\Delta = 2$ and $\delta=0.05$ with Pareto distribution with shape $2.01$ in all directions. The Y-axis plots the norm of the observed vector and this thus non-zero. There are two missed-detections in this sample path.}
%     \label{fig:multiple_motivation}
% \end{figure}
\section{Conclusions }

We introduced a new method based on clipped-SGD, to  detect change-points with guaranteed finite-sample FPR, without parametric or tail assumptions. The key technical contribution is to give an anytime online mean estimation algorithm, that provides a confidence bound for the mean at all confidence levels simultaneously. We also give a finite-sample, high probability bound on the detection delay as a function of the gap between the means and number of pre-change observations. We further corroborate empirically that ours is the only algorithm to detect change-points with bounded FPR, across multi-dimensional heavy tailed, gaussian or binary-valued data streams. 

Our work opens several interesting directions for future investigation. Obtaining sharp confidence intervals for estimating the mean of a random vector without the existence of variance was shown recently in \citep{cherapanamjeri2022optimal}. Extending the tools from therein to further relax the second moment assumption we considered is a natural direction of future work.  Another open question is to see if the Laplace method used in \citep{maillard2019sequential} can be extended to the high-dimensional to get dimension free confidence bounds. Further, we observe in simulations that our method attains `sharp' localization empirically. Understanding the three-way trade-off between sharpness of localization, FPR and detection delay is an important area of future work. 
\clearpage
\balance
\bibliography{change-point-detect}

\onecolumn %% Turn this off if single column is desired for the supplement



\end{document}
