\section{Introduction}
\label{sec:intro}


Kernel density estimation~(KDE) is a well-known machine learning approach with wide applications in biology~\cite{fc17,crv+18}, physics~\cite{hallin2021classifying,c01} and law~\cite{chs20}. The KDE is defined as follows: given a kernel function $f(x,y)$ and a dataset $\{x_1, x_2, \cdots, x_n \}$, we would like to estimate 
% \begin{align*}
$\frac{1}{n}\sum_{i=1}^{n} f(x_i,y)$
% \end{align*}
for a query $y$. It is standard to assume the kernel $f$ to be positive semi-definite. From a statistics perspective, we regard KDE as estimating the density of a probability distribution provided by a mapping. 
Recently, there has been a growing trend in applying probabilistic data structures for KDE  \cite{cs17,biw19,srb+19,acss20,ckns20,cs20,kap22}. The general idea is to transform the kernel function into a distance measure and then apply similarity search data structures such as locality sensitive hashing and sketching. This co-design of data structure and KDE is of practical importance: (1) the computational efficiency is taken into consideration when we design KDE algorithms; (2) the capacity of traditional probabilistic data structures is extended from search to sampling. As a result, we obtain a series of KDE algorithms with both sample efficiency and running time efficiency.
However, current KDE data structures focus on static settings where the dataset is fixed and the queries and independent of each other. More practical settings should be taken into consideration. In some applications of KDE, the dataset is dynamically changing. For instance, in time series modeling~\cite{mrl95,hl18}, the KDE estimators should be adaptive to the insertion and deletion in the dataset. In semi-supervised learning~\cite{whm+09,wlg19}, KDE data structures should handle the update of the kernel function. Moreover, in the works that apply KDE in optimization, the data structures should be robust over adversarial queries. As a result, the dynamic maintenance of KDE data structures should be emphasized in the research of machine learning.

 

In this paper, we argue that there exists a practice-to-theory gap for the dynamic maintenance of KDE data structures. Although there are existing work~\cite{ciu21} that supports insertion and deletion in KDE data structures, these operations' impact on the quality of KDE is not well-addressed. Moreover, the robustness of KDE data structures over adversaries has recently been raised as a concern. Thus, a formal theoretical analysis 
is required to discuss the robustness of KDE data structures in a dynamic setting.
We present a theoretical analysis of the efficient maintenance of KDE for dynamic datasets and adversarial queries. Specifically, we present the first data structure design that can quickly adapt to updated input data and is robust to adversarial queries. We call our data structure and the corresponding algorithms \textit{adaptive kernel density estimation.}  
Our data structure only requires subquadratic spaces, and each update to the input data only requires sublinear time, and each query can finish in sublinear time.

\paragraph{Notation}  

We use $\R$, $\R_+$, $\mathbb{N}_+$ to denote the set of real numbers, positive real numbers, and positive integers. 
For a set $X$, we use $|X|$ to denote its cardinality. 
Let $n \in \mathbb{N}_+$ and $r \in \R$. We define $[n] : = \{1, 2, 3, \dots, n\}$ and $\lceil r \rceil$ to be the ceiling of $r$. Let $\R^n$ be the set of all $n$-dimensional vectors whose entries are all real numbers. $\|x\|_2$ represents the $\ell_2$ norm of $x$. $\Pr[\cdot]$ represents the probability, and $\E[\cdot]$ represents the expectation. We define $\exp_2(r):= 2^r$.

\subsection{Related Work}

 
\paragraph{Efficient Kernel Density Estimation}
The naive KDE procedure takes a linear scan of the data points. This is prohibitively expensive for large-scale datasets. Thus, it is of practical significance to develop efficient KDE algorithms. A series of traditional KDE algorithms, namely kernel merging~\cite{hs08,chm12}, is to perform clustering on the dataset so that the KDE is approximated by a weighted combination of centroids. However, these algorithms do not scale to high-dimensional datasets. 
Also, there is a trend of sampling-based KDE algorithms. The goal is to develop efficient procedures that approximate KDE with fewer data samples. Starting from random sampling~\cite{mf+17}, sampling procedures such as Herding~\cite{cws10} and $k$-centers~\cite{cs16} are introduced in KDE. Some work also incorporates sampling with the coreset concept~\cite{pt20} and provides a KDE algorithm by sampling on an optimized subset of data points. Recently, there has been a growing interest in applying hash-based estimators (HBE)~\cite{cs17,biw19,srb+19,cs20,cgc+18,ss21} for KDE. The HBE uses Locality Sensitive Hashing~(LSH) functions. The collision probability of two vectors in terms of an LSH function is monotonic to their distance measure. Using this feature, HBE performs efficient importance sampling by LSH functions and hash table type data structures. However, current HBEs are built for static settings and thus, are not robust to incremental changes in the input data. As a result, their application in large-scale online learning is limited. Except for LSH based KDE literature, there are also other KDE work based polynomial methods \cite{acss20,as23,as25,as25_rank}. The dynamic type of KDE has also been considered in \cite{djs+22,bsz23}. \cite{dms23} presents both randomized and deterministic algorithms for approximating a symmetric KDE computation.

\paragraph{Adaptive Data Structure}
Recently, there has been a growing trend of applying data structures~\cite{s19,cmf+20,ssx21,xss21,xcl+21,sxz22,z22,sxyz22} to improve running time efficiency in machine learning. These adaptive data structures have also extended their success to many fields, such as optimization~\cite{lsz19,cls19,blss20,jswz21,qszz23} and differential privacy~\cite{hkm+22,csw+23,syyz23,ffl+25}. 
However, there exists a practice-to-theory gap between data structures and learning algorithms. Most data structures assume queries to be independent and provide theoretical guarantees based on this assumption. On the contrary, the query to data structures in each iteration of learning algorithms is mutually dependent. As a result, the existing analysis framework for data structures could not provide guarantees in optimization. To bridge this gap, quantization strategies~\cite{ssx21,xss21,sxz22,sxyz22} are developed for adaptive queries in machine learning to quantize each query into its nearest vertex on the $\epsilon$-net. Therefore, the failure probability of the data structures could be upper bounded by a standard $\epsilon$-net argument. Although quantization methods demonstrate their success in machine learning, this direct combination does not fully enable the power of both data structure and learning algorithms. In our work, we aim at a co-design of data structure and machine learning for efficiency improvements in adaptive KDE. 

 

\subsection{Problem Formulation}






In this work, we would like to study the following problem.

\begin{definition}[Dynamic KDE]\label{def:dynamic_KDE}
Let $f:\mathbb{R}^d\times \mathbb{R}^d \rightarrow [0,1]$ denote a kernel function. Let   $X=\{x_{i}\}_{i=1}^{n} \subset \mathbb{R}^{d}$ denote a dataset. Let $f_{\mathsf{KDE}}^{*}:=f(X,q):=\frac{1}{|X|}\sum_{x \in X} f(x,q)$ define the kernel density estimate of a query $q\in \R^d$ with respect to $X$. Our goal is to design a data structure that efficiently supports any sequence of the following operations:  
\begin{itemize}
    \item \textsc{Initialize}$(f:\mathbb{R}^d\times \mathbb{R}^d \rightarrow [0,1],X\subset\mathbb{R}^d, \epsilon \in (0,1),f_{\mathsf{KDE}}\in[0,1])$. The data structure takes kernel function $f$, data set $X=\{x_1, x_2, \dots, x_n\}$, accuracy parameter $\epsilon$ and a known quantity $f_{\mathsf{KDE}}$ satisfying $f_{\mathsf{KDE}} \geq f_{\mathsf{KDE}}^{*}$ as input for initialization.
    \item \textsc{Update}$(z \in \R^d, i \in [n])$. Replace the $i$'th data point of data set $X$ with $z$.
    \item \textsc{Query}$(q \in \R^d)$. Output $\tilde{d}\in\mathbb{R}$  
    such that $(1-\epsilon)f_{\mathsf{KDE}}^{*}(X,q)\leq \tilde{d} \leq(1+\epsilon)f_{\mathsf{KDE}}^{*}(X,q)$.
\end{itemize}
\end{definition}

We note that in the \textsc{Query} procedure do not assume i.i.d queries. Instead, we take adaptive queries and provide theoretical guarantees.


 


\subsection{Our Result}\label{sec:our_result}
In this work, we provide theoretical guarantees for the dynamic KDE data structures defined in Definition~\ref{def:dynamic_KDE}. We summarize our main result as below:

\begin{theorem}[Main result]\label{thm:main_result}
Given a function $K$ and a set of points set $X \subset \R^d$. Let $\cost(f)$ be defined as Definition~\ref{def:cost_K}. For any accuracy parameter $\epsilon \in (0,0.1)$, there is a data structure using space $O(\epsilon^{-2}n\cdot \cost(f))$  (Algorithm~\ref{alg:dynamic_KDE_initialize},~\ref{alg:dynamic_KDE_update} and~\ref{alg:dynamic_KDE_query}) for the Dynamic KDE Problem (Definition~\ref{def:dynamic_KDE}) with the following procedures:
\begin{itemize}
    \item \textsc{Initialize}$(f:\mathbb{R}^d\times \mathbb{R}^d \rightarrow [0,1], X\subset\mathbb{R}^d, \epsilon \in (0,1),f_{\mathsf{KDE}}\in[0,0.1])$. Given a kernel function $f$, 
    a dataset $P$, an accuracy parameter $\epsilon$ and a quantity $f_{\mathsf{KDE}}$ as input, the data structure \textsc{DynamicKDE} preprocess in time 
    \begin{align}\label{eq:time_1}
    % \label{eq:time_1}
        & ~ O(\epsilon^{-2}n^{1+o(1)}\cost(f)\notag\\
        \cdot & ~ (\frac{1}{f_{\mathsf{KDE}}})^{o(1)}\log (1/ f_{\mathsf{KDE}}) \cdot \log^3 n)
    \end{align}
    \item \textsc{Update}$(z \in \R^d, i \in [n])$. Given a new data point $z\in \mathbb{R}^d$ and index $i \in [n]$, the \textsc{Update} operation take $z$ and $i$ as input and update the data structure in time 
    \begin{align}\label{eq:time}
        & ~ O(\epsilon^{-2}n^{o(1)}\cost(f) \notag\\
        \cdot & ~ (\frac{1}{f_{\mathsf{KDE}}})^{o(1)}\log (1/ f_{\mathsf{KDE}}) \cdot \log^3 n)
    \end{align}
    
    \item \textsc{Query}$(q \in \R^d)$. Given a query point $q\in \mathbb{R}^d$, the \textsc{Query} operation takes $q$ as input and approximately estimate kernel density at $q$ in Eq.~\eqref{eq:time} time 
    % \begin{align*}
    %     O(\epsilon^{-2}n^{o(1)}\cost(f)\cdot (\frac{1}{f_{\mathsf{KDE}}})^{o(1)}\log (1/ f_{\mathsf{KDE}}) \cdot \log^3 n).
    % \end{align*}
    and output $\tilde{d}$ such that:
    % \begin{align*}
        $(1-\epsilon)f(X,q)\leq \tilde{d} \leq(1+\epsilon)f(X,q)$.
    % \end{align*}
\end{itemize}
\end{theorem}
 
We prove the main result in Lemma~\ref{lem:dynamic_KDE_initialize}, Lemma~\ref{lem:dynamic_KDE_update} and Lemma~\ref{lem:dynamic_KDE_query}.
 


\subsection{Technical Overview}
In this section, we introduce an overview of our technique that leads to our main result.





{\bf Density Constraint.}
We impose an upper bound on the true kernel density for query $q$. We also introduce geometric level sets, so the number of data points that fall into each level is upper bounded.

{\bf Importance Sampling.}
To approximate kernel density efficiently, we adopt the importance sampling technique. We sample each data point with different probability according to their contribution to the estimation, the higher the contribution, the higher the probability to be sampled. Then, we can construct an unbiased estimator based on sampled points and sampling probability. The main problem is how to evaluate the contribution to KDE for each point. We explore the geometry property of the kernel function and estimate the contribution of each point based on their distance from the query point. 

{\bf Locality Sensitive Hashing.}
One problem with importance sampling is that when preprocessing, we have no access to the query point. It is impossible to estimate the contribution of each point for a future query. We make use of LSH to address this issue. To be specific, LSH preprocesses data points and finds the near neighbors for a query with high probability. With this property, we can first sample all data points in several rounds with geometrically decaying sampling probability. We design LSH for each round that satisfies the following property: given a query, LSH recovers points that have contributions proportional to the sampling probability in that round. Then we can find sampled points and proper sampling probability when a query comes.

{\bf Dynamic Update.}
Since the previous techniques, i.e. importance sampling and LSH, are independent of the coordinate value of the data point itself. This motivates us to support updating data points dynamically. Since LSH is a hash-based structure, given a new data point $z \in \R^d$ and index $i$ indicating the data point to be replaced, we search for a bucket where $x_i$ was hashed, replace it with new data point $z$ and update the hash table. Such an update will not affect the data structure for estimating kernel density.

{\bf Robustness.}
To make our data structure robust to adaptive queries, we take the following steps to obtain a robust data structure. Starting from the constant success probability, we first repeat the estimation several times and take the median. This will provide a high probability of obtaining a correct answer for a fixed point. Then we push this success probability to the net points of a unit ball. Finally, we generalize to all query points in $\R^d$. Thus we have a data structure that is robust to adversary query.


\paragraph{Roadmap} In Section~\ref{sec:preli}, we describe some basic definitions and lemmas that are frequently used. 
% In Section~\ref{sec:technical}, we list some technical claims for our results. 
In Section~\ref{sec:data}, we demonstrate our data structure in detail, including the algorithm and the running time analysis. 
% We study the correctness of our data structure in Section~\ref{sec:correctness}. 
We perform an analysis of our data structures over the adversary in Section~\ref{sec:adversary}.
Finally, we draw our conclusion in Section~\ref{sec:conclusion}.