\section{Discussion} \label{app:discussion}

\paragraph{Implementation in Deep Neural Networks.}

Our data structure \textsc{DynamicKDE} can be integrated with deep neural networks in tasks that require efficient and adaptive kernel-based similarity or density estimation. One application is in the embedding or attention layers of neural architectures, where kernel similarity computations are needed. For example, in~\cite{zhdk23}, they reduce the softmax matrix in attention computation to a variant of KDE, and implement the efficient KDE solver to approximate the attention computation in sub-quadratic time. Our dynamic maintenance of KDE data structures with robustness to adversarial queries can also be used to study the attention computation problem in future work. To implement this in DNNs, we can follow these steps:

\begin{enumerate}
    \item {\bf Preprocessing: } Use the \textsc{Initialize} of the data-structure \textsc{DynamicKDE} (Algorithm~\ref{alg:dynamic_KDE_initialize_pseudo}) to preprocess the data using the chosen kernel and \textsc{LSH} hash tables. This can be done on feature embeddings generated by earlier layers of the network.
    \item {\bf Query Phase: } During inference or training (especially in attention-like mechanisms), use \textsc{Query} of \textsc{DynamicKDE} (Algorithm~\ref{alg:dynamic_KDE_query_pseudo}) to efficiently compute kernel densities for input query embeddings. The approximation guarantees are preserved due to importance sampling and \textsc{LSH} recovery.
    \item \textbf{Online/Incremental Learning: } When new data points are introduced (e.g., in continual learning or streaming settings), \textsc{Update} of \textsc{DynamicKDE} (Algorithm~\ref{alg:dynamic_KDE_update_pseudo}) allows for sublinear-time integration of new points, without reinitialization of the entire data structure.
\end{enumerate}

\paragraph{Dynamically Evolving Data Distributions.}

We believe that our method can handle dynamically evolving data distributions, such as the distribution in~\cite{xcw+24}. This is because we focus on dynamically updating kernel density estimates in response to insertions/deletions in the dataset and handling adaptive/adversarial queries. \cite{xcw+24} analyzes evolving domain generalization (EDG), where the data distribution changes over time due to factors like temporal shifts (concept drift, covariate shift). It proposes Mutual Information-Based Sequential Autoencoders (MISTS) that explicitly separate dynamic and invariant features to adapt across evolving domains. Our method does provide robustness to dynamic changes in the dataset, making it applicable in scenarios where the data evolves over time. However, this may require a more careful analysis of the details of both works, so we leave this as a future direction.

\paragraph{Justification of Assumptions.}

In our proof, we consider the worst-case scenario. Specifically, we apply the $\epsilon$-net technique, where we union bound over all balls in the covering net to ensure robustness against adaptive, potentially adversarial queries. We do not assume queries are i.i.d., unlike many traditional data structures. Instead, our framework is explicitly designed for adaptive queries, which are the essence of adversarial behavior in real-world applications such as interactive machine learning, data poisoning scenarios, and online optimization. The robustness guarantees are built up incrementally: from single-query success probability, to net points on the unit ball via $\epsilon$-nets, and finally to all query points in the input space. This ensures our results extend beyond static, non-adversarial settings and are well-founded in scenarios where queries are interdependent or chosen in response to previous answers.

\paragraph{Practical Implications.} Kernel density estimation (KDE) has a direct connection to efficient attention computation in Transformers, as widely discussed in prior works~\cite{tby+19,zhdk23,as23,as24,as24_tensor,as25,as25_rank}. Recall that for the token sequence $X_\ell \in \R^{n \times d}$ at Transformer layer $\ell$ and weight matrices $Q, K \in \R^{d\times d}$, the attention weight matrix is given by
\begin{align*}
    \mathsf{Attn}(Q,K) := D^{-1} \exp(X_{\ell} Q K^\top X_{\ell}^\top),
\end{align*}
where $D:= \diag( \exp( X_{\ell} Q K^\top X_{\ell}^\top ){\bf 1}_n )$ and $\exp(A)_{i, j} = \exp(A_{i, j})$ for all matrices $A$. 

We can set $k_i:= (X_\ell K)_{i, *}, q_i:= (X_\ell Q)_{i, *}, A_{i,j} = \exp(q_i^\top k_j)$ for all $i\in[n]$. Then, it is evident that $A$ is a kernel matrix whose entries are exponentiated inner products, and the only difference between $\mathsf{Attn}(Q,K) = D^{-1} A$ and kernel computation of $A$ is the normalization matrix $D^{-1}$. 

To make the connection explicit, we can consider the Gaussian kernel
\begin{align*}
    f_{\mathrm{Gaussian}}(k,q):= \exp(- 0.5 \sigma^{-2}\| k - q\|_2^2). 
\end{align*}

When $\| k \|_2 = 1$ and $\| q\|_2 = 1$, the Gaussian kernel can be simplified as 
\begin{align*}
    f_{\mathrm{Gaussian}}(k,q)= \exp( \sigma^{-2} (q^\top  k-2)),
\end{align*}
which corresponds to $\exp(q^\top k)$ and exactly recovers the attention computation $A_{i,j} = \exp(q_i^\top k_j)$.

Thus, the efficient KDE algorithm proposed in this paper may inspire further applications in efficient attention computation, including Transformer architectures~\cite{fa23,lls+25_pruning,cssz25,gswy25}, graph attention~\cite{vcc+18,flz+21,zha24,lls+25}, and attention-inspired regression~\cite{dls23,gsx23,syz24,dlms24}. These architectural advancements can further enhance computational efficiency across various fields, such as diffusion models~\cite{hwl+24,ssz+25,cgh+25,ghs+25_physical} and flow-based generative models~\cite{cgl+25,ccl+25_form,csy25_vlfm,gkl+25}.

