\section{Preliminaries}
\label{sec-def}

This section presents the background of kernel density estimation (KDE), locality-sensitive hashing (LSH) kernels, and metric-based local differential privacy (mLDP), and formally defines the problem studied in this work.

\paragraph{Kernel Density Estimation}
We denote $\mathcal{D}$ as a data set of $n$ data points in an $m$-dimensional Euclidean space $\mathbb{R}^m$.
A \emph{kernel} is defined as a function $k: \mathbb{R}^m \times \mathbb{R}^m \mapsto [0,1]$ that quantifies the similarity of two points in $\mathbb{R}^m$.
For any given query point $\bm{q} \in \mathbb{R}^m$, the \emph{kernel density estimation} (KDE) for a data set $\mathcal{D}$, represented as $\mathrm{KDE}_{\mathcal{D}}: \mathbb{R}^m \mapsto [0, 1]$, is defined by $\mathrm{KDE}_{\mathcal{D}}(\bm{q}) = \frac{1}{n} \sum_{\bm{x} \in \mathcal{D}}{k(\bm{x}, \bm{q})}$.
Our goal is to approximate $\mathrm{KDE}_{\mathcal{D}}(\bm{q})$ for any $\bm{q} \in \mathbb{R}^m$.
We aim to achieve this through a randomized $(\alpha, \eta)$-approximation as below.
\begin{definition}[$(\alpha, \eta)$-Approximate KDE]\label{def-approx}
  Let $\alpha, \eta \in (0, 1)$. 
  Given a data set $\mathcal{D} \subset \mathbb{R}^m$, a query point $\bm{q} \in \mathbb{R}^m$, and a kernel function $k(\cdot, \cdot)$, $\widehat{\mathrm{KDE}}_{\mathcal{D}}(\bm{q})$ is an $(\alpha, \eta)$-approximation of $\mathrm{KDE}_{\mathcal{D}}(\bm{q})$ if $\Pr[|\widehat{\mathrm{KDE}}_{\mathcal{D}}(\bm{q}) - \mathrm{KDE}_{\mathcal{D}}(\bm{q})| \leq \alpha] \geq 1 - \eta$.
\end{definition}

Moreover, the approximation bound to $\mathrm{KDE}_{\mathcal{D}}(\bm{q})$, as outlined in Definition~\ref{def-approx}, should be achieved with space and query time complexities that are sublinear w.r.t.~$n$ and polynomial w.r.t.~$m$.
This requirement is critical for efficient processing of large, high-dimensional data sets.

\paragraph{Locality-Sensitive Hashing Kernel}
Let $d: \mathbb{R}^m \times \mathbb{R}^m \mapsto \mathbb{R}_{\geq 0}$ be a distance function to measure the \emph{dissimilarity} between any two points in $\mathbb{R}^m$.
We call $d(\cdot, \cdot)$ a metric if it satisfies the axioms of non-negativity, identity of indiscernibles, symmetry, and triangle inequality.
An LSH family $\mathcal{H}$ w.r.t.~a metric distance $d(\cdot, \cdot)$ is a family of hash functions $h: \mathbb{R}^m \mapsto \mathbb{Z}$ such that for any two points $\bm{x}, \bm{x}' \in \mathbb{R}^m$, the probability that $h(\bm{x}) = h(\bm{x}')$ monotonically decreases with $d(\bm{x}, \bm{x}')$.
Formally, $\Pr_{h \in \mathcal{H}}[h(\bm{x}) = h(\bm{x}')] = f(d(\bm{x}, \bm{x}'))$, where $f(\cdot)$ is a monotonically decreasing collision probability function.
In many LSH families, this collision probability function forms a positive semidefinite radial kernel \citep{ColemanS20}, which is referred to as an \emph{LSH kernel}, i.e., $k(\bm{x}, \bm{x}') = f(d(\bm{x}, \bm{x}'))$.
Notable examples of LSH schemes, including the Signed Random Projection (SRP) LSH for the angular distance \citep{Charikar02} and the $1$-stable or $2$-stable LSH scheme for the Manhattan ($l_1$-) or Euclidean ($l_2$-) distance \citep{DatarIIM04, HuangFFNW17}, all induce useful LSH kernels.

In this paper, we focus on the $l_2$-LSH kernel for the Euclidean distance.\footnote{Note that our results can be extended to other LSH kernels, as shown in Appendix~\ref{appendix-general-lsh}.}
Specifically, a hash function in the $2$-stable LSH family \citep{DatarIIM04} is defined as $h(\bm{x}) = \lfloor \frac{\bm{a} \cdot \bm{x} + b}{\omega} \rfloor$, where $\bm{a}$ is a vector drawn from the standard $m$-dimensional Gaussian distribution, $b$ is a scalar drawn from the uniform distribution $\mathcal{U}(0, \omega)$, and $\omega > 0$ is the bandwidth.
Let $d(\bm{x}, \bm{x}') = \|\bm{x} - \bm{x}'\|_2$.
The $l_2$-LSH kernel is denoted by a complex function as
\begin{multline}
\label{eq-l2kernel}
    k(\bm{x}, \bm{x}') = 1 - 2 \Psi\left(\frac{- \omega}{d(\bm{x}, \bm{x}')}\right)\\
    - \frac{2 d(\bm{x}, \bm{x}')}{\sqrt{2\pi}\omega}\left(1 - \exp\left(\frac{-\omega^2}{2d^2 (\bm{x}, \bm{x}')}\right) \right),
\end{multline}
where $\Psi(\cdot)$ is the cumulative distribution function (CDF) of the standard Gaussian distribution $\mathcal{N}(0, 1)$.

\paragraph{Metric-based Local Differential Privacy}
Let $\mathcal{V}$ and $\mathcal{Y}$ be the input and output domains of a randomized mechanism $\mathcal{M}$.
Given a parameter $\varepsilon > 0$, the mechanism $\mathcal{M}: \mathcal{V} \mapsto \mathcal{Y}$ satisfies $\varepsilon$-local differential privacy ($\varepsilon$-LDP) if for every pair of inputs $v, v' \in \mathcal{V}$ and measurable output $Y \subset \mathcal{Y}$, $\Pr[\mathcal{M}(v) \in Y] \leq e^{\varepsilon} \cdot \Pr[\mathcal{M}(v') \in Y]$.
This ensures that $\mathcal{M}$ cannot reliably distinguish between $v$ and $v'$ for any observed output $Y$, preventing an adversary from inferring the original data.
A lower value of $\varepsilon$ indicates a stronger level of privacy, as the probability that the output can be used to infer its corresponding input is reduced.

The GRR \citep{KairouzBR16} mechanism satisfies $\varepsilon$-LDP for any finite and discrete domain.
To facilitate understanding, let $\mathcal{V}, \mathcal{Y}$ represent the input and output domains of the GRR mechanism, both length-$R$ arrays indexed by $[1, 2, \cdots, R]$.
For any input value $v \in \mathcal{V}$, the output $\mathcal{M}_{\mathrm{GRR}}(v) \in \mathcal{Y}$ is a random variable sampled as follows:
\begin{equation}\label{eq-grr}
  \Pr[\mathcal{M}_{\mathrm{GRR}}(v) = i] =
  \begin{cases}
    \frac{e^{\varepsilon}}{e^{\varepsilon}+R-1} & \text{if } i = v,\\
    \frac{1}{e^{\varepsilon}+R-1} & \text{otherwise}.\\
  \end{cases}
\end{equation}

We consider a metric-based variant of LDP (mLDP), also known as local $d_{\chi}$-privacy \citep{ChatzikokolakisABP13, Alvim0PP18}, which relaxes the privacy requirements by allowing two data points to become more distinguishable as their distance increases.
Formally, a randomized mechanism $\mathcal{M}: \mathcal{V} \mapsto \mathcal{Y}$ satisfies local $d_{\chi}$-privacy if for any pair of inputs $v, v'$ and any measurable output $Y \subset \mathcal{Y}$,
$$\Pr[\mathcal{M}(v) \in Y] \leq e^{d_{\chi}(v, v')} \cdot \Pr[\mathcal{M}(v') \in Y],$$
where $d_{\chi}(\cdot, \cdot)$ is a distance function in the metric space $\chi$.

We aim to devise a randomized mechanism that provides mLDP for KDE using LSH kernels, which takes a data point in $\mathbb{R}^m$ as input and an integer array as output.
As an LSH scheme only approximately preserves the original distance, we introduce a new metric distance, $d_{\chi}(\cdot, \cdot)$, into the mLDP definition by allowing extra errors w.r.t.~the original $d(\cdot, \cdot)$, i.e., $d_{\chi}(\bm{x}, \bm{x}') = \mu \cdot d(\bm{x}, \bm{x}') + \delta$, where $\mu, \delta > 0$.
It is evident that $d_{\chi}(\cdot, \cdot)$ is a metric as long as $d(\cdot, \cdot)$ is a metric.
Furthermore, given the probabilistic nature of LSH schemes, our privacy guarantee is inherently satisfied with a certain probability when LSH functions are chosen randomly, which aligns with the concept of probabilistic differential privacy \citep{MachanavajjhalaKAGV08}.
Formally,
\begin{definition}[$(d_{\chi}, \eta)$-mLDP]\label{def-mldp}
  For any $\mu, \delta > 0$ and $\eta \in (0, 1)$, a randomized mechanism $\mathcal{M} : \mathbb{R}^m \mapsto \mathbb{Z}^L$ provides $(d_{\chi}, \eta)$-mLDP iff any two inputs $\bm{x}, \bm{x}' \in \mathbb{R}^m$ and output $\bm{y} \in \mathbb{Z}^L$, $\Pr\Big[\tfrac{\Pr[\mathcal{M}(\bm{x}) = \bm{y}]}{\Pr[\mathcal{M}(\bm{x}') = \bm{y}]} \leq \exp( d_{\chi}(\bm{x}, \bm{x}'))\Big] \geq 1 - \eta$.
\end{definition}

On the basis of all the concepts above, we formally define the problem studied in this paper.
\begin{definition}[Approximate KDE under mLDP]\label{def-kde-mldp}
  For a data set $\mathcal{D} \subset \mathbb{R}^m$ and the $l_2$-LSH kernel $k(\cdot, \cdot)$ with a bandwidth $\omega > 0$, build a sketch $\mathcal{S}_{\mathcal{D}}$ to offer an $(\alpha, \eta)$-approximate $\widehat{\mathrm{KDE}}_{\mathcal{D}}(\bm{q})$ for any $\bm{q} \in \mathbb{R}^m$ under $(d_{\chi}, \eta)$-mLDP.\footnote{The two $\eta$'s in the privacy and approximation bounds can take different values but are kept the same in our analysis for simplicity.}
\end{definition}
