\section{Introduction}

In today's digital era, vast amounts of user-generated data are collected daily by service providers for analysis.
Such data are often of a sensitive nature because analytical results on them, even in the form of aggregated statistics or predictions, can leak substantial information about individual users \citep{DworkR14, HuSSDYZ22}.
Therefore, while \emph{effectiveness} and \emph{efficiency} remain fundamental for data analysis tasks, \emph{privacy} has also emerged as a crucial concern.

In the realm of privacy-preserving data analysis, local differential privacy (LDP) \citep{DuchiJW13} offers a theoretically rigorous definition of privacy that allows an untrusted server to collect and analyze user data privately with guarantees.
In particular, LDP is the local model of differential privacy (DP) \citep{DworkMNS06}, the de facto standard of privacy preservation, wherein users perturb data on their own devices before sending them to the server.
Informally, a mechanism is locally differentially private if its output is indistinguishable for any given pair of input values.
This guarantees a level of protection against information leakage in any individual record measured by the privacy parameter $\varepsilon > 0$.
A smaller value of $\varepsilon$ implies a lower level of distinguishability and more reliable privacy protection.
Despite providing provably strong privacy guarantees, LDP is meanwhile known to incur considerable losses in the quality of analytical results \citep{CormodeJKLSW18, XiangD0Z20}.
Consequently, it has become a prominent challenge to accurately analyze large-scale data sets while maintaining local privacy \citep{CormodeJKLSW18, DuchiJW13, KairouzBR16, WangXYZHSS019, LiWLLS20}.

In this paper, we study the kernel density estimation (KDE) problem \citep{Parzen62}, which is a cornerstone of numerous machine learning applications, including clustering \citep{HinneburgK03}, anomaly detection \citep{HuGLWDM20}, and visualization \citep{ChanLUXC21, ChanUCX22}, in a local privacy setting.
KDE is an unsupervised technique for estimating a probability density function from data points, offering insights into the underlying data distribution and patterns.
Given a data set $\mathcal{D} \subseteq \mathbb{R}^{m}$ of $n$ points in an $m$-dimensional space and a kernel function $k: \mathbb{R}^{m} \times \mathbb{R}^{m} \mapsto [0, 1]$, the kernel density at a query point $\bm{q} \in \mathbb{R}^{m}$ is defined as 
\begin{displaymath}
  \mathrm{KDE}_{\mathcal{D}}(\bm{q}) = \frac{1}{n} \sum_{\bm{x} \in \mathcal{D}} k(\bm{x}, \bm{q}).
\end{displaymath}
Exact computation of $\mathrm{KDE}_{\mathcal{D}}(\bm{q})$, requiring $O(nm)$ time and space, is costly for large, high-dimensional datasets.

Recent advances have focused on developing methods for approximate KDE that are more efficient in terms of time and memory \citep{MuandetFSS17, CharikarS17, SiminelakisRBCL19, PhillipsT18, BackursIW19, ColemanS20, LeiWL0ZGD21}.
Particularly noteworthy is the exploration of the intrinsic connection between KDE and locality-sensitive hashing (LSH) \citep{CharikarS17, SiminelakisRBCL19, BackursIW19}.
This connection has led to several sketch-based methods \citep{ColemanS20, LeiWL0ZGD21} to estimate the density of LSH kernels within sublinear time and memory space while also providing approximation guarantees.
Moreover, the \emph{mergeability} of the sketches significantly improves the applicability of sketch-based methods, especially in distributed and streaming models \citep{ColemanS20, LeiWL0ZGD21}.
This enables the construction of a sketch for the entire data set $\mathcal{D}$ by seamlessly combining the sketches created from its disjoint subsets.
Although sketch-based KDE methods do not require storing original data, they do not inherently guarantee differential privacy. 
This is because an adversarial server can potentially recover users' data from the hash values they send \citep{FernandesKM21}.
In addition, existing differentially private sketches for KDE \citep{ColemanS21, WagnerNM23} are tailored for the centralized setting, where user data should first be gathered by a \emph{trusted} server, and then a sketch is constructed with perturbation from the original data to answer KDE queries privately.
However, these methods are not adaptable to the LDP setting, where user data must be privatized prior to collection.

\subsection{Main Contributions}

To fill this gap, we first attempt to approximate KDE subject to LDP constraints.
We observe that the conventional LDP notion, which demands indistinguishability between any pair of inputs, is too stringent for the KDE problem.
It often leads to prohibitive errors in the KDE query results, even when a substantial privacy budget is allocated.
To strike a balance between maintaining an adequate level of local privacy and preserving high-utility KDE results, we opt for a more relaxed metric-based variant of LDP, known as \emph{local $d_{\chi}$-privacy} \citep{ChatzikokolakisABP13, Alvim0PP18}.
Specifically, metric-based LDP (mLDP) quantifies the distinguishability between any two data points $\bm{x}, \bm{x}'$ in relation to their distance $d_{\chi}(\bm{x}, \bm{x}')$ within a given metric space $\chi$. 
A point $\bm{x}$ becomes more distinguishable from another point $\bm{x}'$ as $d_{\chi}(\bm{x}, \bm{x}')$ increases and vice versa.
In this way, mLDP allows the server to collect approximate information from users while protecting the exact values of individual data points.
This characteristic of mLDP is particularly compatible with the nature of KDE because many kernel classes, including common LSH kernels, are defined directly on metric distances.
For example, the $l_1$- and $l_2$-LSH kernels are derived from the Manhattan and Euclidean distances \citep{ColemanS20}, while the angular kernel corresponds to the angular distance \citep{LeiWL0ZGD21}.
Furthermore, mLDP has a distinct advantage over traditional LDP in preserving data distribution, which is essential for high-utility KDE results.

To provide mLDP, we design a general framework that augments sketch-based KDE methods by introducing the generalized randomized response (GRR) mechanism \citep{KairouzBR16} for users to perturb hash values before sending them to the server.
Building on this framework, we introduce an unbiased estimator that enables the server to accurately answer KDE queries using the sketch built by aggregating the perturbed hash values from users.
Our theoretical analysis shows that the user-level mechanism for computing the hash values provides mLDP with high probability.
Moreover, any KDE result provided by the server, calculated in sublinear time and space, has bounded additive errors, also with high probability.
To the best of our knowledge, this is the first method for approximate KDE under mLDP.
The main contributions are summarized as follows.
\begin{itemize}
  \item We formally define the problem of approximate kernel density estimation (KDE) under metric-based local differential privacy (mLDP). (Section~\ref{sec-def})
  \item We propose a novel \textsc{mLDP-KDE} framework and analyze its privacy guarantee, approximation bound, and complexity theoretically. (Section~\ref{sec-alg})
  \item We conduct extensive experiments on five real-world and synthetic data sets to evaluate the performance of the \textsc{mLDP-KDE} framework.
  The results confirm its superiority over existing methods for KDE under LDP and mLDP by achieving significantly better privacy-utility trade-offs and demonstrating better scalability on large, high-dimensional data sets. (Section~\ref{sec-exp})
\end{itemize}
