%\documentclass{uai2022} % for initial submission
\documentclass[accepted]{uai2022} % after acceptance, for a revised
                                    % version; also before submission to
                                    % see how the non-anonymous paper
                                    % would look like
%% There is a class option to choose the math font
% \documentclass[mathfont=ptmx]{uai2022} % ptmx math instead of Computer
                                         % Modern (has noticable issues)
% \documentclass[mathfont=newtx]{uai2022} % newtx fonts (improves upon
                                          % ptmx; less tested, no support)
% NOTE: Only keep *one* line above as appropriate, as it will be replaced
%       automatically for papers to be published. Do not make any other
%       change above this note for an accepted version.



%% Choose your variant of English; be consistent
\usepackage[american]{babel}
% \usepackage[british]{babel}

%% Some suggested packages, as needed:
% \usepackage{natbib} % has a nice set of citation styles and commands
%     \bibliographystyle{plainnat}
%     \renewcommand{\bibsection}{\subsubsection*{References}}
\usepackage{mathtools} % amsmath with fixes and additions
% \usepackage{siunitx} % for proper typesetting of numbers and units
\usepackage{booktabs} % commands to create good-looking tables

%% Provided macros
% \smaller: Because the class footnote size is essentially LaTeX's \small,
%           redefining \footnotesize, we provide the original \footnotesize
%           using this macro.
%           (Use only sparingly, e.g., in drawings, as it is quite small.)

%% Self-defined macros
\usepackage[dvipsnames]{xcolor}

\usepackage{amssymb,amsfonts,amsmath,amsthm} %ams
\usepackage{enumerate,enumitem,tikz,graphicx,mathrsfs,eucal,verbatim, bbm, derivative}
\usepackage{caption}
\usepackage{subcaption}
\usepackage{wrapfig}
\usepackage{graphbox} 
\usepackage{comment}
\usepackage{algorithm}
\usepackage{algpseudocode}
\usepackage{nicefrac}
\usepackage{array}
\usepackage{hyperref}

\newcommand{\swap}[3][-]{#3#1#2} % just an example


\usepackage[strict]{changepage}
\usepackage{manfnt}
\usepackage{multicol}
 \usepackage[
    backend=biber,
    style=authoryear,
  ]{biblatex}
\addbibresource{polianskii_188.bib}

\let\svthefootnote\thefootnote
\newcommand\freefootnote[1]{%
  \let\thefootnote\relax%
  \footnotetext{#1}%
  \let\thefootnote\svthefootnote%
}

\theoremstyle{plain}% default
\newtheorem{thm}{Theorem}[section]
\newtheorem{prb}[thm]{Problem}
\newtheorem{prop}[thm]{Proposition}
\newtheorem*{cor}{Corollary}
\theoremstyle{definition}
\newtheorem{defn}{Definition}[section]

\renewcommand{\algorithmicrequire}{\textbf{Input:}}
\renewcommand{\algorithmicensure}{\textbf{Output:}}


\newcommand{\blue}{\color{blue}}
\newcommand{\green}{\color{green!50!black}}  
\newcommand{\red}{\color{red}}  

\newcommand{\centered}[1]{\begin{tabular}{l} #1 \end{tabular}}
\newcommand{\kw}[1]{{\bfseries #1}}
% \newcommand{\la}{\langle}
% \newcommand{\ra}{\rangle}
\newcommand*\rfrac[2]{{}^{#1}\!/_{#2}}
\newcommand{\inner}[2]{\left\langle #1 , #2 \right\rangle}

\newcommand{\ssize}{|\Sigma|} % r
\newcommand{\ssizeopt}{} % $r$

\DeclareMathOperator{\vol}{\textnormal{Vol}}
\DeclareMathOperator{\diam}{diam}




\title{Voronoi Density Estimator for High-Dimensional Data: \\ Computation, Compactification and Convergence}

% The standard author block has changed for UAI 2022 to provide
% more space for long author lists and allow for complex affiliations
%
% All author information is authomatically removed by the class for the
% anonymous submission version of your paper, so you can already add your
% information below.
%
% Add authors 
% {\href{mailto:<vpol@kth.se>?Subject=Your UAI 2022 paper}

\author[1]{{\href{mailto:<vpol@kth.se>?Subject=Your UAI 2022 paper}{Vladislav Polianskii*}}}
\author[1]{{\href{mailto:<glma@kth.se>?Subject=Your UAI 2022 paper}{Giovanni Luca Marchetti*}}}
\author[1]{Alexander Kravberg}
\author[ ]{Anastasiia Varava}
\author[1]{Florian~T.~Pokorny}
\author[1]{Danica Kragic}
% Add affiliations after the authors
\affil[1]{%
    School of Electrical Engineering and Computer Science, Royal Institute of Technology (KTH)\\
    Stockholm, Sweden
}




\begin{document}
\maketitle
    
\begin{abstract}
      The Voronoi Density Estimator (VDE) is an established density estimation technique that adapts to the local geometry of data. However, its applicability has been so far limited to problems in two and three dimensions. This is because Voronoi cells rapidly increase in complexity as dimensions grow, making the necessary explicit computations infeasible. We define a variant of the VDE deemed Compactified Voronoi Density Estimator (CVDE), suitable for higher dimensions. We propose computationally efficient algorithms for numerical approximation of the CVDE and formally prove convergence of the estimated density to the original one. We implement and empirically validate the CVDE through a comparison with the Kernel Density Estimator (KDE). Our results indicate that the CVDE outperforms the KDE on sound and image data. 
\end{abstract}

\section{INTRODUCTION}\label{intro}
\freefootnote{*Equal contribution.}
%{\let\thefootnote\relax\freefootnote{{*Equal contribution.}}}
%\footnote[0]{*Equal contribution.}
Given a discrete set of data sampled from an unknown probability distribution, the aim  of density estimation is to recover the underlying Probability Density Function (PDF) (\cite{pointpatterns, scottdensity}). Non-parametric methods achieve this by directly computing the PDF through a closed formula, avoiding the potentially expensive need of searching for optimal parameters.

One of the most common non-parametric density estimation techniques is the Kernel Density Estimator (KDE; \cite{kdebook}). The resulting PDF is a convolution between a fixed kernel and the discrete distribution of samples. In case of the Gaussian kernel, this corresponds to a mixture density with a Gaussian distribution centered at each sample. Another popular density estimator, more commonly used for visualization purposes is given by histograms (\cite{histograms}), which depend on a prior tessellation of the ambient space (typically, a grid). The estimation is piece-wise constant and is obtained by the number of samples falling in each cell normalised by its volume. 

\begin{figure}[t]
\centering
\includegraphics[width=.8\columnwidth]{3d.png}
\caption{Graph of a density estimated by the CVDE, with the Voronoi tessellation underneath.}\label{firstpagefigure}
\end{figure}

A common limitation of the aforementioned methods is a bias towards a fixed local geometry. Namely, estimates through KDE near a sample are governed by the level sets of the chosen kernel. In the Gaussian case, such level sets are ellipsoids of high estimated probability. Histograms suffer from an analogous bias towards the geometry of the cells of the tessellation (i.e., the bins of the histograms), on which the estimated PDF is constant. The issue of geometrical bias severely manifests when considering real-world high-dimensional data. Indeed, one cannot expect to approximate the rich local geometries of complex data with a simple fixed one. Both the estimators come with hyperparameters controlling the scale of the local geometries which require tuning. This amounts to the bandwidth for KDE and the diameter of the cells for histograms. 

The \emph{Voronoi Density Estimator} (VDE) has been suggested to tackle the challenges discussed above (\cite{ord}). By considering the Voronoi tessellation generated by data (\cite{voronoibook}), the estimated PDF is piece-wise constant on the cells and proportional to their inverse volume. The Voronoi tessellation adapts local polytopes so that each datapoint is equally likely to be the closest when sampling from the resulting PDF. This has enabled successful application of the VDE to geometrically articulated real-world distributions in lower dimensions (\cite{voronoineuronal, voronoiphotons, voronoiastronomy}).


The goal of the present work is to enable the VDE for high-dimensional scenarios. Although the VDE constitutes a promising candidate due to its local adaptivity, the following aspects have to be addressed: 

%\begin{itemize}
    %\item 
 \noindent
{\bf Computation}. The Voronoi cells are arbitrary convex polytopes and their volume is thus challenging to compute explicitly, which yields the necessity for fast approximate computations. 
  %  \item \emph{Smoothing}. The estimated PDF notoriously exhibits high variation. Although smoothing techniques exist, they have to be computationally adapted to high dimensions. 
\noindent

{\bf Compactification}. Data is often concentrated around low-dimensional submanifolds, which makes most of the ambient space empty and several Voronoi cells unbounded, i.e. of infinite volume (see Figure \ref{parabola}). One still needs to produce a finite estimate on those cells, a process we refer to as 'compactification'. 
    
    %The standard solution of restricting the estimator to a bounded region is inadequate since ...

    
%\end{itemize}
    We propose solutions to the problems above. First, we present efficient algorithmic procedures for volume computation and sampling from the estimated density. We formulate the cell volumes as integrals over a sphere, which can then be approximated by Monte Carlo methods. Furthermore, we propose a sampling procedure for the distribution estimated by the VDE. This consists in randomly traversing the Voronoi cells via a 'hit-and-run' Markov chain (\cite{hitnrun}).
     The proposed algorithms are highly parallelizable, allowing efficient computations on the GPU.% Our experiments show that our method outperforms Kernel Density Estimator on high-dimensional data. 
     %
     %{\color{red} We evaluate our algorithm on ..., compare it with ..., and show that...} %A number of algorithmical improvements and pre-computations enable a sampling computational cost which does not depend on dimensionality of the data. 
    
    
    In order to compactify the cells, we place a finite measure on each of them by means of a fixed kernel (typically, a Gaussian one), leading to an altered version of the VDE which we refer to as \emph{Compactified Voronoi Density Estimator} (CVDE). Figure \ref{firstpagefigure} shows an example of an estimate by the CVDE on a simple two-dimensional dataset. All the computational and sampling procedures naturally extend to the CVDE.  

A further contribution of the present work is a theoretical proof of \textbf{convergence} for the CVDE. Assuming the original density has support in the whole ambient space, we show that the PDF estimated by the CVDE converges (with respect to an appropriate notion for random measures) to the ground-truth one as the number of datapoints increases. The convergence holds without any continuity assumptions on the ground-truth PDF nor on the kernel and does not require the kernel bandwidth to vanish asymptotically. This is in contrast with the convergence properties of the KDE. Due to the aforementioned local geometric bias of the KDE, the bandwidth has to decrease at an appropriate rate in order to amend for the local influence of the kernel and guarantee convergence to the underlying distribution (\cite{devroye1979l1, jiang2017uniform}). 

Finally, we implement the CVDE in $ C\texttt{++}$ and parallelize computations via the OpenCL framework. Our code, with a provided Python interface, is 
%included in the supplementary material. 
publicly available at \\ \url{https://github.com/vlpolyansky/cvde}.

\section{COMPACTIFIED VORONOI DENSITY ESTIMATOR}\label{method}
This section presents Voronoi cell compactification and Compactified Voronoi Density Estimator, CVDE. We begin by defining the Voronoi tessellations in a general setting (see \cite{voronoibook} for a comprehensive treatment). Suppose that $(X, d)$ is a connected metric space and $P \subseteq X$ is a finite collection of distinct points referred to as \emph{generators}.  \\

\begin{defn}
The \emph{Voronoi cell}\footnote{Sometimes referred to as \emph{Dirichlet cell}.} of $p \in P$ is defined as
\begin{equation}
C(p) = \{ x\in X \ | \ \forall q \in P \  d(x,q) \geq d(x,p) \}.
\end{equation}


\end{defn}

The Voronoi cells intersect at the boundary and cover the ambient space $X$. The collection $\{C(p) \}_{p \in P}$ is called \emph{Voronoi tessellation} generated by $P$. For a point $x \in X$ not on the boundary of any cell, we write $C(x)$ for the unique cell containing it. When $X=\mathbb{R}^n$ with Euclidean distance, the Voronoi cells are convex $n$-dimensional polytopes which are possibly unbounded. 

Assume now that $X$ is equipped with a finite Borel measure denoted by $\vol$. An additional technical condition is that the boundaries of the Voronoi cells have vanishing measure. \\

\begin{figure*}[tbh!]
    \centering
    \begin{subfigure}[b]{.35\linewidth}
        \centering
        \includegraphics[width=\linewidth]{sample_vde.png}
        \subcaption*{VDE with bounding square A}
    \end{subfigure}
    \begin{subfigure}[b]{.35\linewidth}
        \centering
        \includegraphics[width=\linewidth]{sample_cvde.png}
        \subcaption*{CVDE with Gaussian kernel}
    \end{subfigure}
    \caption{Comparison between VDE and CVDE for generators in the plane. A darker color represents higher estimated density.}
    \label{fig:samples}
\end{figure*}

\begin{defn}
The \emph{Voronoi Density  Estimator } (VDE) at a point $x \in X$ is defined almost everywhere as 
\begin{equation}
\widetilde{f}(x) = \frac{1}{|P| \vol(C(x))}
\end{equation}
where $|\cdot|$ denotes cardinality.
\end{defn}

The function $\widetilde{f}$ defines a locally constant PDF on $X$ and thus a probability measure $\widetilde{f} \vol$. With respect to this distribution the cells are equally likely, and the  restriction to each cell coincides with the normalisation of $\vol$. 

We focus on the case where $X=\mathbb{R}^n$ equipped with Euclidean distance. One major issue for the choice of $\vol$ is that the standard Lebesgue measure does not satisfy the finiteness requirement. A common solution in the literature is to restrict the measure to a fixed bounded region $A \subseteq \mathbb{R}^n$ containing $P$ (\cite{voronoiresample, voronoiplanar}), which is equivalent to setting $X=A$ as the ambient space. However, this results in an often unsuitable solution for high-dimensional data. Under the manifold hypothesis (\cite{manifoldhyp}), data are concentrated around a submanifold with high codimension which implies that most of $\mathbb{R}^n$ falls outside the support. Moreover, the cells of the points lying at the boundary of the convex hull of data, which constitute the majority of cells for such submanifolds, are unbounded (see Figure \ref{parabola}). Estimating the density as uniform, after eventually intersecting with the bounded region $A$, becomes thus unreasonable and 
heavily relies on the a priori choice of $A$. 

 \begin{figure}[h!]
\centering
\includegraphics[width=.5\columnwidth]{sample_manifold.png}

\caption{Voronoi tessellation for generators distributed on a submanifold (a parabola). In this case, all the Voronoi cells are unbounded and the VDE is strongly biased by the choice of the bounding region $A$. }\label{parabola}
\end{figure}

We instead take a different route. The idea is to make the measure of each cell finite ('compactify') by considering a \emph{local} distribution with mode at the corresponding generator in $P$. In general terms, we fix a positive kernel $K: \ \mathbb{R}^n \times \mathbb{R}^n \rightarrow \mathbb{R}_{\geq 0} $ which is at least integrable in the second variable and define the following: 

\begin{defn}
The \emph{Compactified Voronoi Density Estimator} (CVDE) at a point $x \in \mathbb{R}^n$ is defined almost everywhere as
\begin{equation}\label{defcvde}
 f(x) =   \frac{K(p,x)}{|P| \vol_p(C(x))}
\end{equation}
where $\vol_p(C(x))=  \int_{C(x)} K(p,y) \ \textnormal{d}y$ and $p$ is the generator of $C(x)$ i.e., the generator $p \in P$ closest to $x$. 
\end{defn}

In practice, a commonly considered kernel is the Gaussian one
\begin{equation}\label{gaussiankernel}
K(p,x) = e^{-\frac{\| p-x\|^2}{2h^2}}
\end{equation}
where $h \in \mathbb{R}_{>0}$ is a hyperparameter referred to as 'bandwidth'. More generally, with abuse of notation a kernel can be constructed from an arbitrary integrable map $K \in L^1(\mathbb{R}^n)$:

\begin{equation}\label{convolutional}
K(p,x) = K\left(\frac{p-x}{h}\right).
\end{equation}

Note that the VDE with a bounding region $A$ corresponds to the particular case of the CVDE with the characteristic function of $A$ as kernel i.e., $K(p,x) = \chi_A(x)$. Figure \ref{fig:samples} shows a comparison between the VDE and the Gaussian CVDE on a simple two-dimensional dataset. 

It is worth to briefly compare the CVDE to the Kernel Density Estimator (KDE). Recall that the KDE with kernel $K$ (which is assumed to integrate to $1$ in the second variable) is given by $\frac{1}{| P|}\sum_{p }K(p,x)$. The kernel is aggregated over all the generators, which can possibly oversmooth the estimation. In contrast, the CVDE $f(x)$  involves $K$ evaluated at the closest generator alone. Furthermore, assume that all the cells have the same local volume i.e, $\vol_p(C(p))=1$ for all $p \in P$, and that $K$ monotonically decreases with respect to the distance i.e., $K(p,x) \leq K(p', x)$ when $d(p,x) \geq d(p',x)$. Then the CVDE reduces to  
\begin{equation}
    f(x) = \frac{1}{|P|}\max_{p \in P } K(p,x)
\end{equation} 
which is a variant of the KDE where the sum gets replaced by a maximum. Such distributions are sometimes referred to as  `max-mixtures' (\cite{maxmixture}). An empirical comparison with KDE is presented in our experimental section (Section \ref{expkde}). 


 \begin{figure}[h!]
\centering
\includegraphics[width=.55\columnwidth]{integration.png}

\caption{An illustration of the directional radius involved in volume estimation and sampling.}\label{integfigure}
\end{figure}




\section{ALGORITHMIC PROCEDURES}

The CVDE presents a number of computational challenges in high dimensions ($n \gg 3$) due to the increasing geometric complexity of Voronoi tessellations. We propose to deploy raycasting methods on polytopes which reduce the problem to one-dimensional subspaces. In the context of Voronoi tessellations raycasting has been considered to explore the boundaries of the cells in (\cite{mitchell2018spoke}), which has led to a US Patent (\cite{ebeida2019generating}), as well as in (\cite{polianskii2020voronoi}). We utilize these techniques for volume computation and point sampling, and improve the time complexity through pre-computations and parallelization.

We first introduce an algebraic quantity necessary for the subsequent methods. Consider an arbitrary versor $\sigma$ and a point $z \in \mathbb{R}^n$. Define $l_z(\sigma)$ as the maximum $t$ such that $z + t\sigma$ is contained in $C(z)$, and $l_z(\sigma)=\infty$ if such $t$ does not exist. We refer to this value as a \emph{directional radius}, originating at $z$ in the direction $\sigma$ (see Figure \ref{integfigure}). The directional radius can be expressed via a closed and computable formula. Denote by $p$ the generator closest to $z$ and for $q \in P \setminus \{p\}$, set
\begin{equation}\label{eq:lqz}
    l^q_z(\sigma) = \frac{\| q-z\|^2 - \| p - z\|^2}{2 \langle \sigma, q-p \rangle }. 
\end{equation}
As shown in (\cite{polianskii2019voronoi}), the directional radius is given by
\begin{equation}\label{eq:lz}
    l_z(\sigma) = \min_{q \not = p, \ l^q_z(\sigma) \geq 0} l^q_z(\sigma)
\end{equation}
with $l_z(\sigma) =\infty$ if $l^q_z(\sigma)$ is negative for all $q$.





\subsection{Volume Estimation and Sampling}
We now present a way to efficiently compute the (local) volumes $\vol_p$ via spherical integration. Such an approach to integration over high-dimensional Voronoi tessellations has been explored in the past by (\cite{winovich2019rigorous}) and (\cite{polianskii2019voronoi}).




Assume that the kernel is as in Equation \ref{convolutional} for a continuous $K$. By a change of variables into spherical coordinates centered at $p$ and due to convexity of $C(p)$, the volumes can be rewritten as an integral over the unit sphere $\mathbb{S}^{n-1} \subseteq \mathbb{R}^n$:
\begin{equation}\label{sphericalintegral}
\vol_p = \int_{\mathbb{S}^{n-1}}  \int_{[0, l_p(\sigma)]} K(t\sigma)t^{n-1}   \textnormal{d}t \textnormal{d}\sigma
\end{equation}
where $l_p(\sigma)$ is the directional radius of the cell originating from its generator ($z=p$). The spherical integral can be computed via Monte Carlo approximation by sampling a finite set of \ssizeopt versors $ \Sigma_p \subseteq \mathbb{S}^{n-1}$ uniformly and estimating the empirical average
\begin{equation}\label{sphericalequation}
    \frac{2 \pi^{\frac{n}{2}}}{|\Sigma_p | \Gamma(\frac{n}{2})} \sum_{\sigma \in \Sigma_p}  \int_{[0, l_p(\sigma)]} K(t\sigma)t^{n-1}   \textnormal{d}t
\end{equation}
where $\Gamma$ denotes Euler's Gamma function. In the case of Gaussian kernel (Equation \ref{gaussiankernel}), by bringing the constant $\vol (\mathbb{S}^{n-1}) = \frac{2 \pi^{\frac{n}{2}}}{\Gamma(\frac{n}{2})}$ under the summation the summand simplifies to  $
    (2\pi h^2)^{\frac{n}{2}} \overline{\gamma}\left(\frac{n}{2}, \ l_p(\sigma) \right) $, 
where $\overline{\gamma}$ denotes the regularized lower incomplete Gamma function $\overline{\gamma}(a, z) = \frac{1}{\Gamma(a)}\int_0^z{t^{a-1}e^{-t}\text{d}t}$.

Next, we propose a sampling procedure for the CVDE which is a version of the \emph{hit-and-run} sampling for distributions on higher-dimensional polytopes (\cite{hitnrun}). It consists in first choosing a generator $p = z^{(0)} \in P$ uniformly. Then, one traverses the cell $C(p)$  by constructing a Markov chain $\{z^{(i)} \}$ in the following way. A random versor $\sigma^{(i+1)} \in \mathbb{S}^{n-1}$ is sampled uniformly and the next point $z^{(i+1)}$ is sampled from $\frac{1}{\vol_p}K(p, \cdot)$ restricted to the segment $\{z^{(i)} + t\sigma^{(i+1)} \ | \ t \in [-l_{z^{(i)}}(-\sigma^{(i+1)}), \ l_{z^{(i)}}(\sigma^{(i+1)})]\}$. As shown by \cite{hitnrun}, the Markov chain converges w.r.t. total variation distance to the underlying distribution $\frac{1}{\vol_p}K(p, \cdot)$ over $C(p)$. In practice, one terminates the sampling process after a number $I$ of steps returning the last point $z^{(I)}$. Figure \ref{figurehitnrun} shows an instance of hit-and-run on a simple two-dimensional dataset.



\subsection{Computational Complexity}
The computational optimizations deserve a separate discussion. As seen from Equations \ref{eq:lz} and \ref{eq:lqz}, the natural way of estimating the directional radius $l_z(\sigma)$ for given $z\in \mathbb{R}^n$ and $\sigma \in \mathbb{S}^{n-1}$ would require $O(n|P|)$ numerical operations. This would bring the overall computational cost to $O(n\max_p|\Sigma_p||P|^2)$ for the spherical integrals and to $O(n|P|I)$ for a sampling run with $I$ hit-and-run steps. 

In order to optimize the algorithms, we first rewrite Equation \ref{eq:lqz} as 
\begin{equation}\label{eq:lqz-rewritten}
    l^q_z(\sigma) = \frac{\inner{q}{q} - \inner{p}{p} - 2\inner{z}{q} + 2\inner{z}{p}}{2\inner{\sigma}{q} - 2\inner{\sigma}{p}}. 
\end{equation}
 In spherical integration, we deploy the same set of \ssizeopt versors $\Sigma=\Sigma_p \subset \mathbb{S}^{n-1}$ for all the generators. This allows to pre-compute $\inner{q}{p}$ and $\inner{\sigma}{p}$ for all $p, q \in P, \sigma \in \Sigma$, achieving a total computational complexity of $O(n|P|^2 + n|\Sigma||P| + |\Sigma||P|^2)$.

For the sampling procedure, we similarly fix a prior finite set $\Sigma$ of all available versors. This does not affect the convergence property of the hit-and-run Markov chain assuming $\Sigma$ linearly spans $\mathbb{R}^n$ (\cite{belisle1993hit}). While $\langle \sigma, p \rangle$ and $\inner{q}{p}$ can be pre-computed in $O(n|P|^2 + n\ssize|P|)$ time, the terms involving $z$ in Equation \ref{eq:lqz-rewritten} require more care. To that end, the $i$-th step of the hit-and-run Markov chain is given by $z^{(i)} = z^{(i-1)} + t^{(i-1)} \sigma^{(i-1)}$ for appropriately sampled $t^{(i-1)}, \sigma^{(i-1)}$. The term $\langle z, p \rangle$ can then be updated inductively in $O(1)$ as $\inner{z^{(i)}}{p} = \inner{z^{(i-1)}}{p} + t^{(i-1)}\inner{\sigma^{(i-1)}}{p}$. Summing up, the cost of a hit-and-run Markov chain run reduces to $O((\ssize + |P|)I)$, which does not depend on the space dimensionality $n$ multiplicatively. 

Algorithms \ref{alg:sph-int} and \ref{alg:hitnrun} provide a more detailed description of volume computation and point sampling via the hit-and-run procedure respectively, including the discussed optimizations. Note that the loops in both algorithms are independent and involve elementary algebraic operations. This allows to utilize GPU capabilities, which also significantly boosts the computation performance.


% The VDE and CVDE present a number of computational challenges, especially in high dimensions $n \gg 3$, due to increasingly high geometric complexity of Voronoi diagrams. First, we present a way to efficiently compute the (local) volumes $\vol_p$. Assume that the kernel is as in Equation \ref{convolutional} for a continuous $K$. By a change of variables into spherical coordinates centered at $p$ and due to convexity of $C(p)$, the volumes can be rewritten as an integral over the unit sphere $\mathbb{S}^{n-1} \subseteq \mathbb{R}^n$:
% \begin{equation}\label{sphericalintegral}
% \vol_p = \int_{\mathbb{S}^{n-1}} \textnormal{d}\sigma \int_{[0, l_p(\sigma)]} K(t\sigma)t^{n-1}   \textnormal{d}t 
% \end{equation}
% where $l_p(\sigma)$ denotes the maximum $t$ such that $p + t\sigma$ is contained in $C(p)$, and $l_p(\sigma) = \infty$ if such $t$ does not exist. The spherical integral can be computed via Monte Carlo approximation by sampling a finite set of \ssizeopt versors $ \Sigma_p \subseteq \mathbb{S}^{n-1}$ uniformly and estimating the empirical average
% \begin{equation}\label{sphericalequation}
%     \frac{2 \pi^{\frac{n}{2}}}{r \Gamma(\frac{n}{2})} \sum_{\sigma \in \Sigma_p}  \int_{[0, l_p(\sigma)]} K(t\sigma)t^{n-1}   \textnormal{d}t
% \end{equation}
% where $\Gamma$ denotes Euler's Gamma function. In the case of Gaussian kernel (Equation \ref{gaussiankernel}), if the constant for $\vol (\mathbb{S}^{n-1}) = \frac{2 \pi^{\frac{n}{2}}}{\Gamma(\frac{n}{2})}$ is brought under the summation, the summand simplifies to  \begin{equation}
%     (2\pi h^2)^{\frac{n}{2}} \overline{\gamma}\left(\frac{n}{2}, \ l_p(\sigma) \right)
% \end{equation}
% where $\overline{\gamma}$ denotes the regularized lower incomplete Gamma function $\overline{\gamma}(a, z) = \frac{1}{\Gamma(a)}\int_0^z{t^{a-1}e^{-t}\text{d}t}$.


\begin{algorithm}[t]
    \caption{$\vol_p$ computation with Gaussian kernel}
    \label{alg:sph-int}
    \begin{algorithmic}        
    \footnotesize
    % \Procedure{ComputeVolumes}{$P, \Sigma$}
        \Require $P \subset \mathbb{R}^n$ set of generators \\
        $\Sigma \subset \mathbb{S}^{n-1}$ set of \ssizeopt versors
        \Ensure $\vol_p$ for all $p\in P$
        \State Compute $\inner{q}{p}$ \kw{for all} $(q, p) \in P \times P$
        \State Compute $\inner{\sigma}{p}$ \kw{for all} $(\sigma, p) \in \Sigma \times P$
        \ForAll{$p \in P$}
            \State Initialize $\vol_p \leftarrow 0$
            \ForAll{$\sigma \in \Sigma$}
                \State Initialize $l_p(\sigma) \leftarrow \infty$
                \ForAll{$q \in P \setminus \{p\}$}
                    \State $l^q_p(\sigma) \leftarrow \frac{\inner{q}{q} - 2\inner{q}{p} + \inner{p}{p}}{2\inner{\sigma}{q} - 2\inner{\sigma}{p}}$
                    \If{$l^q_p(\sigma) > 0$}
                        \State $l_p(\sigma) \leftarrow \text{min}\{ l_p(\sigma), l^q_p(\sigma) \}$
                    \EndIf
                \EndFor
                % \State $\vol_p \leftarrow \vol_p + \frac{\left(2\pi h^2\right)^{\frac{n}{2}}}{\ssize} \overline{\gamma}\left(\frac{n}{2}, \ l_p(\sigma)\right)$
                \State $\vol_p \leftarrow \vol_p + \ssize^{-1}{\left(2\pi h^2\right)^{\frac{n}{2}}}\ \overline{\gamma}\left(\frac{n}{2}, \ l_p(\sigma)\right)$
            \EndFor
        \EndFor
    \end{algorithmic}
\end{algorithm}

\begin{algorithm}[t]
    \caption{CVDE sampling}
    \label{alg:hitnrun}
    \begin{algorithmic}
    \footnotesize
        % \Procedure{ComputeVolumes}{$P, \Sigma$}
        \Require $P \subset \mathbb{R}^n$ set of generators
        \State $\Sigma \subset \mathbb{S}^{n-1}$ set of \ssizeopt versors
        \State $m$ desired number of samples
        \State $I$ number of hit-and-run steps
        \Ensure $Z = Z^{(I)} \subset \mathbb{R}^n$ samples from CVDE
        \State Initialize $Z^{(0)} \sim \text{Uni}^{m}(P)$ 
        \State Compute $\langle p, p \rangle$ \kw{for all} $p \in P$
        \State Compute $\langle z, p \rangle$ \kw{for all} $(z, p) \in Z^{(0)} \times P$
        \State Compute $\langle \sigma, p \rangle$ \kw{for all} $(\sigma, p) \in \Sigma \times P$
        \For{$i=1$ \kw{to} $I$}
            \ForAll{$z  \in Z^{(i-1)}$}
                \State $\sigma \leftarrow \text{Uni}(\Sigma)$, $p \leftarrow z^{(0)}$
                \State Initialize $l_z(-\sigma) \leftarrow \infty$, $l_z(\sigma) \leftarrow \infty$
                \ForAll{$q \in P \setminus \{p\}$}
                    \State $l^q_z(\sigma) \leftarrow \frac{\inner{q}{q} - \inner{p}{p} - 2\inner{z}{q} + 2\inner{z}{p}}{2\inner{\sigma}{q} - 2\inner{\sigma}{p}}$
                    \If{$l^q_z(\sigma) > 0$}
                        \State $l_z(\sigma) \leftarrow \text{min}\{ l_z(\sigma), l^q_z(\sigma)\}$
                    \Else
                        \State $l_z(-\sigma) \leftarrow \text{min}\{l_z(-\sigma), -l^q_z(\sigma)\}$
                    \EndIf
                \EndFor
                \State Sample $t \in [-l_{z}(-\sigma), \ l_{z}(\sigma)] $ 
                \State Add $z + t \sigma$ to $Z^{(i)}$
                \State Update $\inner{z}{p} \leftarrow \inner{z}{p} + t\inner{\sigma}{p}$ \kw{for all} $p \in P$
            \EndFor
        \EndFor
    \end{algorithmic}
\end{algorithm}


 \begin{figure}[h!]
\centering
\includegraphics[width=.8\columnwidth]{hitnrun_image.png}
\caption{An illustration of the hit-and-run sampling procedure, with a trajectory of length $I=4$ for each generator. The sampled points are displayed in orange.}\label{figurehitnrun}
\end{figure}


% Next, we propose a sampling procedure for the CVDE which is a version of the \emph{hit-and-run} sampling for distributions on higher-dimensional polytopes (\cite{hitnrun}). It consists in first choosing a generator $p = z^{(0)} \in P$ uniformly. Then, one traverses the polytopal cell $C(p)$  by constructing a Markov chain $\{z^{(i)} \}$ in the following way. A random versor $\sigma^{(i+1)} \in \mathbb{S}^{n-1}$ is sampled uniformly and the next point $z^{(i+1)}$ is sampled from $\frac{1}{\vol_p}K(p, \cdot)$ restricted to the segment $\{z^{(i)} + t\sigma^{(i+1)} \ | \ t \in [-l_{z^{(i)}}(-\sigma), \ l_{z^{(i)}}(\sigma)]\}$. As shown by \cite{hitnrun}, the Markov chain converges w.r.t. total variation distance to the underlying distribution $\frac{1}{\vol_p}K(p, \cdot)$ over $C(p)$. In practice, one terminates the sampling process after a number $I$ of steps returning the last point $z^{(I)}$. Figure \ref{figurehitnrun} shows an instance of hit-and-run on a simple two-dimensional dataset.



% \subsection{Computation of $l_z(\sigma)$}
% The computation of $l_z(\sigma)$ deserves a separate discussion. First, assume $z=p\in P$ as happens for the spherical integral (Equation \ref{sphericalintegral}). For $q \in P \setminus \{p\}$ define
% \begin{equation}\label{eq:lqp}
%     l^q_p(\sigma) =   \frac{\| q-p\|^2}{2 \langle \sigma, q-p \rangle } = . 
% \end{equation}
% A closed formula for $l_p(\sigma)$, $\sigma \in \mathbb{S}^{n-1}$, is then given by 
% \begin{equation}\label{eq:lp}
%     l_p(\sigma) = \min_{q \not = p, \ l^q_p(\sigma) \geq 0} l^q_p(\sigma)
% \end{equation}
% with $l_p(\sigma) =\infty$ if $l^q_p(\sigma)$ is negative for all $q$. This allows to compute $l_p(\sigma)$ in $O(n|P|)$ numerical operations, which would bring the overall cost for all $p, \sigma$ to $O(n\ssize|P|^2)$. The computation can be however accelerated by deploying the same set of \ssizeopt versors $\Sigma=\Sigma_p \subset \mathbb{S}^{n-1}$ for all the generators. In that case, Equation \ref{eq:lp} can be estimated for all $p, \sigma$ in $O(n|P|^2 + n\ssize|P| + \ssize|P|^2)$ time by computing $\langle q, p \rangle$ and $\langle \sigma, p \rangle$ for all $p, q \in P$, $\sigma \in \Sigma$ separately which in turn allows constant-time computation of Equation \ref{eq:lqp}. 

% The hit-and-run sampling procedure, however, involves $l_z(\sigma)$ for $z \not \in P$. Denote by $p$ the generator closest to $z$. The formula for $l^q_z(\sigma)$ has to be adjusted to 
% \begin{equation}%\label{eq:lqz}
%     l^q_z(\sigma) = \frac{\| q-z\|^2 - \| p - z\|^2}{2 \langle \sigma, q-p \rangle } = . 
% \end{equation}

% As before, we fix a prior finite set $\Sigma \subset \mathbb{S}^{n-1}$ of all available versors. Fixing the versors beforehand does not affect the convergence property of hit-and-run sampling given that $\Sigma$ spans $\mathbb{R}^n$ (\cite{belisle1993hit}). Then $l^q_z(\sigma)$ can be computed in constant time once $\langle z, p \rangle$, $\langle \sigma, p \rangle$ and $\|p\|^2$ are known for all $p \in P$, $\sigma \in \Sigma$. 
% While $\langle \sigma, p \rangle$ and $\|p\|^2$ can be pre-computed in $O(n|P|^2 + n\ssize|P|)$ time, the term $\langle z, p \rangle$ requires more care. To that end, the $i$-th step of the hit-and-run Markov chain is given by $z^{(i)} = z^{(i-1)} + t^{(i-1)} \sigma^{(i-1)}$ for an appropriately sampled $t^{(i-1)}, \sigma^{(i-1)}$. The term $\langle z, p \rangle$ can then be updated inductively in $O(1)$ as $\inner{z^{(i)}}{p} = \inner{z^{(i-1)}}{p} + t^{(i-1)}\inner{\sigma^{(i-1)}}{p}$. Summing up, the cost of a hit-and-run Markov chain run reduces to $O((\ssize + |P|)I)$, which does not depend on the space dimensionality $n$ multiplicatively. 

% Algorithms \ref{alg:sph-int} and \ref{alg:hitnrun} provide a more detailed description of volume computation and point sampling via the hit-and-run procedure respectively. Note that the loops in both algorithms are independent and involve elementary algebraic operations. This allows to utilize GPU capabilities, which also significantly boosts the computation time.



\section{THEORETHICAL PROPERTIES}
\subsection{Convergence}
We now discuss the convergence of the CVDE when the set $P$ of generators is sampled from an underlying distribution. Suppose thus that there is an absolutely continuous probability measure $\mathbb{P} = \rho \textnormal{d}x$ on $\mathbb{R}^n$ defined by a density $\rho \in L^1(\mathbb{R}^n)$. When $P$ is sampled from $\mathbb{P}$ the CVDE can be considered as (the density of) a random probability measure. We denote by $\mathbb{P}_m$ this random measure when the number of generators is $m$ i.e.,  $\mathbb{P}_m = f \textnormal{d}x $ for $P \sim \rho^m$.  

The following is our main theoretical result. It guarantees that $\mathbb{P}_m$ converges to $\mathbb{P}$ with respect to a canonical notion of convergence for random measures, assuming $\rho$ has full support. 

\begin{thm}\label{thmbody}
Suppose that $\rho$ has support in the whole $\mathbb{R}^n$. For any $K \in L^1(\mathbb{R}^n \times \mathbb{R}^n)$ the sequence of random probability measures $\mathbb{P}_m$ converges to $\mathbb{P}$ in distribution w.r.t. $x$ and in probability w.r.t. $P$. Namely, for any measurable set $E \subseteq \mathbb{R}^n$ the sequence $\mathbb{P}_m(E)$ of random variables converges in probability to the constant $\mathbb{P}(E)$.
\end{thm}

\begin{proof}
We outline here an idea of the proof and refer to the Appendix for full details. For a measurable set $E$, $\mathbb{P}_m(E)$ is equal to
\begin{equation}
\label{eqproof}
\frac{1}{m}| P \cap E | + residue
\end{equation}
    where the residue bounded by (twice) the relative number $R$ of generators whose Voronoi cell intersects the boundary $\partial E$ of $E$. The variable $\frac{1}{m}| P \cap E |$ tends to $\mathbb{P}(E)$ in probability by the law of large numbers. 
    
    We then proceed to show that the boundary term $R$ tends to $0$ in probability. To this end, we first prove that the diameters of the Voronoi cells intersecting $E$ tend uniformly to $0$, which in turn requires a preliminary result constraining such cells in a neighbour of $E$ (which is assumed to be bounded). Given that, we conclude that $R$ tends to $\mathbb{P}(\partial E)$ by the law of large numbers. By the Portmanteau Lemma (\cite{van2000asymptotic}), we can assume that $\mathbb{P}(\partial E) = 0$ (and that $E$ is bounded), which concludes the proof.  
\end{proof}

 Note that the above results holds for any (integrable) kernel, thus even for discontinuous ones. The kernel is fixed, and there is no need for an eventual bandwidth (Equation \ref{convolutional}) to vanish asymptotically. This is in contrast with KDE, which requires $h$ to tend to $0$ at an appropriate rate in order to obtain convergence to $\rho$ (\cite{devroye1979l1, jiang2017uniform}). This is because of the local geometric bias inherent to the KDE, as discussed in Section \ref{intro}. In order to obtain convergence, such bias has to be amended with a vanishing bandwidth that annihilates the local geometry of the kernel. 
 
We remark that the assumption on the support of $\rho$ in Theorem \ref{thmbody} is satisfied in the presence of noise, which is realistic in practical scenarios. Assuming that data exhibit, say, Gaussian noise, the actual underlying distribution is of full support even when the ideal one is concentrated on a submanifold of $\mathbb{R}^n$.  

\subsection{Bandwidth Asymptotics}\label{banddiscussion}
Consider a kernel in the form of Equation \ref{convolutional}. The asymptotics with respect to $h$ (with fixed set of generators $P$) can be easily deduced:

\begin{prop}
For a continuous $K: \ \mathbb{R}^n \rightarrow \mathbb{R}_{\geq 0}$, the following hold: 

     (i) As $h$ tends to $0$, $f$ converges in distribution to the empirical measure $\frac{1}{|P|}\sum_{p \in P}\delta_p$, where $delta_p$ denotes the Dirac's delta centered in $p$ i.e., the probability measure concentrated in the singleton $\{ p\}$.
    
      (ii) Consider the restriction of the kernel to a bounded region $A$ (i.e., its product with $\chi_A$). As $h$ tends to $+ \infty$, $f$ converges in distribution to the VDE $\widetilde{f}$. 
\end{prop}
\begin{proof}
For the first statement, note that $\frac{1}{h^n}K(\frac{x}{h})$ tends to $K(0) \delta_0$ in distribution by the general theory of approximators of unity. Since $\lim_{h \to 0} \vol_p(C(x)) = K(0)$ as well for every $p$, the claim follows from the definition of the CVDE (Equation \ref{defcvde}). As for the second part, observe that $K(x,p)$ tends to $K(0)$ by continuity of $K$ and thus $f(x)$ tends to $\widetilde{f}(x)$ for almost every $x$. To conclude, pointwise convergence of PDFs implies convergence in distribution (Scheffé's Lemma). 
\end{proof}

The asymptotics for small bandwidth are the same as for the KDE. For bandwidth tending to infinity, however, the KDE tends to the uniform distribution over $A$, while the CVDE still gives reasonable estimates in the form of its non-compactified version. 





\section{RELATED WORK}\label{relwork}

\textbf{Non-parametric Density Estimation}. The first traces of systematic density estimation date back to the introduction of histograms (\cite{pearsonhistograms}). Those have been subsequently considered with a variety of cell geometries such as rectangles, triangles (\cite{scottrecrangles}) and hexagons (\cite{hexagons}). The choice of geometry constitutes the main source of bias for the histogram-based density estimator.

Arguably, the most popular density estimator is the KDE, first discussed by \cite{rosenblattkde} and \cite{parzenkde}. Numerous extensions have followed, for example, to the multivariate case (\cite{izemanmulti, silvermanmulti}), bandwidth selection methods (\cite{band1, band2}) and algorithms for adaptive bandwidths (\cite{wang2007bandwidth, van2017variable}). The latter aim to partially amend for the local geometric bias of the KDE, which is in line with the present work. However, adapting the bandwidth alone provides a partial solution since it enables different scales of the same local geometry. Among applications, the KDE has been deployed to estimate traffic incidents (\cite{kdetraffic}), archeological data (\cite{kdearcheo}) and wind speed (\cite{kdewind}) to name a few. 



\textbf{VDE and its Applications}. The VDE has been originally introduced by \cite{ord} under the name 'ideal estimator' because of its local geometric adaptivity. Subsequent works have discussed regularisation (\cite{voronoiresample}) and lower-dimensional aspects  (\cite{voronoiplanar}). The VDE has seen a applications to a variety of real-world densities such as neurons in the brain (\cite{voronoineuronal}), photons (\cite{voronoiphotons}) and stars in a galaxy (\cite{voronoiastronomy}). Although promising, the VDE has been previously limited to low-dimensional problems. 

% \textbf{Numerical Methods for Polytopes}

% {\blue Need to add the following somewhere, either here or in the introduction. The methodology traces back to different roots of ray sampling methods (citations). A particular example of Voronoi density is this: \cite{winovich2019rigorous}, which uses VoroSpokes (patent: ...)}

\textbf{Theoretical Convergence}. Convergence of the VDE has been previously considered in the literature, usually in the language of Poisson point processes. For uniform underlying distribution, pointwise convergence of the averaged estimated density (i.e., unbiasedness: $\lim_{m \to \infty} \mathbb{E}_{P \sim \rho^m}[\widetilde{f}(x)] = \rho(x)$ for almost all $x$) has been proven by  \cite{stationarypoisson}. For non-uniform distributions, the same convergence has been shown by \cite{voronoiresample} with strong continuity assumptions on the density, which allows a reduction to the uniform case. Our theoretical result is based on a different, non-averaged notion of convergence and holds for the more general CVDE with no continuity assumptions. 


\section{EXPERIMENTS}


    
\begin{figure}[t]
    \setlength{\tabcolsep}{-5pt}
    \renewcommand{\arraystretch}{0}
    \centering
    % \scalebox{1}{
    \begin{tabular}{p{0.2\linewidth}cc}
        & \textbf{$n=2$} & \textbf{$n=10$} \\
        {\footnotesize Original} & \includegraphics[align=c, width=.35\linewidth]{bbox_original_2d.png} & \includegraphics[align=c, width=.35\linewidth]{bbox_original_10d.png}\\[0pt]
        
        {\footnotesize VDE} & \includegraphics[align=c, width=.35\linewidth]{bbox_vde_2d.png} & \includegraphics[align=c, width=.35\linewidth]{bbox_vde_10d.png}\\[0pt]

        {\footnotesize CVDE} & \includegraphics[align=c, width=.35\linewidth]{bbox_cvde_2d.png} & \includegraphics[align=c, width=.35\linewidth]{bbox_cvde_10d.png}
    \end{tabular}
    % }
    \caption{Visual comparison between samples from the CVDE and the VDE estimating an $n$-dimensional Gaussian for $n=2,10$. In the $10$-dimensional case, points are projected onto a plane. In high dimensions, the VDE appears as biased towards a uniform distribution. This is because of abundance of unbounded cells, over which the estimated density is constant. }
    \label{fig:vde-vs-cvde}
\end{figure}


\begin{figure}[t]
    \centering
    \begin{subfigure}[b]{.55\linewidth}
        \centering
        \includegraphics[width=\linewidth]{si_g10.png}
        \subcaption*{$10$-dimensional Gaussian}
    \end{subfigure} 
      \begin{subfigure}[b]{.55\linewidth}
        \centering
        \includegraphics[width=\linewidth]{si_mnist.png}
        \subcaption*{MNIST}
    \end{subfigure}
    \begin{picture}(0,0)
    \put(-85,132){{\tiny number of versors}}
      \put(-135,160){\rotatebox{90}{\tiny avg. log-likelihood}}
        \put(-85,16){{\tiny number of versors}}
      \put(-135,44){\rotatebox{90}{\tiny avg. log-likelihood}}
    \end{picture}
    \caption{Stabilisation of the Monte Carlo spherical integral. The plots display the average log-likelihood of the estimated density on the training set as the number of sampled versors increases. For each of the 2 datasets, 10 experimental runs are shown. }
    \label{fig:si_convergence}
\end{figure}


\subsection{Dataset Description}\label{datadescription}
In our experiments, we evaluate the CVDE on datasets of different nature: simple \emph{synthetic distributions} of Gaussian type, \emph{image data} in pixel-space, and \emph{sound data} in a frequency space. The datasets we deploy are the following:

%\begin{itemize}
\noindent
{\bf  Gaussians and Gaussian Mixtures:} for synthetic experiments we generate two types of datasets, each containing $1000$ training and $1000$ test points. The first one consists of samples from an $n$-dimensional standard Gaussian distribution. The second one is sampled from a Gaussian mixture density $\rho = \frac{1}{2}(\rho_1 + \rho_2)$. Here, $\rho_1, \rho_2 $ are Gaussian distributions with means $\mu_1 = (-0.5, 0 , \cdots, 0)$, $\mu_2 = (0.5, 0 , \cdots, 0) $ and standard deviations $\sigma_1 = 0.1$, $\sigma_2 = 100$ respectively. 
    
 \noindent
{\bf  MNIST} (\cite{deng2012mnist}): the dataset  consists of $28 \times 28$ grayscale images of handwritten digits which are normalised in order to lie in $[0,1]^{28 \times 28}$. For each experimental run, we sample half of the $60000$ training datapoints in order to evaluate the variance of the estimation. The test set size is $10000$. 
     
     
 \noindent
{\bf  Anuran Calls} (\cite{Dua:2019}): the datasets consists of $7195$ calls from $10$ species of frogs which are represented by $21$ normalised mel-frequency cepstral coefficients in $[0,1]^{21}$. We retain $10 \%$ of data for testing and again sample half of the training data at each experimental run. 
%\end{itemize}


\begin{figure*}[t!]
    \centering
    \begin{subfigure}[b]{.3\linewidth}
        \centering
        \includegraphics[width=\linewidth]{vskde_twogaussians.png}
        \subcaption*{Gaussian Mixture}
    \end{subfigure}
    \begin{subfigure}[b]{.3\linewidth}
        \centering


 \includegraphics[width=\linewidth]{vskde_mnist.png}
        \subcaption*{MNIST}
    \end{subfigure}
    \begin{subfigure}[b]{.3\linewidth}
        \centering
        \includegraphics[width=\linewidth]{vskde_frogs.png}
        \subcaption*{Anuran Calls}
    \end{subfigure}
    \begin{subfigure}[b]{.05\linewidth}
        \centering
        \includegraphics[width=\linewidth]{vskde_labels.png}
        \subcaption*{}
    \end{subfigure}

    \caption{Empirical comparisons between the \colorbox{YellowOrange!40}{CVDE}, the \colorbox{Green!30}{KDE} and the KDE with adaptive bandwidth (\colorbox{RoyalBlue!30}{AdaKDE}). The plots display the average log-likelihood over the test set as the bandwidth varies. The shadowed region represents standard deviation (with respect to sampling of the dataset) on $5$ experimental runs. }
    \label{fig:vskde}
\end{figure*}

\subsection{Comparison with VDE} 

In this section, we evaluate empirically the necessity of compactification for high-dimensional data. To this end, we visually compare samples from the CVDE (with Gaussian kernel) and from the the VDE. The VDE is implemented with a bounding hypercube $A = [-\frac{7}{2}, \frac{7}{2}]^n$ as described in Section \ref{method}. 


 We consider the Gaussian dataset in $n=2$ and $n=10$ dimensions. For both the estimators, $1000$ points are sampled via hit-and-run (with trajectories of length $I=1000$) from the estimated density. The bandwidth for the CVDE is chosen following Scott's rule (\cite{scottdensity}) and amounts to $h=0.33$ in two dimensions and to $0.66$ in ten dimensions.
 
 The results are presented in Figure \ref{fig:vde-vs-cvde}. In two dimensions, both the estimators produce samples that are visually close to the ground-truth distribution. However, in ten dimensions the sampling quality of VDE drastically decreases, while the CVDE still produces a satisfactory result. In the provided examples, more than $85\%$ of points sampled from the VDE belong to the Voronoi cells intersecting the boundary of $A$. Since the VDE is uniform within each cell, the estimation and the consequent sampling is biased by the choice of the bounding region $A$, especially in high dimensions.    




\subsection{Convergence of the Spherical Integral}

We now empirically estimate the amount of Monte Carlo samples required for spherical integration (Equation \ref{sphericalequation}). To this end, we visualize how the approximation for the volumes in the CVDE (with Gaussian kernel) changes as the number $|\Sigma|$ of versors increases.  We consider two datasets: the 10-dimensional Gaussian one and MNIST. Each plot in Figure \ref{fig:si_convergence} displays 10 curves, each corresponding to one experimental run. What is shown is the average log-likelihood of the estimated density on the training set, which correponds up to an additive constant to the average negative logarithmic volume $- \frac{1}{|P|}\sum_{p \in |P|} \log \vol_p(C(p))$ of the Voronoi cells. The bandwidth is again chosen according to Scott's rule for the Gaussian dataset while it is set to $1$ for MNIST. Evidently, all the curves are stable at $|\Sigma|=5000$ sampled versors, which we fix as a parameter in later experiments.


% {\blue
% \subsection{Testing different kernels}
% {\red Likely to be left out, but easy to implement quickly for e.g. laplace distribution.}

% An experiment could be: verify a claim that "the shape of the kernel does not matter as much as the bandwidth."
% }

% {\bf Convergence of hit-and-run.}{\blue
% example: fix dimensionality, vary bandwidth, vary cell shape (e.g. elongate the cube along some dimension(s)), check sampling quality somehow (e.g. against rejection sampling from the cell, which may be very slow but correct). can look at precision recall curves again. 
% }

% \begin{figure}[!tbh]
%     \centering
%     % \begin{subfigure}[b]{.3\linewidth}
%         \centering
%         \includegraphics[width=.7\linewidth]{images/hr_frogs.png}
%         \subcaption{Frogs}
%     % \end{subfigure}
%     % \begin{subfigure}[b]{.3\linewidth}
%     %     \centering
%     %     \includegraphics[width=\linewidth]{images/hr_frogs_2.png}
%     %     \subcaption{Also frogs (WD / sigma)}
%     % \end{subfigure}
%     \caption{Convergence of the hit-and-run sampling on AnuranCalls. }
%     \label{fig:hr_convergence}
% \end{figure}



\subsection{Comparison with KDE}\label{expkde}  
We now compare the CVDE with the KDE (both with Gaussian kernel) on the synthetic and real-world data described in Section \ref{datadescription}. However, the distribution of high-dimensional real-world data is too sparse in the original ambient space to allow for a meaningful comparison. We consequently pre-process the MNIST and the Anuran Calls datasets via Principal Component Analysis (PCA) and orthogonally project them to the $10$-dimensional subspace with largest variance. We set the dimension of the synthetic Gaussian mixture to $10$ as well. 

We compare the CVDE with the standard KDE as well as the KDE with local, adaptive bandwidths (AdaKDE) described in \cite{wang2007bandwidth}. In the AdaKDE the bandwidth $h_p$ depends on $p \in P$ and is smaller when data is denser around $p$. Specifically, denote by $\hat{f}(p)$ the standard KDE estimate with a global bandwidth $h$. Then $h_p = h \lambda_p$ where $\lambda_p =   (g / \hat{f}(p))^{\frac{1}{2}}$ and $g = \prod_{q \in P}\hat{f}(q)^{\frac{1}{|P|}}$. 

We score the estimators via the average log-likelihood on a test set i.e., $P_{\textnormal{test}}$ i.e., $\frac{1}{|P_{\textnormal{test}}|} \sum_{p \in P_{\textnormal{text}}} \log f(p)$. Such score measures the adherence of the estimated density to the ground-truth one and penalizes overfitting thanks to the deployment of the test set. 

The results are displayed in Figure \ref{fig:vskde} with the bandwidth varying for all the estimators. For AdaKDE we vary the global bandwidth for $\hat{f}$. Sampling of training and test data is repeated for $5$ experimental runs, from which mean and standard deviation of the score are displayed. 

When each estimator is considered with its best bandwidth, the CVDE outperforms the baselines. This shows that the local geometric adaptivity of the CVDE leads to density estimates that are closer to the ground-truth distribution. Moreover, the CVDE displays remarkably better scores as the bandwidth increases. This is consistent with the discussion in Section \ref{banddiscussion} as the CVDE has more informative asymptotics than the KDE for large $h$. On the real-world datasets (MNIST and Anuran Calls), the adaptive bandwidth does not drastically improve the performance of KDE. On the synthetic data, the AdaKDE is instead competitive with the CVDE. This indicates that the local adaptivity of the AdaKDE is enough to capture simple densities such as a Gaussian mixture. However, for more complex distributions the AdaKDE still suffers from the bias due to the Gaussian kernel (albeit with a local bandwidth) as mentioned in Section \ref{relwork}. The CVDE instead effectively adapts to the local geometry of data via the Voronoi tessellation. 


\section{CONCLUSIONS AND FUTURE WORK}
In this work, we defined an extension of the Voronoi Density Estimator suitable for high-dimensional data, providing efficient methods for approximate computation and sampling. Additionally, we proved convergence to the underlying data density. 

A promising line of future research lies in exploring both theory and applications of the VDE and CVDE to metric spaces beyond the Euclidean one, in particular higher-dimensional Riemannian manifolds. Spheres, for example, naturally appear in the context of normalised data, while complex projective spaces of arbitrary dimension arise as Kendall shape spaces on the plane (\cite{directional}). 

\section{ACKNOWLEDGEMENTS}
This work was supported by the Swedish Research Council, the Knut and Alice
Wallenberg Foundation and the European Research Council (ERC-BIRD-884807).

\newpage
%\bibliography{references}

\printbibliography

% \appendix
% \input{sections/appendix}
\end{document}
