% \documentclass{uai2022} % for initial submission
\documentclass[accepted]{uai2022} 

%% Choose your variant of English; be consistent
\usepackage[american]{babel}

%% Some suggested packages, as needed:
\usepackage{natbib} % citation styles and commands
    \bibliographystyle{plainnat}
    \renewcommand{\bibsection}{\subsubsection*{References}}
\usepackage{mathtools} % amsmath with fixes and additions
% \usepackage{siunitx} % typesetting numbers and units
\usepackage{booktabs} % good-looking tables
\usepackage{tikz} % drawings and diagrams

%% Self-defined macros (and packages)
\usepackage{soul}
\usepackage{graphicx}
\usepackage{caption}
\usepackage{subcaption}
\usepackage{xcolor}
\usepackage{amsmath}
\newcommand{\mathcolorbox}[2]{\colorbox{#1}{$\displaystyle #2$}}
% \newcommand{\omar}[1]{\textcolor{blue}{#1}}
% \newcommand{\alex}[1]{\textcolor{green}{#1}}
% \newcommand{\aapo}[1]{\textcolor{red}{#1}} 

% \usepackage[backref=page]{hyperref}
\input{math_commands.tex}  % optional math commands from github.com/goodfeli/dlbook_notation


\usepackage{amsthm}
\newtheoremstyle{mythmstyle}% name
{\topsep}%Space above
{-2\topsep}%Space below
{\itshape}%Body font
{0pt}%Indent amount
{\bfseries}% Theorem head font
{}%Punctuation after theorem head
{ }%Space after theorem head
{}% theorem head specification

\theoremstyle{mythmstyle}
\newtheorem{theorem}{Theorem}
\newtheorem{conjecture}{Conjecture}
\newcommand\numberthis{\addtocounter{equation}{1}\tag{\theequation}}
% \newcommand{\sectionbreak}{\clearpage}


\title{The Optimal Noise in Noise-Contrastive Learning Is \\ Not What You Think}

% Add authors
\author[1]{\href{mailto:<jj@example.edu>?Subject=Your UAI 2022 paper}{Omar Chehab}{}}
\author[1]{Alexandre Gramfort}
\author[2]{Aapo Hyv\"arinen}

% Add affiliations
\affil[1]{%
    Université Paris-Saclay, Inria, CEA, 
    Palaiseau, France
}
\affil[2]{%
    Department of Computer Science, University of Helsinki, Finland 
}
  
\begin{document}
\maketitle

\begin{abstract}
    Learning a parametric model of a data distribution is a well-known statistical problem that has seen renewed interest as it is brought to scale in deep learning. Framing the problem as a self-supervised task, where data samples are discriminated from noise samples, is at the core of state-of-the-art methods, beginning with Noise-Contrastive Estimation (NCE). Yet, such contrastive learning requires a good noise distribution, which is hard to specify; domain-specific heuristics are therefore widely used. While a comprehensive theory is missing, it is widely assumed that the optimal noise should in practice be made equal to the data, both in distribution and proportion; this setting underlies Generative Adversarial Networks (GANs) in particular. Here, we empirically and theoretically challenge this assumption on the optimal noise. We show that deviating from this assumption can actually lead to better statistical estimators, in terms of asymptotic variance. In particular, the optimal noise distribution is different from the data's and even from a different family.
\end{abstract}

\section{Introduction}
\label{sec:intro}

Learning a parametric model of a data distribution is at the core of statistics and machine learning. Once a model is learnt, it can be used to generate new data, to evaluate the likelihood of existing data, or be introspected for meaningful structure such as conditional dependencies between its features. Among an arsenal of statistical methods developed for this problem, Maximum-Likelihood Estimation (MLE) has stood out as the go-to method: given data samples, it evaluates a model's likelihood to have generated them and retains the best fit. However, MLE is limited by the fact that the parametric model has to be properly normalized, which may not be computationally feasible.

In recent years, an alternative has emerged in the form of Noise-Contrastive Estimation (NCE) \citep{gutmann2012nce}: given data samples, it generates noise samples and trains a discriminator to learn the data distribution by constrast. Its supervised formulation, as a binary prediction task, is simple to understand and easy to implement. In fact, NCE can be seen as one of the first and most fundamental methods of \textit{self-supervised} learning, which has seen a huge increase of interest recently~\citep{vanoord2018cpc, chen2020simclr}.

Crucially, NCE can handle unnormalized, i.e.\ energy-based, models. It has shown remarkable success in Natural Language Processing \citep{mnih2012nce, mikolov2013nce} and has spearheaded an entire family of contrastive methods \citep{pihlaja2010nce, gutmann2011bregman,menon2016nce,ma2018contrastive,goodfellow2014gan,vanoord2018cpc}. 

While MLE is known to be optimal in the asymptotic limit of infinite samples, NCE is a popular choice in practice due to its computational advantage. In fact, NCE outperforms Monte Carlo Maximum Likelihood (MLE-MC) \citep{rioudurand2018nce} - an MLE estimation procedure where normalization is performed by importance sampling. 

Nevertheless, NCE's performance is however dependent on two hyperparameters: the choice of noise distribution and the noise-data ratio (or, proportion of noise samples)~\citep{gutmann2012nce}. A natural question follows: what is the optimal choice of noise distribution, and proportion of noise (or, noise-data ratio) for learning the data distribution? 
% \textbf{Choosing the noise}
There are many heuristics for choosing the noise distribution and ratio in the NCE setting. Conventional wisdom in the related setting of GANs and variants \citep{goodfellow2014gan, gao2020fce} is to set both the proportion and the distribution of noise to be equal to those of the data. The underlying assumption is a game-theoretic notion of optimality: the task of discriminating data from noise is hardest, and therefore most "rewarding", when noise and data are virtually indistinguishable. The noise would then be optimal when the discriminator is no longer able to distinguish noise samples from data samples. 

However, such an adversarial form of training where a noise generator aims to fool the discriminator suffers from instability and mode-collapse \citep{dieng2019gan, lorraine2021complexmomentum}.
Furthermore, while the above assumptions (optimal noise equals data) have been supported by numerous empirical successes,
it is not clear whether such a choice of noise (distribution and ratio) achieves optimality from a statistical estimation viewpoint. Since NCE is fundamentally motivated by parameter estimation, the optimization of hyperparameters should logically be based on that same framework.

In this work, we propose a principled approach for choosing the optimal noise distribution and ratio while challenging, both theoretically and empirically, the current practice.
In particular, we make the following claims that challenge conventional wisdom:
%
\begin{enumerate}
    \item The optimal noise distribution is not the data distribution; in fact, it is of a very different family than the model family.
    \item The optimal noise proportion is generally not 50\%; the optimal noise-data ratio is not one.
\end{enumerate}
%
The paper is organized as follows. First, we present NCE and related works in Section~\ref{sec:background}, as well as the theoretical framework of asymptotic MSE that we use to optimize the NCE estimator. We start Section~\ref{sec:plan} by empirically showing that the optimal noise distribution is not the data distribution. Our main theoretical results describing the optimal noise distribution are in Section~\ref{sec:nonparametric}.
Specifically, we analytically provide the optimal noise for NCE in two interesting limits,
and numerically verify how optimal that optimal noise remains outside these limits. We further show empirically that the optimal noise proportion is not 50\% either. Finally we discuss the limitations of this work in Section~\ref{sec:discussion} and conclude in Section~\ref{sec:conclusion}. 


\paragraph{Notation} We denote with $p_d$ a data distribution, 
$p_n$ a noise distribution, and $(p_{\vtheta})_{\vtheta \in \Theta}$ 
a parametric family of distributions assumed to contain 
the data distribution $p_d = p_{\vtheta^*}$. 
All distributions are normalized, meaning that the NCE estimator does not consider the normalizing constant as a parameter to be estimated to simplify the analysis: in this setup, NCE can be fairly compared to MLE and the Cramer-Rao bound is well-defined and applicable.
The logistic function is denoted by  $\sigma(x)$. 
We will denote by $\nu$ the ratio between the number of noise samples and data samples: $\nu = T_n / T_d$.
The notation $\langle x, y \rangle_\mA := \langle x, \mA y \rangle$ refers to the inner product with metric $\mA$. The induced norm is $\| \vx \|_\mA := \| \mA^{\frac{1}{2}} \vx \|$.


%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

\section{Background}
\label{sec:background}

\subsection{Definition of NCE}

Noise-Contrastive Estimation consists in 
approximating a data distribution $p_d$ by training a
discriminator $D(\vx)$ to distinguish data samples 
$(\vx_i)_{i  \in [1, T_d]} \sim p_d$ 
from noise samples $(\vx_i)_{i  \in [1, T_n]} \sim p_n$~\citep{gutmann2012nce}. This defines a binary task where $Y=1$ is the data label and $Y=0$ is the noise label. The discriminator is optimal when it equals the (Bayes) posterior 
\begin{align} % multiline
    D(\vx) = P(Y=1 | X) 
    % = 
    % \frac{p_d(\vx)}{p_d(\vx) + \nu p_n(\vx)} \\
    = 
    \sigma \left(\frac{p_d(\vx)}{\nu p_n(\vx)}\right)
\end{align} 
i.e. when it learns 
the density-ratio $\frac{p_d}{p_n}$ \citep{gutmann2012nce, mohamed2016implicitgen}. 
The basic idea in NCE is that replacing in the ratio the data distribution by $p_\vtheta$
and optimizing a discriminator with respect to $\theta$, 
yields a useful estimator $\hat{\vtheta}_{\mathrm{NCE}}$ because at the optimum, the model density has to then equal the data density. 

Importantly, there is no need for the model to be normalized; the normalization constant (partition function) can be input as an extra parameter, in stark contrast to MLE.

\subsection{Asymptotic analysis}

We consider here a very well-known framework to analyze the statistical performance of an estimator. Fundamentally, we are interested in the Mean-Squared Error (MSE), generally defined as $$ \mathbb{E}_{\vtheta}[(\hat{\vtheta} - \vtheta)^2] = \mathrm{Var}_{\vtheta}(\hat{\vtheta})+ \mathrm{Bias}_{\vtheta}(\hat{\vtheta}, \vtheta)^{2}$$ It can mainly be analyzed in the asymptotic regime, with the number of data points $T_d$ being very large. For (asymptotically) unbiased estimators, the 
estimator's statistical performance is in fact completely characterized by 
its asymptotic variance (or rather, covariance matrix) because the bias squared is of a lower order for such estimators. The asymptotic variance is
classically defined as
\begin{equation} \label{eq:asvar}
    \mSigma = \lim_{T_d\rightarrow \infty} T_d \, \mathbb{E}_{\vtheta}[(\hat{\vtheta} - \mathbb{E}_{\vtheta}[\hat{\vtheta}])(\hat{\vtheta} - \mathbb{E}_{\vtheta}[\hat{\vtheta}])^\top]
\end{equation} 
where the estimator is evaluated for each sample size $T_d$ separately.
Thus, we use the asymptotic variance to compute an asymptotic approximation of the total Mean-Squared Error which we define as
%
\begin{equation}
\mathrm{MSE} = \frac{1}{T_d} \mathrm{tr}(\mSigma) 
% = \frac{1 + \nu}{T} \mathrm{tr}(\mSigma)
\enspace .
\label{eq:mse}
\end{equation}
%
In the following, we talk about MSE to avoid any confusion regarding the role of bias: we emphasize that the MSE is given by the asymptotic variance since the bias squared is of a lower order (for consistent estimators, and under some technical constraints). Furthermore, the MSE is always defined in the asymptotic sense as in Eqs.~(\ref{eq:asvar}) and (\ref{eq:mse}).

When considering normalized distributions, classical statistical theory tells us that the best attainable $\mathrm{MSE}$ 
(among unbiased estimators) is the Cramer-Rao bound, achieved by 
Maximum-Likelihood Estimation (MLE). This provides a useful baseline, and implies that $\mathrm{MSE}_{\mathrm{NCE}} \geq \mathrm{MSE}_{\mathrm{MLE}}$ necessarily.

In contrast to a classical statistical framework, however, we consider here the case where the bottleneck of the estimator is the computation, while data samples are abundant. This is the case in many modern machine learning applications. The computation can be taken proportional to the total number of data points used, real data and noise samples together, which we denote by $T = T_d + T_n$. Still, the same asymptotic analysis framework can be used.

An asymptotic analysis of NCE has been carried out 
by~\citet{gutmann2012nce}.
The MSE of NCE depends on three design choices (hyperparameters) 
of the experiment:
\begin{itemize}
    \item the noise distribution $p_n$ 
    \item  the noise-data ratio $\nu = T_n/T_d$, from which the noise proportion can be equivalently calculated
    \item the total number of samples $T= T_d + T_n$, corresponding here to the computational budget
    \end{itemize}
    Building on theorem 3 of~\citet{gutmann2012nce}, we can write $\mathrm{MSE}_{\mathrm{NCE}}$ as a function of $T$ (not $T_d$) to enforce a finite computational budget, giving
%
\begin{align}
\begin{split}
    \mathrm{MSE}_{\mathrm{NCE}}(T, \nu, p_n) & = \\
    \frac{\nu + 1}{T} \mathrm{tr}
    (
        \mI^{-1} & - \frac{\nu + 1}{\nu} (\mI^{-1} \vm \vm^\top \mI^{-1})
    )
    \label{eq:asympmsence}
\end{split}
\end{align}
%
where $\vm$ and $\mI$ are a generalized score mean and covariance, where the integrand is weighted by the term $(1 - D(\vx))$ involving the optimal discriminator $D(\vx)$:
%
\begin{align*}
    \vm & = 
    \int \vg(\vx) (1 - D(\vx)) p(\vx)d\vx \\
    \mI & = 
    \int \vg(\vx)\vg(\vx)^\top (1 - D(\vx)) p(\vx)d\vx
    \numberthis
    \label{eq:asympmsenceintegrals}
\end{align*}
%
The (Fisher) score vector is the gradient (or derivative in one dimension) of the log of the data distribution with respect to its parameter $\vg(\vx) = \nabla_{\vtheta} \log p_{\vtheta}(\vx)|_{\vtheta=\vtheta^*}$. Its actual (without the discriminator weight term) mean is null and its covariance is the Fisher Information matrix, written as $I_F = \int \vg(\vx)\vg(\vx)^\top p(\vx)d\vx$ for the rest of the paper. 

The question of statistical efficiency of NCE to bridge the gap with MLE therefore becomes to optimize Eq.~\ref{eq:asympmsence} with respect to the three hyperparameters.

\subsection{Previous work}

Despite some early results, choosing the best noise distribution to reduce the variance of the NCE estimator remains largely unexplored. \citet{gutmann2012nce} and \citet{pihlaja2010nce} remark that setting $p_n = p_d$ offers a MSE $(1 + \frac{1}{\nu})$ times higher than the Cramer-Rao bound. Therefore, with an infinite budget $T \rightarrow \infty$, taking all samples from noise $\nu \rightarrow \infty$ brings the $\mathrm{MSE}_{\mathrm{NCE}}$ down to the Cramer-Rao bound.

Motivated by the same goal of improving the statistical efficiency of NCE,  \citet{pihlaja2010nce, gutmann2011bregman} and \citet{uehara2018nce} have looked at reducing the variance of NCE. 
They relax the original NCE objective by writing it as an M-divergence between the distributions $p_d$ and $p_\theta$ \citep{uehara2018nce} or as a Bregman-divergence between the density ratios $\frac{p_d}{\nu p_n}$ and $\frac{p_\theta}{\nu p_n}$. Choosing a divergence boils down to the use of specific non-linearities, which when chosen for the Jensen-Shannon f-divergence leads to the NCE estimator.
\citet{pihlaja2010nce} numerically explore which non-linearities lead to the lowest MSE, but they explore estimators different from NCE.

More recently, \citet{uehara2018nce} show that the asymptotic variance of NCE can be further reduced by using the MLE estimate of the noise parameters obtained from the noise samples, as opposed to the true noise distribution. A similar idea underlies Flow-Contrastive Estimation \citep{gao2020fce}. While this is useful in practice, it does not address the question of finding the optimal noise distribution.

When the noise distribution is fixed, it remains to optimize the noise-data ratio $\nu$ and samples budget $T$. The effect of the samples budget on the NCE estimator is clear: it scales as $\mathrm{MSE}_{\mathrm{NCE}} \propto \frac{1}{T}$.
Consequently and remarkably, the optimal noise distribution and noise-data ratio
actually do not depend on the budget $T$. As for the noise-data ratio $\nu$, while \citet{gutmann2012nce} and \citet{pihlaja2010nce} report that NCE reaches Cramer-Rao when both $\nu$ and $T$ tend to infinity, it is of limited practical use due to finite computational resources $T$. In the limit of finite samples, \citet{pihlaja2010nce} offers numerical results touching on this matter, although it considers the noise prior is 50\% which greatly simplifies the problem as the MSE here becomes linearly dependent on $\nu$.

\section{Optimizing noise in NCE}
\label{sec:plan}

In this work we aim to directly optimize the MSE of the original NCE estimator with respect to the noise distribution and noise-data ratio.
Analytical optimization of the $\mathrm{MSE}_{\mathrm{NCE}}$ with respect to the noise distribution $p_n$ or ratio $\nu$ is a difficult task: both terms appear nonlinearly within the integrands. Even in the simple case where the data follows a one-dimensional Gaussian distribution parameterized by variance, as specified in Section 2 of the Supplementary Material, the resulting expression is intractable. This motivates the need for numerical methods.

In the following, we pursue two different strategies for finding the optimal $p_n$. Either $p_n$ can be chosen within the same parametric family as the data distribution (we use the same parametric model for simplicity) as in Section~\ref{sec:parametric}; this leads to a simple one-dimensional optimization problem (e.g.\ optimizing a Gaussian mean or variance $\theta$). Or one can relax this assumption and use more flexible ``non-parametric" methods as in Sections~\ref{sec:nonparametric} and~\ref{ssec:experiments}, such as a histogram-based expression for $p_n$. In the latter case, assuming the bins of histograms are fixed, one has in practice a higher-dimensional optimization problem with one weight per histogram bin to estimate.

\subsection{Optimization within the same parametric family}
\label{sec:parametric}

We use here simple data distributions to illustrate the difficulty of finding the optimal distribution. We work with families of a single scalar parameter to make sure that the numerical calculations can be performed exactly.

The data distributions considered from now on are picked among three generative models with a scalar parameter:
\begin{enumerate}
    \item[(a)] a univariate Gaussian parameterized by its mean and whose variance is fixed to 1,
    
    \item[(b)] a univariate zero-mean Gaussian parameterized by its variance,

    \item[(c)] a two-dimensional zero-mean Gaussian parameterized by correlation, i.e. the off-diagonal entries of the covariance matrix. The variables are taken standardized.
\end{enumerate}

While the Gaussian distribution is simple, it is ubiquitous in generative models literature and remains a popular choice 
in state-of-the-art deep learning algorithms, such as Variational Auto-Encoders (VAEs). Yet, to our knowledge, it remains completely unknown to date how to design the optimal noise to infer the parameters of a Gaussian using NCE.

Assuming the same parametric distribution for the noise as for the data,
Figure~\ref{fig:noiseval_vs_dataval} presents the optimal noise parameter as a function of the data parameter. Details on numerical methods are explained below. For the three models above and setting $\nu=1$, one can observe that the noise parameter systematically differs from the data parameter. They are equal only in the very special case of estimating correlation (case c) for uncorrelated variables.
This means that the optimal noise distribution is not equal to the data distribution, even when the noise and the data are restricted to be in the same parametric family of distributions. 

Looking more closely, one can notice that the relationship between the optimal noise parameter and the data parameter highly depends on the estimation problem. For model (a), the optimal noise mean is (randomly) above or below the data mean, 
while at constant distance (cf. the two local minima of the MSE landscape shown in Section 1 of the Supplementary Material). For model (b), the optimal noise variance is obtained from the data variance by a scaling of $3.84$. This linear relationship is coherent with the symmetry of the problem with respect to the variance parameter. Interestingly for model (c), the optimal noise parameter exhibits a nonlinear relationship to the data parameter: for a very low positive correlation between variables the noise should be negatively correlated, whereas when data variables are strongly correlated, the noise should also be positively correlated. 

\begin{figure}[!t]
\centering
\includegraphics[width=\columnwidth]{img/noiseval_vs_dataval.pdf}
\caption{Relationship between the (optimal) noise parameter and the data parameter.
(top left) Optimal variance in model (a) as function of the data mean. Note that the noise parameter has two symmetric local minima, given by the individual points, which are joined by a manually drawn line. (top right) Optimal variance in model (b) as function of the data variance. (bottom left) Optimal noise correlation in model (c) as a function of the data correlation.}
\label{fig:noiseval_vs_dataval}
\end{figure} 

Having established how different the optimal parametric noise can be, a question naturally follows: what does the optimal, unconstrained noise distribution look like?


\subsection{Theory}
\label{sec:nonparametric}

While the analytical optimization of the noise model is intractable, it is possible to study some limit cases, and by means of Taylor expansions, obtain analytical results which hopefully shed some light to the general behaviour of the estimator even far away from those limits.

In what follows, we study an analytical expression for the optimal noise distribution in three limit cases: (i) when the noise distribution is a (infinitesimal) perturbation of the data distribution $\frac{p_d}{p_n} \approx 1$; as well as when the noise proportion (ratio) is chosen so that training uses either (ii) all noise samples $\nu \rightarrow \infty$ or (iii) all data samples $\nu \rightarrow 0$. The following Theorem is proven in Section 4 of the Supplementary Material.
%
\begin{theorem} \label{th:one}
    In either of the following two limits: 
    \begin{enumerate}
        \item[(i)] the noise distribution is a (infinitesimal) perturbation of the data distribution $\frac{p_d}{p_n}(\vx) = 1 + \eps(\vx)$;
        \item[(ii)] in the limit of all noise samples $\nu \rightarrow \infty$;
    \end{enumerate}
    the noise distribution minimizing asymptotic MSE is
    % distributions match. They are given by:
%    
    \begin{align}
        p_n^{\mathrm{opt}}(\vx) 
        \propto 
        p_d(\vx) 
        % \|\mI_F^{-1} \vg(\vx)\|
        \| \vg(\vx) \|_{\mI_F^{-2}}
        \label{eq:allnoisebestmse} \enspace .
    \end{align}
\label{th:allnoisebestmse}
\end{theorem}
%


Interestingly, this is the same as the optimal noise derived by \citet{pihlaja2010nce} for another, related estimator (Monte Carlo MLE with Importance Sampling). 
For example, in the case of estimating Gaussian variance:
$p^{\mathrm{opt}}_n(x) \propto \frac{1}{\sqrt{2\pi\theta}}e^{-\frac{x^2}{2\theta}}|x^2 - \theta|$
which is highly \textit{non-Gaussian unlike the data distribution}. Similar derivations can be easily done for the cases of Gaussian mean or correlation.

In Section 4 of the Supplementary Material, we further derive a general formula for the gap between the MSE for the typical case $p_n = p_d$ and the optimal case $p_n = p_n^{\mathrm{opt}}$. It is given by 
%
\begin{equation}
\Delta \mathrm{MSE} 
= 
\frac{1}{T} 
\mathrm{Var}_{x \sim p_d}
(
% \|\mI_F^{-1}\vg(\mathbf{x})\|
\| \vg(\vx) \|_{\mI_F^{-2}}
) \enspace .
\end{equation}
%
This quantity seems to be positive for any reasonable distribution, which implies (in the all-noise limit) that the optimal noise cannot be the data distribution $p_d$.  Furthermore, we can compute the gap to efficiency in the all noise limit, i.e.\ between $p_n = p_n^{\mathrm{opt}}$ and the Cramer-Rao lower bound 
%
$
\Delta_{\text{opt}} \mathrm{MSE} = 
\frac{1}{T} \mathbb{E}_{x \sim p_d}(
% \|\mI_F^{-1}\vg(\mathbf{x})\|
\| \vg(\vx) \|_{\mI_F^{-2}}
)^2 .$
%

In the third case, the limit of all data, we have the following conjecture: 
\begin{conjecture} \label{conj:one}
    In case (iii), the limit of all data samples $\nu \rightarrow 0$, the optimal noise distribution is such that it is all concentrated at the set of those $\boldsymbol{\xi}$ which are given by
%
    \begin{align*}
        \arg\max_{\boldsymbol{\xi}} \,
        &p_d(\boldsymbol{\xi}) 
        \mathrm{tr} \bigg(
        (\vg(\xi)\vg(\xi)^\top)^{-1}
        \bigg)^{-1} \\
        \mathrm{s.t.} & \quad \vg(\xi) = \mathrm{constant}
        \numberthis \enspace .
        \label{eq:alldatabestmse}
    \end{align*}
%
\label{th:alldatabestmse}
\end{conjecture}
%
This is typically a degenerate distribution since it is concentrated on a low-dimensional manifold, in the sense of a Dirac delta. For a scalar parameter, the function whose maxima are sought is simply $p_d(\boldsymbol{\xi}) 
\| \vg(\xi) \|^2$. 
An informal proof of this conjecture is given in Section 4 of the Supplementary Material. The "proof" is not quite rigorous due to the singularity of the optimal "density", which is why we label this as conjecture only. 
Indeed, this closed-form formula (Eq.~\ref{eq:alldatabestmse}) was obtained using a Taylor expansion up to the first order. This formula is well-defined in one dimension but is challenging in higher dimensions as it involves the inversion of a rank-one matrix, which we accomplish by regularization (provided at the end of Section 4 of the Supplementary Material). 
While this is in apparent contradiction to having the noise distribution's support include the data distribution's, this result can be understood as a first-order approximation of what one should do with few noise data points available.

Specifically, in the case of estimating a Gaussian mean (for unit variance), the maximization in the first line of Eq.~\ref{eq:alldatabestmse} yields two candidates for $p_n^{\mathrm{opt}}(x)$ to concentrate its mass on: $\delta_{-\sqrt{2}}$ and $ \delta_{\sqrt{2}}$. Moreover, the second line of Eq.~\ref{eq:alldatabestmse} predicts how the probability mass should be distributed to the two candidates: because they have different scores $g(-\sqrt{2}) \neq g(\sqrt{2})$, they are two distinct global minima. This is coherent with the two minima observed for the Gaussian mean in Figure~\ref{fig:noiseval_vs_dataval} (top-left). Similarly, when estimating a Gaussian variance, the maximization in the first line of Eq.~\ref{eq:alldatabestmse} yields candidates $\delta_{-\sqrt{5}}$ and  $\delta_{\sqrt{5}}$ for $p_n^{\mathrm{opt}}(x)$. In this case however, both candidates have the same score $g(-\sqrt{5}) = g(\sqrt{5})$. The theory above does not say anything about how the probability mass should be distributed to these two points: it can be 50-50 or all on just one point. A possible solution is $p_n^{opt}(x) =
\frac{1}{2}(\delta_{-\sqrt{5}} + \delta_{\sqrt{5}})$ as observed in Figure~\ref{fig:optimalnoisevar}.
Throughout, the optimal noise distributions are highly \textit{non-Gaussian unlike the data distribution}. 

\begin{figure*}[!ht]
\centering
\begin{subfigure}[t]{0.48\textwidth}
    \centering
    \includegraphics[width=0.9\linewidth]{img/optimal_noise_variance.pdf} 
    \caption{Optimal noise for model (b) (Gaussian variance).}
    \label{fig:optimalnoisevar}
\end{subfigure}
\hfill
\begin{subfigure}[t]{0.48\textwidth}
    \centering
    \includegraphics[width=0.9\linewidth]{img/optimal_noise_mean.pdf} 
    \caption{Optimal noise for model (a) (Gaussian mean).}
    \label{fig:optimalnoisemean}
\end{subfigure}
\caption{Histogram-based optimal noise distributions. Each row gives a different $\nu$ or noise proportion. The pink bars give the numerical approximations. The theoretical approximation of optimal noise is given by the dashed lines: the all-noise limit in the bottom panel, and the all-data limit in the top panel. In the top panel, the optimal noise is given by single points (Dirac masses) which are chosen symmetric for the purposes of illustration, but as explained in the text, they are two global minima in the case of Gaussian mean estimation, whereas when estimating the variance, any distribution of probability on those two points is equally optimal.}
\end{figure*}

So far, we have obtained the optimal noise which minimizes the (asymptotic) estimation error $\mathbb{E} \big[ \| \hat{\theta}_T - \theta^* \|^2 \big] = \frac{1}{T_d} \mathrm{tr}(\Sigma)$ of NCE for the data \textit{parameter}. However, sometimes estimating the parameter is only a means for estimating the data \textit{distribution} --- not an end in itself. We therefore consider the (asymptotic) estimation error induced by the NCE estimator $\hat{\theta}_T$ in the distribution space using the Kullback-Leibler divergence which is well-known to equal
%
\begin{equation}
    \mathbb{E}\big[ \mathcal{D}_{\mathrm{KL}}(p_d, p_{\hat{\theta}_T}) \big]
    =
    \frac{1}{2T_d} \mathrm{tr}(\Sigma I_F)
    \label{eq:avgklerror}
\end{equation}
%
(shown in Section 5 of the Supplementary Material). We are thus able to obtain the optimal noise for estimating the data \textit{distribution} in cases (i), (ii) and (iii).

\begin{theorem} \label{th:two}
    In the two limit cases of Theorem~\ref{th:one},
    the noise distribution minimizing the expected Kullback-Leibler divergence is given by
%    
    \begin{align}
        p_n^{\mathrm{opt}}(\vx) \propto p_d(\vx) 
        % \|\mI_F^{-\frac{1}{2}} \vg(\vx)\|
        \| \vg(\vx) \|_{\mI_F^{-1}}
        \label{eq:allnoisebestkl} \enspace .
    \end{align}
\label{th:allnoisebestkl}
\end{theorem}
%
In the third case, the limit of all data, we have the following conjecture: 
\begin{conjecture}
    In the limit of Conjecture~\ref{conj:one}
    the noise distribution minimizing the expected Kullback-Leibler divergence is such that it is all concentrated at the set of those $\boldsymbol{\xi}$ which are given by
%
    \begin{align*}
        \arg\max_{\boldsymbol{\xi}} \,
        &p_d(\boldsymbol{\xi}) 
        \mathrm{tr} \bigg(
        (\vg(\xi)\vg(\xi)^\top)^{-\frac{1}{2}}
        \bigg)^{-1} \numberthis \\
        \mathrm{s.t.} & \quad \vg(\xi) = \mathrm{constant}
        \enspace .
        \label{eq:alldatabestkl}
    \end{align*}
%
\label{th:alldatabestkl}
\end{conjecture}
%
These optimal noise distributions resemble those from Theorem~\ref{th:allnoisebestmse} and Conjecture~\ref{th:alldatabestmse}: only the exponent on the Fisher Information matrix changes. This is predictable, as the new cost function $
%\mathbb{E}\big[ \mathcal{D}_{\mathrm{KL}}(p_d, p_{\hat{\theta}_T}) \big] =
\frac{1}{2T_d} \mathrm{tr}(\Sigma I_F)$ is obtained by scaling with the Fisher Information matrix. More specifically, when the data parameter is scalar, the optimal noises from Theorems~\ref{th:allnoisebestmse} and \ref{th:allnoisebestkl} coincide, as the Fisher Information becomes a multiplicative constant; those from Conjectures~\ref{th:alldatabestmse} and \ref{th:alldatabestkl} do not coincide but are rather similar. The scope of this paper is to investigate the already rich case of a one-dimensional parameter, hence the following focuses on the optimal noise distributions from Theorem~\ref{th:allnoisebestmse} and Conjecture~\ref{th:alldatabestmse}.

\subsection{Experiments}
\label{ssec:experiments}

We now turn to experiments to validate the theory above. Specifically, we verify our formulae for the optimal noise distribution in the all-data~(Eq.\ref{eq:alldatabestmse}) and all-noise~(Eq.\ref{eq:allnoisebestmse}) limits, by numerically minimizing the MSE~(Eq.\ref{eq:asympmsence}). Outside these limits, we show that our formulae are competitive against a parametric approach, and that the general-case optimal noise is an interpolation between both limits. We first describe numerical strategies.

\paragraph{Numerical Methods} 
The integrals from Eq.~\ref{eq:asympmsenceintegrals} involved in evaluating the asymptotic MSE
can be approximated using numerical integration (quadrature) or Monte-Carlo simulations. While both approaches lead to comparable results, quadrature is significantly faster and more precise, especially in low dimension. However, using Monte-Carlo leads to an estimate that is fully differentiable with respect to the parameters of $p_n$.

To tackle the one-dimensional parametric problem, we simply employed quadrature for evaluating the function to optimize over a dense grid and then selected the minimum. This appeared as the most computationally efficient strategy and allows for visualizing the MSE landscape reported in Section 1 of the Supplementary Material. In the multi-dimensional non-parametric case,
the histogram's weights can be optimized by first-order methods using automatic differentiation.

In the following experiments, the optimization strategy consists in obtaining the gradients of the Monte-Carlo estimate using PyTorch~\citep{pytorch} and plugging them into a non-linear conjugate gradient scheme implemented in Scipy~\citep{scipy}. We chose the conjugate-gradient algorithm as it is deterministic (no residual asymptotic error as with SGD), and as it offered fast convergence. None of the experiments below required more than 100 iterations of conjugate-gradient. 
Note that for numerical precision, we had to set PyTorch's default to 64-bit floating-point precision.
Our code is available at \href{https://github.com/l-omar-chehab/nce-noise-variance}{https://github.com/l-omar-chehab/nce-noise-variance}.

\paragraph{Results} Figure~\ref{fig:optimalnoisevar} shows the optimal histogram-based noise distribution for estimating the variance of a zero-mean Gaussian, together with our theoretical predictions (Theorem~\ref{th:allnoisebestmse} and Conjecture~\ref{th:alldatabestmse}).
We can see that our theoretical predictions in the all-data and all-noise limits match numerical results. It is apparent in  Figure~\ref{fig:optimalnoisevar} that the optimal noise places its mass where the data distribution is high, and where it varies most when $\theta^*$ changes.
Furthermore, the noise distribution in the all-data limit has higher mass concentration, which also matches our predictions. Interestingly, in a case not covered by our hypotheses, when there are as many noise samples as data samples i.e. noise proportion of 50\% or $\nu = 1$, the optimal noise in Figure~\ref{fig:optimalnoisevar} (middle) is qualitatively not very different from the limit cases of all data or all noise samples.

Figure \ref{fig:optimalnoisemean} gives the same results for the estimation of a Gaussian's mean. The conclusions are similar; in this case, the optimal distributions in the two limits resemble each other even more. It is here important to take into account the indeterminacy of distributing probability mass on the two Diracs, which is coherent with initial experiments in Figure~\ref{fig:noiseval_vs_dataval} as well as the MSE landscape included in Section 1 of the Supplementary Material.
%
Figure \ref{fig:optimalnoisemean} is a perfect illustration of a complex phenomenon occurring in a setup as simple as Gaussian mean estimation.
Our conjecture in Eq.~\ref{eq:alldatabestmse} predicts the equivalent optimal noises seen in our experiments, in Figure 1 (top-left) and Figure 2.b., where the noise concentrates its mass on either point of the set $\{-\sqrt{2}, \sqrt{2}\}$. Indeed, Eq.~\ref{eq:alldatabestmse} shows that any noise which concentrates its mass on a set of points where the score is constant is (equally) optimal. So despite its approximative quality, Eq.~\ref{eq:alldatabestmse} is able to explain what we observed empirically: in the all-data limit, there can be many equivalent optimal noises.

\begin{figure}[!t]
\centering
\includegraphics[width=\columnwidth]{img/optimal_noise_correlation.png}
\caption{Optimal noise for a 2D Gaussian parameterized by correlation. 2D Gaussian with correlation 0 (top) and 0.3 (bottom three) are considered. Left panel is data density, right panel is the optimal histogram-based noise density.
The theoretical approximation of optimal noise is given by the black level lines: the case of Theorem~1 the bottom panel, and the Conjecture~1 in the second panel. Here, the optimal noise in the latter limit is given by a softmax relaxation with temperature 0.01. It makes the choice of placing its mass symmetrically on the single points (Dirac masses), but as explained in the text, any distribution of probability on those two points could be equally optimal.} 
\label{fig:optimalnoisecorr}
\end{figure} 

Figure~\ref{fig:optimalnoisecorr} shows the numerically estimated optimal noise distribution for model (c) using a Gaussian correlation parameter. Here, the distributions are perhaps even more surprising than in previous figures. This can be partly understood by the extremely nonlinear dependence of the optimal noise parameter from the data parameter shown in Fig.~\ref{fig:noiseval_vs_dataval}.

\begin{figure}[!t]
\centering
\includegraphics[width=\columnwidth]{img/mse_vs_noiseprop.pdf}
\caption{Asymptotic MSE vs. noise proportion. Top panel: Asymptotic MSE vs. noise proportion for model (a) with parameter mean; Middle panel: Asymptotic MSE vs. noise proportion for model (b) with parameter variance; Bottom panel: Asymptotic MSE vs. noise proportion for model (c) with parameter correlation. The parameter in ``parametric noise" is the optimal parameter for $\nu=1$, i.e. for when half the samples are noise and half are data. The "optimal noise" is the approximation given by Theorem~\ref{th:one}.
}
\label{fig:msevsnoiseprop}
\end{figure}

We next ask: how robust to $\nu$ is the analytical noise we derived in these limiting cases?
Figure~\ref{fig:msevsnoiseprop} shows the Asymptotic MSE achieved by two noise models, across a range of noise proportions. The first noise model is the optimal noise in the parametric family containing the data distribution $p_n = p_{\theta}$, optimized for $\nu=1$, while the second noise model is the optimal analytical noise $p_n^{\mathrm{opt}}$ derived in the all-noise limit~(Eq.\ref{eq:allnoisebestmse}). They are both compared to the Cramer-Rao lower bound. For all models (a) (b) and (c), the optimal analytical noise $p_n^{\mathrm{opt}}$ (red curve) is empirically useful even far away from the all-noise limit, and across the entire range of noise proportions. In fact, $p_n = p_n^{\mathrm{opt}}$ empirically seems a better choice than using the data distribution $p_n = p_d$, and is (quasi) uniformly equal to or better than a parametric noise $p_n = p_\theta$ optimized for $\nu=1$.

\begin{figure}[!ht]
\centering
\includegraphics[width=0.9\columnwidth]{img/noiseprop_vs_noiseval.pdf}
\caption{Optimal noise proportion against the noise parameter. Top panel for model (a), Gaussian mean; Middle panel for model (b), Gaussian variance; Bottom panel for model (c), Gaussian correlation.}
\label{fig:optimalpropvsval}
\end{figure} 

\subsection{Optimizing Noise Proportion}
\label{sec:noiseprop}

Next, we consider optimization of the noise proportion. It is often heuristically  assumed that having 50\% noise, i.e.\ $\nu=1$ is optimal. On the other hand, \citet{pihlaja2010nce} provided a general analysis, although it didn't quite answer this question from a practical viewpoint; nor is it compatible with the basic NCE framework of this paper.

In the special case where $p_d = p_n$, we can actually show (see Section 3 of the Supplementary Material) that the optimal noise proportion is $50\%$. This is obtained for a fixed computational budget $T$, as the noise proportion varies between 0 and 1. When this constraint on the budget is relaxed, the optimal noise proportion is $\nu \rightarrow \infty$ as in Corollary 7 and Figure 4.d. of~\cite{gutmann2012nce}.
The reciprocal for the theoretical result above does not hold: a noise proportion of $50\%$ does \textit{not}
ensure that the noise distribution equals the data's, as shown by counter-examples in Figures~\ref{fig:noiseval_vs_dataval} and \ref{fig:optimalpropvsval}.

However, in the general case $p_n \neq p_d$, the optimal proportion is not 50\%.  We can again look at Figure~\ref{fig:msevsnoiseprop} which analyses the MSE as a function of noise proportion for simple one-parameter families.
It is not optimized, in general, at 50\%, for the noise distributions considered here. 
In fact, the parameter of the noise distribution is here optimized for a proportion of 50\%, so the results are skewed towards finding that proportion optimal, but still that is not the optimum for most cases.

A closer look at this phenomenon is given by Figure~\ref{fig:optimalpropvsval} which shows the optimal noise proportion as a function of a Gaussian's parameter (mean, variance, or correlation). We see that while it is 50\% for when the data parameter is used for noise, it is in general less.


\section{Discussion}
\label{sec:discussion}

We have shown that choosing an optimal noise means choosing a noise distribution that is \textit{different} to the data's. An interesting question is what implications does this have for GANs, which iteratively guide the noise distribution to \textit{match} the data's? Both NCE and GANs in fact solve the binary task of discriminating data from noise. While the optimal discriminator for the binary task recovers the density ratio between data and noise, GANs parameterize the entire ratio (as well as the noise distribution), while NCE only parameterizes the ratio numerator. 
Hence they do not learn the same object, though GANs do claim inspiration from NCE~\citep{goodfellow2014gan}. Moreover, of course, the goals of the two methods are completely different: GANs do not perform estimation of  parameters of a statistical model but focus on the generation of data.

Nevertheless, GAN updates \textit{have} inspired the choice of NCE noise as in Flow-Contrastive Estimation (FCE) by \citet{gao2020fce}, which parameterizes both the discriminator numerator and discriminator, providing a bridge between NCE and GANs. Results on FCE by \citet{gao2020fce} empirically demonstrate that the choice of noise matters: NCE is made quicker by iterative noise updates \textit{\`a la} GAN, presumably because setting the noise distribution equal to the data's reduces asymptotic variance compared to choosing a generic noise distribution such as the best-matching Gaussian. Noise-updates based on the optimal noise in this paper, could perhaps accelerate convergence even further, avoiding the numerical difficulties of an adversarial game while still increasing the statistical efficiency.

However, using the optimal noise distributions we present in Section~\ref{sec:nonparametric} can be numerically challenging, especially when the parametric model $p_{\vtheta}$ is higher-dimensional and unnormalized (e.g. $\theta$ is a dense covariance matrix along with the normalization term as a parameter). Evaluating an optimal noise involves the Fisher score (and therefore access to the very data distribution we seek to estimate) and a Monte-Carlo method may be needed for sampling. We hope that these questions can be resolved in practice by having a relatively simple noise model which is still more statistically efficient than alternatives typically used with NCE, and whose choice is guided by our optimality results.

\section{Conclusion}
\label{sec:conclusion}

We studied the choice of optimal design parameters in Noise-Contrastive Estimation. These are essentially the noise distribution and the proportion of noise. We assume that the total number of data points (real data + noise) is fixed due to computational resources, and try to optimize those two hyperparameters. It is easy to show empirically that, in stark contrast to what is often assumed, the optimal noise distribution is not the same as the data distribution, thus extending the analysis by \citet{pihlaja2010nce}. Our main theoretical results derive the optimal noise distribution in limit cases where either almost all samples to be classified are noise, or almost all samples are real data, or the noise distribution is an (infinitesimal) perturbation of the data distribution. The optimal noise distributions in two of these cases are different but have in common the point of emphasizing parts of the data space where the Fisher score function changes rapidly. We hope these results will improve the performance of NCE in demanding applications.

\subsubsection*{Acknowledgements}

Numerical experiments were made possible thanks to the 
scientific Python ecosystem: 
Matplotlib~\citep{matplotlib}, 
Scikit-learn~\citep{scikit-learn}, 
Numpy~\citep{numpy}, 
Scipy~\citep{scipy} and PyTorch~\citep{pytorch}.

We would like to thank our reviewers whose detailed comments have helped improve this paper.

This work was supported by the French ANR-20-CHIA-0016 to Alexandre Gramfort. Aapo Hyv\"arinen was supported by funding from the Academy of Finland and a Fellowship from CIFAR.

\bibliography{chehab_427}

\newpage

% \appendix
% \onecolumn
% \setcounter{conjecture}{0}
% \setcounter{theorem}{0}
% \input{chehab_427-supp}

\end{document}
