% \documentclass{uai2022} % for initial submission
\documentclass[accepted]{uai2022} % after acceptance, for a revised
                                    % version; also before submission to
                                    % see how the non-anonymous paper
                                    % would look like
%% There is a class option to choose the math font
% \documentclass[mathfont=ptmx]{uai2022} % ptmx math instead of Computer
                                         % Modern (has noticable issues)
% \documentclass[mathfont=newtx]{uai2022} % newtx fonts (improves upon
                                          % ptmx; less tested, no support)
% NOTE: Only keep *one* line above as appropriate, as it will be replaced
%       automatically for papers to be published. Do not make any other
%       change above this note for an accepted version.

%% Choose your variant of English; be consistent
\usepackage[american]{babel}
% \usepackage[british]{babel}

%% Some suggested packages, as needed:
\usepackage{natbib} % has a nice set of citation styles and commands
    \bibliographystyle{plainnat}
    \renewcommand{\bibsection}{\subsubsection*{References}}
\usepackage{mathtools} % amsmath with fixes and additions
% \usepackage{siunitx} % for proper typesetting of numbers and units
\usepackage{booktabs} % commands to create good-looking tables
\usepackage{tikz} % nice language for creating drawings and diagrams

% For theorems and such
\usepackage{amsmath}
\usepackage{amssymb}
\usepackage{mathtools}
\usepackage{amsthm}
\usepackage{bm}

\usepackage{tabularx}

% if you use cleveref..
\usepackage[capitalize,noabbrev]{cleveref}

\usepackage{microtype}
\usepackage{graphicx}
\usepackage{subfigure}
\usepackage{booktabs} % for professional tables

% hyperref makes hyperlinks in the resulting PDF.
% If your build breaks (sometimes temporarily if a hyperlink spans a page)
% please comment out the following usepackage line and replace
% \usepackage{icml2022} with \usepackage[nohyperref]{icml2022} above.
\usepackage{hyperref}
\usepackage{array}
\newcolumntype{H}{>{\setbox0=\hbox\bgroup}c<{\egroup}@{}}

\usepackage{siunitx}
\robustify\bfseries
\sisetup{table-parse-only,detect-weight=true,detect-inline-weight=text}

% Attempt to make hyperref and algorithmic work together better:
\newcommand{\theHalgorithm}{\arabic{algorithm}}

\theoremstyle{plain}
\newtheorem{theorem}{Theorem}[section]
\newtheorem{proposition}[theorem]{Proposition}
\newtheorem{lemma}[theorem]{Lemma}
\newtheorem{corollary}[theorem]{Corollary}
\theoremstyle{definition}
\newtheorem{definition}[theorem]{Definition}
\newtheorem{claim}[theorem]{Claim}
\newtheorem{assumption}[theorem]{Assumption}
\theoremstyle{remark}
\newtheorem{remark}[theorem]{Remark}

% Todonotes is useful during development; simply uncomment the next line
%    and comment out the line below the next line to turn off comments
%\usepackage[disable,textsize=tiny]{todonotes}
\usepackage[textsize=tiny]{todonotes}
\usepackage{multirow}

\newcommand{\itodo}[1]{{\color{red} #1}}
\newcommand{\todoi}[1]{{\color{red} #1}}
\newcommand{\ours}[0]{\textsc{AdaCat}}
\newcommand{\AdaCat}[3]{\ours_{#1}(#2,#3)}

\newcommand{\swap}[3][-]{#3#1#2} % just an example

\newcommand{\nllmetric}[2]{\num[round-mode=places,round-precision=2]{#1} $\pm$ \num[round-mode=places,round-precision=2]{#2}}
\newcommand{\nllmetricb}[2]{{\bfseries \nllmetric{#1}{#2}}}
\newcommand{\mcdmetric}[2]{\num[round-mode=places,round-precision=2]{#1} $\pm$ \num[round-mode=places,round-precision=2]{#2}}
\newcommand{\mcdmetricb}[2]{{\bfseries \mcdmetric{#1}{#2}}}

\title{\ours{}: Adaptive Categorical Discretization for Autoregressive Models}

\author[1]{\href{mailto:<qcli@berkeley.edu>}{Qiyang Li}{}}
\author[1]{\href{mailto:<ajayj@berkeley.edu>}{Ajay Jain}{}}
\author[1]{\href{mailto:<pabbeel@berkeley.edu>}{Pieter Abbeel}{}}
\affil[1]{
University of California Berkeley\\
Berkeley, CA, USA
}
\begin{document}
\maketitle

\begin{abstract}
Autoregressive generative models can estimate complex continuous data distributions, like trajectory rollouts in an RL environment, image intensities, and audio. Most state-of-the-art models discretize continuous data into several bins and use categorical distributions over the bins to approximate the continuous data distribution. The advantage is that the categorical distribution can easily express multiple modes and are straightforward to optimize. However, such approximation cannot express sharp changes in density without using significantly more bins, making it parameter inefficient.
We propose an efficient, expressive, multimodal parameterization called Adaptive Categorical Discretization (\ours). \ours{} discretizes each dimension of an autoregressive model adaptively, which allows the model to allocate density to fine intervals of interest, improving parameter efficiency. \ours{} generalizes both categoricals and quantile-based regression. \ours{} is a simple add-on to any discretization-based distribution estimator. In experiments, \ours{} improves density estimation for real-world tabular data, images, audio, and trajectories, and improves planning in model-based offline RL.
\end{abstract}

\section{Introduction}
\label{sec:intro}

\begin{figure}
    \centering
    \includegraphics[width=\linewidth]{figures/twospirals_color}
    \caption{In the 2-D two-spirals dataset, an autoregressive model parameterizes $p(x^1)$, the marginal distribution over the first dimension, and $p(x^2 | x^1)$ a conditional distribution over the second. Uniform discretization (middle) divides their 1-D support into 16 equal-sized intervals and parameterizes each conditional with a categorical. However, it poorly fits the continuous samples. In contrast, parameterizing $p(x^t | x^{<t})$ with {\color{blue} \ours{}} closely approximates the target distribution with the same number of bins.}
    \label{fig:twospirals}
\end{figure}

\begin{figure}[t]
    \centering
    \includegraphics[width=\linewidth]{figures/adacat_1d_stacked}
    \caption{\ours{} learns how to discretize the support of continuous distributions for expressive, parameter efficient density estimation, and generalizes other discretization strategies like categoricals with equal bin widths (top). The flexibility afforded by adaptive discretization allows closer approximations of target densities, such a mixture of two Gaussians with different scales (bottom).}
    \label{fig:1d_example}
\end{figure}

Deep generative models estimate complex, high-dimensional distributions from samples. Autoregressive models like NADE \citep{pmlr-v15-larochelle11a, uria2016neural}, PixelRNN \citep{van2016pixel} and GPT \citep{radford2018improving} express a joint distribution by decomposing it into a product of simpler one-dimensional conditionals. Each of these conditionals $p(x^t \vert x^1, x^2\ldots x^{t-1})$ is parameterized by a neural network mapping from a subset of observed variables to logits over the next dimension. For discrete data like language tokens, the conditional takes the form of a categorical distribution. Categorical distributions are relatively easy to optimize, flexible and can easily express multimodal distributions as each bin's logit is independently predicted.

Ordinal and continuous data such as image intensities ranging from 0 to 1 have a natural ordering between possible values of each dimension $x^t$. The categorical does not exploit this ordering, instead separately predicting each bin. Categorical distributions also scale poorly when encoding highly precise data like agent trajectories, tabular datasets and audio \citep{oord2016wavenet}. Auditory quality degrades if the waveform is quantized to less than 8-16 bits (256-65k intensity levels). Control applications often need high precision as well. Unfortunately, categorical likelihood degrades rapidly at high quantization levels. The uniformly discretized model in Figure~\ref{fig:twospirals} has a negative log-likelihood $-0.85$ with 16 bins, while our adaptively discretized approach achieves NLL $-1.02$ with the same architecture and number of bins (lower is better). We note that the negative log-likelihood can be negative as it is computed on the continuous density by treating the discretized distribution as a mixture of uniform distributions. Halving the width of a particular bin in the categorical distribution would require double the parameters in the final layer of the network.

Past work tries to improve the efficiency of categoricals with hierarchical softmax \citep{morin2005hierarchical} or quantile-based discretization \citep{janner2021sequence}. Heuristic, hand-engineered discretizations like the $\mu$-law \citep{oord2016wavenet} reduce quantization error and improve perceptual quality. As an alternative, a single Gaussian, Gaussian mixtures \citep{bishop1994mixture} or logistic mixtures \citep{Salimans2017PixeCNN} are frequently used for parameter efficient conditionals, but can be hard to optimize, especially as the number of mixture components increases.

In this work, we propose a parameterization of 1-D conditionals that is parameter efficient, expressive and multimodal. We propose Adaptive Categorical Discretization (\ours{}). Based on the observation that high precision is often only required in a small subset of a distribution's support, \ours{} is a distribution parameterized by a vector of interval masses \textit{and} interval widths. \ours{} is depicted in Figure~\ref{fig:1d_example}. In contrast to categoricals with equal bin widths, variable bin widths allow the network to localize mass precisely without increasing precision elsewhere. Compared to non-uniform but fixed discretizations like quantiles, \ours{} parameters are adaptive: they are predicted by a neural network conditioned on prior dimensions, which is important as the best discretization for a particular conditional differs from the best for the marginal.

We also propose an analytic target smoothing strategy to ease optimization, and draw connections between target smoothing and dequantization~\citep{uria2016neural} and score matching~\citep{vincent2011connection}. In experiments, \ours{} with target smoothing scales better to few parameters or is competitive with strong baselines on image density estimation, offline reinforcement learning, tabular data and audio.\footnote{The code for reproducing the experiments in this paper is available at \href{https://github.com/ColinQiyangLi/AdaCat}{\texttt{github.com/ColinQiyangLi/AdaCat}}. Website: \href{https://colinqiyangli.github.io/adacat}{\texttt{colinqiyangli.github.io/adacat}}.}

\section{Adaptive Categorical Discretization}
\subsection{\ours{} Distribution}
The \ours{} distribution is a particular subfamily of mixtures of uniform distributions where each mixture component has non-overlapping support. A standard \ours{} distribution $\AdaCat{k}{w}{h}$ has $k$ components with a support over $[0, 1)$. It is parameterized by two vectors in the $k$-dimensional simplex: $w, h \in \Delta^{k-1}$. Thus, $w$ and $h$ are normalized. $w$ is additionally constrained to be non-zero in all of its elements. The probability density function (PDF) of an \ours{} distribution is defined as,
\begin{align}
    f_{w, h, k}(x) = \sum_{i=1}^k \left\{ \mathbb{I}_{\left[c_i 
    \leq x < c_i + w_i \right]} \frac{h_i}{w_i} \right\} \label{eq:pdf}
\end{align}
where $c_i = \sum_{j=1}^{i-1} w_j$ is the prefix sum of the dimensions of parameter $w$ and $\mathbb{I}_{[\cdot]}$ is the indicator function.

Intuitively, $w_i$ captures the size of each discretized bin (support of each mixture component), $h_i$ captures the probability mass assigned to each bin, and $\frac{h_i}{w_i}$ is the density contributed by each bin.


\subsection{Relationship with Uniform and Quantile Discretization}

\paragraph{Connection to Uniform Discretization}

Generative models over ordinal data like PixelRNNs~\citep{van2016pixel} commonly divide the support of 1-D distributions into equal-width bins, and share the same bins across all dimensions of the data. This allows neural networks to parameterize the distribution with a simple classification head that predicts a categorical distribution over bins. \ours{} generalizes 1-D categorical distributions with uniformly discretized support. If $w$ is set to be $w_i = \frac{1}{k}, \forall i$, the distribution is reduced to a categorical distribution over $\{0, \frac{1}{k}, \frac{2}{k}, \cdots, \frac{k-1}{k}\}$ augmented with a uniform noise of magnitude $\frac{1}{k}$. Figure~\ref{fig:1d_example} shows how \ours{} is more expressive than a uniformly discretized categorical, allowing bin widths to vary and more closely approximating the modes of a mixture of two Gaussians.

\paragraph{Connection to Quantile-based Discretization}
\ours{} also generalizes quantile-based discretization, which discretizes a distribution's support by binning data into groups with equal numbers of observed data points. If $h$ is set to be $h_i = \frac{1}{k}, \forall i$, \textit{i.e.} the same mass in every bin, the vector $w$ can be interpreted as the $k$-quantile of the distribution.
This strategy is employed by generative models like the Trajectory Transformer~\citep{janner2021sequence}, which pre-computes and fixes the bin widths $w$ separately for each dimension to achieve equal mass $\frac{1}{k}$ per bin of the marginal distributions of the training set, then predicts mass $h$ with a neural network based on observed dimensions.

\subsection{Autoregressive Parameterization}
In problems with dimension greater than 1, we use deep autoregressive models to factorize the joint density $f(x)$ into multiple 1-D conditional \ours{} distributions:
\begin{align*}
    p_\theta(x) &= \prod_{t=1}^m p_\theta(x^t | x^{<t})
    = \prod_{t=1}^m f_{w^t, h^t, k}(x^t)
\end{align*}

For each dimension, conditioned on observed or generated values of prior dimensions, the neural net $g_\theta$ outputs two unconstrained parameters $\{\phi^t, \psi^t\} = g_\theta(x^{<t})$, where $\phi^t, \psi^t \in \mathbb{R}^{k}$. The predicted $\phi$ and $\psi$ represent the unnormalized log values for $h$ and $w$ for each dimension. These parameters are normalized independently using a softmax to satisfy the normalization and positivity constraints:
\begin{equation}
\begin{aligned}
    w^t_i = \frac{\exp(\psi^t_i)}{\sum_{j=1}^k \left[\exp(\psi^t_j)\right]}, \quad
    h^t_i = \frac{\exp(\phi^t_i)}{\sum_{j=1}^k \left[\exp(\phi^t_j)\right]} 
\end{aligned}
\end{equation}
where $\phi^t = \phi_\theta(x^{<t}), \psi^t = \psi_\theta(x^{<t})$.

Unlike uniform, quantile-based, or heuristic discretization strategies, our autoregressive model can adaptively choose how to discretize each dimension's conditional distribution based on observations. Adaptivity improves expressiveness, since density can be precisely localized in regions of interest, and the discretization can vary across data dimensions. This is especially important for problems where the optimal discretization is not known \textit{a priori}. In the 2-D dataset shown in Figure~\ref{fig:twospirals}, fixed discretizations poorly express the inherent multimodality in the data, while \ours{}'s adaptivity allows the network to shift modes of $p_\theta(x^2 | x^1)$ for different values of the first dimension, $x^1$.

\begin{figure}
    \centering
    \includegraphics[width=0.5\textwidth]{figures/adacat_1d_9999.png}
    \caption{\textbf{1-D Toy Density Modeling}: \ours{} optimized with the non-smoothed objective ({\color{red}{red}}) suffers from \textbf{bin collapse}. The non-smoothed objective shrinks the size of most bins until they are degenerate with small support in order to increase density at the modes. In contrast, with the smoothed objective ({\color{blue}{blue}}), \ours{} closely approximates the target. The transparent curves show an evolution of the learned density at different optimization iterations, with more transparent ones being earlier in the optimization. The code for reproducing this figure and an animated version are available in the supplement.}
    \label{fig:collapse}
\end{figure}


\section{Optimizing with Analytic Target Smoothing}
Autoregressive models with \ours{} conditionals can be estimated by minimizing the Kullback–Leibler (KL)-divergence between the target distribution and the learned density $D_{\text{KL}}(p_\text{data}(x) \;\|\; p_\theta(x))$. The KL reduces to the empirical log-likelihood objective below, where $x_1, \cdots, x_n$ are sampled from the data distribution $p_\text{data}$:
\begin{align}
    \hat{\mathcal{L}}_\text{ll} &= \frac{1}{n} \sum_{d=1}^n \sum_{t=1}^m \log p_\theta(x^t_d | x^{<t}_d) \label{eq:empirical_ll}
\end{align}

However, due to the discontinuous nature of the \ours{} density function, we observed that na{\"i}vely maximizing the empirical log-likelihood encourages the model to get trapped in poor local optima.
This phenomenon is best illustrated in 1-D, as in Figure \ref{fig:collapse}. The density in red is estimated with maximum likelihood $\hat{\mathcal{L}}_\text{ll}$~\eqref{eq:empirical_ll}, but the bin widths shrink over the course of optimization and reach small values. Density is overestimated in between modes and underestimated in regions where a single uniform mixture component needs to cover a large interval.

We provide one possible explanation for bin collapse. Rewriting the log-likelihood with \ours{}'s PDF \eqref{eq:pdf} based on a summation over bins $i=1$ to $k$,
\begin{align}
    \hat{\mathcal{L}}_\text{ll} &= \frac{1}{n} \sum_{d=1}^n \sum_{t=1}^m \log \underbrace{\sum_{i=1}^k \left\{ \mathbb{I}_{\left[c_i^t \leq x < c_i^t + w_i^t \right]} \frac{h_i^t}{w_i^t} \right\}}_{f_{w^t,h^t,k}(x^t)} \label{eq:empircal_ll_indicator}
\end{align}
Due to the constraint that mixture components are non-overlapping, only a single term of the inner summation is non-zero in (\ref{eq:empircal_ll_indicator}). The loss separates into terms for $h$ and $w$,
\begin{align}
    \hat{\mathcal{L}}_\text{ll} &= \frac{1}{n} \sum_{d=1}^n \sum_{t=1}^m \sum_{i=1}^k \mathbb{I}_{\left[c_i^t \leq x < c_i^t + w_i^t \right]} \left\{ \log h_i^t - \log {w_i^t} \right\} \label{eq:empircal_ll_indicator_simple}
\end{align}

Maximizing the loss for data point $x_d$ pushes for higher density $\frac{h_i^t}{w_i^t}$ when $x_d^t$ lies in for bin $i$ of the conditional $p_\theta(\cdot | x_d^{<t})$. This density is increased by either increasing log mass $\log h_i^t$ or decreasing log bin width $\log w_i$. For uniform and heuristic discretizers, $\log w_i^t$ is fixed. However, updating the bin width $w_i^t$ with finite step sizes can make the data point $x_d^t$ move out of the current bin discontinuously, which can result in biased gradient estimates (see the supplement C.1).

The gradient $\frac{d}{d w_i^t} \hat{\mathcal{L}}_\text{ll}$ is also constant for any value of a sample within bin $i$ as the density is piece-wise constant, so the gradient encouraging bin collapse does not attenuate as the sample approaches a bin boundary. Once a bin is updated to exclude a particular data point, only the normalization of $w^t$ encourages the bin to grow to include the data point again, but we empirically find that this is not enough to prevent collapse. Instead, optimization could shrink the new bin $w_{i+1}^t$ or $w_{i-1}^t$, repeating until a majority of the mixture components collapse to support a small fraction of the overall interval.  

Luckily, this issue can be largely alleviated by using a smoothed objective:
\begin{align}
    \hat{\mathcal{L}}_{s} &= \frac{1}{n} \sum_{d=1}^n \sum_{t=1}^m \mathbb{E}_{\zeta(\tilde{x} | x_d^t)} \left[ \log p_\theta(\tilde{x} | x^{<t}_d) \right] \label{eq:expected_target_smoothed} \\
    &= \frac{1}{n} \sum_{d=1}^n \sum_{t=1}^m \left[ \int_{\tilde{x}} \zeta(\tilde{x} | x^t_d) \log p_\theta(\tilde{x} | x^{<t}_d) d\tilde{x}\right] \label{eq:target_smoothed}
\end{align}
where $\zeta(\tilde{x} | x)$ is any smoothing density function that is centered around $x$. This smoothed objective can be interpreted as the NLL objective under a smoothed data distribution (by applying the smoothing function on top of the data). We discuss this in more details in the supplement (Section A). In practice, we find that both Uniform and Gaussian distributions with mean $x$ effectively prevent the bins from collapsing, and use $\zeta(\cdot | x)=\text{Unif}[x - \frac{\lambda}{2}, x + \frac{\lambda}{2})$ or $\zeta(\cdot | x) = \mathcal{N}(x, \lambda^2)$ in all experiments, truncating on the boundaries of the support of $x \in [0, 1)$. By optimizing $\hat{\mathcal{L}}_s$ with uniform target smoothing, the density in blue in Figure \ref{fig:collapse} converges to a close approximation of the target density.

The smoothed objective might seem intractable with an integral in the inner summation. Fortunately, the form of the conditional $\log p_\theta(\tilde{x} | x^{<t}_i)$ with \ours{}'s simple density function allows us to evaluate the integral analytically as long as the smoothing density has an analytic cumulative density function (CDF). If $F(\cdot)$ is the CDF of $\zeta$, then the integral can be analytically computed as:
\begin{equation}
\begin{aligned}
    \int_{x} & \zeta(x) \log f_{w, h, k}(x) dx \\ 
    &= \sum_{j=1}^k \left[ (F(c_j + p_j) - F(c_j)) (\log h_j - \log w_j) \right]
\end{aligned}
\end{equation}
where we recall that $c$ is the prefix sum of $w$ as defined previously. Only the bins that intersect with the support of the smoothing density function contribute to this objective. 
\subsection{Relationship with denoising score matching}

Energy based models and denoising autoencoders trained by denoising score matching (DSM, \cite{vincent2011connection}) minimize a reconstruction objective:
\begin{equation*}
    \mathcal{L}_\text{DSM}(x) = \mathbb{E}_{\zeta(\tilde{x} | x)} \left\| x - \hat{x}_\theta(\tilde{x}) \right\|^2_2
\end{equation*}
Assuming the observation model $p_\theta(\cdot | \tilde{x}) = \mathcal{N}(\hat{x}_\theta(\tilde{x}), I)$ is a standard Gaussian,
\begin{equation*}
    \mathcal{L}_\text{DSM}(x) = -\mathbb{E}_{\zeta(\tilde{x} | x)} \left[\log p_\theta(x | \tilde{x})\right]
\end{equation*}
resembling~\eqref{eq:expected_target_smoothed}. However, our target smoothed loss is designed to regularize the output conditional distribution, so perturbations are employed on the output space, not the input space, and our generative model is conditioned on clean, unperturbed observations. Recent works introduce multi-scale perturbations~\citep{DBLP:journals/corr/abs-1907-05600}, and denoising diffusion probabilistic models reweight a related variational bound for this class of models to improve sample quality~\citep{ho2020denoising}.

\subsection{Relationship with dequantization}

Other continuous density estimators like normalizing flows trained on discrete data suffer from degenerate solutions if trained na{\"i}vely via maximum likelihood~(\cite{ho2019flow}, Sec.~3.1). Flows suffer from a different failure case than non-smoothed \ours{}. Continuous estimators, \textit{e.g.,} a mixture of Dirac $\delta$ functions, can arbitrarily increase density on discrete training data as the empirical distribution is supported on a set with measure zero. Dequantization avoids the problem by adding continuous noise to observed discrete samples~\citep{theis2015note, hoogeboom2021learning}. As an example, the following dequantized objective gives a lower bound of the log-likelihood of discrete data sample $x$:
\begin{align}
    \log p^\text{DQ}_\theta(x) &= \mathbb{E}_{\zeta(\tilde{x} | x)} \log p_\theta(\tilde{x}) \label{eq:dequant} \\
    &= \int_x^{x+\lambda} \zeta(\tilde{x} | x) \log p_\theta (\tilde{x}) d \tilde{x} \nonumber \\
    &\leq \log \int_x^{x + \lambda} \zeta(\tilde{x} | x) p_\theta(\tilde{x}) d\tilde{x} = \log P_\theta(x), \nonumber
\end{align}
where $\lambda$ is chosen such that $[x, x + \lambda)$ with different discrete sample $x$ do not overlap with each other.
While~\eqref{eq:dequant} closely resembles \eqref{eq:target_smoothed}, it differs subtly in that dequantization perturbs all dimensions of the data $x$, not just the 1-D target, and that the integral is done via a stochastic sample from $\zeta$ rather than analytically. We observe bin collapse even on continuous data like the mixtures in Figures~\ref{fig:1d_example}, \ref{fig:collapse}, and find that single-sample estimates of the expectation do not prevent collapse. These findings suggest that analytic target smoothing helps with the discontinuity in the model conditional rather than a property of the data.

\section{Evaluation}
\label{sec:evaluation}

In experiments, we evaluate the performance of autoregressive density estimators with adaptive categorical conditional distributions for several data modalities. We evaluate on standard benchmarks for real-world tabular data (Section~\ref{sec:eval:tabular}), image generation (\ref{sec:eval:image}), speech synthesis (\ref{sec:eval:speech}) and offline reinforcement learning (\ref{sec:eval:offlinerl}). \ours{} outperforms uniform discretization strategies in all settings, and is competitive with hand-engineered conditional distributions. Beyond density estimation, our results suggest that \ours{} can improve downstream task performance, including speech quality and control.

\subsection{Tabular data modeling}
\label{sec:eval:tabular}



We compare the performance of autoregressive models with \ours{} and uniform parameterizations on real-world tabular density estimation benchmarks, the UCI datasets of \cite{Dua:2019}. The state-of-the-art performances on these benchmarks are also included for reference. We use a 4-layer feed-forward network to predict the \ours{} parameters for each dimension of the data (\textit{e.g.}, we use 6 MLPs for the POWER dataset since it has 6-dimensional data). Each network has $500$ hidden units for all datasets except for GAS, where we use $1000$ hidden units.

We also use a Fourier encoding of the input inspired by \citet{tancik2020fourier, kingma2021variational} to allow a shallow model to capture high-frequency variations in the input. Specifically, we augment each input element $x^t$ with $b$ pairs of additional features: $\{\sin(2^j x^t), \cos(2^j x^t)\}_{j=0}^{b-1}$. We choose the feature count $b=32$ for GAS and POWER, $b=8$ for MINIBOONE, and $b=4$ for HEPMASS.

The uniform baseline uses the same architecture except the widths of the bins are forced to be uniform. We search over the number of bins in $\{100, 200, 300, 500, 1000\}$ for both the uniform baseline and \ours{} and select the best to report in the table. All models are trained for 400 epochs using Adam~\citep{kingma2014adam} with a learning rate of $0.0003$ and the learning rate halves every 100 epochs. We use truncated Gaussian target smoothing for \ours{} with $\lambda = 0.00001$ for POWER, and $\lambda = 0.0001$ for all other datasets. See more details in the supplement (Section B).

Table \ref{table:tabular} reports results. Overall, \ours{} consistently outperforms the uniform baseline across all datasets, reducing the NLL by 1.9, 4.8, 3.1 and 4.0 nats on POWER, GAS, HEPMASS and MINIBOONE, respectively. Our approach also obtains competitive performance with the state-of-the-art normalizing flow models on GAS.

\begin{table*}[ht!]
\centering
\small
\begin{tabular*}{0.9\textwidth}{lrrrrr}
\toprule
\multicolumn{1}{c}{\bf Method} & \multicolumn{1}{c}{\bf POWER (m=6)}  & \multicolumn{1}{c}{\bf GAS (m=8)} & \multicolumn{1}{c}{\bf HEPMASS (m=21)}   & \multicolumn{1}{c}{\bf MINIBOONE (m=43)} \\ 
\midrule

MADE~\citep{germain2015made}    & $3.08$ & $-3.56$ & $20.98$ & $15.59$ \\
MAF~\citep{papamakarios2017masked}    & $-0.24 $& $-10.08$ & $17.70$ & $11.75$ \\
NAF-DDSF~\citep{huang2018neural} & $-0.62$ & $-11.96$ & $15.09$ & $8.86$  \\ 
TAN~\citep{oliva2018transformation}    & $-0.48$ & $-11.19$ & $15.12$ & $11.01$  \\
FFJORD~\citep{grathwohl2018ffjord}    & $-0.46$ & $-8.59$ & $14.92$ & $10.43$  \\
Block NAF~\citep{de2020block} & $-0.61$ & $-12.06$ & $14.71$ & $8.95 $  \\
DDEs~\citep{bigdeli2020learning}      & $-0.97$ & $-9.73$ & $11.3$ & $6.94$ \\
nMDMA~\citep{gilboa2021marginalizable}     & $-1.78$ & $-8.43$ & $18.0$ & $18.6$  \\
\midrule
% Uniform Discretization & $1.32$ & $-6.43$ &   $21.26$ & $18.31$ \\
% \ours{} & $-0.59$ & $-11.25$ & $18.19$ & $14.27$ \\
Uniform Discretization & $1.34$ & $-6.29$ &   $21.37$ & $16.93$ \\
\ours{} & $-0.56$ & $-11.27$ & $18.17$ & $14.14$ \\
\bottomrule
\end{tabular*}
% \vspace{.1cm}
\caption{
\textbf{(Tabular Data)} Test negative log-likelihood for density estimation on UCI datasets~\citep{Dua:2019}. We followed the same data pre-processing pipeline as in \citet{papamakarios2017masked}. \ours{} achieves competitive performance on GAS and consistently outperforms the uniform baseline.}
\label{table:tabular}
\end{table*}

\subsection{Image density estimation}
\label{sec:eval:image}


\begin{figure}[t]
    \centering
    \includegraphics[width=0.5\textwidth]{figures/shorter.png}
    \caption{Test bits per dimension (bpd) on MNIST image generation task with different output parameter count. The parameter count is in log scale. The MNIST data is discrete with $2^8 = 256$ possible values for the intensity at each pixel.}
    \label{fig:mnist}
\end{figure}

\begin{table}[ht!]
\centering
\small
% \begin{tabular}{0.59\textwidth}{crrrr}
\begin{tabular}{ccccc}
\toprule
{\bf \multirow{2}{*}{Parameters}} & \multicolumn{1}{c}{ \multirow{2}{*}{\bf Uniform}}  & \multirow{2}{14mm}{\bf Adaptive\\Quantile} & \multicolumn{1}{c}{\bf \multirow{2}{*}{ DMoL}}   & \multicolumn{1}{c}{\bf \multirow{2}{*}{ AdaCat}} \\\\
\midrule
512 & \multicolumn{1}{c}{N/A} & \multicolumn{1}{c}{$\times$} & $0.761$ & $\mathbf{0.561}$ \\
256 & $\mathbf{0.561}$ &  \multicolumn{1}{c}{$\times$} &  $0.698$ & $0.573$  \\ 
216 & $0.838$ &           \multicolumn{1}{c}{$\times$} &  $0.704$ & $\mathbf{0.615}$  \\ 
180 & $1.061$  &          \multicolumn{1}{c}{$\times$} & $0.684$ & $\mathbf{0.629}$  \\ 
152 & $1.299$  &          \multicolumn{1}{c}{$\times$} & $0.776$ & $\mathbf{0.612}$  \\ 
128 & $1.490$  &          \multicolumn{1}{c}{$\times$} & $0.700$ & $\mathbf{0.608}$   \\ 
64  & $2.453$  &          \multicolumn{1}{c}{$\times$} & $0.720$ & $\mathbf{0.695}$  \\ 
32  & $3.392$  & $1.276$ & $\mathbf{0.715}$ &$0.793$   \\
\midrule
Best & $\mathbf{0.561}$ & $1.276$ & $0.715$ & $\mathbf{0.561}$  \\
\bottomrule
\end{tabular}
% \vspace{.1cm}
\caption{
\textbf{(Image Generation)} Test negative log-likelihood in bits per dimension (bpd) on MNIST image generation task with different output parameter count. \ours outperforms other baselines on most parameter counts. The adaptive quantile baseline diverges with a parameter count higher than 32, indicated by $\times$.}
\label{table:image}
\end{table}


Table \ref{table:image} compares the performance of \ours{} against existing parameterizations on the grayscale MNIST~\citep{lecun-mnisthandwrittendigit-2010} image generation task in terms of negative log-likelihood measured in bits/dimension. The autoregressive architecture we use for this task is a GPT-like Transformer decoder~\citet{DBLP:journals/corr/VaswaniSPUJGKP17} with 4 layers and 4 heads.\footnote{We use the implementation, training pipeline, and the default training hyperparameters from \href{https://github.com/karpathy/minGPT}{\url{github.com/karpathy/minGPT}}, and treat an image as a token sequence with a vocabulary size of 256. We also use a smaller batch size of $20$. See more details in the supplement (Section C.2).} Since the image data is discrete, instead of dequantizing and smoothing the target, we directly minimize the cross entropy loss in the original discrete space. We compute the log probability of the $i^{\text{th}}$ discrete target by analytically computing the total probability mass assigned to $\left[\frac{i}{256}, \frac{i+1}{256}\right]$ in our continuous distribution, \textit{i.e.} density integrated over the interval. This corresponds to mapping our continuous distribution from $[0, 1]$ to the 256 discrete values uniformly such that the $i^{\text{th}}$ discrete value is mapped from $\left[\frac{i}{256}, \frac{i+1}{256}\right]$. 

The results are grouped according to the number of parameters used to express the intensity distribution of each pixel, allowing us to examine the parameter efficiency of each approach. For the uniform baseline, we evenly divide  the $[0, 1]$ intensity interval into $k$ bins and use $k$ parameters to model the probability assigned to each bin (with unnormalized log probability). DMoL uses $3k$ parameters for a $k$-component mixture model (\textit{e.g.}, a 256 parameter count budget for DMoL corresponds to a $86$-component mixture model). \ours{} uses $2k$ parameters for a $k$-component mixture model, using $k$ bins of variable size. We examine parameter counts ranging from $32$ to $512$. 

Overall, \ours{} has better performance on most parameter counts. It only underperforms the uniform discretization at 256 parameters. We note that the MNIST dataset is discrete with 256 classes, which means that the uniform discretization has the optimal bin division. Therefore, we do not expect \ours{} to be able to outperform uniform because \ours{} has effectively half of the bins available. DMoL achieves the best performance when there are 32 parameters, but scales poorly with the number of components and underperforms \ours{} for most settings with more than 32 parameters.

We also experiment with an adaptive quantile baseline where we keep the probability mass assigned to each bin to be the same rather than the width. However, we found that the adaptive quantile baseline is very unstable to train. We only report its result on a parameter count of 32 because the model at a higher parameter count often diverges early in training which results in inconsistent performances across runs. The adaptive quantile baseline outperforms uniform with 32 parameters, yet is still much less expressive than \ours{}.


\subsection{Audio density estimation and vocoding}
\label{sec:eval:speech}


\begin{table*}[t]
\centering
\small
\begin{tabular*}{0.85\textwidth}{lccrrr}
\toprule
\multicolumn{1}{l}{\bf Conditional dist.} & {\bf Transform} & \multicolumn{1}{c}{\bf Parameters} & \multicolumn{1}{c}{\bf NLL (raw) $\downarrow$} & \multicolumn{1}{c}{\bf NLL ($\mu$-law $256$) $\downarrow$} & \multicolumn{1}{c}{\bf MCD $\downarrow$} \\ 
\midrule
Gaussian & -- & 2 & \nllmetric{-8.386890751457226}{0.1121239574410288} & \nllmetric{-4.779327632510675}{0.08437377059570372} & \mcdmetric{3.083703872341475}{0.01579009490156828} \\ \midrule
Uniform Categorical & -- & 30  & \nllmetric{-3.7837350635530314}{0.010705543565271626} & \nllmetric{0.6272370998469731}{0.006552882891747916} & \multicolumn{1}{c}{--} \\  % train pabrtxs2
Uniform Categorical & $\mu$-law & 30 &  \nllmetric{-7.709024366355303}{0.06800552693385746} & \nllmetric{-3.322211777553805}{0.031624705329819584} & \mcdmetric{17.00305793055417}{0.4902891843852458} \\
DMoL, 10 components & -- & 30 & \nllmetric{-8.451060023837371}{0.11388617978833497} & \nllmetric{-4.843496629718581}{0.08622791005737283} & \mcdmetric{2.9996378886449144}{0.011966555044538363}  \\
% Adaptive Cat. ({\ours{}}) & -- & 30 & \nllmetric{-8.209699709202628}{0.09829672819043035} & \nllmetric{-3.822886845228889}{0.06249182033763499} & \\  % train dgx gpu1, smooth 0.0001 for 500k iterations
Adaptive Cat. ({\ours{}}) & -- & 30 & \nllmetric{-8.3042517806495}{0.157773444248523} & \nllmetric{-3.8780257901636523}{0.09628723880795718} & \mcdmetric{4.871829278014099}{0.04359576058266944} \\ \midrule
Uniform Categorical & -- & 256  & \nllmetric{-6.280967525604935}{0.03958890865536694} & \nllmetric{-0.845202480244617}{0.03605127026862487} & \multicolumn{1}{c}{--} \\
Uniform Categorical & $\mu$-law & 256  & \nllmetric{-8.762136326514831}{0.10367519776154016} & \nllmetric{-4.375323221765382}{0.06852243303319218} & \mcdmetric{3.2515686684526766}{0.012450773812721505} \\  
DMoL, 85 components & -- & \;255$^\dagger$ & \nllmetric{-8.459562158165426}{0.11408211792220034} & \nllmetric{-4.851999211201526}{0.08649329465336626} & \mcdmetric{3.006010470652975}{0.012368450954111971} \\
Adaptive Cat. (\ours{}) & -- & 256  & \nllmetric{-8.372055390407677}{0.10170032493677568} & \nllmetric{-3.985242836002713}{0.06596573186130596} & \mcdmetric{3.0249459386503577}{0.033011902425867996} \\  \midrule 
Uniform Categorical & -- & 512  & \nllmetric{-6.921092730733378}{0.051468628163777556} & \nllmetric{-1.4435485591365007}{0.04916729075389461} & \multicolumn{1}{c}{--} \\
Uniform Categorical & $\mu$-law & 512  & \nllmetric{-8.121395148324059}{0.1038044757986994} & \nllmetric{-3.734582249953789}{0.06809553349170203} & \mcdmetric{1.9890040254194699}{0.011102717415837168}  \\ 
DMoL, 171 components & -- & \;513$^\dagger$ & \nllmetric{-8.455742079524459}{0.11420041558368886} & \nllmetric{-4.8481790293709714}{0.0865949545536215} &  \mcdmetric{3.0742626112835905}{0.013312985026852051} \\ 
Adaptive Cat. (\ours{}) & -- & 512 & \nllmetric{-8.325184339430495}{0.11141588590696805} & \nllmetric{-3.9383714410602257}{0.07544256772644131} & \mcdmetric{2.285745259392688}{0.016842716007925206} \\
\bottomrule
\end{tabular*}
\caption{
\textbf{(Audio vocoding)} Continuous negative log-likelihood (NLL, in bits/dim) and waveform vocoding MCD error for WaveNet with different parameterizations of conditional distributions on the LJSpeech dataset. $^\dagger$Discretized Mixture of Logistics (DMoL) requires 3 parameters per mixture component (weight, mean and log scale), so output parameters are approximately matched to baselines.}
\label{table:audio}
\end{table*}

Neural vocoders synthesize human-like speech expressed as a waveform, conditioned on phonetic or frequency spectrum based features. Speech waveforms are long sequences as audio must be sampled at a high rate for fidelity, typically 16-24 kHz. Thus, efficient generative models are essential for practical applications, and the output layer of the network can be a significant fraction of the compute. WaveNet~\citep{oord2016wavenet} is a popular autoregressive vocoder that synthesizes waveforms conditioned on a Mel-spectrogram representation of the amplitude of audio frequencies across multiple bands. Despite the conditioning information, vocoding is challenging as WaveNet needs to reconstruct the phase of the audio frequencies. This is done by estimating the distribution of audio in a dataset via maximum likelihood.

We train WaveNet with the standard dilated CNN architecture on the open LJSpeech dataset~\citep{ljspeech17} of wav files using an open-source implementation. A ground-truth Mel-spectrogram is extracted and used for conditioning WaveNet. For baselines, we use different parameterizations of the conditional distribution: a uniformly discretized categorical, a categorical discretized by a hand-engineered $\mu$-law strategy that sets bin widths logarithmically with intensity, and a discretized mixture of logistics. All models use 24 convolutional layers and are optimized for 500k iterations with Adam. The learning rate is initially 0.001 and is decayed by half every 200k iterations, with batch size 8. An exponential moving average of model parameters is used for testing. Audio is sampled at 22,050 Hz and windowed into blocks of 1024 samples.

We evaluate the continuous negative log-likelihood (NLL) of test waveforms.
NLL is measured in bits/dimension with the waveform scaled to a $[-1, 1]$ amplitude.
We also measure the NLL of $\mu$-law transformed data with $\mu=256$, which is also scaled to $[-1, 1]$.

Following \cite{chen2020wavegrad}, the Mel Cepstral Distance (MCD) objectively quantifies the perceptual similarity of our synthesized audio and reference audio based on an aligned mean squared error metric \citep{407206}.
Samples from the uniformly discretized model led to numerical instabilities in the open-source library used to compute the MCD metric and have clear auditory artifacts, so the MCD metric is omitted.

Table~\ref{table:audio} shows results grouped by the number of parameters output by WaveNet for each conditional. Note that \ours{} has half the bins of categorical baselines at the same parameter count due to using two parameter vectors, $w$ and $h$. Across all settings, \ours{} achieves significantly better negative log-likelihood than uniform discretization: $4.52$ bpd lower NLL with 30 output parameters, $2.09$ bpd with 256 parameters, and $1.41$ bpd with 512 parameters.

\ours{} is also competitive with hand-engineered quantization in the $\mu$-law intensity space, despite not having prior knowledge about humans' logarithmic perception of sound intensity, and without DMoL's instabilities. Still, well-tuned $\mu$-law and DMoL strategies perform well, so the main advantage of \ours{} in the audio domain is capturing most of their performance without human-provided inductive bias. \ours{} and the $\mu$-law transform are complementary, and could be used in concert by learning an adaptive discretization of the heuristically transformed interval.

\subsection{Model-based Offline Reinforcement Learning}
\label{sec:eval:offlinerl}

\ours{}'s parameterization can also be adopted in the dynamics model of a model-based planner for reinforcement learning tasks. We tested our parameterization with Trajectory Transformer~\citep{janner2021sequence}, a recent work in model-based offline RL that uses a Transformer-based architecture to learn the dynamics model of an environment from a dataset offline, and then uses the model to plan online to produce actions for RL agents. The original Trajectory Transformer architecture discretizes each dimension of states and actions into tokens and uses a one-hot embedding to encode them, similar to how a language model handles vocabulary. Since our discretization is done adaptively with context dependency, continuous inputs are more informative.

Thus, we minimally modify the architecture by replacing the one-hot embedding layer with a linear layer that takes in a scalar input and outputs its embedding. This modification arguably loses some capacity since it has many fewer parameters than the original architecture. Yet, as we show in our experiments, the gain from the flexibility of our parameterization outweighs the potential capacity reduction. We also reduce the number of bins by a factor of 2 to match the parameter size of the output layer since \ours{} requires $2\times$ more parameters than uniform and non-adaptive quantile-based discretization. We use uniform smoothing with a smoothing coefficient of $\lambda = 0.001$ for the target smoothing objective. We also keep the planning hyperparameters the same as the original work for a fair comparison (except for one hyperparameter on action sampling). See more details on other minor differences between our training and planning procedure compared to the original training and planning procedure in the supplement (Section C.3).

Table \ref{table:d4rl} shows the performance of the RL agent on three D4RL datasets~\citep{fu2020d4rl} under \ours{}'s parameterization and two discrete parameterizations used by \cite{janner2021sequence}. \ours{} performs better than or on par with the uniform and quantile parameterizations used in the original paper, improving return by 2.9\% and 5.2\% on average, respectively. This demonstrates its effectiveness in accurately modeling continuous data, and the downstream benefits of more expressive discretization. 



\newcommand{\returnmetric}[2]{$#1$ \scriptsize{\raisebox{1pt}{$\pm #2$}}}
\newcommand{\returnmetricb}[2]{{$\mathbf{#1}$ \scriptsize{\raisebox{1pt}{$\pm \mathbf{#2}$}}}}

\begin{table}[t]
\centering
\small
\begin{tabular*}{0.48\textwidth}{@{\extracolsep{\fill}}lrrr}
\toprule \multicolumn{1}{c}{\bf Dataset} & \multicolumn{1}{c}{\bf Uniform} & \multicolumn{1}{c}{\bf Quantile} & \multicolumn{1}{c}{\bf AdaCat}  \\
\midrule
HalfCheetah-Medium & \returnmetric{44.0}{0.31} & \returnmetric{46.9}{0.4} & \returnmetricb{47.8}{0.22} \\ 
Hopper-Medium & $67.4$ \scriptsize{\raisebox{1pt}{$\pm 2.9$}} & $61.1$ \scriptsize{\raisebox{1pt}{$\pm 3.6$}} &  \returnmetricb{69.2}{4.5}\\ 
Walker2d-Medium &  \returnmetricb{81.3}{2.1}& \returnmetric{79.0}{2.8}  & \returnmetric{79.3}{0.8} \\ 
\bottomrule
\end{tabular*}
\caption{
\textbf{(Offline reinforcement learning)} Normalized scores on three D4RL locomotion (\texttt{v2}) tasks~\citep{fu2020d4rl} using Trajectory Transformer~\citep{janner2021sequence} with three different discretization methods. \ours{} parameterization performs on par with or better than the uniform and quantile methods used in the original paper. Both mean and standard error over 15 random seeds (5 independently trained Transformers and 3 trajectories per Transformer) are reported, following the protocol in the original paper.}
\label{table:d4rl}
\end{table}

\section{Related work}

\paragraph{Adaptive Discretization} 
The idea of adaptive discretization has found tremendous applications in different fields such as reinforcement learning~\citep{chow1991optimal}, finite element analysis~\citep{liszka1980finite} and computer graphics~\citep{jevans1988adaptive}. We bring this powerful idea into density modeling by introducing \ours. Unlike most existing works on discretization that rely on heuristics and prior knowledge of the data domain~\citep{tang2020discretizing, ghasemipour2021emaq}, \ours{} can be jointly optimized with the rest of the network parameters and learns to adaptively discretize. \citet{bhat2021adabins} (AdaBins) is closest to our work. Though the idea is similar, AdaBins is different from \ours{} in several important aspects. AdaBins parameterizes the bin widths directly while \ours{} parameterizes the unnormalized log bin widths. AdaBins uses additional regularization loss that encourages the bin centers to be close to the data while \ours{} does not need any additional regularization. AdaBins is primarily used in depth prediction in vision, whereas our work focuses on generative modeling across multiple domains.

\paragraph{Efficient softmax} Language models often have vocabulary sizes of 10k-100k tokens. Computing the softmax normalizer for such a large vocabulary can be expensive, motivating more efficient softmax variants, surveyed by \citet{ruder2016wordembeddingspart2}. Hierarchical softmax \citep{morin2005hierarchical} is one variant that groups tokens in a tree structure, and does not rely on data being ordinal. However, unlike \ours{}, hierarchical softmax groups a fixed, discrete vocabulary, rather than supporting continuous intervals. 

\paragraph{Image density estimation}
Until recently~\citep{kingma2021variational}, autoregressive models were the-state-of-the-art on image density estimation benchmarks, and are still widely used. Order agnostic models~\cite{uria2014deep} improve the flexibility of autoregressive models in downstream tasks like inpainting and outpainting~\citep{jain2020locally}, but do not change the form of the conditional distribution.

\paragraph{Audio synthesis} Likelihood-based models are popular in text-to-speech. Efficiency is a key concern, motivating parameter-efficient conditional distributions. \cite{paine2016fast} cache intermediate WaveNet activations for improved speed, and \cite{oord2017parallel} distill WaveNet into a parallel flow. Tacotron~\citep{wang2017tacotron} performed end-to-end speech synthesis from text with a WaveNet vocoder, so our work can be applied to text-to-speech (TTS) systems. Other approaches include Flow-based models, diffusion~\citep{kong2021diffwave, chen2020wavegrad} and GANs~\citep{kumar2019melgan}. 


\section{Limitations}
One fundamental limitation of \ours{} is that it can never model more modes than the number of bins. One possible solution is to add flow steps~\citep{DBLP:journals/corr/DinhSB16, rezende2015variational} to transform the space and run the autoregressive model with \ours{} in the transformed space. It is possible that using \ours{} in a transformed or latent space can lead to a more parameter-efficient way of modeling continuous distributions.

Another limitation of \ours{} is that the parameterization is discontinuous, with more discontinuities as the number of bins increases. We largely resolved the issue using target smoothing. It is possible to use non-uniform density within each bin like spline parameterizations~\citep{durkan2019neural}, but our preliminary experiments suggested that these are challenging to optimize in practice.

\section{Conclusion}
Likelihood-based autoregressive generative models can estimate complex distributions of high-dimensional real-world data, but often struggle with efficiency and can be difficult to expressively parameterize on continuous data. In this paper, we presented \ours{}, a flexible, efficient, multimodal parameterization of 1-D conditionals with applications in autoregressive models. We demonstrated the effectiveness of \ours{} on a diverse set of tasks: image generation, tabular density estimation, audio vocoding and model-based planning. \ours{} improves density estimation and downstream task performance over uniform discretization, and is competitive with other hand-engineered discretizers.

\begin{acknowledgements}
We would like to thank Michael Janner, Xinyang Geng and Hao Liu for helpful discussions at the early stage of this work. We also thank Sergey Levine for insightful feedback on the offline RL experiments. We thank Katie Kang, Dibya Ghosh, and other members of the RAIL lab for feedback on paper drafts. Qiyang Li is supported by Berkeley Fellowship.
\end{acknowledgements}

\bibliography{li_714}

\end{document}
