Abstract: We study the problem of predicting numeric labels that are constrained to the integers or to a subrange of the integers. For example, the number of up-votes on social media posts, or the number of bicycles available at a public rental station. While it is possible to model these as continuous values, and to apply traditional regression, this approach changes the underlying distribution on the labels from discrete to continuous. Discrete distributions have certain benefits, which leads us to the question whether such integer labels can be modeled directly by a discrete distribution, whose parameters are predicted from the features of a given instance. Moreover, we focus on the use case of output distributions of neural networks, which adds the requirement that the _parameters_ of the distribution be continuous so that backpropagation and gradient descent may be used to learn the weights of the network. We investigate several options for such distributions, some existing and some novel, and test them on a range of tasks, including tabular learning, sequential prediction and image generation. We find that overall the best performance comes from two distributions: _Bitwise_, which represents the target integer in bits and places a Bernoulli distribution on each, and a discrete analogue of the Laplace distribution, which uses a distribution with exponentially decaying tails around a continuous mean.
Submission Type: Long submission (more than 12 pages of main content)
Changes Since Last Submission: We have uploaded a substantially revised version of the manuscript. We will respond to each review in detail below. Here is an overview of the most substantial changes.
* We have re-written the introduction and removed the motivating example of MIDI music. This example still remains as one of the tasks we evaluate on, but we motivate the research more generally from different angles.
* We have checked every mathematical derivation and proof in the paper. This has led to various minor corrections, and to three more substantial changes:
*The derivatives of the discretized Laplace were incorrect. Since the correct derivatives are complicated, and can easily be computed by automatic differentiation, we have simply removed the derivatives.
* The section that described behavior of the integer regularization term $log (\gamma^c + \gamma^f)$ was incorrect, since the regularization actually works in the opposite direction. We explain the true effect of this term in detail. Since this does not add much to the understanding of Dalap (but does explain a counter-intuitive aspect of the functional form of the loss), we moved this section to the appendix.
* We rewrote the proof of Proposition 4 to be more precise in its application of limits.
* We repeated all experiments 10 times with different seeds and report standard errors, except for the image generation task, where this is not feasible. This is finished for the tabular data and included in the revision as well as the MIDI data.
* We included the Poisson distribution as a baseline in the tabular and MIDI tasks.
* We added a section to the appendix which provides descriptive statistics and target value histograms for all datasets.
* We rewrote the results and discussion sections to remove any claims that are not borne out by the evaluation.
Assigned Action Editor: ~David_Rügamer1
Submission Number: 7463
Loading