Abstract: The cross-entropy and its related terms from information theory (e.g.~entropy, Kullback–Leibler divergence) are used throughout artificial intelligence and machine learning. This includes many of the major successes, both current and historic, where they commonly appear as the natural objective of an optimisation procedure for learning model parameters, or their distributions. This paper presents a novel derivation of the differential cross-entropy between two 1D probability density functions represented as piecewise linear functions. Implementation challenges are resolved and experimental validation is presented, including a rigorous analysis of accuracy and a demonstration of using the presented result as the objective of a neural network. Previously, cross-entropy would need to be approximated via numerical integration, or equivalent, for which calculating gradients is impractical. Machine learning models with high parameter counts are optimised primarily with gradients, so if piecewise linear density representations are to be used then the presented analytic solution is essential. This paper contributes the necessary theory for the practical optimisation of information theoretic objectives when dealing with piecewise linear distributions directly. Removing this limitation expands the design space for future algorithms.
Submission Length: Regular submission (no more than 12 pages of main content)
Code: https://github.com/thaines/orogram
Supplementary Material: zip
Assigned Action Editor: ~Brian_Kulis1
Submission Number: 2041
Loading