Phase Transitions in Rate Distortion Theory and Deep Learning

Philipp Grohs, Andreas Klotz, Felix Voigtländer

Published: 01 Jan 2023, Last Modified: 12 May 2023Found. Comput. Math. 2023Readers: Everyone

Abstract: Rate distortion theory is concerned with optimally encoding signals from a given signal class $$\mathcal {S}$$ S using a budget of R bits, as $$R \rightarrow \infty $$ R → ∞ . We say that $$\mathcal {S}$$ S can be compressed at rate s if we can achieve an error of at most $$\mathcal {O}(R^{-s})$$ O ( R - s ) for encoding the given signal class; the supremal compression rate is denoted by $$s^*(\mathcal {S})$$ s ∗ ( S ) . Given a fixed coding scheme, there usually are some elements of $$\mathcal {S}$$ S that are compressed at a higher rate than $$s^*(\mathcal {S})$$ s ∗ ( S ) by the given coding scheme; in this paper, we study the size of this set of signals. We show that for certain “nice” signal classes $$\mathcal {S}$$ S , a phase transition occurs: We construct a probability measure $$\mathbb {P}$$ P on $$\mathcal {S}$$ S such that for every coding scheme $$\mathcal {C}$$ C and any $$s > s^*(\mathcal {S})$$ s > s ∗ ( S ) , the set of signals encoded with error $$\mathcal {O}(R^{-s})$$ O ( R - s ) by $$\mathcal {C}$$ C forms a $$\mathbb {P}$$ P -null-set. In particular, our results apply to all unit balls in Besov and Sobolev spaces that embed compactly into $$L^2 (\varOmega )$$ L 2 ( Ω ) for a bounded Lipschitz domain $$\varOmega $$ Ω . As an application, we show that several existing sharpness results concerning function approximation using deep neural networks are in fact generically sharp. In addition, we provide quantitative and non-asymptotic bounds on the probability that a random $$f\in \mathcal {S}$$ f ∈ S can be encoded to within accuracy $$\varepsilon $$ ε using R bits. This result is subsequently applied to the problem of approximately representing $$f\in \mathcal {S}$$ f ∈ S to within accuracy $$\varepsilon $$ ε by a (quantized) neural network with at most W nonzero weights. We show that for any $$s > s^*(\mathcal {S})$$ s > s ∗ ( S ) there are constants c, C such that, no matter what kind of “learning” procedure is used to produce such a network, the probability of success is bounded from above by $$\min \big \{1, 2^{C\cdot W \lceil \log _2 (1+W) \rceil ^2 - c\cdot \varepsilon ^{-1/s}} \big \}$$ min { 1 , 2 C · W ⌈ log 2 ( 1 + W ) ⌉ 2 - c · ε - 1 / s } .

0 Replies