Keywords: Synthetic Data, V-Information, Information Restoration, Small-Scale Experiments, Model Capacity, CNNs, LLMs, Inductive Biases, Alignment
TL;DR: DPI paradox: Synthetic data boosts learners by making existing info usable (V-Information) via a synthesizer, not by adding new info. Our CIFAR-10 (Conv-AE restoring for CNN) experiment confirms this.
Abstract: This paper investigates synthetic data generation as a mechanism for restoring or reformatting task-relevant
information that is obscured or unusable for a specific, computationally bounded learner. We conduct a small-scale,
controlled experiment on CIFAR-10, involving pixel permutation to corrupt data, a Convolutional Autoencoder
(Conv-AE) synthesizer for information restoration, and a downstream CNN learner. Framed through V-Information,
which quantifies information accessible to such a learner, empirical results demonstrate that while permutation
drastically reduces usable V-Information, the synthesizer partially restores it, leading to significant performance
recovery. We further explore how model capacities interact with this process, finding learner capacity beneficial
only when usable information is present. This highlights computation’s role in making latent information
accessible, a principle highly relevant to current synthetic data practices in capabilities and alignment of foundation models.
Code: ipynb
Submission Number: 87
Loading