Restoring Task-Relevant Information in Synthetic Data: A Small-Scale V-Information View

Published: 10 Jun 2025, Last Modified: 15 Jul 2025MOSS@ICML2025EveryoneRevisionsBibTeXCC BY 4.0
Keywords: Synthetic Data, V-Information, Information Restoration, Small-Scale Experiments, Model Capacity, CNNs, LLMs, Inductive Biases, Alignment
TL;DR: DPI paradox: Synthetic data boosts learners by making existing info usable (V-Information) via a synthesizer, not by adding new info. Our CIFAR-10 (Conv-AE restoring for CNN) experiment confirms this.
Abstract: This paper investigates synthetic data generation as a mechanism for restoring or reformatting task-relevant information that is obscured or unusable for a specific, computationally bounded learner. We conduct a small-scale, controlled experiment on CIFAR-10, involving pixel permutation to corrupt data, a Convolutional Autoencoder (Conv-AE) synthesizer for information restoration, and a downstream CNN learner. Framed through V-Information, which quantifies information accessible to such a learner, empirical results demonstrate that while permutation drastically reduces usable V-Information, the synthesizer partially restores it, leading to significant performance recovery. We further explore how model capacities interact with this process, finding learner capacity beneficial only when usable information is present. This highlights computation’s role in making latent information accessible, a principle highly relevant to current synthetic data practices in capabilities and alignment of foundation models.
Code: ipynb
Submission Number: 87
Loading