Learning to reconstruct from saturated data: audio declipping and high-dynamic range imaging

TMLR Paper5678 Authors

19 Aug 2025 (modified: 30 Oct 2025)Under review for TMLREveryoneRevisionsBibTeXCC BY 4.0
Abstract: Learning based methods are now ubiquitous for solving inverse problems, but their deployment in real-world applications is often hindered by the lack of ground truth references for training. Recent self-supervised learning strategies offer a promising alternative, avoiding the need for ground truth. However, most existing methods are limited to linear inverse problems. This work extends self-supervised learning to the non-linear problem of recovering audio and images from clipped measurements, by assuming that the signal distribution is approximately invariant to changes in amplitude. We provide sufficient conditions for learning to reconstruct from saturated signals alone and a self-supervised loss that can be used to train reconstruction networks. Experiments on both audio and image data show that the proposed approach \textcolor{reviews}{is almost as effective as} fully supervised approaches, despite relying solely on clipped measurements for training.
Submission Length: Long submission (more than 12 pages of main content)
Previous TMLR Submission Url: https://openreview.net/forum?id=gwDNM3b353
Changes Since Last Submission: In the first submission, the revised version was uploaded on 8/06/2025, two weeks and five days after the rebuttal period began (20/05/2025), i.e., five days late. Reviewer _MqeL_ read it and updated their recommendation positively "*I thank the authors for their detailed responses to my points and their significant revisions to the manuscript. The added experiments and the additional explanatory text is very helpful. I have changed my recommendation accordingly.*", while reviewer _9DsW_ likely did not, leading to an early rejection despite many concerns being addressed. Below we provide a detailed answer to the concerns raised by reviewer "9DsW" that motivated the rejection: > The main theoretical result presented in the manuscript is trivial after authors reformulated proposition 3. Proposition 3 is not the main result—that role belongs to Theorem 1. Although its proof is simple, it shows that scale invariance is **necessary** for learning from clipped data, a fact we confirm experimentally on audio and image signals. For this reason, we believe it is both relevant and necessary to state it explicitly. > Theorem 1 is also within expectation,[...] the approximate isometry of Gaussian matrix has classic Johnson-Lindenstrauss lemma that serves as the prototype to the presented result, regardless of the inclusion of the thresholding operator. We respectfully disagree with the assertion that the theoretical results are trivial or merely a reformulation of classical results such as the Johnson–Lindenstrauss (JL) lemma. While the analysis in Theorem 1 is indeed inspired by the theory of approximate isometries for Gaussian matrices, our setting introduces non-trivial elements that go beyond the standard JL framework. In particular, the operator $\eta$ includes a non-linear thresholding/clipping operation that breaks the linear structure typically assumed in classical random embedding results such as JL lemma or the restricted isometry of certain random matrices. This requires standard tools to be adapted, resulting in non-trivial results. Therefore, Theorem 1 is not a direct corollary of the JL lemma, and we believe that extending classical results to accommodate this non-linearity represents a meaningful contribution. To the best of our knowledge, the closest previous theorem come from Foucart et al. (2016, Proposition 6), which is limited to the set of $k$-sparse signals. > As another reviewer pointed out, there is no essential connection between the presented theorem and the practical reconstruction algorithm (the application of the invariance penalty). We would like to stress that the invariance assumption is key: it ensures $\eta$ is injective on $\mathcal{X}$ with high probability, and directly motivates the loss function used to train our network. > The measure consistency loss does not make much sense [...] there are infinite many solutions to this problem. The reviewer is correct and this is precisely the point we make in the manuscript (see explanation after eq. (8) and experimental results in table 1). The consistency loss is designed to enforce alignment between the reconstruction and the observed measurements, even though, in general, infinitely many reconstructions may be consistent with the data. The proposed addition of an equivariance loss is essential to solve this issue: it further constrains the solution space by leveraging the underlying structure of the signal set, thereby reducing the ambiguity in the reconstruction. > Authors still fail to formulate the problem they try to solve as a clear optimisation problem. We want to clarify that the problem of identifying the *signal set* (rather than a single signal) cannot be naturally formulated as an optimization problem. The objective of our work is not to numerically approximate a solution, but rather to provide a precise mathematical characterization of that set. Specifically, the goal is to characterize the set $\mathcal{X}$ uniquely, given the observed, clipped set $\mathcal{Y}=\eta(\mathcal{X})$, where $\eta$ is the clipping operator. Our approach is to show that the conic extension of $\mathcal{Y} \cap \mathbb{B}_\mu$ -- which is unique -- serves as the desired set $\mathcal{X}$. Comments by the Action Editor: > This reviewer and MqeL suggest the paper can be resubmitted if it is significantly modified following their recommendations [...]. We believe that the revised manuscript takes into account the recommendations of the reviewers, making sure that the link between the theoretical results and the practical losses are clear. All modifications are in red. We show that identifying the model from clipped measurement data only is possible if the model is scale invariant, and show throughout a series of experiments that the addition of a scale-equivariance loss enables learning from clipped data alone. Finally, we have included all experimental details that were missing in the first submission.
Assigned Action Editor: ~Fernando_Perez-Cruz1
Submission Number: 5678
Loading