Abstract: Adapting pre-trained foundation models to novel sensor modalities is a fundamental challenge. These models are pre-trained on large RGB datasets that typically lack exposure to the imaging characteristics of other modalities. Physical acquisition effects, such as photon statistics and sensor-specific noise, produce appearance shifts that are underrepresented in pre-training and can degrade transfer performance. We propose a noise-aware adaptation framework that conditions model adaptation on sensor-specific acquisition statistics. Central to our approach is a lightweight Noise Adapter that modulates pre-trained visual features using summary statistics of the sensor’s outputs, to decouple acquisition-induced appearance variation from semantics and improve robustness in low-label regimes. We instantiate this idea as a case study on single-photon LiDAR depth images by designing a Noise Adapter that leverages summary statistics computed from raw single-photon histograms for few-shot classification. We also present an exploratory analysis showing how learned modulation patterns correspond to noise-induced feature shifts, providing insight into the adapter’s role in feature robustness. Experiments on both synthetic and real single-photon datasets show that our method improves accuracy over baselines, with an average improvement of 3\% over the best baseline. These results suggest that explicitly conditioning adaptation on physical acquisition factors is a practical and promising strategy that may generalize to other non-standard modalities. The code is available at~\url{https://github.com/ZiTingW/noise_adapter}.
Submission Length: Regular submission (no more than 12 pages of main content)
Changes Since Last Submission: The paper has been de-anonymized, and a link to the code repository has been added for the Camera Ready version.
------
We revised the manuscript to address reviewer feedback: we reframed the core contribution to emphasize adapting pre-trained foundation models to new sensor modalities by conditioning on acquisition variability, clarified terminology, and expanded the methodological descriptions (rationale for the gating/modulation and choice of noise descriptors). Experimentally, we added results on stronger backbones (DINOv2), comparisons to image-level augmentation baselines, and additional SPAD visualizations; figures, tables, and appendices have been updated accordingly.
Video: https://www.youtube.com/watch?v=GyYcgXmoHpQ&t=90s
Code: https://github.com/ZiTingW/noise_adapter
Supplementary Material: zip
Assigned Action Editor: ~Peilin_Zhao2
Submission Number: 5249
Loading