Noise-Aware Adaptation of Vision Language Models for Single-photon Image Understanding

Noise-Aware Adaptation of Vision Language Models for Single-photon Image Understanding

TMLR Paper5249 Authors

30 Jun 2025 (modified: 22 Oct 2025)Under review for TMLREveryoneRevisionsBibTeXCC BY 4.0

Abstract: Single-photon LiDAR enables high-resolution depth imaging under extreme photon-limited conditions, making it attractive for low-light and long-range 3D perception. Beyond depth reconstruction, semantic understanding from single-photon images remains challenging due to limited data and sensitivity to noise-induced appearance variation. In this work, we present a noise-aware adaptation framework that transfers large-scale vision-language models, such as CLIP, from natural RGB images to the novel modality of single-photon depth images for few-shot classification. We introduce a lightweight Noise Adapter that modulates CLIP visual features using summary statistics derived from raw single-photon histograms. This design helps decouple imaging noise from semantics, enabling more robust prediction under varying noise levels. Furthermore, we leverage the learned modulation pattern to guide feature-level augmentation, simulating feature changes caused by noise and improving generalization in the low-data regime. To the best of our knowledge, this is the first work to explicitly integrate noise-awareness into pre-trained model adaptation for single-photon images. Experiments on both synthetic and real single-photon datasets show that our method improves accuracy over baselines, with an average improvement of 3\% over the best baseline. These results highlight the importance of modeling physical noise in photon-limited imaging and demonstrate the potential of leveraging vision models pre-trained on conventional modalities to improve performance on single-photon depth data with limited supervision.

Submission Length: Regular submission (no more than 12 pages of main content)

Changes Since Last Submission: We revised the manuscript to address reviewer feedback: we reframed the core contribution to emphasize adapting pre-trained foundation models to new sensor modalities by conditioning on acquisition variability, clarified terminology, and expanded the methodological descriptions (rationale for the gating/modulation and choice of noise descriptors). Experimentally, we added results on stronger backbones (DINOv2), comparisons to image-level augmentation baselines, and additional SPAD visualizations; figures, tables, and appendices have been updated accordingly.

Assigned Action Editor: ~Peilin_Zhao2

Submission Number: 5249

Loading