Noise-Aware Adaptation of Vision Language Models for Single-photon Image Understanding

TMLR Paper5249 Authors

30 Jun 2025 (modified: 11 Jul 2025)Under review for TMLREveryoneRevisionsBibTeXCC BY 4.0
Abstract: Single-photon LiDAR enables high-resolution depth imaging under extreme photon-limited conditions, making it attractive for low-light and long-range 3D perception. Beyond depth reconstruction, semantic understanding from single-photon images remains challenging due to limited data and sensitivity to noise-induced appearance variation. In this work, we present a noise-aware adaptation framework that transfers large-scale vision-language models, such as CLIP, from natural RGB images to the novel modality of single-photon depth images for few-shot classification. We introduce a lightweight Noise Adapter that modulates CLIP visual features using summary statistics derived from raw single-photon histograms. This design helps decouple imaging noise from semantics, enabling more robust prediction under varying noise levels. Furthermore, we leverage the learned modulation pattern to guide feature-level augmentation, simulating feature changes caused by noise and improving generalization in the low-data regime. To the best of our knowledge, this is the first work to explicitly integrate noise-awareness into pre-trained model adaptation for single-photon images. Experiments on both synthetic and real single-photon datasets show that our method improves accuracy over baselines, with an average improvement of 3\% over the best baseline. These results highlight the importance of modeling physical noise in photon-limited imaging and demonstrate the potential of leveraging vision models pre-trained on conventional modalities to improve performance on single-photon depth data with limited supervision.
Submission Length: Regular submission (no more than 12 pages of main content)
Assigned Action Editor: ~Peilin_Zhao2
Submission Number: 5249
Loading