Abstract: Image segmentation under low-light conditions is essential in real-world applications, such as autonomous driving and video surveillance systems. The recent Segment Anything Model (SAM) exhibits strong segmentation capability in various vision applications. However, its performance could be severely degraded under low-light conditions. On the other hand, multimodal information has been exploited to help models construct more comprehensive understanding of scenes under low-light conditions by providing complementary information (e.g., depth). Therefore, in this work, we present a pioneer attempt that elevates a unimodal vision foundation model (e.g., SAM) to a multimodal one, by efficiently integrating additional depth information under low-light conditions. To achieve that, we propose a novel method called Depth Perception SAM (DPSAM) based on the SAM framework. Specifically, we design a modality encoder to extract the depth information and the Depth Perception Layers (DPLs) for mutual feature refinement between RGB and depth features. The DPLs employ the cross-modal attention mechanism to mutually query effective information from both RGB and depth for the subsequent feature refinement. Thus, DPLs can effectively leverage the complementary information from depth to enrich the RGB representations and obtain comprehensive multimodal visual representations for segmenting anything in the dark. To this end, our DPSAM maximally maintains the instinct expertise of SAM for RGB image segmentation and further leverages on the strength of depth for enhanced segmenting anything capability, especially for cases that are likely to fail with RGB only (e.g., low-light or complex textures). As demonstrated by extensive experiments on four RGBD benchmark datasets, DPSAM clearly improves the performance for the segmenting anything performance in the dark, e.g., +12.90% mIoU and +16.23% mIoU on LLRGBD and DeLiVER, respectively.
Loading