Abstract: Recognizing human actions under low illumination is challenging due to the limited high-quality data and weak recognition backbones. To this end, we propose Modality Fusion Dark-to-Light (MFDL), a two-stage framework to simultaneously enhance the invisibility of poorly-lit videos and strengthen recognition performance. Our initial stage is a pixel-wise diffusion model that converts the dark frames into their brightened counterparts. Specifically, we apply ControlNet to produce preliminarily recovered frames as the additional constraint priors. To ensure the temporal consistency between neighboring frames, we furthur integrate a specific spatial-temporal attention into the diffusion model during sampling. Subsequently, we equip the action recognition backbone with custom modality fusion modules to promote both the interaction between light and dark modalities and the spatial-temporal consistency of the recognition architecture. Extensive experiments on the ARID and Dark48 benchmarks validate the effectiveness of our approach, both quantitative and qualitative. In particular, MFDL achieves state-of-the-art results on both datasets and exhibits an 9.06% improvement over the baseline on Dark48.
External IDs:dblp:conf/icassp/WangXW25
Loading