Abstract: Highlights•We propose a novel multi-modal crowd counting model to address information fusion and scale variation problems.•The model uses the three-stream fusion encoder with IIM to fuse modality-paired and modality-specific features.•The model adaptively integrates multi-scale features by SDAM to emphasize discriminative scale information.•Our method outperforms its counterparts and performs consistently well in the daytime and nighttime.
Loading