MIANet: Bridging the Gap in Crowd Density Estimation With Thermal and RGB Interaction

Published: 01 Jan 2025, Last Modified: 15 May 2025IEEE Trans. Intell. Transp. Syst. 2025EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Video surveillance and crowd analysis are essential for urban public safety, particularly for accurate crowd counting and density estimation. Existing methods primarily use RGB modality, which limits their effectiveness in complex environments. With advancements in thermal sensors, some studies have combined RGB and thermal images to improve crowd counting accuracy. However, the existing studies face the risk of introducing redundant information during modality interaction and exacerbating the influence of non-uniform crowd density for modality fusion. Therefore, further advancements are needed to bridge the gap in crowd density estimation by RGB and thermal image interaction. To adequately capture information from both RGB and thermal crowd images and alleviate the above difficulties, we propose a Modality Interaction Attention Network (MIANet). Specifically, Modality Interaction Attention (MIA) module consists of two Multi-Scale Attention (MSA) and a Channel Direction Attention (CDA), which serve to remove redundant information and amplify modality attributes. The MSA incorporates multi-scale kernel factors, enabling its application to still images to solve non-uniform crowd density in one image. To combine modality-specific attributes, the Tri-level MIA modules are connected to the front-end network in a stacked manner. Polished fusion features are further extracted using the Grid Block that combine level-by-level features. On two real-world datasets, we conducted in-depth experiments. Results of the evaluation reveal that our MIANet works better than cutting-edge baseline methodologies and MIANet variants in relation to a variety of prediction inaccuracies, highlighting the efficiency of MIANet and each of its essential modules in crowd density estimation. Code is available at Github.
Loading