Depthwise-Attentive Hierarchical Cross-Modal Knowledge Distillation Network for Rail Surface Defect Detection
Abstract: Accurate detection of surface defects on railway tracks is critical for safe railway operation. Most existing models rely solely on Red–Green–Blue (RGB) images, limiting their ability to capture structural information. Incorporating depth features provides richer spatial cues, significantly improving detection accuracy. However, current Red–Green–Blue and Depth (RGB-D) dual-stream models suffer from high computational complexity and hardware dependencies, making them impractical for real-world deployment. To address these limitations, we propose DAHNet, an asymmetric knowledge distillation model with a teacher–student architecture. DAHNet-T serves as the teacher network, taking RGB-D inputs and integrating a cross-modal attention feature enhancement (CAFE) module to capture contextual information, along with a depth feature interaction block (DFIB) for efficient cross-modal fusion. DAHNet-S is the student network, a lightweight single-stream RGB model employing depthwise separable convolutions to reduce computation. We introduce a multi-level distillation strategy with dynamic temperature scaling to balance coarse-grained and fine-grained knowledge transfer, while incorporating contrastive learning and structural loss to improve pixel-level accuracy. Extensive experiments on the NEU RSDDS-AUG dataset demonstrate that our distilled model DAHNet-KD outperforms state-of-the-art methods. Compared to DAHNet-T, the number of parameters is reduced from 87.72 MParams to 13.97 MParams, and the computational cost decreases from 19.79 GFLOPs to 5.41 GFLOPs. The proposed model achieves superior performance across various evaluation metrics and also generalizes well on other public datasets. Therefore, the model provides a lightweight and high-accuracy solution for deployment on mobile devices in real-world industrial scenarios.
External IDs:doi:10.1109/jiot.2026.3650951
Loading