M3Net: Efficient Time-Frequency Integration Network with Mirror Attention for Audio Classification on Edge
Abstract: Audio classification plays a crucial role within fields such as human-machine interaction and intelligent robotics. However, high-performance audio classification systems typically demand significant computational and storage resources, posing substantial challenges when deploying to the resource-constrained edge devices with an urgent need for such capabilities. To achieve a new level of balance between model complexity and performance, we introduce a novel multi-view method for the separated time-frequency features extraction and utilization, which exists within the proposed Mini Mirror Multi-View Network (M3Net) in the form of the Mirror Attention mechanism. M3Net enables reversible spatial transformation of spectral features is capable of efficiently leverages robust local and global features in the time and frequency domains with low requirements for parameters. Experiments based on Mel-Spectrogram without data augmentation and pre-training indicate that M3Net can achieve classification accuracy over 97% on the UrbanSound8K and SpeechCommandsV2 datasets with only 0.03 million parameters. The contribution of each functional segment in M3Net is fully verified and explained in the ablation experiments.
Loading