Abstract: This paper introduces a gradient-based approach for reducing the dimensionality of acoustic features, tailored for supervised deep learning models used in speech emotion recognition (SER). This method allows us to pinpoint the crucial acoustic features that the network heavily relies on, enabling us to simplify and retrain the network accordingly. It significantly boosts testing speed, making real-time SER systems suitable for embedded systems with resource constraints in speech processing units. The proposed method is evaluated on four convolutional neural network (CNN)-based deep learning models, and one of the best results demonstrates a 56.96% reduction in test time, albeit with a minor 3.81% drop in test accuracy. The method is compared with three mainstream dimensionality reduction techniques across various dimensions, consistently outperforming them in most scenarios.
Loading