Abstract: Audio splicing is a typical form of audio forgery that enables arbitrary tampering of original audio content, disrupting both time-domain continuity and frequency-domain structure. However, most existing methods focus exclusively on frequency-domain information and seldom exploit splicing cues embedded in the time domain, thereby failing to comprehensively capture the cues present in both domains and overlooking their intrinsic correlation and complementarity. To address this limitation, this paper proposes a heterogeneous dual-branch convolutional network, named HDBC. Specifically, we employ a multi-scale convolutional module to fuse dynamic frequency-domain features: log spectrograms and Constant-Q Transform (CQT) spectrograms, thereby constructing a complementary time-frequency representation. Subsequently, dual-branch heterogeneous modeling is performed separately on the time domain and frequency domain. Their outputs are fused by an adaptive weighting mechanism for final classification. Additionally, a discrepancy-sensitive consistency loss is introduced to guide the dual-branch module in minimizing prediction discrepancies between the time and frequency domains. Experiments show HDBC achieves 93.85% accuracy on a TIMIT splicing dataset and 99.84% accuracy on ASV2015_EVA_S10 dataset, outperforming comparative methods and demonstrating its effectiveness. Noise robustness analysis further confirms its stable performance in complex environments.
External IDs:dblp:conf/mmm/FengTCL26
Loading