Abstract: The rapid progress in speech synthesis technology has heightened concerns regarding the threat of speech spoofing. Specifically, the detection of partially spoofed speech is challenging due to the presence of real speech segments. In the realm of partial spoofing speech detection, previous studies have primarily concentrated on two aspects. On one hand, direct training at the segment level has been conducted. However, practical constraints make it difficult to acquire segment-level annotations. On the other hand, training at the utterance level often leads to misleading gradients within utterance-level data. Consequently, this paper introduces an innovative approach to address the challenge of detecting partial spoofing speech when only labels at the utterance level are available. This approach harnesses the effectiveness of power pooling for localization, the capability of auto-softmax pooling in managing segments of varying complexities, and the proficiency of max pooling in extracting the most prominent features. Furthermore, we scrutinize both local and global correlations within utterances, introducing a conformer block to enhance speech features and thereby improve detection performance. Experiments were conducted utilizing the PartialSpoof dataset and the ASVspoof2019LA dataset. The experimental results confirm that the proposed method delivers effective detection performance at both the utterance and segment levels.
Loading