SSLCT: A Convolutional Transformer for Synthetic Speech Localization

Kratika Bhagtani, Amit Kumar Singh Yadav, Paolo Bestagini, Edward J. Delp

Published: 2024, Last Modified: 17 Nov 2025MIPR 2024EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Deep learning methods can now generate high quality synthetic speech which is perceptually indistinguishable from real speech. As synthetic speech can be used for nefarious purposes, speech forensics methods to detect fully synthetic speech have been developed. Speech editing tools can also create partially synthetic speech in which only a part of the speech signal is synthetic. Detecting these short synthetic segments within a speech signal requires specialized methods to determine the tem-poral location of the synthetic speech. In this paper, we propose the Synthetic Speech Localization Convolutional Transformer (SSLCT), a neural network and transformer method for synthetic speech localization. SSLCT can temporally localize synthetic speech segments as small as 20 milliseconds. We demonstrate that SSLCT achieves less than 10% Equal Error Rate (EER), which is an improvement over several existing methods.