Speaker Inference Detection Using Only Text

Published: 2025, Last Modified: 04 Jan 2026ICICS (3) 2025EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Audio obtained from Internet of Things (IoT) devices can inadvertently disclose personally identifiable information (PII), particularly when combined with related text data. Accordingly, developing robust tools to detect privacy leakage in audio models such as Contrastive Language-Audio Pretraining (CLAP) is imperative. Existing membership inference attacks (MIAs) require audio inputs, which jeopardize voiceprint security and entail costly shadow-model training. To overcome these limitations, we propose SIDG, a speaker-level inference detector based exclusively on gibberish text. Our approach generates random text sequences guaranteed to be absent from the training corpus, extracts their feature representations via CLAP, and trains anomaly detectors on these representations. At inference, each test text’s feature vector is evaluated by the anomaly detector to determine membership status: “anomalous” indicates the speaker was present in the training set, whereas “normal” indicates a non-member. Furthermore, when real speaker audio is available, SIDG can integrate it to further enhance detection accuracy. Extensive experiments on multiple datasets demonstrate that SIDG outperforms baseline methods that rely solely on text data. Our source code and datasets are available at the anonymous link.
Loading