Spoken Named Entity Localization as a Dense Prediction task: End-to-end Frame-Wise Entity Detection

ICLR 2026 Conference Submission24066 Authors

20 Sept 2025 (modified: 03 Dec 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Speech, Named Entity Recognition, Audio, Entity Detection
TL;DR: DEnSNEL is a lightweight end-to-end model that directly detects and localizes spoken named entities at the audio-frame level for precise, privacy-preserving redaction.
Abstract: Precise temporal localization of named entities in speech is crucial for privacy-preserving audio processing. However, prevailing cascaded pipelines propagate transcription errors and end‐to‐end models lack the temporal granularity required for reliable frame‐level detection. To address these limitations, we introduce DEnSNEL (Dense End‐to‐end Spoken Named Entity Localizer), the first end-to-end model to perform direct, frame-level spoken named entity localization, without intermediate character or text representations. We reformulate the task as a dense, frame-wise binary classification ("entity" vs. "non-entity"), employing a lightweight encoder-classifier architecture. To improve boundary delineation, DEnSNEL incorporates a learnable complex filter bank to capture phonetic information, and employs a boundary-focused loss that explicitly optimizes span precision. On the SLUE Phase 2 benchmark, DEnSNEL outperforms state-of-the-art methods in frame-level spoken named entity localization, while requiring substantially fewer parameters. With its lightweight architecture and precise frame-level entity detection, DEnSNEL offers a practical and efficient solution for real-world privacy-sensitive speech applications. Our code and models will be released publicly.
Supplementary Material: zip
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 24066
Loading