Spoken Named Entity Localization as a Dense Prediction task: End-to-end Frame-Wise Entity Detection

Didula Samaraweera; Sasini Wanigathunga; Oshan Jayawardena; Udaya Sampath K. Perera Miriya Thanthrige

Spoken Named Entity Localization as a Dense Prediction task: End-to-end Frame-Wise Entity Detection

Didula Samaraweera, Sasini Wanigathunga, Oshan Jayawardena, Udaya Sampath K. Perera Miriya Thanthrige

20 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0

Keywords: Speech, Named Entity Recognition, Audio, Entity Detection

TL;DR: DEnSNEL is a lightweight end-to-end model that directly detects and localizes spoken named entities at the audio-frame level for precise, privacy-preserving redaction.

Abstract: Precise temporal localization of named entities in speech is crucial for privacy-preserving audio processing. However, prevailing cascaded pipelines propagate transcription errors and end‐to‐end models lack the temporal granularity required for reliable frame‐level detection. To address these limitations, we introduce DEnSNEL (Dense End‐to‐end Spoken Named Entity Localizer), the first end-to-end model to perform direct, frame-level spoken named entity localization, without intermediate character or text representations. We reformulate the task as a dense, frame-wise binary classification ("entity" vs. "non-entity"), employing a lightweight encoder-classifier architecture. To improve boundary delineation, DEnSNEL incorporates a learnable complex filter bank to capture phonetic information, and employs a boundary-focused loss that explicitly optimizes span precision. On the SLUE Phase 2 benchmark, DEnSNEL outperforms state-of-the-art methods in frame-level spoken named entity localization, while requiring substantially fewer parameters. With its lightweight architecture and precise frame-level entity detection, DEnSNEL offers a practical and efficient solution for real-world privacy-sensitive speech applications. Our code and models will be released publicly.

Supplementary Material: zip

Primary Area: applications to computer vision, audio, language, and other modalities

Submission Number: 24066

Loading