SegINR: Segment-Wise Implicit Neural Representation for Sequence Alignment in Neural Text-to-Speech

Minchan Kim, Myeonghun Jeong, Joun Yeop Lee, Nam Soo Kim

Published: 01 Jan 2025, Last Modified: 09 Jan 2026IEEE Signal Processing LettersEveryoneRevisionsCC BY-SA 4.0
Abstract: We present SegINR, a novel approach to neural Text-to-Speech (TTS) that eliminates the need for either an auxiliary duration predictor or autoregressive (AR) sequence modeling for alignment. SegINR simplifies the TTS process by directly converting text sequences into frame-level features. Encoded text embeddings are transformed into segments of frame-level features with length regulation using a conditional implicit neural representation (INR). This method, termed Segment-wise INR (SegINR), captures temporal dynamics within each segment while autonomously defining segment boundaries, resulting in lower computational costs. Integrated into a two-stage TTS framework, SegINR is employed for semantic token prediction. Experiments in zero-shot adaptive TTS scenarios show that SegINR outperforms conventional methods in speech quality with computational efficiency.
Loading