Learnable Fractional Superlets with a Spectro-Temporal Emotion Encoder for Speech Emotion Recognition

ICLR 2026 Conference Submission21289 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Speech Emotion Recognition, Time–Frequency Analysis, Learnable Fractional Superlets, Spectro-Temporal Encoding, Representation Learning, End-to-End Neural Networks
TL;DR: We propose the Learnable Fractional Superlet Transform (LFST), a principled differentiable time–frequency representation integrated with a Spectro-Temporal Emotion Encoder (STEE), enabling end-to-end speech emotion recognition from raw waveforms.
Abstract: Speech emotion recognition (SER) hinges on front-ends that expose informative time-frequency (TF) structure from raw speech. Classical short-time Fourier and wavelet transforms impose fixed resolution trade-offs, while prior "superlet" variants rely on integer orders and hand-tuned hyperparameters. We revisit TF analysis from first principles and formulate a learnable continuum of superlet transforms. Starting from DC-corrected analytic Morlet wavelets, we define superlets as multiplicative ensembles of wavelet responses and realize learnable fractional orders via softmax-normalized weights over discrete orders, computed as a log-domain geometric mean. We establish admissibility (zero mean) and continuity in order and frequency, and characterize approximate analyticity by bounding negative-frequency leakage as a function of an effective cycle parameter. Building on these results, we introduce the Learnable Fractional Superlet Transform (LFST), a fully differentiable front-end that jointly optimizes (i) a monotone, log-spaced frequency grid, (ii) frequency-dependent base cycles, and (iii) learnable fractional-order weights, all trained end-to-end. LFST further includes a learnable asymmetric hard-thresholding (LAHT) module that promotes sparse, denoised TF activations while preserving transients; we provide sufficient conditions for boundedness and stability under mild cycle and grid constraints. To exploit LFST for SER, we design the Spectro-Temporal Emotion Encoder (STEE), which consumes two-channel TF maps, magnitude $S$ and phase-congruency $\kappa$, through a compact multi-scale stack with residual temporal and depthwise-frequency blocks, Adaptive FiLM gating, axial (time-axis) self-attention, global attentive pooling, and a lightweight classifier. The full LFST+STEE system is trained in a standard train-validate-test regime using focal loss with optional class rebalancing, and is validated on IEMOCAP, EMO-DB, and the private NSPL-CRISE dataset under standard protocols. By unifying a principled, learnable TF transform with a compact encoder, LFST+STEE replaces ad hoc front-ends with a mathematically grounded alternative that is differentiable, stable, and adaptable to data, enabling systematic ablations over frequency grids, cycle schedules, and fractional orders within a single end-to-end model. The source code for this paper is shared in this anonymous repository: https://anonymous.4open.science/r/LFST-for-SER-C5D2.
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 21289
Loading