Statistically Guided Near-End Speech Intelligibility Improvement Through Voice Transformation and Transfer Learning

Ritujoy Biswas, Karan Nathwani, Vinayak Abrol

Published: 01 Jan 2024, Last Modified: 10 Feb 2025IEEE ACM Trans. Audio Speech Lang. Process. 2024EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: In recent developments, speech intelligibility has been improved through an optimal trapezoidal transformation function, which performed normal to Lombard speech conversion via formant shifting. Despite performing well, the optimization took very long to converge and led to artifacts in the modified signal due to aggressive formant shifts in unvoiced frames. Therefore, transfer learning was used to rapidly modify the optimized parameters for a target language to bypass re-optimization for a new language. However, such transfer across noises was left unaddressed. This work proposes a Gaussian transformation function to perform statistically guided normal to Lombard speech conversion. Optimizing fewer parameters ensures faster convergence than before. The new transformation function generates fewer artifacts during voice modification while performing at par with the earlier function. This work enhances transfer learning performance by mitigating the directional nature in case of language mismatch. We also propose the transfer learning across noises using the comparative estimations of noise magnitude spectra, which was not feasible earlier. The simultaneous transfer of parameters across languages and noises is now feasible via the proposed Gaussian transformation function. We also explore the statistical difference between formant shifts produced by the Gaussian transformation function and its predecessor and their effect on intelligibility improvement. All experiments were conducted on exhaustive combinations of three languages, four noise types, and three SNR levels.