Towards End-to-End Speech Articulation and Spoken Language Analysis Using Deep Learning

Tobias Weise, Kubilay Can Demir, Paula Andrea Pérez-Toro, Tomás Arias-Vergara, Andreas Maier, Elmar Nöth, Maria Schuster, Björn Heismann, Seung Hee Yang

Published: 2025, Last Modified: 20 May 2025Hum. Centric Intell. Syst. 2025EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: This study presents a speech and spoken language analysis framework, leveraging a robust, end-to-end deep learning model developed in our prior work. The framework represents a foundational step towards a comprehensive solution for analyzing speech articulation and spoken language. Unlike traditional approaches that rely on separate specialized models, our architecture integrates multiple prediction tasks into a single multi-task learning setup: nine articulatory trajectories, a phoneme sequence, and phoneme alignment. While conceptually distinct, these outputs share a strong underlying relation: phonemes, as the fundamental building blocks of language, emerge from specific articulatory configurations, and phoneme alignment provides crucial temporal structure. We bridge the gap between abstract linguistic representations and their physical realizations by integrating phoneme recognition, articulatory trajectory prediction, and phoneme alignment within a single deep learning framework. Phonemes, as abstract speech units, manifest as concrete articulatory gestures, which can be precisely captured through EMA and analyzed using deep learning methods. This integration lays the foundation for diverse applications, including intelligibility assessment and therapeutic feedback. Extensive experiments validate the model’s capabilities and demonstrate its potential in real-world contexts. These include evaluations of articulatory and phoneme-related metrics, intelligibility estimation using phoneme error rates, and open vocabulary keyword spotting. A case study on stroke-related datasets highlights how the framework provides detailed articulatory feedback and supports therapy progress tracking. While not a complete solution, this work shows that an integrated, end-to-end deep learning approach can effectively address multiple facets of speech analysis. Ultimately, it serves as a foundation for developing scalable and robust frameworks to tackle challenges in speech and language processing.