A Dataset for the Prediction of Spanish Language Fluency by Quantification of Linguistic Components with Artificial Intelligence

Brock P Wilson, Anthony Rios, Amanda Fernandez, Michael Rushforth, Samuel Ang, John Quarles, Mario A Flores

Published: 13 Dec 2024, Last Modified: 28 Oct 2025CrossrefEveryoneRevisionsCC BY-SA 4.0

Abstract: The native language of an individual is the language acquired naturally in early childhood, typically from their family and immediate community. Native language acquisition intrinsically involves several linguistic components that help individuals develop their language skills, such as morphology, pragmatics, syntax, and semantics. In this work the goal is to predict Spanish language fluency by quantification of these linguistic components using an artificial intelligence (AI) pipeline. The pipeline includes a novel Spanish language question-answer dataset, automatic question text generation, data augmentation, preprocessing using Natural Language Processing (NLP) techniques, and a Transformer model that integrates the components to quantify and provide a prediction of fluency. We found that our model is able to predict language fluency with high accuracy using the components: morphology, syntax and pragmatics with higher scores for syntax. The results of this study show the possibility of the use of AI to verify if an individual is fluent in a particular language.

External IDs:doi:10.1145/3711542.3711552