Spanish Dialect Classification: A Comparative Study of Linguistically Tailored Features, Unigrams and BERT Embeddings
Keywords: discriminating between similar languages, automatic dialect classification, Spanish, dialect-specific characteristics, statistical machine learning models, transformer models
Abstract: The task of automatic dialect classification is typically tackled using traditional machine-learning models with bag-of-words unigram features. We explore two alternative methods for distinguishing dialects across 20 Spanish-speaking countries:
(i) Support vector machine and decision tree models were trained on dialectal features tailored to the Spanish dialects, combined with standard unigrams.
(ii) A pre-trained BERT model was fine-tuned on the task.
Results show that the tailored features generally did not have a positive impact on traditional model performance, but provide a salient way of representing dialects in a content-agnostic manner. The BERT model wins over traditional models but with only a tiny margin, while sacrificing explainability and interpretability.
Archival Status: Archival
Paper Length: Short Paper (up to 4 pages of content)
Submission Number: 112
Loading