Monolingual and Cross-Lingual Text Classification

Jurgita Kapočiūtė-Dzikienė, Daiga Deksne, Inguna Skadia, Raivis Skadiš, Askars Salimbajevs

Published: 01 Jan 2025, Last Modified: 26 May 2026CrossrefEveryoneRevisionsCC BY-SA 4.0

Abstract: This research explores text classification in two Natural Language Processing (NLP) domains: intent detection and multi-label topic classification. We used two datasets with 41 and 37 intents across eight languages, applying four training-testing strategies. We evaluated two trainable methods (Feed Forward Neural Network (FFNN) on Language Agnostic BERT Sentence Embeddings (LaBSE) and LaBSE fine-tuning), one generative method (Davinci fine-tuning), and four memory-based approaches (LaBSE and ADA vectorizers with greedy and majority voting). All languages tested with optimal strategies and approaches, outperformed the English accuracies of 0.84 and 0.90 achieved using the monolingual strategy, except for one Latvian dataset. Davinci’s fine-tuning excelled on one dataset, while LaBSE-based memory approaches yielded strong results on another, particularly benefiting well-supported languages, though Lithuanian and Latvian required more data and model adjustments. Multi-label topic classification for Latvian and Russian Automatic Speech Recognition (ASR) transcripts faced significant challenges due to long, ambiguous texts and inter-annotator disagreements. The optimized single multi-class model underperformed, leading to the development of separate binary models that achieved high accuracy, surpassing 0.96 for Latvian. Our research highlights the necessity of customized text classification solutions, scientifically proven and validated through acceptance testing with the company’s Tilde clients, demonstrating their operational effectiveness.

External IDs:doi:10.1007/978-3-031-88486-3_3