Class-based Prediction Errors to Categorize Text with Out-of-vocabulary Words

Joan Serrà; Ilias Leontiadis; Dimitris Spathis; Gianluca Stringhini; Jeremy Blackburn

Class-based Prediction Errors to Categorize Text with Out-of-vocabulary Words

Joan Serrà, Ilias Leontiadis, Dimitris Spathis, Gianluca Stringhini, Jeremy Blackburn

11 Jul 2025 (modified: 17 Feb 2017)ICLR 2017Readers: Everyone

Abstract: Common approaches to text categorization essentially rely either on n-gram counts or on word embeddings. This presents important difficulties in highly dynamic or quickly-interacting environments, where the appearance of new words and/or varied misspellings is the norm. To better deal with these issues, we propose to use the error signal of class-based language models as input to text classification algorithms. In particular, we train a next-character prediction model for any given class, and then exploit the error of such class-based models to inform a neural network classifier. This way, we shift from the 'ability to describe' seen documents to the 'ability to predict' unseen content. Preliminary studies using out-of-vocabulary splits from abusive tweet data show promising results, outperforming competitive text categorization strategies by 4-11%.

TL;DR: Using class-based prediction errors is a promising strategy to classify text with out-of-vocabulary words

Keywords: Natural language processing, Applications

Conflicts: telefonica.com, auth.gr, ucl.ac.uk

5 Replies

Loading