Learning from Few Samples: Lexical Substitution with Word Embeddings for Short Text Classification

Ábel Elekes, Antonino Simone Di Stefano, Martin Schäler, Klemens Böhm, Matthias Keller

Published: 2019, Last Modified: 20 May 2025JCDL 2019EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Text classification helps to categorize large number of documents in Digital Libraries. Text classification results highly depend on the quality of labeled training data. In practice, the process of manually annotating documents is a hidden cost that is often overlooked. We propose a general preprocessing method for scenarios in which training data is scarce. It clusters semantically similar terms by including both a semantic distance measure and a probabilistic model of any task-specific term distributions. By preprocessing the training data with our method, one increases the mean classification performance of all tested classification approaches in text classification tasks having 500 or 1000 training samples. The largest observed increase is 15%. When more training samples are available, we report significant improvements in most scenarios as well.