TopEx: Exploration of Text Spaces for Directed Data Augmentation

ACL ARR 2024 August Submission265 Authors

15 Aug 2024 (modified: 25 Aug 2024)ACL ARR 2024 August SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Scarce labeled data is a common problem in machine learning, that would usually be tackled by using large amounts of human annotation work. Synthetic data augmentation can help alleviate this problem, but how exactly newly generated points change the data distribution, which data points contribute to increased performances and what the overall effect on the dataset is, usually is opaque. In this paper, we propose an interpretability and text classification dataset analysis method that first examines the output space resulting from passing the already existing data into a model and then identifies areas in which the model fails to provide a correct classification in said output space. We map the model outputs to an examinable continuous space and apply different clustering algorithms to identify clusters of data points that either aren't well represented in the data space or are too difficult to learn. We automatically label these clusters using topic modeling and pass the labels to an LLM to generate synthetic data points, filling the gaps in our data space. Our method reliably improves language model accuracy by up to 2% on four representative multi-class text classification problems while adding less than one percent of synthetic data to the training pool.
Paper Type: Long
Research Area: Machine Learning for NLP
Research Area Keywords: data augmentation, data-efficient training, self-supervised learning, hardness of samples, data influence, topic modeling, NLP in resource-constrained settings, automatic evaluation
Contribution Types: Model analysis & interpretability, NLP engineering experiment, Approaches low compute settings-efficiency, Publicly available software and/or pre-trained models
Languages Studied: English
Submission Number: 265
Loading