CLIMB: Imbalanced Data Modelling Using Contrastive Learning with Limited Labels

Published: 01 Jan 2024, Last Modified: 17 Apr 2025WISE (4) 2024EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Machine learning classifiers typically rely on the assumption of balanced training datasets, with sufficient examples per class to facilitate effective model learning. However, this assumption often fails to hold. Consider a common scenario where the positive class has only a few labelled instances compared to thousands in the negative class. This class imbalance, coupled with limited labelled data, poses a significant challenge for machine learning algorithms, especially in the ever-growing data landscape. This challenge is further amplified when dealing with short text datasets, as these inherently provide less information for computational models to leverage. While techniques like data sampling and fine-tuning pre-trained language models exist to address these limitations, our analysis reveals their inconsistencies in achieving reliable performance. We propose a novel model that leverages contrastive learning within a two-stage approach to overcome these challenges. Our proposed framework involves unsupervised Fine-Tuning of a language model to learn representation on short text followed by fine-tuning on a few labels integrated with GPT-generated text using a novel contrastive learning algorithm designed to effectively model short texts and handle class imbalance simultaneously. Our experimental results demonstrate that the proposed method significantly outperforms established baseline models.
Loading