Hybrid Minority Oversampling via LLM-Generated Seeds and SMOTE Expansion

Hybrid Minority Oversampling via LLM-Generated Seeds and SMOTE Expansion

ICLR 2026 Conference Submission20786 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Oversampling, SMOTE, LLM, tabular data

TL;DR: LSH (LLM-SMOTE Hybrid) combines the creativity of LLMs with the efficiency of SMOTE, delivering a robust and practical oversampling method that shines in highly imbalanced and extreme few-/zero-shot scenarios.

Abstract: Class imbalance poses a persistent challenge in machine learning, as classifiers often underperform on the minority class when trained on skewed data. Oversampling is a common solution, with methods such as Synthetic Minority Oversampling Technique (SMOTE) offering efficiency but limited representational power, since they rely solely on existing data points. Recent approaches that employ large language models (LLMs) for oversampling overcome this limitation by generating diverse synthetic samples informed by contextual knowledge. However, LLM-only methods are computationally expensive and often impractical at scale. To bridge this gap, we propose LLM-SMOTE Hybrid (LSH), a method that integrates the strengths of both paradigms. In LSH, an LLM acts as a Scout that generates contextually meaningful seed samples for the minority class, while SMOTE serves as the Surveyor that efficiently expands these seeds to generate new samples. This design reduces reliance on repeated LLM calls while preserving diversity and scalability. Extensive experiments on 60 imbalanced tabular datasets, across multiple classifiers and resampling strategies, reveal that LSH consistently outperforms SMOTE and LLM in highly imbalanced datasets, demonstrating particular effectiveness in few-shot and zero-shot scenarios where SMOTE fails. Robustness analysis further shows that LSH achieves stable generalization with lower variance compared to other methods. Finally, LSH provides a practical trade-off, achieving competitive performance to LLM-based methods at substantially lower computational cost. These findings position LSH as an efficient, robust, and broadly applicable oversampling strategy for imbalanced learning problems.

Primary Area: generative models

Submission Number: 20786

Loading