Semantics-Aware Generative Latent Data Augmentation for Learning in Low-Resource Domains
Keywords: generative model, data augmentation, speech emotion recognition, long-tail image recognition
TL;DR: We perform generative data augmentation in foundation model–induced latent space, enabling efficient and effective augmentation for low-resource domains.
Abstract: In many real-world applications, labeled data is costly to collect and often heavily imbalanced, causing models to overfit to dominant classes. While foundation models (FMs) with lightweight adapters offer strong representations, they remain vulnerable to label imbalance. Loss re-weighting and gradient adjustment provide partial relief, but the core issue is insufficient data variability for underrepresented classes. Data augmentation (DA) addresses this, yet both input-space manipulations and generative approaches struggle to guarantee diversity matching the unseen data distribution due to the high dimensionality of raw data spaces.
A more principled alternative is to perform DA in a structured, low-dimensional latent space, where task-relevant semantics are more compactly encoded and inter-class relationships are better preserved. Prior methods rely on simple statistical manipulations under convexity assumptions, while recent generative approaches remain underexplored. Key open questions concern how generative models interact with pretrained FMs, what conditioning strategies are most effective, and how feature-space abstraction level affects augmentation quality.
To address these questions, we propose GeLDA (Generative Latent Data Augmentation), a semantics-aware framework that synthesizes samples in an FM-learned latent space using diffusion models conditioned on augmented label information. By leveraging both pretrained and fine-tuned FMs, GeLDA ensures augmentation operates in a well-structured, task-relevant space. Crucially, augmenting the label and subdomain conditioning signals enables GeLDA to capture relationships between high-resource and low-resource classes, making it especially powerful in severely data-scarce settings.
We validate GeLDA on two realistic scenarios: zero-shot speech emotion recognition (SER) for low-resource languages, and long-tailed image classification. On SER, GeLDA improves unweighted average recall by 6.2% using Whisper-large as the FM, with a compact 21M-parameter diffusion model trained on just 83 hours of data. On ImageNet Long Tail, GeLDA achieves 74.7% tail-class accuracy—a new state-of-the-art—while preserving middle- and head-class performance. These results demonstrate GeLDA's broad applicability across modalities and task configurations, offering a principled solution to the persistent challenges of low-resource and label-imbalanced learning.
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 109
Loading