Submission Type: Short paper (4 pages)
Keywords: Tabular data augmentation, Large language models, LLM-augmented data generation, Rock type classification
TL;DR: LLMs can encode domain-specific scientific constraints through policy generation to create better data augmentation strategies that outperform traditional geometric methods while using fewer synthetic samples.
Abstract: Traditional tabular augmentation methods such as SMOTE and Gaussian sampling treat features as generic vectors, disregarding the domain-specific constraints often present in scientific tabular data. This work introduces a domain-aware augmentation approach that leverages Large Language Models (LLMs) to encode scientific knowledge through policy generation. The effectiveness of this approach is demonstrated using a case study on geochemical compositions, where data must satisfy closure constraints and exhibit intrinsic correlations that geometric interpolation methods fail to preserve. Evaluated on an imbalanced geochemical rock classification dataset, the LLM-based augmentation achieves 95.74% accuracy and a 0.9544 macro-F1 score, outperforming SMOTE, Gaussian sampling, and no-augmentation baselines while requiring fewer synthetic samples.
Published Venue And Year: AI for Tabular Data workshop at EurIPS 2025
Submission Number: 17
Loading