Keywords: Text condensation, data distillation, coherence, optimization, understanding and reasoning
TL;DR: We propose a model-agnostic text condensation framework that preserves textual coherence to enable efficient use across diverse language models.
Abstract: Data condensation has emerged as a promising technique for improving training efficiency. However, it remains challenging to produce a small synthetic text set that retains its utility for use with language models. Existing approaches are typically model-specific and often focus only on generating readable text, which limits their applicability to text understanding tasks (e.g., classification). In this work, we propose a model-agnostic text condensation framework with coherence awareness. Our method synthesizes a compact set of representative texts by modeling in the semantic embedding space while enforcing coherence constraints when converting them back into the input space. This model-agnostic design allows the condensed data to be used for training or adapting a wide range of models without retraining the condensation pipeline. Experiments on diverse language understanding and reasoning benchmarks show that our method outperforms state-of-the-art text condensation techniques, achieving competitive results on classification tasks and significant gains on GSM8K when used for in-context learning or fine-tuning. Our work highlights the importance of preserving textual coherence in dataset condensation and opens new avenues for efficient and reusable data preparation across models.
Supplementary Material: zip
Primary Area: other topics in machine learning (i.e., none of the above)
Submission Number: 10096
Loading