Quantifying Tolerance to Errors in Synthetic Data: An Atomic-level Operand vs. Operator Perturbation Study
Keywords: Synthetic Data; Error Tolerance; Large Language Models
Abstract: Synthetic data generation has become a cornerstone for advancing large language models. However, the absence of a quantitative analysis for error tolerance remains a critical bottleneck. Consequently, current filtering strategies fluctuate between two extremes: they are either overly aggressive, risking the exclusion of potentially valuable samples, or overly permissive, failing to eliminate erroneous samples effectively. To bridge this gap, we introduce **A**tomic **T**ree **O**peration **M**odeling (**ATOM**), a framework that decomposes data into functional units ($y=f(x)$) to precisely differentiate between *benign Operand perturbations* and *fatal Operator perturbations*. Our experiments reveal a \textbf{double dissociation}: models are robust to operand noise but collapse under operator disruption. By prioritizing operator over strict operand precision, our ATOM-synthesized data significantly outperforms rigorous baselines (e.g., +3.3% gain over LIMA), validating that structural diversity is the decisive factor for synthetic data.
Paper Type: Long
Research Area: Interpretability and Analysis of Models for NLP
Research Area Keywords: data influence; robustness; data shortcuts/artifacts
Contribution Types: Model analysis & interpretability, Data resources, Data analysis
Languages Studied: English
Submission Number: 8208
Loading