Quantifying Tolerance to Errors in Synthetic Data: An Atomic-level Operand vs. Operator Perturbation Study

Quantifying Tolerance to Errors in Synthetic Data: An Atomic-level Operand vs. Operator Perturbation Study

ACL ARR 2026 January Submission8208 Authors

06 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Synthetic Data; Error Tolerance; Large Language Models

Abstract: Synthetic data generation has become a cornerstone for advancing large language models. However, the absence of a quantitative analysis for error tolerance remains a critical bottleneck. Consequently, current filtering strategies fluctuate between two extremes: they are either overly aggressive, risking the exclusion of potentially valuable samples, or overly permissive, failing to eliminate erroneous samples effectively. To bridge this gap, we introduce **A**tomic **T**ree **O**peration **M**odeling (**ATOM**), a framework that decomposes data into functional units ($y=f(x)$) to precisely differentiate between *benign Operand perturbations* and *fatal Operator perturbations*. Our experiments reveal a \textbf{double dissociation}: models are robust to operand noise but collapse under operator disruption. By prioritizing operator over strict operand precision, our ATOM-synthesized data significantly outperforms rigorous baselines (e.g., +3.3% gain over LIMA), validating that structural diversity is the decisive factor for synthetic data.

Paper Type: Long

Research Area: Interpretability and Analysis of Models for NLP

Research Area Keywords: data influence; robustness; data shortcuts/artifacts

Contribution Types: Model analysis & interpretability, Data resources, Data analysis

Languages Studied: English

Submission Number: 8208

Loading