Keywords: Augmentation, HardMining
Abstract: Machine learning systems deployed in the wild must operate reliably despite unreliable inputs, whether arising from distribution shifts, adversarial manipulation, or strategic behavior by users. Content moderation is a prime example: violators deliberately exploit euphemisms, obfuscations, or benign co-occurrence patterns to evade detection, creating unreliable supervision signals for classifiers. We present a span-aware augmentation framework that generates high-quality counterfactual hard negatives to improve robustness under such conditions. Our pipeline combines (i) multi-LLM agreement to extract causal violation spans, (ii) policy-guided rewrites of those spans into compliant alternatives, and (iii) validation via re-inference to ensure only genuine label-flipping counterfactuals are retained. Across real-world ad moderation and toxic comment datasets, this approach consistently reduces spurious correlations and improves robustness to adversarial triggers, with PRAUC gains of up to +6.3 points. We further show that augmentation benefits peak at task-dependent ratios, underscoring the importance of balance in reliable learning. These findings highlight span-aware counterfactual augmentation as a practical path toward reliable ML from strategically manipulated and unreliable text data.
Submission Number: 81
Loading