Improving OOD Robustness via Background-Aware Test- Time-AugmentationinBlack-BoxandResourceConstrained Settings

Ping Song; Adegboyega Ojo; Edward Curry

Improving OOD Robustness via Background-Aware Test- Time-AugmentationinBlack-BoxandResourceConstrained Settings

Ping Song, Adegboyega Ojo, Edward Curry

Published: 15 May 2026, Last Modified: 15 May 2026Accepted by TMLREveryoneRevisionsBibTeXCC BY 4.0

Abstract: Deep learning models for text classification typically achieve strong performance on in-distribution (ID) data but often fail to generalize to out-of-distribution (OOD) inputs. This degradation frequently arises because models rely on spurious background cues (e.g., specific syntax or register) learned during training, which become unreliable when the domain changes. While recent Test-Time Augmentation (TTA) approaches have enabled robustness in black-box settings, they often rely on unconstrained rewriting strategies. For instance, standard In-Context Rewriting (ICR) instructs Large Language Models (LLMs) to modify input details to match ID exemplars, creating a high risk of semantic drift and label flipping, particularly when using smaller, resource-constrained LLMs. In this work, we propose a Background-Aware TTA framework that strictly disentangles style from semantics. Unlike prior methods that encourage broad paraphrasing, we utilize a semantic-constrained alignment strategy that enables small, efficient LLMs to transform specific background attributes, such as tone and sentence structure, to match in-distribution priors while explicitly enforcing the preservation of original meaning. This approach mitigates OOD degradation by neutralizing spurious background shifts, allowing frozen black-box models to process inputs in their native distribution without risking semantic corruption. Empirical evaluations across multiple text classification benchmarks demonstrate that our targeted alignment strategy outperforms unconstrained augmentation baselines. By generating higher-fidelity augmentations, our method achieves superior OOD robustness with reduced computational overhead, establishing a viable path for deploying robust in resource-limited black-box environments. We validate the versatility of BA-TTA using a range of open-weights generators, from Llama-2 based models to the recent Llama-3.1-8B and Qwen-2.5-7B, showing consistent gains across model families.

Submission Type: Long submission (more than 12 pages of main content)

Code: https://github.com/Ping-Song/OOD_TTA

Assigned Action Editor: ~Yu_Meng1

Submission Number: 7077

Loading