DT4LM: Differential Testing for Reliable Language Model Updates in Classification Tasks

Xinyue Zuo, Yan Xiao, Xiaochun Cao, Wenya Wang, Jin Song Dong

Published: 2025, Last Modified: 22 Jan 2026IEEE Trans. Software Eng. 2025EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: In the field of Natural Language Processing (NLP), Language Models (LMs) are frequently updated to enhance performance. However, these updates can introduce unintended regressions, cases where the updated model fails on inputs correctly handled by its predecessor, posing challenges for backward compatibility and model reliability. Prior studies focus mainly on accuracy improvement, with limited attention to regressions. While some acknowledge their existence, efforts to address them rely on predefined test sets, capturing only a small subset of cases. To overcome these limitations, we propose DT4LM, a novel Differential Testing framework for Reliable Language Model Updates in Classification Tasks. DT4LM systematically generates differential inputs that expose weaknesses in updated models by comparing their behavior with that of their previous versions. By pioneering the application of differential testing to address language model update issues, DT4LM generates higher-quality test inputs than adversarial testing, which operates on a single model. Our framework introduces a novel goal function that dynamically adapts to model behavior, guiding the search for regression-revealing inputs with greater effectiveness and efficiency than the static approach. These differential inputs are then leveraged in adversarial training to improve model robustness against regressions, facilitating more reliable model updates. Extensive experiments across multiple datasets and model architectures validate DT4LM’s effectiveness. It generates differential inputs with success rates ranging from 21.22% to 97.11%, surpassing the baseline by an average of 111.46%. Differential inputs are consistently of higher quality than adversarial inputs, as demonstrated across six automated metrics and confirmed by human evaluation of semantic fidelity. Additionally, adversarial training with these differential inputs improves model robustness by 64.24% on average, without compromising clean accuracy.

External IDs:dblp:journals/tse/ZuoXCWD25