Deep Learning Framework Testing via Model Mutation: How Far Are We?

Yanzhou Mu, Rong Wang, Juan Zhai, Chunrong Fang, Xiang Chen, Peiran Yang, Zhixiang Cao, Ruixiang Qian, Shaoyu Yang, Zhenyu Chen

Published: 01 Jan 2026, Last Modified: 13 Mar 2026IEEE Transactions on Software EngineeringEveryoneRevisionsCC BY-SA 4.0
Abstract: Deep Learning (DL) frameworks are fundamental components of DL systems in their development, deployment, and execution, while defects in DL frameworks can cause severe consequences. Ensuring the quality of DL frameworks has therefore become a pressing challenge. Among the various testing techniques, model mutation has emerged as a widely adopted approach. Such methods generate mutants by applying mutation operators to DL models (e.g., structural changes or parameter edits) and then analyzing inconsistencies, crashes, or abnormal behaviors across different frameworks or hardware. Despite its effectiveness, existing methods suffer from the following limitations. First, they mainly reuse operators designed for model testing, raising doubts about their ability to expose framework-level defects. Besides, they insufficiently consider mutation constraints, such as mutation type, position, and order, which directly affect the defect detection ability of generated mutants. Finally, they rely on the limited detection range and narrow test oracles, focusing on functional correctness in model inference while overlooking defects in efficiency, resource usage, and other defects that developers care about in other stages, such as model training or deployment. These limitations result in a weak alignment with the critical defects that developers are most concerned about in practice. Motivated by these observations, this study conducts a comprehensive investigation into the effectiveness of existing mutation-based testing methods. We first collect and classify defect reports from PyTorch and MindSpore according to developers’ priority tags, building a taxonomy of seven categories and 19 sub-categories of HP defects. We then map the defects reported by five state-of-the-art methods into this taxonomy to evaluate their detection abilities. To explain these limitations, we further analyze how three key factors, mutation type, mutation position, and mutation order, affect the generated mutants. Based on the experiment results, we summarize ten findings ranging from revealing the priority of developers on fixing framework defects, evaluating the defect detection ability of existing methods, to how mutation factors affect the generated mutants. Furthermore, we reveal four limitations and their root causes of existing methods and propose four targeted optimization strategies. We further apply these strategies to COMET and successfully uncover six new defects spanning four types, including two previously unreported categories. Overall, our study identifies 38 unique framework defects, of which 30 are confirmed by developers and 12 have been fixed, demonstrating the practical value of our findings.
Loading