BDTest: A Diversity-Oriented Test Case Generation Framework for Deep Neural Networks in 6G-IOT

Wendian Luo, Shengxin Dai, Cheng Dai, Bing Guo, Sherif Moussa, Mubarak Alrashoud

Published: 2026, Last Modified: 30 Mar 2026IEEE Internet Things J. 2026EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: The widespread integration of artificial intelligence (AI) in sixth-generation Internet of Things (6G-IoT) applications introduces significant challenges for ensuring the trustworthy and dependability of AI models. The “black-box ” characteristic of numerous deep neural networks (DNNs) creates a notable obstacle for confirming their safety in intricate, ever-changing environments. Consequently, there is a need for extensive testing, requiring the gathering and labeling of a large number of test cases, a process that is both time-intensive and resource-consuming. While previous studies have adopted neuron coverage (NC) criteria for steering test case generation in DNNs. Yet, these criteria are white-box measures requiring access to model states and presenting their practical limitations. Conversely, black-box metrics, which focus on outputs, present a more feasible approach. Among these, black-box diversity metrics evaluate model robustness by generating diverse test cases, eliminating the need for internal model details. This article presents a test case generation framework centered on diversity, known as BDTest. BDTest enhances test adequacy through five stages: 1) mapping feature vectors extracted from an initial set of seed images onto a low-dimensional manifold utilizing UMAP; 2) detecting sparse regions using DBSCAN; 3) sampling key points from these regions via Latin hypercube sampling (LHS); 4) reconstructing latent features and generating new images through ICA and GAN inversion; and 5) measuring the diversity of the generated set using metrics such as the log-determinant (LD). Experiments demonstrate that BDTest significantly improves test set diversity and error detection performance, achieving error rates of 59.36%, 59.76%, and 67.03% on VGG19, DenseNet121, and MobileNetV2, respectively, outperforming DeepXplore by an average of 12.43% and DLFuzz by 9.95% across all tested models. When retrained with the generated test cases, the model demonstrated improved accuracy on the original test set, alongside a significant enhancement in accuracy on the natural adversarial test set.
Loading