The impact of imbalanced datasets on Deep Neural Network predictions: A case study in scramjet performance

TMLR Paper5565 Authors

06 Aug 2025 (modified: 29 Aug 2025)Under review for TMLREveryoneRevisionsBibTeXCC BY 4.0
Abstract: Robust aerodynamic predictions for hypersonic vehicles increasingly rely on existing deep‑learning tools. However, imbalanced datasets — often resulting from limited experimental data or insufficient coverage of operational conditions — can compromise model reliability and introduce bias into predictions. This work offers an application‑centered account of how a feed‑forward multilayer perceptron (PyTorch implementation) behaves when trained on (i) a data‑rich yet operationally imbalanced set of scramjet simulations and (ii) a deliberately balanced counterpart generated with a conventional metaheuristic (MH) sampling scheme, but with a lower sample count. Without altering network architecture, loss function, or optimizer, we expose a clear trade‑off: the imbalanced model achieves a 14% lower root mean square error (RMSE) but produces thrust predictions that violate first‑principles trends, whereas the balanced model sacrifices a small amount of numerical accuracy to maintain physical coherence across Mach–altitude space. These results illuminate both the strength (high statistical accuracy) and the weakness (loss of physical fidelity under bias) of off‑the‑shelf deep neural networks (DNNs) when data coverage is uneven. The findings serve as a cautionary example for practitioners who might otherwise deploy such models uncritically, and underscore the methodological importance of rigorous dataset diagnostics — rather than chasing novel algorithms — for reliable AI adoption in aerospace design.
Submission Length: Long submission (more than 12 pages of main content)
Assigned Action Editor: ~Florian_Tobias_Schaefer1
Submission Number: 5565
Loading