Improving generalizability of ML-enabled software through domain specification

Hamed Barzamini, Mona Rahimi, Murtuza Shahzad, Hamed Alhoori

Published: 2022, Last Modified: 14 Jul 2023CAIN 2022Readers: Everyone

Abstract: While the conventional software components implement pre-defined specifications, Machine Learning (ML)-enabled Software Components (MLSC) learn the domain specifications from the training samples. Thus, the MLSC's data-driven and inductive reasoning becomes highly reliant on the quality of the training dataset, which are often arbitrarily collected in ad hoc manners. The random collection of samples leads to a significant gap between the actual specifications of a real-world concept, and the picture that a dataset represents of the concept, reducing MLSC generalizability, particularly in perceptual tasks where understanding the environment is an important factor of accurate prediction. To fill the gap between the conceptualization of a targeted domain's concept and its visualization in the MLSC dataset, we propose exploiting semantic specification of the concept to identify the concepts' missing variants in the data. We first, semantically specify hard-to-specify targeted domain's concepts and second, refer to the derived specifications to evaluate the diversity and relative completeness of MLSC collected datasets. The systematic augmentation of training datasets, with respect to the semantics of the domain, improves the quality of an arbitrarily collected dataset and potentially yields more reliable models. As a proof of concept, we automatically acquired the existing semantic knowledge for specifying the automotive domain concept "pedestrian." Augmenting the state-of-the-art pedestrian datasets accordingly, the evaluations showed that semantic augmentation outperforms brute-force machine learning in satisfying the MLSC accuracy requirements.

0 Replies