Software Metrics Selection: The Case of Python

Zamira Kholmatova, Georgy Andryushchenko, Dinislam Gabitov, Firas Jolha, Andrey Palaev, Ninel Yunusova

Published: 01 Jan 2025, Last Modified: 19 May 2025FICC (3) 2025EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Software metrics play a crucial role in the analysis of software quality, detection of code parts to refactor, or identification of bugs. Even though software engineering faced challenges in collecting metrics due to limited tools, the growth of open-source development and gained insights from software production gave rise to the collection of software metrics. However, the analysis of a huge number of metrics can lead to different issues such as collinearity, sparsity, and noise. All of these negatively impact statistical analysis and machine learning models. Moreover, the large volume of different metrics can overwhelm developers and make the understanding of software systems more challenging. To address the problem of many metrics, the researchers have already proposed various solutions such as heuristic algorithms and statistical tests. These approaches require supervised learning algorithms, thus introducing the problem of automatic labeling. In this paper, we propose an approach for identifying minimal subsets of software engineering metrics that effectively explain the structural properties of Python repositories. The proposed approach employs two optimization techniques – Particle Swarm Optimization and Genetic Algorithms with a Sammon error as a fitness function. To run the experiments we collected a large set of metrics from open source Python repositories. We validated our methodology on class and method-level metrics. The results demonstrate that both PSO and GA can successfully used to identify optimal subsets of metrics. Moreover, we presented subsets of optimal metrics obtained through both techniques.