Diversity in software engineering research

Harold Valdivia Garcia, Meiyappan Nagappan

Published: 2016, Last Modified: 22 Jul 2025Perspectives on Data Science for Software Engineering 2016EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: With the popularity and availability of OSS projects, Software Engineering (SE) researchers have made many advances in understanding how software is developed. However, in SE Research, like in any other scientific field, it is always desirable to produce results, techniques, and tools that can apply to a large (or all if possible) number of software projects. The ideal case would be to randomly select a statistically significant sample of software projects. However, past SE studies evaluate hypotheses on a small sample of deliberately chosen OSS projects that are out there in the world. More recently, an increasing number of SE researchers have started examining their hypotheses on larger datasets, which are deliberately chosen as well. The aim of the large-scale studies is to increase the generality of the research studies. However, generality of results may not be achieved if the sample of projects chosen for evaluation are homogeneous in nature and not diverse with respect to the entire population of SE projects. In this chapter, we present the initial work done on diversity and representativeness in SE research. We first define what we mean by diversity and representativeness in SE research. Then, we present: (a) a way to assess the quality of a given sample of projects with respect to diversity and representativeness and (b) a selection technique that allows one to tailor a sample with high diversity and representativeness.