Abstract: Infant and youth mortality has seen a steady decline over the years. However, many issues related to sociodemographic factors still persist. In Portugal, while mortality forecasts are regularly disclosed to the general public by specialised public entities, very few studies have focused on its determinants, and none have taken advantage of the modelling capabilities of Machine Learning (ML) techniques. This work makes use of real-world data in order to identify the main determinants of infant and youth mortality in Portugal using some of these techniques. The data used for this study comprised 178 databases from various authorities, encompassing economic, demographic, environmental, health, education, and mortality variables at the municipal level. No data at the individual level was available. Two different approaches were proposed. For the first one, the problem was framed as a regression problem, with the number of deaths as the target variable and the potential determinants as the predictors. Simple regression models were used, mainly due to their interpretability. A neural network was also employed to enable a comparison between linear and nonlinear models. Feature elimination and feature selection methods were devised in order to ascertain which variables were the most relevant. These include a feature selection method specifically custom for the problem at hand which proved particularly effective, as it led to performance improvements for every model used in this work. The second approach made use of the K-means clustering technique to determine which of the previously selected variables led to better clusters with both the number of deaths and the mortality rate. To this end, the silhouette method was chosen as the evaluation metric. The best regression model achieved an R2 of 0.846. The foreign population with legal status of residence in the parents’ place of residence and the average monthly earnings of employees were shown to be the features with the greatest impact on mortality.
Loading