Enhancing Intra-Continental Biogeographical Ancestry Prediction Through a Machine Learning Marker Selection Method

Theresa Maurer, Lennart Purucker, Frank Hutter, Peter Pfaffelhuber, Carola Sophia Heinzel

Published: 10 Nov 2025, Last Modified: 21 Jan 2026CrossrefEveryoneRevisionsCC BY-SA 4.0
Abstract: h3>Abstract</h3> <p>While classifiers such as TabPFN (Hollmann et al., 2025) and SNIPPER (Phillips et al., 2007a) achieve strong intercontinental performance (Heinzel et al., 2025), their accuracy in classifying individuals within Europe remains low. One major factor contributing to this limitation is the set of genetic markers used for classification. Marker panels such as the VISAGE Enhanced Tool (Xavier et al., 2022) are commonly employed in forensic genetics because they contain ancestry-informative markers (AIMs) that distinguish very well between major continental populations. However, these panels are often not optimized for fine-scale differentiation within continents, where genetic variation is more subtle and population structure is rather continuous.</p><p>We apply machine learning to select informative markers for intra-European classification, using data from Consortium et al. (2015). Compared with the VISAGE Enhanced Tool and allele frequency–based approaches (Phillips et al., 2007b; Kosoy et al., 2009; Nassir et al., 2009; Kidd et al., 2014; Phillips et al., 2014a), our marker sets achieve substantially higher accuracy within Europe: For four European populations, accuracy improves from 68.2% (VISAGE, 104 markers) to 73.7% (100 new markers) and 82.3% (200 new markers). For five populations, accuracy rises from 56.1% (VISAGE) to 64.5% (100 new markers).</p><p>Our results show that tailored marker selection markedly improves intra-continental classification. While optimized here for Europe, the method can be applied to any region with sufficient training data.</p>
Loading