Generating Investigative Leads from Forensic DNA Data: Mapping Y-STR Profiles to Ancestral Haplogroups

Published: 13 Dec 2025, Last Modified: 16 Jan 2026AILaw26EveryoneRevisionsBibTeXCC BY-NC-SA 4.0
Keywords: XGBoost Machine Learning Model, SHAP Values Feature Importance, Forensic Y-STR DNA Analysis, Haplogroup Ancestral Lineage Typing, Law Enforcement, Police Investigation Lead Generation
Paper Type: Full papers
TL;DR: This project aims to employ machine learning models to correctly predict the Y-SNP Haplogroup Classification from Forensic Y-STR DNA Data for Investigative Lead Generation in Law Enforcement Investigations.
Abstract: Genetic markers, particularly Y-chromosome short tandem repeats (Y-STRs), play a critical role in forensic investigations. Since Y-STRs are inherited strictly along the paternal line, it can help differentiate male lineages. However, its forensic value depends heavily on the availability of reference profiles in population databases. When no corresponding entry exists, a generated Y-STR profile cannot be used for direct identification. Acknowledging this gap, this study aims to investigate the possible features critical in developing machine learning framework for predicting Y-chromosome Single Nucleotide Polymorphism (Y-SNP) haplogroups from standard Y-STR profiles. Prediction of haplogroups provides information on paternal lineage ancestry, enabling the generation of intelligence leads useful for police investigations. Through comprehensive evaluation of multiple supervised classifiers on a dataset of 4,064 Y-STR profiles, the optimized XGBoost (Extreme Gradient Boosting) classifier was selected for its superior raw predictive power, achieving the highest overall accuracy of 96.98\% and a Macro F1-score of 0.9810. Critically, the framework employs stratified sampling and class weighting to ensure fairness across demographically underrepresented ancestral groups Evaluation of the model incorporates stratified sampling and class weighting to mitigate inherent demographic data imbalance, ensuring fairness across minority ancestral groups. Furthermore, the integration of SHAP (SHapley Additive exPlanations) provides the necessary model interpretability to guide ethical and legal requirements for deployment in police investigations, thus advancing the paradigm of trustworthy AI in law enforcement.
Poster PDF: pdf
Submission Number: 54
Loading