An extended de Bruijn graph for feature engineering over biological sequential data

Mert Onur Cakiroglu, HASAN KURBAN, Parichit Sharma, Muhammed Oguzhan Kulekci, Elham khorasani buxton, Maryam Raeeszadeh-Sarmazdeh, Mehmet Dalkilic

Published: 19 Jul 2024, Last Modified: 13 Dec 2024Machine Learning: Science & TechnologyEveryoneRevisionsCC BY 4.0

Abstract: In this study, we introduce a novel de Bruijn graph (dBG) based framework for feature engineering in biological sequential data such as proteins. This framework simplifies feature extraction by dynamically generating high-quality, interpretable features for traditional AI (TAI) algorithms. Our framework accounts for amino acid substitutions by efficiently adjusting the edge weights in the dBG using a secondary trie structure. We extract motifs from the dBG by traversing the heavy edges, and then incorporate alignment algorithms like BLAST and Smith–Waterman to generate features for TAI algorithms. Empirical validation on TIMP (tissue inhibitors of matrix metalloproteinase) data demonstrates significant accuracy improvements over a robust baseline, state-of-the-art PLM models, and those from the popular GLAM2 tool. Furthermore, our framework successfully identified Glycine and Arginine-rich motifs with high coverage, highlighting it is potential in general pattern discovery.