Abstract: In this study, we introduce a novel de Bruijn graph (dBG) based framework for feature engineering
in biological sequential data such as proteins. This framework simplifies feature extraction by
dynamically generating high-quality, interpretable features for traditional AI (TAI) algorithms.
Our framework accounts for amino acid substitutions by efficiently adjusting the edge weights in
the dBG using a secondary trie structure. We extract motifs from the dBG by traversing the heavy
edges, and then incorporate alignment algorithms like BLAST and Smith–Waterman to generate
features for TAI algorithms. Empirical validation on TIMP (tissue inhibitors of matrix
metalloproteinase) data demonstrates significant accuracy improvements over a robust baseline,
state-of-the-art PLM models, and those from the popular GLAM2 tool. Furthermore, our
framework successfully identified Glycine and Arginine-rich motifs with high coverage,
highlighting it is potential in general pattern discovery.
Loading