Decision Tree Induction with Dynamic Feature Generation: A Framework for Interpretable DNA Sequence Analysis

Nicolas Huynh; Krzysztof Kacprzyk; Ryan M Sheridan; David L. Bentley; Mihaela van der Schaar

Decision Tree Induction with Dynamic Feature Generation: A Framework for Interpretable DNA Sequence Analysis

Nicolas Huynh, Krzysztof Kacprzyk, Ryan M Sheridan, David L. Bentley, Mihaela van der Schaar

Published: 06 Mar 2025, Last Modified: 21 Jul 2025ICLR 2025 Workshop LMRLEveryoneRevisionsBibTeXCC BY 4.0

Track: Full Paper Track

Keywords: DNA sequences, decision tree, interpretable

Abstract: The analysis of DNA sequences has become increasingly critical in numerous fields, from evolutionary biology to understanding gene regulation and disease mechanisms. While machine learning approaches to DNA sequence classification, particularly deep neural networks, achieve remarkable performance, they typically operate as black boxes, severely limiting their utility for scientific discovery and biological insight. Decision trees offer a promising direction for interpretable DNA sequence analysis, yet they suffer from a fundamental limitation: considering individual raw features in isolation at each split limits their expressivity, which results in prohibitive tree depths that hinder both interpretability and generalization performance. We address this challenge by introducing $\texttt{DEFT}$, a novel framework that adaptively generates high-level sequence features during tree construction. $\texttt{DEFT}$ leverages large language models to propose biologically-informed features tailored to the local sequence distributions at each node and to iteratively refine them with a reflection mechanism. Through a comprehensive case study on RNA polymerase II pausing prediction, we demonstrate that $\texttt{DEFT}$ discovers human-interpretable sequence features which are highly predictive of pausing, providing insights into this complex phenomenon.

Attendance: Nicolas Huynh

Submission Number: 37

Loading