BounDr.E: Predicting Drug-likeness via Biomedical Knowledge Alignment and EM-like One-Class Boundary Optimization
TL;DR: Drug-likeness prediction framework based on EM-like one-class boundary optimization with multi-modal alignment of biomedical knowledge graph and structural space.
Abstract: The advent of generative AI now enables large-scale $\textit{de novo}$ design of molecules, but identifying viable drug candidates among them remains an open problem. Existing drug-likeness prediction methods often rely on ambiguous negative sets or purely structural features, limiting their ability to accurately classify drugs from non-drugs. In this work, we introduce BounDr.E}: a novel modeling of drug-likeness as a compact space surrounding approved drugs through a dynamic one-class boundary approach. Specifically, we enrich the chemical space through biomedical knowledge alignment, and then iteratively tighten the drug-like boundary by pushing non-drug-like compounds outside via an Expectation-Maximization (EM)-like process. Empirically, BounDr.E achieves 10\% F1-score improvement over the previous state-of-the-art and demonstrates robust cross-dataset performance, including zero-shot toxic compound filtering. Additionally, we showcase its effectiveness through comprehensive case studies in large-scale $\textit{in silico}$ screening. Our codes and constructed benchmark data under various schemes are provided at: https://github.com/eugenebang/boundr_e.
Lay Summary: The rapid advancement of generative models has enabled the creation of large libraries of de novo molecules, yet assessing which of these are truly drug-like remains an unresolved challenge. Traditional rules and property-based filters offer only coarse approximations, and most learning-based models lack integration of biological context, relying heavily on molecular structure alone. Furthermore, the highly scattered nature of approved drugs in chemical space makes it difficult to define a boundary that captures drug-likeness without overgeneralization.
To address this, we propose \textsc{BoundDr.E}, a deep one-class boundary learner that defines drug-likeness as a compact, data-driven region around approved drugs, without relying on negative samples. Our method iteratively refines this region via an Expectation-Maximization-like optimization and embeds molecules into a unified space that integrates both structural and biomedical knowledge through multi-modal mixup.
Empirical results show strong and consistent performance across time-based, scaffold-based, and cross-dataset evaluations, as well as in zero-shot toxic compound filtering. These findings suggest that BoundDr.E provides a robust and biologically grounded framework for drug-likeness prediction, offering a scalable solution for prioritizing AI-generated compounds in early-stage drug discovery.
Application-Driven Machine Learning: This submission is on Application-Driven Machine Learning.
Primary Area: Applications->Health / Medicine
Keywords: Drug-likeness, Expectation-Maximization, Multi-modal learning, AI Drug Discovery
Submission Number: 15634
Loading