Scalable Diversity-Aware Feature Scoring for Biomedical Big Data via Hypercube-Based Density Estimation
Confirmation: I have read and agree with the IEEE BHI 2025 conference submission's policy on behalf of myself and my co-authors.
Keywords: Cheminformatics, Molecular Diversity, Virtual Screening, MapReduce Algorithms, Apache Spark and Cloud Computing.
Abstract: Efficiently quantifying molecular diversity is essential
for high-throughput virtual screening in early-stage drug
discovery and cheminformatics pipelines. However, classical diversity
metrics—such as pairwise distance computations—are
computationally prohibitive at the scale of modern molecular
libraries containing hundreds of millions of compounds. This paper
presents a fast, scalable diversity scoring framework based on
hypercube partitioning and MapReduce, implemented in Apache
Spark and designed for ultra-large descriptor spaces. Each
molecule is embedded in a normalized high-dimensional descriptor
space, assigned to a discrete hypercube, and scored inversely
by local cell occupancy—approximating structural novelty without
pairwise distance computations. The method achieves linear
runtime scaling and stable memory usage across cloud clusters,
validated up to 200 million molecules. We demonstrate integration
with diversity-constrained compound selection, where our
score functions as a penalty term in bioactivity optimization.
While motivated by cheminformatics, the framework generalizes
to other biomedical domains, including genomic feature selection
and high-dimensional clustering in computational biology. This
work provides a cloud-ready, domain-agnostic diversity scoring
method for scalable screening applications in cheminformatics
and biomedicine.
Track: 7. General Track
Registration Id: 42NXW6ND396
Submission Number: 329
Loading