Scalable Diversity-Aware Feature Scoring for Biomedical Big Data via Hypercube-Based Density Estimation

Published: 19 Aug 2025, Last Modified: 12 Oct 2025BHI 2025EveryoneRevisionsBibTeXCC BY 4.0
Confirmation: I have read and agree with the IEEE BHI 2025 conference submission's policy on behalf of myself and my co-authors.
Keywords: Cheminformatics, Molecular Diversity, Virtual Screening, MapReduce Algorithms, Apache Spark and Cloud Computing.
Abstract: Efficiently quantifying molecular diversity is essential for high-throughput virtual screening in early-stage drug discovery and cheminformatics pipelines. However, classical diversity metrics—such as pairwise distance computations—are computationally prohibitive at the scale of modern molecular libraries containing hundreds of millions of compounds. This paper presents a fast, scalable diversity scoring framework based on hypercube partitioning and MapReduce, implemented in Apache Spark and designed for ultra-large descriptor spaces. Each molecule is embedded in a normalized high-dimensional descriptor space, assigned to a discrete hypercube, and scored inversely by local cell occupancy—approximating structural novelty without pairwise distance computations. The method achieves linear runtime scaling and stable memory usage across cloud clusters, validated up to 200 million molecules. We demonstrate integration with diversity-constrained compound selection, where our score functions as a penalty term in bioactivity optimization. While motivated by cheminformatics, the framework generalizes to other biomedical domains, including genomic feature selection and high-dimensional clustering in computational biology. This work provides a cloud-ready, domain-agnostic diversity scoring method for scalable screening applications in cheminformatics and biomedicine.
Track: 7. General Track
Registration Id: 42NXW6ND396
Submission Number: 329
Loading