Comprehensive Benchmark for Tailored Small Molecule-Binding Aptamer Design

20 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0
Keywords: aptamer, small molecule, binding, prediction, benchmark
TL;DR: We introduce a unified benchmark for aptamer–small molecule interactions, showing that aptamer sequence diversity is well covered while ligand representation remains the main challenge for predictive modeling and practical applications.
Abstract: Despite their growing role as recognition elements in diagnostics, therapeutics, and biosensing, aptamers remain overlooked by computational design tools compared to antibodies and protein binders. Current pipelines are fragmented and predominantly protein-focused, leaving small-molecule aptamer discovery underexplored. A key bottleneck has been the absence of a unified benchmark dataset that would allow systematic evaluation of predictive and generative models. To address this gap, we introduce the first comprehensive benchmark for aptamer–small molecule interactions, integrating eight curated sources into 6,413 annotated pairs covering 1,686 unique aptamers (DNA and RNA) and 1,041 chemically diverse ligands. More than 30\% of the entries include quantitative binding affinities, enabling not only binary classification but also regression. To demonstrate the utility of this resource, we establish baseline results across shallow and deep learning baseline models under multiple splitting protocols. Our analysis reveals two central findings. First, the diversity and coverage of aptamer sequences are sufficiently broad to support robust modeling, indicating that limitations in current approaches do not stem from the receptor side. Second, the primary bottleneck emerges from the ligand space: a comparatively small number of molecules spans a highly heterogeneous chemical landscape, substantially constraining model transferability to unseen targets. Since practical aptamer design ultimately requires generalization to novel small molecules, these observations underscore a fundamental representation challenge on the ligand side. By providing a standardized corpus, rigorous evaluation protocols, and reproducible baselines, our benchmark establishes a foundation for systematic progress in aptamer-small molecule prediction.
Primary Area: datasets and benchmarks
Submission Number: 24877
Loading