DELBERT-2: Pretrained Fingerprint Language Models for DEL Protein Binder Prediction

Bing Xu Hu; Sun Sun; Shaik salman basha; Anita Layton; Helen Hong Chen

DELBERT-2: Pretrained Fingerprint Language Models for DEL Protein Binder Prediction

Bing Xu Hu, Sun Sun, Shaik salman basha, Anita Layton, Helen Hong Chen

Published: 28 May 2026, Last Modified: 28 May 2026ICML 2026 FM4LS Workshop PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Molecular Foundation Models, Chemical Fingerprints, Protein Binder Prediction

Abstract: We present DELBERT-2, a pretrained fingerprint language model that converts sparse molecular fingerprints into unified token sequences using a ModernBERT encoder trained with masked language modelling on 2.5M molecules from the AIRCHECK DEL corpus. We evaluate across six targets (WDR91, WDR12, SETDB1, PLCZ1, LRRK2, DCAF7) under three out-of-distribution (OOD) protocols: hierarchical cluster splits probe chemical novelty, library splits test cross-library transfer, and building-block splits assess compositional generalization. DELBERT-2 consistently improves PR-AUC and NDCG@1000 relative to LightGBM ensemble baselines and transformers trained from scratch, with the largest gains in stringent OOD regimes. DELBERT-2 achieves 13.28+/-3.72 enrichment factor at top-100 vs. 12.36+/-3.75 for no-pretraining (p=0.040), representing 7.4% improvement and 13x enrichment over random selection. These results demonstrate that fingerprint-centric self-supervised learning effectively improves hit prioritization under distribution shift, enabling practical DEL virtual screening.

Email Sharing: We authorize the sharing of all author emails with Program Chairs.

Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.

Submission Number: 44

Loading