SafeBench-Seq: A Homology-Clustered, CPU-Only Baseline for Protein Hazard Screening with Physicochemical/Composition Features and Cluster-Aware Confidence Intervals
Keywords: Protein hazard screening, Biosecurity screening, Sequence-level classification, Homology-clustered evaluation, CPU-only baseline
TL;DR: SafeBench-Seq is a CPU-only, reproducible baseline for protein hazard screening that uses homology-clustered splits, showing random splits overstate robustness and can mislead biosecurity decisions.
Abstract: Foundation models for protein design raise concrete biosecurity risks, yet the
community lacks a simple, reproducible baseline for sequence-level hazard screen-
ing that is explicitly evaluated under homology control and runs on commodity
CPUs. We introduce SafeBench-Seq, a metadata-only, reproducible benchmark
and baseline classifier built entirely from public data (SafeProtein hazards and
UniProt benigns) and interpretable features (global physicochemical descriptors
and amino-acid composition). To approximate “never-before-seen” threats, we
homology-cluster the combined dataset at 40% identity and perform cluster-level
holdouts (no cluster overlap between train/test). We report discrimination (AU-
ROC/AUPRC) and screening-operating points (TPR@1% FPR; FPR@95% TPR)
with 95% bootstrap confidence intervals (n=200), and we provide calibrated prob-
abilities via CalibratedClassifierCV (isotonic for Logistic Regression / Ran-
dom Forest; Platt sigmoid for Linear SVM). We quantify probability quality using
Brier score, Expected Calibration Error (ECE; 15 bins), and reliability diagrams.
Shortcut susceptibility is probed via composition-preserving residue shuffles and
length-/composition-only ablations. Empirically, random splits substantially over-
estimate robustness relative to homology-clustered evaluation; calibrated linear
models exhibit comparatively good calibration, while tree ensembles retain slightly
higher Brier/ECE. SafeBench-Seq is CPU-only, reproducible in Colab, and releases
metadata only (accessions, cluster IDs, split labels), enabling rigorous evaluation
without distributing hazardous sequences. Code & metadata: Anonymous GitHub
repo (link redacted for double-blind review; will be provided in the camera-ready
version).
Submission Number: 43
Loading