Keywords: Machine learning, peptides, foundation model, drug discovery, pharmacokinetics, biology, SMILES
TL;DR: We introduce PeptideMTR, a SMILES-based foundation model for therapeutic peptides that combines masked language modeling with regression to physicochemical descriptors, yielding improved representations for peptide drug discovery.
Abstract: Foundation models for molecular science have significantly impacted small-molecule and protein modeling, however there is a lack of models able to encode therapeutic peptides. Existing chemical language models often operate with short context windows, while protein language models are limited to canonical amino acids and struggle with nonnatural residues, modifications, or cyclizations. We present PeptideMTR, a SMILES-based foundation model with multimodal pretraining via descriptor alignment. PeptideMTR couples masked language modeling with an auxiliary regression objective to RDKit-derived physicochemical descriptors, aligning symbolic sequence representations with continuous chemical properties. Our contributions are threefold: (i) a kmer tokenizer tailored to chemically coherent fragments and peptide motifs, (ii) a dual-objective pretraining scheme that unifies symbolic and numeric modalities, and (iii) an empirical study of the impact scaling from 32M to 337M parameters has on predicting peptide permeability and aggregation. PeptideMTR consistently outperforms fingerprint baselines and MLM-only pretraining, demonstrating that multimodal pretraining yields richer peptide representations.
Submission Number: 58
Loading