FluorCode: Predicting Fluorescent Protein Photophysical Properties with LoRA-Fine-Tuned Protein Language Models
Additional Submission Instructions: For the camera-ready version, please include the author names and affiliations, funding disclosures, and acknowledgements.
Track: Track 1: Original Research/Position/Education/Attention Track
Keywords: Protein, Protein language model, LoRA-Fine tuning, Fluorescent protein, Photophysical property prediction
TL;DR: A fine-tuned PLM for photophysical property prediction shows robust generalization across fluorescent protein families.
Abstract: Predicting fluorescent protein (FP) photophysical properties from sequence remains challenging, especially for novel fluorescent proteins because of the limited diversity of existing datasets. Using FluorCode, a fine-tuned protein language model for photophysical properties, we compare alignment-based one-hot features against LoRA-fine-tuned ESM2-650M with chromophore-aware attention pooling. Under sequence-identity-clustered cross-validation at 50\%, the alignment-based baseline accuracy degrades sharply, with excitation MAE increasing from 12.7 to 35.6\,nm and brightness prediction even collapsing to noise ($R = 0.13$), whereas the LoRA-ESM2 MLP model remains robust (excitation MAE: 8.9 $\rightarrow$ 11.3\,nm; brightness $R = 0.96$). Meanwhile, adding Pocket-3D chromophore-anchored descriptors from 913 chromophore-grafted structures yields no consistent measurable gain beyond the sequence model, indicating that explicit structural information in this hand-crafted form does not improve over LoRA-ESM2 under our current setup. Here, we demonstrate that the standard random cross-validation protocol used in prior work, previously shown to inflate performance in protein machine learning but not addressed in fluorescent protein prediction, overestimates performance by placing near-identical FP variants in both training and test folds.
Submission Number: 150
Loading