Inference-Time Toxicity Mitigation in Protein Language Models via Logit-Diff Amplification

Manuel Fernández Burda; Santiago Aranguri; Iván Arcuschin; Enzo Ferrante

Inference-Time Toxicity Mitigation in Protein Language Models via Logit-Diff Amplification

Manuel Fernández Burda, Santiago Aranguri, Iván Arcuschin, Enzo Ferrante

Published: 02 Mar 2026, Last Modified: 05 Mar 2026GEM 2026EveryoneRevisionsBibTeXCC BY 4.0

Keywords: protein language models, ProGen2, dual-use, biorisk, biosecurity, toxicity elicitation, taxonomic finetuning, inference-time mitigation, logit-diff amplification, LDA, decoding control, model-diff steering, toxicity classification, ToxDL2, ESM-2 embeddings, ESMFold, pLDDT, Fréchet ESM Distance, distributional shift, activation steering, representation engineering

TL;DR: Taxonomic finetuning elicits toxic protein generation (10-65% rates). We adapt Logit Diff Amplification for inference-time mitigation, reducing toxicity while preserving biological quality—unlike activation steering which degrades sequences.

Abstract: Protein language models (PLMs) are becoming practical tools for \textit{de novo} protein design, yet their dual-use potential raises safety concerns. We show that domain adaptation to specific taxonomic groups can unintentionally elicit toxic protein generation, even when toxicity is not the training objective. To address this, we adapt Logit Diff Amplification (LDA) as an inference-time control mechanism for PLMs. LDA modifies token probabilities by amplifying the logit difference between a baseline model and a toxicity-finetuned model, requiring no retraining. Across four taxonomic groups, LDA consistently reduces predicted toxicity rate (measured via ToxDL2) below the taxon-finetuned baseline while preserving biological plausibility. We evaluate quality using Fr\'{e}chet ESM Distance and predicted foldability (pLDDT), finding that LDA maintains distributional similarity to natural proteins and structural viability---unlike activation-based steering methods that tend to degrade sequence properties. Our results demonstrate that LDA provides a practical safety knob for protein generators that mitigates elicited toxicity while retaining generative quality.

Presenter: ~Manuel_Fernández_Burda1

Format: Yes, the presenting author will definitely attend in person because they attending ICLR for other complementary reasons.

Funding: Yes, the presenting author of this submission falls under ICLR’s funding aims, and funding would significantly impact their ability to attend the workshop in person.

Submission Number: 60

Loading