Paraphrasing Away Malicious Tokens: Improving LLM Finetuning Safety by Filtering Spurious Correlation

Marcel Mateos Salles; Praney Goyal; Pradyut Sekhsaria; Hai Huang; Randall Balestriero

Paraphrasing Away Malicious Tokens: Improving LLM Finetuning Safety by Filtering Spurious Correlation

Marcel Mateos Salles, Praney Goyal, Pradyut Sekhsaria, Hai Huang, Randall Balestriero

Published: 24 Sept 2025, Last Modified: 24 Sept 2025NeurIPS 2025 LLM Evaluation Workshop PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: LLMs, Data, SSTI, Spurious Correlation, Risks, Reliability

Abstract: Large Language Models (LLMs) are increasingly adapted to classification-style tasks through Low-Rank Adaptation (LoRA). While LoRA provides strong performance at low cost, we find it introduces a major security vulnerability: susceptibility to Seamless Spurious Token Injection (SSTI). In SSTI, even a single token spuriously correlated with downstream labels can dominate model predictions, either through accidental data artifacts or intentional dataset poisoning. We conduct comprehensive experiments across three model families (Meta LLaMA-3, Apple OpenELM, and Snowflake Arctic) and four diverse datasets (IMDB, Financial Classification, CommonSenseQA, and Bias in Bios), and evaluate the impact of using LLMs for paraphrasing as a defense mechanism. Our findings reveal: (1) minimal injection—just one token per prompt—is sufficient to steer model outputs; and (2) paraphrasing serves as a partial defense against easy SSTI. Together, our results expose a critical tradeoff between efficiency and robustness in LoRA finetuning, raising new concerns for both data quality and model security.

Submission Number: 40

Loading