LLMs are Vulnerable to Malicious Prompts Disguised as Scientific Language

Published: 25 Jul 2025, Last Modified: 12 Oct 2025COLM 2025 Workshop SoLaR PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: model bias, security, jailbreak, toxicity
TL;DR: LLMs can be jailbroken using scientific-sounding language to justify stereotypes, revealing serious vulnerabilities even in top models like GPT.
Abstract: As large language models (LLMs) have been deployed in various real-world settings, concerns about the harm they may propagate have grown. Various jailbreaking techniques have been developed to expose the vulnerabilities of these models and improve their safety. This work reveals that many state-of-the-art LLMs are vulnerable to malicious requests hidden behind scientific language. Specifically, our experiments with GPT4o, GPT4o-mini, GPT-4, Llama3.1-405B-Instruct, Llama3.1-70B-Instruct, Gemini models on the StereoSet data demonstrate that, the models’ biases and toxicity substantially increase when prompted with requests that deliberately misinterpret social science and psychological studies as evidence supporting the benefits of stereotypical biases. Alarmingly, these models can be also manipulated to generate fabricated scientific arguments claiming that biases are beneficial, which can be used by ill-intended actors to systematically jailbreak even the strongest models like GPT. Our findings call for a more careful investigation on the use of scientific data for training LLMs.
Submission Number: 10
Loading