Keywords: Prompt Sensitivity, LLMs, Interpretability
TL;DR: This paper explains the prompt sensitivity of LLMs by Taylor expansion.
Abstract: Prompt sensitivity, which refers to how strongly the output of a large language model (LLM) depends on the exact wording of its input prompt, raises concerns among users about the LLM's stability and reliability. In this work, we consider LLMs as multivariate functions and perform a first-order Taylor expansion, thereby analyzing the relationship between prompts, their gradients, and the logit of the model's next token. Furthermore, according to the Cauchy–Schwarz inequality, the logit difference can be upper bounded by the product of the gradient norm and the norm of the difference between the prompts' embeddings or hidden states. Our analysis allows a general interpretation of why current transformer-based autoregressive LLMs are sensitive to prompts with the same meaning. In particular, we show that LLMs do not internally cluster similar inputs like smaller neural networks do, but instead disperse them. This dispersing behavior leads to an excessively large upper bound on the logit difference between the two prompts, making it difficult to be effectively reduced to zero. In our analysis, we also show which types of meaning-preserving prompt variants are more likely to introduce prompt sensitivity risks in LLMs. Our findings provide crucial evidence for interpreting the prompt sensitivity of LLMs. Code for experiments is available in the supplementary materials.
Supplementary Material: zip
Primary Area: interpretability and explainable AI
Submission Number: 761
Loading