Microsaccade Inspired Probing: Positional Encoding Perturbations Reveal LLM Misbehaviors

ICLR 2026 Conference Submission13974 Authors

18 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: LLM, NLP, Mechanistic Interpretability, ML, Causality Analysis
TL;DR: Microsaccade-inspired probes use tiny position perturbations to reveal hidden signals of LLM misbehavior. Without fine-tuning, they detect factual, safety, toxicity, and backdoor failures across models.
Abstract: We draw inspiration from microsaccades, tiny involuntary eye movements that reveal hidden dynamics of human perception, to propose an analogous probing method for large language models (LLMs). Just as microsaccades expose subtle but informative shifts in vision, we show that lightweight position encoding perturbations elicit latent signals that reliably indicate model misbehaviour. Our method requires no fine-tuning or task-specific supervision, yet detects failures across diverse settings including factuality, safety, toxicity, and backdoor attacks. Experiments on multiple state-of-the-art LLMs demonstrate that these perturbation-based probes consistently surface misbehaviours while remaining computationally efficient. These findings suggest that pretrained LLMs already encode the internal evidence needed to flag their own failures, and that microsaccade-inspired interventions provide a pathway for detecting and mitigating undesirable behaviours.
Supplementary Material: zip
Primary Area: causal reasoning
Submission Number: 13974
Loading