Keywords: Representation Engineering, Model Interpretability, Prompt and Context Engineering
Abstract: We propose a principled, training-free criterion for evaluating prompt effectiveness: for concepts satisfying the Linear Representation Hypothesis, prompt success can be diagnosed before any output is generated by examining whether the intended concept is geometrically well-formed in the model's internal state. We operationalize this criterion through five geometric properties---Contrast and Additivity as core requirements implied by LRH, plus Intensity, Order Invariance, and Saturation as diagnostic indicators---and validate across 220 conditions (5 models $\times$ 3 frameworks), with 97.3\% ID and 92.3\% OOD accuracy confirming the extracted directions are meaningful.
This criterion yields two immediate consequences. First, context engineering failures become diagnosable: Distraction, Confusion, Clash, and Poisoning each produce characteristic geometric signatures---signal decay, proportion reduction, polarity weakening, or complete reversal---enabling failure-type identification without behavioral testing. Second, failures become repairable: because failures are geometric perturbations, steering can restore concept activation by correcting the internal structure, recovering both representation signals and output behavior. Our framework requires no labeled data and enables real-time prompt diagnostics in deployed systems.
Paper Type: Long
Research Area: Interpretability and Analysis of Models for NLP
Research Area Keywords: Representation Engineering, Model Interpretability, Prompt and Context Engineering
Contribution Types: Model analysis & interpretability, Approaches to low-resource settings
Languages Studied: English
Submission Number: 2830
Loading