Graph-Based Instrumental Power-Seeking Index for Robust and Training-Free Auditing of Large Language Models

20 Sept 2025 (modified: 25 Sept 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: large language models, power seeking, emergent behavior, alignment, safety, interpretability, graph representation, contrastive consistency, training free methods, evaluation benchmark
Abstract: Large Language Models (LLMs) increasingly act in interactive and decision-making settings, raising urgent concerns about alignment and safety. A key theoretical risk is power-seeking, where agents adopt strategies that preserve or expand their own influence as an instrumental subgoal. While alignment theory has formally predicted such behaviors, the field lacks a scalable, reproducible, and interpretable empirical benchmark. We introduce the Instrumental Power-Seeking (IPS) Index, a model-agnostic and training-free framework for quantifying emergent power-oriented behaviors in LLMs. IPS represents interactions as structured decision graphs spanning debate, collaboration, hierarchy, and resource-allocation domains. Each node captures a model’s decision, and edges encode consistency relations across scenarios. A contrastive evaluation layer distinguishes influence-preserving trajectories from baseline alternatives, producing a unified IPS score. This score integrates rubric-based judgments, likelihood signals, and structured verification, yielding both quantitative and interpretable outputs. In large-scale experiments across 2,000 scenarios and multiple frontier models, IPS reveals systematic differences in power-seeking tendencies, with high-pressure and strategic contexts amplifying control-preserving motifs. Beyond measurement, IPS provides a reusable framework for probing generalization and scaling trends, showing how power-seeking behaviors evolve with model size, prompt structure, and reasoning depth. Our work bridges the gap between alignment theory and empirical auditing by introducing the first graph-based, contrastive, and domain-general framework for detecting and quantifying emergent power-seeking in LLMs.
Primary Area: alignment, fairness, safety, privacy, and societal considerations
Code Of Ethics: true
Submission Guidelines: true
Anonymous Url: true
No Acknowledgement Section: true
Submission Number: 23038
Loading