Keywords: Embodied AI, Foundation Models, Information Theory, Planning Horizon, Vision-Language-Action Models, Planning Information Bottleneck
TL;DR: We introduce the Planning Information Bottleneck to derive the first theoretical limits on the maximum reliable planning horizon and optimal subgoal density for embodied foundation models.
Abstract: Foundation models---including Large Language Models (LLMs), Vision--Language Models (VLMs), and Vision--Language--Action Models (VLAs)---have demonstrated impressive capabilities in grounding language instructions to embodied actions. Yet a systematic, theoretically grounded explanation for why these systems fail reliably as task horizon grows has remained elusive. We close this gap by introducing the Planning Information Bottleneck (PIB), a scalar $B \ge 0$ (in bits) that measures the task-relevant information irrecoverably lost when a VLM compresses a physical observation into its internal representation.
From this quantity we derive four rigorous results. (i) \textbf{Semantic Horizon Theorem} (Thm. 3.5): the maximum planning horizon at which a VLM-guided agent can succeed with probability $1 - \epsilon$ is $H_{sem} \approx \epsilon \log |A| / (B - 1)$, providing the first closed-form horizon bound for embodied foundation models. (ii) \textbf{Optimal Subgoal Count} (Thm. 3.9): the bottleneck-minimizing number of language subgoals is $K^* = \left( \frac{\gamma B_0 H^\gamma}{B_{spec}} \right)^{1/(\gamma+1)}$, where $\gamma$ captures the super-linearity of reasoning difficulty with horizon. (iii) \textbf{Adaptive Replanning Criterion} (Thm. 3.12): an agent should replan when semantic drift $D_t \ge \sqrt{\frac{C_{replan}}{r_{max}(H - t)}}$, yielding a threshold that tightens with task-deadline proximity. (iv) \textbf{Calibration--Bottleneck Duality} (Thm. 3.13): a VLA is Bayesian-calibrated if and only if its policy action entropy equals $B$.
We validate all four theorems across five VLMs (LLaVA-1.6, GPT-4V, Gemini-1.5-Pro, InternVL2, OpenVLA) and four benchmarks (ALFRED, RLBench, Habitat, MetaWorld), finding theoretical horizon predictions within 8.7\% of empirical measurements ($r = 0.991, p < 10^{-16}$). We further introduce PIB-AUC, a new evaluation axis for embodied benchmarks that predicts long-horizon failure two--five times better than existing VQA-based scores. Code, data, and all experimental artefacts will be released upon acceptance.
Submission Number: 40
Loading