CLAMP: Constrained Language-Action Multimodal Planning
Keywords: embodied agents, constrained generation, vision-language models, multimodal planning
Abstract: Generating valid, semantically grounded action sequences from visual observations is a central challenge for embodied agents: plans must satisfy both format constraints (syntactically valid actions) and semantic constraints (physically feasible state transitions). Constrained generation methods and vision-language-action models have advanced these goals in parallel, but no unified framework enforces both constraint types in the visual POMDP setting. Ctrl-G achieves strong constrained generation for text-only embodied AI with a two-layer DFA–HMM architecture, yet it degrades in multimodal settings because visual token spans can induce spurious DFA transitions, the semantic HMM assumes a fully observed discrete state, and symbolic grounding cannot refer to visual entities. We introduce CLAMP (Constrained Language-Action Multimodal Planning), which extends Ctrl-G to vision-language agents via three modules. (1) Visual DFA freezes automaton transitions during visual token spans, protecting format constraints from image patch tokens. (2) Belief-Conditioned HMM replaces the oracle state with a lightweight GRU belief estimator trained on automatically annotated trajectories, converting hard semantic constraints into soft, belief-weighted constraints. (3) Visual Bridge maps VLM visual references (bounding boxes, patch indices) to canonical environment entity names through CLIP and DINO vocabulary matching. CLAMP preserves the original two-layer factorization and its memory efficiency over flat product-space formulations. On VAGEN (5 tasks) and MolmoSpaces (8 tasks), CLAMP matches or outperforms GRPO-trained reinforcement learning baselines on task success, format violations, and constraint violations, without modifying VLM weights or requiring environment interaction during training. These results show that structured constraint guidance can generalize from text-only to multimodal embodied agents, providing a training-free alternative to RL fine-tuning for grounded action generation.
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 48
Loading