EvoWorld: A World-Model-Centric Framework for Continuous Self-Evolution of Modular Embodied Skills

Published: 27 May 2026, Last Modified: 01 Jun 2026FMEA @ CVPR 2026 PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: World Models, Memory, Continual Skill Learning, Expert Routing
TL;DR: Decouple physical dynamics from skills: a world model imagines, a router picks the right VLA expert, and a VLM diagnoses failures to drive continuous, module-specific self-evolution.
Abstract: Current progress in physical intelligence largely relies on scaling monolithic Vision-Language-Action (VLA) models, yet real-world policy data remain fragmented across scenes and tasks. This mismatch limits transfer, exacerbates catastrophic forgetting, and impedes continual improvement. A modular design that shares dynamics while specializing skills is therefore a promising paradigm. We introduce EvoWorld (EvoW), a world-model-centric framework for skill orchestration and iterative self-evolution. In EvoW, VLAs form an expandable library of pluggable experts. A high-level router selects experts conditioned on scene and task, while an action-conditioned video world model serves as the cognitive core for rollout-based planning. The world model provides counterfactual rollouts to score candidate experts, while selected experts execute in the grounded scene to generate trajectories for verification. A vision-language evaluator delivers semantic scoring and diagnostic tags, enabling targeted updates to the world model, router memory, or specific experts rather than global retraining. This closes an automated loop that jointly improves grounding, routing, and skill refinement without manual task engineering. Experiments show that EvoW enables automated task-to-policy synthesis with competitive success rates in our evaluated settings and iterative improvement trends over iterations, while producing valid and diverse trajectories that support evaluation and skill refinement.
Submission Number: 42
Loading