Graph-Aware Diffusion Policy for Fault-Tolerant Agentic AI Service Migration in Edge Computing Power Networks
Abstract: In edge computing power network environments, there is a growing demand to support compute-intensive Agentic AI Services, which are composed of interdependent functions represented as Directed Acyclic Graphs (DAGs). Nevertheless, the challenges posed by dynamic resource volatility and potential node failures significantly impact reliable task execution. Existing solutions (often reactive heuristics or GAN-based models) struggle to anticipate risks and overlook DAG dependencies. This paper introduces GADP, a Graph-Aware Diffusion Policy framework designed to facilitate proactive fault-tolerant DAG workload migration in large-scale edge computing systems. This paper presents GADP, a Graph-Aware Diffusion Policy framework for proactive, fault-tolerant DAG workload migration in large-scale edge systems. GADP integrates three key modules: a Transformer-GAT fault predictor for failure probability and type estimation; a DAG encoder that learns structure-preserving task embeddings via multi-round attention; and a diffusion policy generator that refines placement strategies through conditional denoising. Experiments on dynamic simulations with real workload traces show that GADP achieves 99.6% fault detection accuracy, 95.4% diagnosis F1, and over 60% fewer SLO violations, while consuming the least energy among baselines. These results demonstrate GADP's robustness and effectiveness in anticipatory migration under volatile edge conditions.
External IDs:doi:10.1109/tnse.2025.3627391
Loading