Abstract: Abstract:
Reinforcement learning (RL) holds great promise for network control problems, yet its deployment in real-world systems remains limited due to the instability and unpredictability of RL policies during training. To address this challenge, we propose a two-phase conservative RL framework that combines domain expertise from classical network optimization with modern deep RL techniques. Our key idea is to initialize the learning process with a stable base policy, derived from expert knowledge, and then apply conservative fine-tuning under a Kullback–Leibler (KL) divergence constraint to safely explore improved behaviors. We apply this framework to the problem of covert multi-hop routing, where the objective is to optimize data throughput while minimizing detectability by adversaries. In Phase I, we construct a reliable base policy by imitating the back-pressure algorithm, which guarantees throughput-optimal behavior and stable queue dynamics. Phase II fine-tunes this policy to improve covert performance, as measured by the Detection Error Probability (DEP), while preserving training-time stability. Empirical evaluations on a grid network show that our method enables more reliable learning than pure RL. While pure RL (e.g., PPO) can sometimes achieve higher covert performance, it frequently suffers from large queues and collapsed throughput during training. In our experiments, our conservative RL framework reduces the worst-case training-time queue length by over 99% while maintaining comparable covert communication performance.
Loading