Abstract: Robust reinforcement learning (RL) is commonly formulated as a min--max optimization problem to account for epistemic uncertainty in transition dynamics.
While theoretically appealing, such formulations are computationally demanding and often induce overly conservative policies.
We study an alternative approach in which transition dynamics are sampled from an uncertainty set and robustness is achieved through explicit control of policy complexity.
In the neural tangent kernel regime, we show that training with uniformly sampled dynamics induces a bias--variance tradeoff, with lower-rank policy representations exhibiting reduced sensitivity to epistemic perturbations.
Within the framework of entropy-regularized RL, we formulate robust learning as a bi-level optimization problem that balances expressiveness and robustness via adaptive low-rank policy representations, leading to an adaptive rank-selection mechanism that navigates this tradeoff during training.
We establish policy convergence and demonstrate empirically on MuJoCo continuous-control benchmarks that the proposed method provides a scalable and computationally efficient alternative to traditional robust RL, achieving improved robustness without the overhead of adversarial inner-loop optimization.
Submission Type: Regular submission (no more than 12 pages of main content)
Assigned Action Editor: ~Shaofeng_Zou1
Submission Number: 7743
Loading