- TL;DR: DCEM learns latent domains for optimization problems and helps bridge the gap between model-based and model-free RL --- we create a differentiable controller and fine-tune parts of it with PPO
- Abstract: We study the Cross-Entropy Method (CEM) for the non-convex optimization of a continuous and parameterized objective function and introduce a differentiable variant (DCEM) that enables us to differentiate the output of CEM with respect to the objective function's parameters. In the machine learning setting this brings CEM inside of the end-to-end learning pipeline in cases this has otherwise been impossible. We show applications in a synthetic energy-based structured prediction task and in non-convex continuous control. In the control setting we show on the simulated cheetah and walker tasks that we can embed their optimal action sequences with DCEM and then use policy optimization to fine-tune components of the controller as a step towards combining model-based and model-free RL.
- Keywords: machine learning, differentiable optimization, control, reinforcement learning