Abstract: While on-policy algorithms are known for their stability, they often demand a substantial number of samples. In contrast, off-policy algorithms, which leverage past experiences, are considered sample-efficient but tend to exhibit instability. Can we develop an algorithm that harnesses the benefits of off-policy data while maintaining stable learning? In this paper, we introduce an actor-critic learning framework that harmonizes two data sources for both evaluation and control, facilitating rapid learning and adaptable integration with on-policy algorithms. This framework incorporates variance reduction mechanisms, including a unified advantage estimator (UAE) and a residual baseline, improving the efficacy of both on- and off-policy learning. Our empirical results showcase substantial enhancements in sample efficiency for on-policy algorithms, effectively bridging the gap to the off-policy approaches. It demonstrates the promise of our approach as a novel learning paradigm.
Submission Length: Regular submission (no more than 12 pages of main content)
Changes Since Last Submission: * We further clarified our contribution in what ways we enhance the sample efficiency in introduction.
* We reiterated the motivation of distributional learning in section 5.1.
* We added a footnote indicating the general applicability of DPO's bi-level policy evaluation procedure in section 5.1.
Assigned Action Editor: ~Pablo_Samuel_Castro1
Submission Number: 1615
Loading