Abstract: Deep reinforcement learning has successfully allowed agents to learn complex behaviors for many tasks. However, a key limitation of current learning approaches is the sample-inefficiency problem, which limits performance of the learning agent. This paper considers how agents can benefit from improved learning via teachers' advice. In particular, we consider the setting with multiple sub-optimal teachers, as opposed to having a single near-optimal teacher. We propose a flexible two-level actor-critic algorithm where the high-level network learns to choose the best teacher in the current situation while the low-level network learns the control policy.
Submission Length: Regular submission (no more than 12 pages of main content)
Changes Since Last Submission: Changes made to the last submission (on Nov 15):
1) Add ablation experiments comparing TL-AC with Random and UCB1, results shown in Figure 11.
2) Add reference lines to Figure 5, 6, and 7.
3) Update the plotting frequency of the A2C baseline in Figure 4 and Figure 7.
4) Modify the example used in the Introduction (page 2, 2nd paragraph).
We made several minor changes to the submission:
1) Merge the Background and Related Works as one section.
2) Update the algorithm pseudocode. Remove the usage of buffer B to avoid confusion.
3) Fix typos and notations.
4) Add reference lines by the side of Figure 3 and Figure 4.
Assigned Action Editor: ~Marcello_Restelli1
License: Creative Commons Attribution 4.0 International (CC BY 4.0)
Submission Number: 1169
Loading