Abstract: In curriculum learning, teaching involves cooperative selection of sequences of data via plans to facilitate efficient and effective learning.
One-off cooperative selection of data has been mathematically formalized as entropy-regularized optimal transport and the limiting behavior of myopic sequential interactions has been analyzed, both yielding theoretical and practical guarantees.
We recast sequential cooperation with curriculum planning in a reinforcement learning framework and analyze performance mathematically and by simulation.
We prove that infinite length plans are equivalent to not planning under certain assumptions on the method of planning, and isolate instances where monotonicity and hence convergence in the limit hold, as well as cases where it does not. We also demonstrate through simulations that argmax data selection is the same across planning horizons and demonstrate problem-dependent sensitivity of learning to the teacher's planning horizon. Thus, we find that planning ahead yields efficiency at the cost of effectiveness. This failure of alignment is illustrated in particular with grid world examples in which the teacher must attempt to steer the learner away from a particular location in order to reach the desired grid square. We conclude with implications and directions for efficient and effective curricula.
Submission Length: Regular submission (no more than 12 pages of main content)
Changes Since Last Submission: We have performed the changes outlined in the comments to the reviewers, with the following exception: We have not included simulations of changing the value of $A$ in $R(\theta)=A\cdot d_{L^1}(\theta,\delta_{h_{\text{bad}}})-d_{L^1}(\theta,\delta_{h_{\text{true}}})$ as for all values tested ($A\in\{.8,1,10,100,1000\}$) the optimal policy yielded identical results, and the inclusion of the figures would have resulted in a paper longer than 12 pages. We can include the simulations in the supplementary text if desired.
We have fixed the axis labels on figure 2 as we accidentally neglected to fix the size of the labels. We have also attached an updated copy of the supplemental materials including a plot of the reward for $R(\theta)=A\cdot d_{L^1}(\theta,\delta_{h_{\text{bad}}})-d_{L^1}(\theta,\delta_{h_{\text{true}}})$.
Assigned Action Editor: ~Amir-massoud_Farahmand1
Submission Number: 208
Loading