Keywords: exploration, exploitation, children, reinforcement learning
TL;DR: In a pure exploration vs. pure exploitation task, children adapt their behavior to different probability bias levels by changing how much they exploit on the most-recently-observed arm while maintaining high exploration across all bias levels.
Abstract: In reinforcement learning, agents often need to decide between selecting actions that are familiar and have previously yielded positive results (exploitation), and seeking new information that could allow them to uncover more effective actions (exploration). Understanding the specific kinds of heuristics and strategies that humans employ to solve this problem over the course of their development remains an open question in cognitive science and AI. In this study we develop an "observe or bet" task that separates "pure exploration” from "pure exploitation.” Participants have the option to either observe an instance of an outcome and receive no reward, or to bet on an action that is eventually rewarding, but offers no immediate feedback. We collected data from 56 five-to-seven-year-old children who completed the task at one of three different probability levels. We compared how children performed against both approximate solutions to the partially-observable Markov decision process and meta-RL models that were meta trained on the same decision making task across different probability levels. We found that the children observe significantly more than the two classes of algorithms. We then quantified how children’s policies differ between the different probability levels by fitting probabilistic programming models and by calculating the likelihood of the children’s actions under the task-driven model. The fitted parameters of the behavioral model as well as the direction of the deviation from neural network policies demonstrate that the primary way children change the frequency with which they bet on the door for which they have less evidence. This suggests both that children model the causal structure of the environment and that they produce a “hedging behavior” that would be impossible to detect in standard bandit tasks, and that reduces variance in overall rewards. The results shed light on how children reason about reward and information, providing a developmental benchmark that can help shape our understanding of both human behavior and RL neural network models.
Submission Number: 27
Loading