The Courage to Stop: Overcoming Sunk Cost Fallacy in Deep Reinforcement Learning

Jiashun Liu; Johan Obando-Ceron; Pablo Samuel Castro; Aaron Courville; Ling Pan

The Courage to Stop: Overcoming Sunk Cost Fallacy in Deep Reinforcement Learning

Jiashun Liu, Johan Obando-Ceron, Pablo Samuel Castro, Aaron Courville, Ling Pan

Published: 01 May 2025, Last Modified: 23 Jul 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0

TL;DR: We observes the sunk cost fallact in Deep RL which may affect the sample efficiency and performance, and propose a simple optimization for unlocking agents' auto-stop ability to mitigate this issue.

Abstract: Off-policy deep reinforcement learning (RL) agents typically leverage replay buffers for reusing past experiences during learning. This can help sample efficiency when the collected data is informative and aligned with the learning objectives; when that is not the case, it has the effect of ``polluting'' the replay buffer with data that can exacerbate optimization challenges in addition to wasting environment interactions due to redundant sampling. We argue that sampling these uninformative and wasteful transitions can be avoided by addressing the **sunk cost fallacy** which, in the context of deep RL, is the tendency towards continuing an episode until termination. To address this, we propose the *learn to stop* (**LEAST**) mechanism which uses statistics based on $Q$-values and gradient to guide early episode termination which helps agents recognize when to terminate unproductive episodes early. We demonstrate that our method improves learning efficiency on a variety of RL algorithms, evaluated on both the MuJoCo and DeepMind Control Suite benchmarks.

Lay Summary: In this paper, we observe the sunk cost fallacy in mainstream deep RL algorithms. This fallacy manifests as agents executing each episode blindly, incur significant interaction costs, and may pollute the data distribution in the buffer, limiting their learning potential. This observation highlights an overlooked aspect within the community: the inability of existing algorithms to stop before entering into suboptimal trajectories may be a hidden factor that limits their performance. To address this issue, we propose a direct optimization approach called **LEAST** to quantify the current situation and control the termination of sampling. **LEAST** effectively mitigates the sunk cost fallacy without the need for additional networks and enhances learning efficiency across diverse scenarios. We hope that the problem presented in this paper could inspire future research aimed at optimizing sampling and learning efficiency.

Primary Area: Reinforcement Learning->Deep RL

Keywords: Deep Reinforcement Learning, Sunk Cost Bias, Sample Efficiency, Continious Action Space

Submission Number: 9923

Loading