A Simple Scaling Model for Bootstrapped DQN

A Simple Scaling Model for Bootstrapped DQN

TMLR Paper6517 Authors

15 Nov 2025 (modified: 04 Feb 2026)Under review for TMLREveryoneRevisionsBibTeXCC BY 4.0

Abstract: We present a large-scale empirical study of Bootstrapped DQN (BDQN) and Randomized-Prior BDQN (RP-BDQN) in the DeepSea environment designed to isolate and parameterize exploration difficulty. Our primary contribution is a simple scaling model that accurately captures the probability of reward discovery as a function of task hardness and ensemble size. This model is parameterized by a method-dependent effectiveness factor, $\psi$. Under this framework, RP-BDQN demonstrates substantially higher effectiveness ($\psi \approx 0.87$) compared to BDQN ($\psi \approx 0.80$), enabling it to solve more challenging tasks. Our analysis reveals that this advantage stems from RP-BDQN's sustained ensemble diversity, which mitigates the posterior collapse observed in BDQN. Interestingly, the model's success, despite assuming member independence, suggests that complex ensemble interactions may be a secondary factor in overall performance. Furthermore, we show how systematic deviations from this simple model can be used to diagnose more subtle dynamics like cooperation and diversity saturation. These results offer practical guidance for ensemble configuration and propose a methodological framework for future studies of deep exploration.

Submission Type: Regular submission (no more than 12 pages of main content)

Changes Since Last Submission: The revised manuscript significantly expands the discussion and analysis to address transferability and practical constraints. The authors added Section 5.1 to contextualize how their "hardness" parameter and scaling law structure function as a theoretical baseline for unstructured or stochastic environments, acknowledging the limitations of the grid-world setting. A new Section 4.3 and an updated Figure 4 now explicitly analyze computational trade-offs, demonstrating that efficiency (discovery per unit of compute) peaks at $K=7$ and recommending "split" ensembles over single large ones to mitigate diminishing returns. Furthermore, Section 4.2 was refined to mechanistically link model residuals to specific phenomena, attributing the "cooperative" boost at small $K$ to the shared replay buffer and the "saturation" at large $K$ to the pigeonhole principle, insights which are now synthesized into actionable advice in the newly added Section 5.2 (Prescriptive Guide).

Assigned Action Editor: ~Zheng_Wen1

Submission Number: 6517

Loading