% \documentclass{uai2025} % for initial submission
\documentclass[accepted]{uai2025} % after acceptance, for a revised version; 
% also before submission to see how the non-anonymous paper would look like 
                        
%% There is a class option to choose the math font
% \documentclass[mathfont=ptmx]{uai2025} % ptmx math instead of Computer
                                         % Modern (has noticeable issues)
% \documentclass[mathfont=newtx]{uai2025} % newtx fonts (improves upon
                                          % ptmx; less tested, no support)
% NOTE: Only keep *one* line above as appropriate, as it will be replaced
%       automatically for papers to be published. Do not make any other
%       change above this note for an accepted version.

%% Choose your variant of English; be consistent
\usepackage[american]{babel}
% \usepackage[british]{babel}

%% Some suggested packages, as needed:
\usepackage{natbib} % has a nice set of citation styles and commands
    \bibliographystyle{plainnat}
    \renewcommand{\bibsection}{\subsubsection*{References}}
\usepackage{mathtools} % amsmath with fixes and additions
% \usepackage{siunitx} % for proper typesetting of numbers and units
\usepackage{booktabs} % commands to create good-looking tables
\usepackage{tikz} % nice language for creating drawings and diagrams
\usepackage{subfigure}
\usepackage{amssymb}
\usepackage{algorithm}
\usepackage{algorithmic}
%% Provided macros
% \smaller: Because the class footnote size is essentially LaTeX's \small,
%           redefining \footnotesize, we provide the original \footnotesize
%           using this macro.
%           (Use only sparingly, e.g., in drawings, as it is quite small.)

%% Self-defined macros
\newcommand{\swap}[3][-]{#3#1#2} % just an example

\title{Full Network Capacity Framework \\ for Sample-Efficient Deep Reinforcement Learning}

% The standard author block has changed for UAI 2025 to provide
% more space for long author lists and allow for complex affiliations
%
% All author information is authomatically removed by the class for the
% anonymous submission version of your paper, so you can already add your
% information below.
%
% Add authors
\author{Wentao Yang}
\author{Xinyue Liu}
\author{Yunlong Gao}
\author{Wenxin Liang}
\author{\\Linlin Zong}
\author{Guanglu Wang}
\author{Xianchao Zhang\thanks{Corresponding author}}
% Add affiliations after the authors
\affil{%
    School of Software\\
    Dalian University of Technology\\
    Dalian, Liaoning, China
}
  
  \begin{document}
\maketitle



\begin{abstract}
In deep reinforcement learning (DRL), the presence of dormant neurons leads to a significant reduction in network capacity, which results in sub-optimal performance and limited sample efficiency. Existing training techniques, especially those relying on periodic resetting (PR), exacerbate this issue.
We propose the Full Network Capacity (FNC) framework based on PR, which consists of two novel modules: Dormant Neuron Reactivation (DNR) and Stable Policy Update (SPU). 
DNR continuously reactivates dormant neurons, thereby enhancing network capacity. 
SPU mitigates perturbation from DNR and PR and stabilizes the Q-values for the actor, ensuring smooth training and reliable policy updates.
Our experimental evaluations on the Atari 100K and DMControl 100K benchmarks demonstrate the remarkable sample efficiency of FNC. 
On Atari 100K, FNC achieves a superhuman IQM HNS of 107.3\%, outperforming the previous state-of-the-art method BBF by 13.3\%. 
On DMControl 100K, FNC excels in 5 out of 6 tasks in terms of episodic return and attains the highest median and mean aggregated scores.
FNC not only maximizes network capacity but also provides a practical solution for real-world applications where data collection is costly and time-consuming. 
Our implementation is publicly accessible at \url{https://github.com/tlyy/FNC}.
\end{abstract}


\section{Introduction}
Deep reinforcement learning (DRL) has emerged as a pivotal approach in the realm of artificial intelligence, especially when dealing with sequential decision-making problems in environments fraught with uncertainty. 
Consider the domain of autonomous aerial vehicles (AAVs). 
The AAVs operate in complex and unpredictable atmospheres, where factors like sudden gusts, rapidly changing weather conditions, and unforeseen obstacles pose significant challenges. 
Each flight path decision they make is subject to a high degree of uncertainty, and traditional DRL algorithms often struggle to handle these uncertainties efficiently with limited data.
Similarly, in financial trading, market conditions are constantly fluctuating due to a myriad of factors as geopolitical events, economic policies, and investor sentiment.
Traders using DRL-based strategies need to make decisions in this highly uncertain environment.
However, the large number of samples required by traditional DRL methods for effective learning can be an obstacle, as market data is often expensive to obtain and rapidly changing.

To enhance the practicality of DRL, numerous studies have concentrated on improving the sample efficiency of DRL agents\citep{CURL,DrQ,SPR,SR-SPR,PlayVirtual,BBF}. 
Recent research \citep{reset} has revealed a significant problem: network over-fitting on early interaction samples. 
This over-fitting problem is particularly prominent in uncertain environments, as the limited initial data may not accurately represent all possible scenarios. 
Consequently, the learned policies become less reliable, increasing the uncertainty in decision-making.
To counter this over-fitting problem, periodic resetting (PR) of network parameters has been proposed \citep{reset}. 
The BBF method \citep{BBF}, which incorporates PR, has achieved state-of-the-art performance on the Atari 100K \citep{SimPLe} benchmark. 
Nevertheless, as the size of the network continues to increase, further improvements become challenging, accompanied by a substantial increase in computing and storage costs.

\begin{figure*}[htbp]
    \centering
    \subfigure[The dormant neurons ratio]{\includegraphics[width=0.37\linewidth]{fig/Dormant Kangaroo.pdf}}
    \subfigure[The episodic return]{\includegraphics[width=0.37\linewidth]{fig/performance curve.pdf}}
    \caption{The dormant neurons ratio (a) and episodic return (b) during training for a baseline, baseline with periodic resetting (PR), and FNC. 
    A high dormant neuron ratio indicates low network capacity. 
    Although PR improves final performance, it causes a higher dormant neuron ratio than the baseline. 
    FNC fixes the unsteady capacity of PR, rapidly reducing the dormant neuron ratio close to 0 and achieving full network capacity for better performance.}
    \label{fig:dormant_neurons}
\end{figure*}

Moreover, the neuron dormancy phenomenon in DRL has been discovered in recent studies \citep{ReDo}. 
Vast neurons remain inactive during training, especially when PR is applied. 
As depicted in Figure \ref{fig:dormant_neurons}, the dormant neuron ratio spikes after each PR operation and remains high throughout the training process. 
This under-utilization of network capacity leads to sub-optimal performance. 
When neurons are dormant, the network fails to fully utilize its potential, reducing its ability to capture complex environment patterns, which is especially crucial in uncertain scenarios.

To overcome these challenges in DRL under limited samples, we propose the Full Network Capacity (FNC) framework based on PR.
FNC consists of two novel modules: Dormant Neuron Reactivation (DNR) and Stable Policy Update (SPU).
DNR continuously locates and reactivates dormant neurons.
It ensures that the network operates at its full capacity, i.e., it rapidly reduces the dormant neuron ratio close to 0, as shown in Figure \ref{fig:dormant_neurons}. 
However, the parameter perturbations introduced by DNR and from the original PR can cause Q-network instability. 
Therefore,
SPU adopts the momentum Q-network to smooth the perturbations from DNR and PR and evaluate the value of the actor policy.

We evaluate FNC on two standard benchmarks for DRL under limited samples: the Atari 100K benchmark \citep{SimPLe} and the DMControl 100K benchmark \citep{PlaNet}. 
These benchmarks, although artificial, mimic the uncertainty present in real-world scenarios. 
In the Atari 100K benchmark, FNC achieves superhuman sample efficiency, outperforming the previous state-of-the-art method BBF. 
In the DMControl 100K benchmark, FNC also shows remarkable results, leading in most tasks.

Our contributions can be summarized as follows:
\begin{itemize}
    \item We identify the problem of neuron dormancy reducing network capacity and performance, especially in the training framework with PR under limited samples, which are critical problems in dealing with uncertainty in DRL.
    \item We propose the FNC framework with two novel modules to maximize network capacity and stabilize training, reducing uncertainty in the learned policies.
    \item We demonstrate that FNC achieves state-of-the-art performance on the Atari 100K and DMControl 100K benchmarks with limited computational resources, providing a practical solution for DRL in uncertain, sample-constrained settings.
\end{itemize}



\section{Related Work}
\subsection{Sample Efficiency in DRL}
Sample efficiency is a critical aspect of DRL, as it determines the ability of agents to learn effectively with a limited number of interaction samples. 
High sample complexity has long been a significant hurdle in the practical application of DRL agents, especially in real-world scenarios where interactions can be costly, time-consuming, or even dangerous. 

Numerous techniques have been proposed to enhance sample efficiency:
\begin{itemize}
\item \textbf{Experience Replay}: Experience replay \citep{DQN2013} stores past experiences $(s, a, r, s')$ in a replay buffer. 
During training, these experiences are randomly sampled and reused. 
The replay ratio (RR) \citep{SR-SPR}, defined as the ratio of learning updates to new experiences, plays a crucial role in optimizing performance on limited data. 
For instance, the original DQN algorithm uses an RR of 0.25. 
However, more recent and efficient agents often utilize higher ratios, allowing them to learn more from the available data.
\item \textbf{Data Augmentation}: Techniques like DrQ \citep{DrQ} and RAD \citep{RAD} introduce data augmentation methods to DQN \citep{DQN2013, DQN2015} and SAC \citep{SAC, SAC2}. 
These methods, such as random shift and intensity adjustments, increase the diversity of the training data. 
By artificially creating more varied input data, the agent can learn more generalizable patterns, leading to improved performance.
\item \textbf{Self-Supervised Learning}: SPR \citep{SPR} builds on the Rainbow \citep{Rainbow} algorithm and incorporates a self-supervised temporal consistency loss based on BYOL \citep{BYOL}, along with data augmentation. 
The self-supervised loss helps the agent learn about the underlying structure of the environment without relying solely on the rewards provided.
\item \textbf{Model-Based Methods}: Algorithms such as SimPLe \citep{SimPLe} and EfficientZero \citep{EfficientZero} focus on learning the environment dynamics. 
By building a model of how the environment behaves, these methods can make more informed decisions and require fewer real-world interactions. 
\end{itemize}

\subsection{Network Capacity in DRL}
Network capacity, which is related to the number of active neurons in a network, is another key factor in DRL. 
A higher network capacity improves the agent's ability to model complex situations. 
Two main ways to enhance network capacity are enlarging the network size and increasing the active neuron ratio in a fixed-size network.

\begin{itemize}
\item \textbf{Scaling Up Networks}: \citet{Large} showed how increasing the number of layers and neurons in a network can improve its representational power.
They also highlighted the challenges, such as increased computational requirements and over-fitting. 
\citet{BBF} achieved sample-efficient performance by scaling the neural networks used for value estimation, combined with other design choices that enabled this scaling. 
However, increasing the network size brings additional computational and storage costs.
\item \textbf{Neuron Dormancy}: Recent studies \citep{ReDo, CBP, CBP-Nat} have uncovered the neuron dormancy phenomenon in neuron network, which means a significant number of neurons remain inactive during training and do not contribute to the network's output. 
This under-utilization of network capacity wastes computational resources and limits the agent's learning ability.
\end{itemize}

Most previous methods have focused on enhancing network capacity by enlarging the network size, which inevitably brings additional computational and storage burdens. 
In contrast, increasing the active neuron ratio in a fixed-size network has been relatively under-explored.
Our approach belongs to the latter, enhancing network capacity without introducing additional burdens.



\section{Problem Formulation}
Deep Reinforcement Learning (DRL) trains an agent to make optimal sequential decisions within an environment.
The goal is to maximize cumulative rewards. 
The overall procedure is formalized through the concept of a Markov Decision Process (MDP) \citep{MDP}, represented by the tuple $M = (S, A, P, R, \rho_0, \gamma)$.
The set $S$ encompasses all possible states $s$ that the agent can occupy within the environment. 
The set $A$ consists of all the actions $a$ that the agent can execute. 
Given a state $s$ and an action $a$, the transition probability function $P(s'|s, a)$ quantifies the likelihood of the environment transitioning from the current state $s$ to a new state $s'$. 
The reward function $R(s, a)$ assigns a scalar value $r$ to the agent when it takes action $a$ in state $s$. 
$\rho_0(s)$ represents the probability distribution over the initial states.
The discount factor $\gamma \in [0, 1]$ determines the relative importance of future rewards compared to immediate rewards.

The goal of DRL is to discover a policy $\pi_\phi (a|s)$ that maximizes the expected cumulative reward, expressed as:
\begin{equation}
J(\pi)=\mathbb{E}[\sum_{t = 0}^{\infty}\gamma^t R(s_t,a_t)|\pi].
\end{equation}

Two crucial functions in DRL algorithms are the state-value function $V^\pi(s)$ and the action-value function $Q^\pi(s, a)$:
\begin{equation}
V^\pi(s)=\mathbb{E}[\sum_{t = 0}^{\infty}\gamma^t R(s_t,a_t)|s_0 = s,\pi].
\end{equation}
\begin{equation}
Q^\pi(s,a)=\mathbb{E}[\sum_{t = 0}^{\infty}\gamma^t R(s_t,a_t)|s_0 = s,a_0 = a,\pi].
\end{equation}

Common DRL algorithms include Q-learning \citep{Q-learning} and Actor-Critic \citep{AC}. Q-learning updates the action-value function $Q(s,a)$ as follows:
\begin{equation}
Q(s,a)\leftarrow Q(s,a)+\alpha (R(s,a)+Q(s',a') - Q(s,a)),
\end{equation}
where $\alpha$ is the learning rate.

In the Actor-Critic (AC) approach, the actor is responsible for updating the policy, while the critic evaluates the state or action values. 
The critic learns the state-value function $V(s)$ or the action-value function $Q(s,a)$ and updates it using the temporal difference (TD) error:
\begin{equation}
\theta \leftarrow \theta+\alpha_C \delta_t \nabla_{\theta}V(s_t), 
\end{equation}
where 
\begin{equation}
\delta_t = R(s,a)+\gamma V(s_{t + 1})-V(s_t), 
\end{equation}
and $\alpha_C$ is the critic's learning rate. 
The actor then updates the policy using an advantage-based policy gradient:
\begin{equation}
\phi \leftarrow \phi+\alpha_A \delta_t \nabla_{\phi}\log\pi(a_t|s_t;\phi),
\end{equation}
where $\alpha_A$ is the actor's learning rate.



\begin{figure*}[htb!]
  \centering
  \includegraphics[width=0.9\linewidth]{fig/fnc.pdf}
  \caption{
    Details of Full Network Capacity (FNC) training framework.
    FNC introduces two new mechanisms into each training update: Dormant Neuron Reactivation (DNR) to activate the dormant neurons and Stable Policy Update (SPU) to smooth the perturbation from DNR and periodic resetting.
  }
  \label{scheme}
\end{figure*}



\section{Full network capacity training framework}
\subsection{Overview}
The Full Network Capacity (FNC) framework fully exploits network capacity when applying periodic resetting (PR) under limited samples. 
FNC incorporates two novel modules into the PR framework: Dormant Neuron Reactivation (DNR) and Stable Policy Update (SPU), as shown in Figure \ref{scheme}. 
These modules work in tandem to enhance network capacity and ensure stable training, thereby improving the sample efficiency of DRL agents.

\subsection{Dormant Neuron Reactivation}
The Dormant Neuron Reactivation (DNR) module addresses dormant neurons by detecting and reactivating these neurons.

\begin{algorithm}[htb]
\caption{Dormant Neuron Reactivation (DNR)}
\label{alg:dnrm}
\begin{algorithmic}[1]
\STATE \textbf{Input:} Online Q network parameters \( \theta_o \), DNR weight \( \beta \), dormant threshold \(\delta\).
\STATE \textbf{Output:} Reactivated parameters \( \theta_{\text{rea}} \)
\FOR{each layer \( l \)}
    \FOR{each unit \( i \)}
        \STATE Compute activation value: \\ \[ a^{l,i} = \frac{\mathbb{E}_{x\in D}|h_i^l(x)|}{\frac{1}{H^l}\sum_{k\in H^l}\mathbb{E}_{x\in D}|h_k^l(x)|}
         \]
        \IF{ \( a^{l,i} \le \delta \) }
            \STATE Reactive the neuron parameters:\\ \[ \theta^{l,i}_{\text{rea}} = \beta \cdot \theta^{l,i}_o + (1 - \beta) \cdot \theta^{l,i}_{\text{init}} \]
        \ENDIF
    \ENDFOR
\ENDFOR
\STATE \textbf{return} \( \theta_{\text{rea}} \)
\end{algorithmic}
\end{algorithm}

We first identify dormant neurons. The dormant ratio \(d^{l,i}\) of neuron \(i\) in layer \(l\) is calculated as:
\begin{equation}
a^{l,i} = \frac{\mathbb{E}_{x\in D}|h_i^l(x)|}{\frac{1}{H^l}\sum_{k\in H^l}\mathbb{E}_{x\in D}|h_k^l(x)|}
\end{equation}
where \(D\) is the input distribution, \(h_i^l(x)\) is the neuron's activation for input \(x\in D\), and \(H^l\) is the number of neurons in layer \(l\). Neurons with \(d^{l,i}\leq\delta\) form the dormant neuron index set \(I\):
\begin{equation}
I = \{i|a^{l,i} \le \delta\}
\end{equation}

After location, we reactivate dormant neurons using shrink and perturb operations.
To prevent the network from converging to sharp minima, we shrink the parameters of dormant neurons. 
Given online parameters \(\theta_o\), the shrunk parameters are:
\[
\theta^{l,i}_{\text{shrink}} = \beta \cdot \theta^{l,i}_o
\]
where \(\beta\in(0, 1)\) is the shrink weight.
To encourage exploration in the parameter space, we add a fraction of the initial parameters to the shrunk parameters:
\[
\theta^{l,i}_{\text{perturb}} = (1 - \beta) \cdot \theta^{l,i}_{\text{init}}
\]

The reactivated parameters combine the two operations:
\[
\theta^{l,i}_{\text{rea}} = \theta^{l,i}_{\text{shrink}} + \theta^{l,i}_{\text{perturb}}
\]
The DNR process is summarized in Algorithm \ref{alg:dnrm}.
We focus on the dormancy of the critic or the Q-network as empirical study \cite{AdpRR} shows that critic dormancy has a more pronounced impact on sample efficiency.
Moreover, directly recovering the actor's dormancy by DNR introduces too much instability into the policy.

\subsection{Stable Policy Update}
The perturbation introduced by the DNR and periodic resetting can disrupt the stability of policy updates. 
The Stable Policy Update(SPU) mitigates this issue using a momentum Q-network for policy updates. 
The process is described in Algorithm \ref{alg:spum}.
The reason why delayed copy update is not used is that the policy is derived from the momentum Q-network.
The policy does not change within the copy interval time and is not suitable for collecting new data.

\begin{algorithm}[htb]
\caption{Stable Policy Update (SPU)}
\label{alg:spum}
\begin{algorithmic}[1]
\STATE \textbf{Input:} Online Q-network parameters \( \theta_o \), Momentum Q-network parameters \(\theta_m\), momentum parameter \( \tau \)
\STATE \textbf{Output:} Actor policy parameters \( \phi \)
\STATE Initialize \( \theta_m = \theta_o \)
\FOR{each training step}
    \STATE Update \( \theta_m \) using the momentum update rule:\\
    \[\theta_m = \tau \cdot \theta_o + (1 - \tau) \cdot \theta_m\]
    \STATE Use \( \theta_m \) to compute Q-values for actor policy parameters $\phi$ updates
\ENDFOR
\STATE \textbf{return} \( \phi \)
\end{algorithmic}
\end{algorithm}

In the discrete control setting, although the deep Q-network (DQN) does not have an explicit actor in the traditional sense, the Q-network can be regarded as the actor for action generation. 
In this case, the parameters of the actor and the Q-network are identical. 
SPU treats the momentum Q-network in DQN as the actor, and the actual policy is:
\begin{equation}
a = \text{argmax}_a Q_{\theta_m}(s, a),
\end{equation}
which contrasts with previous approaches that typically employ the online Q-network.

In the continuous control setting, the actor policy is explicitly defined with parameters $\phi$. 
SPU optimizes the actor using the momentum critic as:
\begin{equation}
J_\pi(\phi) = Q_{\theta_m}(s, \pi_\phi(s)) - \alpha * \log\pi_\phi(s).
\end{equation}
Since DNR-induced perturbations only occur in the online Q-network during training, and periodic resetting (PR) also affects the network periodically, SPU plays a crucial role in preventing drastic changes caused by both PR and DNR. 
It generates a stable Q-value for policy updates, which is also beneficial in preventing the emergence of dormant neurons, as previously noted in related research \citep{ReDo}.
Furthermore, the momentum counterpart is a laggard of the online Q-network, it tends to have the same dormant neurons as the online Q-network has.
SPU prevents these neurons of the momentum Q-network from being completely inactive once they are detected in the online Q-network. 


\section{Experiment}
We conduct a comprehensive evaluation of the proposed Full Network Capacity (FNC) framework on two standard benchmarks for DRL under limited samples. 
The primary objectives are to thoroughly assess the performance, network capacity, and distinct advantages of FNC over existing methods.

\subsection{Experimental Setup}
\subsubsection{Benchmark Selection Rationale}
Two benchmarks employed in our study are the Atari 100K benchmark \citep{SimPLe} and the DMControl 100K benchmark \citep{PlaNet}. 
The Atari 100K benchmark is a rich source of diverse vision-based control tasks, consisting of 26 games with low-dimensional discrete actions. 
They test an agent's ability to perceive visual cues, make quick and accurate action selections, and adapt to different game mechanics.
The DMControl 100K benchmark, focuses on six control tasks with high-dimensional continuous actions.
These tasks evaluate an agent's proficiency in handling continuous control problems, understanding complex dynamics, and making fine-grained decisions in dynamic environments.




\begin{table*}[htb]
\caption{
    Final scores and aggregate metrics for FNC and competing methods \citep{SPR, rliable, SR-SPR, BBF} across the 26 Atari 100K games. 
    Scores are averaged across five runs per game for FNC. We report the standard error for game scores and the 95\% bootstrap confidence interval for the aggregate metrics of our method FNC.
}
\begin{center}
\scalebox{1}{
\begin{tabular}{lccccccc}
\hline
Game           & Human   & Random  & DrQ($\epsilon$)    & SPR     & SR-SPR   & BBF & FNC     \\ 
\hline
Alien          & 7127.7  & 227.8   & 865.2      & 841.9   & 1107.8   & 1121.7     & \textbf{1250.3 $\pm$ 76.0}  \\
Amidar         & 1719.5  & 5.8     & 137.8      & 179.7   & 203.4    & \textbf{236.6}   & 173.7 $\pm$ 25.2   \\
Assault        & 742.0   & 222.4   & 579.6      & 565.6   & 1088.9   & 2004.5     & \textbf{2521.4 $\pm$ 305.7}  \\
Asterix        & 8503.3  & 210.0   & 763.6      & 962.5   & 903.1    & 3169.8     & \textbf{4410.7 $\pm$ 562.9}  \\
Bank Heist     & 753.1   & 14.2    & 232.9      & 345.4   & 531.7    & 768.8      & \textbf{781.3 $\pm$ 169.5}  \\
Battle Zone    & 37187.5 & 2360.0  & 10165.3    & 14834.1 & 17671.0  & 23681.4    & \textbf{23338.0 $\pm$ 2408.1} \\
Boxing         & 12.1    & 0.1     & 9.0        & 35.7    & 45.8     & 77.4       & \textbf{79.8 $\pm$ 6.3}    \\
Breakout       & 30.5    & 1.7     & 19.8       & 19.6    & 25.5     & 331.1      & \textbf{374.2 $\pm$ 8.2}   \\
ChopperCommand & 7387.8  & 811.0   & 844.6      & 946.3   & 2362.1   & \textbf{4251.6}  & 2802.2 $\pm$ 995.8 \\
CrazyClimber   & 35829.4 & 10780.5 & 21539.0    & 36700.5 & 45544.1  & 60864.5    & \textbf{63323.2 $\pm$ 9785.8} \\
DemonAttack    & 1971.0  & 152.1   & 1321.5     & 517.6   & 2814.4   & 18298.4    & \textbf{20798.0 $\pm$ 4223.8} \\
Freeway        & 29.6    & 0.0     & 20.3       & 19.3    & 25.4     & 23.1       & \textbf{27.1 $\pm$ 1.4}    \\
Frostbite      & 4334.7  & 65.2    & 1014.2     & 1170.7  & \textbf{2584.8} & 2023.1    & 1377.4 $\pm$ 705.2  \\
Gopher         & 2412.5  & 257.6   & 621.6      & 660.6   & 712.4    & 1209.4    & \textbf{1629.7 $\pm$ 285.9} \\
Hero           & 30826.4 & 1027.0  & 4167.9     & 5858.6  & \textbf{8524.0}   & 5741.8  & 5604.6 $\pm$ 624.1 \\
Jamesbond      & 302.8   & 29.0    & 349.1      & 366.5   & 389.1    & \textbf{1124.6}  & 1058.7 $\pm$ 172.8 \\
Kangaroo       & 3035.0  & 52.0    & 1088.4     & 3617.4  & 3631.7   & 5032.1  & \textbf{8202.0 $\pm$ 1830.8 }\\
Krull          & 2665.5  & 1598.0  & 4402.1     & 3681.6  & 5911.8   & 8069.8  & \textbf{8075.1 $\pm$ 59.0 }\\
KungFuMaster   & 22736.3 & 258.5   & 11467.4    & 14783.2 & 18649.4  & 16616.9 & \textbf{21508.6 $\pm$ 4703.6}\\
Ms Pacman      & 6951.6  & 307.3   & 1218.1     & 1318.4  & 1574.1    & \textbf{2217.8}  & 1994.9 $\pm$ 206.8 \\
Pong           & 14.6    & -20.7   & -9.1       & -5.4    & 2.9      & \textbf{13.7}    & 10.2  $\pm$ 6.1  \\
PrivateEye     & 69571.3 & 24.9    & 3.5        & 86.0    & \textbf{97.9}   & 39.1    & 54.0 $\pm$ 46.0  \\
Qbert          & 13455.0 & 163.9   & 1810.7     & 866.3   & \textbf{4044.1}    & 3245.3  & 2897.7 $\pm$ 761.7 \\
RoadRunner     & 7845.0  & 11.5    & 11211.4    & 12213.1 & 13463.4   & 26419.0 & \textbf{30723.0 $\pm$ 2142.3} \\
Seaquest       & 42054.7 & 68.4    & 352.3      & 558.1   & 819.0      & \textbf{988.6}   & 835.6 $\pm$ 182.2  \\
UpNDown        & 11693.2 & 533.4   & 4324.5     & 10859.2 & \textbf{112450.3}  & 15122.7 & 17093.7 $\pm$ 3847.0\\
\hline
IQM HNS
($\uparrow$)   & 100.0\% & 0.0\%   & 28.0\%   & 33.7\%  & 63.1\%   & 94.0\%  & \textbf{107.3\%}[96.2\%,120.0\%] \\
OG HNS
($\downarrow$) & 0.0\%   & 100.0\% & 63.1\%   & 57.7\%  & 43.3\%   & 37.7\%  & \textbf{36.7\%}[34.2\%,39.4\%]  \\
Median HNS
($\uparrow$)   & 100.0\% & 0.0\%   & 31.3\%   & 39.6\%  & 68.5\%   & 75.5\%  & \textbf{89.4\%}[68.3\%,98.1\%]  \\
Mean HNS
($\uparrow$)   & 100.0\% & 0.0\%   & 46.5\%   & 61.6\%  & 127.2\%   & 217.5\%  & \textbf{240.1\%}[221.2\%,257.8\%] \\
\hline
\end{tabular}
}
\end{center}
\label{tab:atari_results}
\end{table*}

\subsubsection{Implementation Details}
In the discrete action settings, our implementation is grounded in the SPR algorithm framework \citep{SPR}. 
We inherit a similar architecture and incorporate random shifts and intensity data augmentation techniques \citep{DrQ}. 
To ensure a fair and reliable comparison, we meticulously follow the same architecture parameters and hyperparameters as those used in the BBF method \citep{BBF}. 
This includes the utilization of the Impala residual network as the encoder and expanding the network width by 4 times to enhance its representational capability.

In continuous action settings, we build on the DrQ framework \citep{DrQ}, a modified version of SAC \citep{SAC} with integrated data augmentation capabilities. 
It allows the framework to handle pixel-based input.
All hyperparameters, network architectures, and implementation choices are detailed in the Appendix \ref{app:implementation}. 



\begin{figure*}[htb]
  \centering
  \subfigure[Freeway]{\includegraphics[width=.22\linewidth]{fig/Ratio Freeway.pdf}}
  \subfigure[Gopher]{\includegraphics[width=.22\linewidth]{fig/Ratio Gopher.pdf}}
  \subfigure[Kangaroo]{\includegraphics[width=.22\linewidth]{fig/Ratio Kangaroo.pdf}}
  \subfigure[KungFuMaster]{\includegraphics[width=.22\linewidth]{fig/Ratio KungFuMaster.pdf}}
  \caption{
    The active neuron ratio during training for baseline with periodic resetting (PR) and FNC.
    The active neuron ratio of baseline with PR is relatively low, especially at the time after each PR,
    FNC quickly recovers the full network capacity by increasing the active neuron ratio close to 100\% and maintains it even with the PR executed.
  }
  \label{Capacity}
\end{figure*}

\begin{figure*}[htb]
  \centering
  \subfigure[Freeway]{\includegraphics[width=.22\linewidth]{fig/Rank Freeway.pdf}}
  \subfigure[Gopher]{\includegraphics[width=.22\linewidth]{fig/Rank Gopher.pdf}}
  \subfigure[Kangaroo]{\includegraphics[width=.22\linewidth]{fig/Rank Kangaroo.pdf}}
  \subfigure[KungFuMaster]{\includegraphics[width=.22\linewidth]{fig/Rank KungFuMaster.pdf}}
  \caption{
    The effective rank \citep{ERank} during training for baseline with periodic resetting (PR) and FNC.
    FNC enhances the effective rank, enabling a better representation and expressivity on all selected games.
  }
  \label{Rank}
\end{figure*}



\begin{figure*}[htb]
  \centering
  \subfigure[Freeway]{\includegraphics[width=.22\linewidth]{fig/performance curve21.pdf}}
  \subfigure[Gopher]{\includegraphics[width=.22\linewidth]{fig/performance curve22.pdf}}
  \subfigure[Kangaroo]{\includegraphics[width=.22\linewidth]{fig/performance curve23.pdf}}
  \subfigure[KungFuMaster]{\includegraphics[width=.22\linewidth]{fig/performance curve24.pdf}}
  \caption{
    The episodic return during training for baseline with period resetting (PR) and FNC.
    FNC outperforms the baseline with periodic resetting on all selected games.
  }
  \label{performance curve}
\end{figure*}



\subsection{FNC improves agent performance}
We gauge the agent's performance by measuring the final score after training with limited interaction samples. 
In the context of a DRL with a fixed sample budget, performance is intrinsically linked to sample efficiency. 
A high-performance agent is a strong indicator of high sample efficiency, as the agent can learn effectively with fewer samples.

On the Atari 100K benchmark, we collect the final scores of the agents across all 26 tasks. 
To standardize the comparison, we calculate the human-normalized score (HNS) for each game using the following formula:
\begin{equation}
    HNS=\frac{S_A - S_R}{S_H - S_R},
\end{equation}
where $S_A$ represents the score achieved by the agent, $S_R$ is the score obtained by random play, and $S_H$ is the score achieved by an expert human player. 
This normalization allows for a direct comparison of the agent's performance relative to human performance.

Subsequently, we adopt the inter-quartile mean (IQM), optimality gap (OG), median, and mean metrics from the rliable \citep{rliable} framework to aggregate the HNS values across the 26 games. 
The IQM metric represents the average score of the middle 50\% of the runs combined across all games and seeds.
It reduces the impact of extreme values, providing a more stable and representative assessment. 
A higher IQM, mean, and median score indicates a better overall performance, while a lower OG, which quantifies the performance gap between the agent and the human, is more favorable.

As presented in Table \ref{tab:atari_results}, FNC showcases remarkable performance. 
It outperforms human performance in 11 games and surpasses previous model-free methods in 15 out of 26 games in terms of the final score. 
FNC achieves the highest IQM HNS of 107.3\%, Median HNS of 89.4\%, Mean HNS of 240.1\%, and the lowest OG HNS of 36.7\%. 
Notably, FNC outperforms the previous state-of-the-art method BBF by 13.3\% in the IQM score under the same replay ratio of 2. 

On the DMControl 100K benchmark, we calculate the final scores of the six tasks and aggregate them using the median and mean metrics. 
As shown in Table \ref{dm performance} in Appendix \ref{dm section}, FNC achieves the best final scores on 5 tasks and attains the best median and mean scores. 
These results, on both the Atari 100K and DMControl 100K benchmarks, demonstrate the effectiveness of FNC in improving sample efficiency.

\subsection{FNC improves network capacity}
To evaluate the network capacity, we monitor the active neuron ratio. 
Since the same network architecture is applied across different variants, the active neuron ratio is reliable to assess how effective the network is. 

We conduct a comparative analysis of the active neuron ratio of the baseline with periodic resetting and FNC during the training process on 4 selected Atari games.
As illustrated in Figure \ref{Capacity}, the active neuron ratio of the baseline with periodic resetting remains relatively low, especially immediately after each resetting event. 
This low ratio indicates that a significant portion of the network's potential is left untapped, leading to under-utilization of the network capacity. 
In contrast, FNC exhibits a remarkable ability to rapidly reduce the dormant neuron ratio to close to 0. 
Even when periodic resetting is executed, FNC manages to maintain a high active neuron ratio, suggesting that it can effectively utilize nearly the entire capacity of the network.

The full utilization of the network capacity endows the network with a higher effective rank \citep{ERank}, which contributes to better representation and expressivity, as shown in Figure \ref{Rank}. 
Consequently, the performance of FNC is higher than that of the baseline with PR during the training procedure, as shown in Figure \ref{performance curve}.


\subsection{FNC reduces the training computation cost}
In the online setting, the agent interacts with the environment and trains the policy simultaneously.  
Each interaction is contingent upon the completion of the update procedure. 
Therefore, reducing the cost of updating is vital to minimize the expenditures and risks associated with interactions in real-world scenarios.

To evaluate the computation cost, we record the training runtime. 
The IQM HNS and runtime comparison are depicted in Figure \ref{IQM and Runtime}, along with relevant model-free \citep{DER, DrQ, SPR, SR-SPR, BBF} and model-based \citep{IRIS, EfficientZero} algorithms. 
FNC only consumes approximately the same amount of time as BBF with a replay ratio of 2 to complete training on a task. 
However, its performance rivals BBF with a replay ratio of 8, which requires four times the computational resources. 
FNC achieves superhuman sample efficiency with a low replay ratio of 2 for the first time, demonstrating its cost-effectiveness in achieving high-performance results.

\begin{figure}[htb!]
  \centering
  \includegraphics[width=0.9\linewidth]{fig/Runtime.pdf}
  \caption{
    Computational cost versus performance, 
    measured by IQM HNS across 26 games and the total number of GPU hours spent per environment for Runtime.
    FNC improves performance even with lower costs.
  }
  \label{IQM and Runtime}
\end{figure}



\subsection{Ablation study}
To gain a deeper understanding of the contribution of each component of FNC, we conduct a systematic ablation study. 
In this study, we remove one component at a time and observe the impact on the performance of the framework. 
The results are presented in Figure \ref{fig:ablation}.

\begin{figure}[htb!]
  \includegraphics[width=0.8\linewidth]{fig/ablation.pdf}
  \caption{
    Ablation study results show the impact of removing different components of FNC on the Atari 100K benchmark.
  }
  \label{fig:ablation}
\end{figure}

When we remove Dormant Neuron Reactivation (FNC-DNR), we observe a notable decline in the IQM score. 
This decline clearly indicates that the activation of dormant neurons in each update step is crucial to maintaining high network expressivity. 
Without DNR, the network fails to fully exploit its capacity, leading to sub-optimal performance.

Removing the Stable Policy Update (FNC-SPU) also results in a significant decrease in the IQM score, even worse than the baseline that only applies PR (FNC-DNR-SPU). 
It indicates that directly introducing DNR into the PR framework without SPU causes excess parameter perturbation and unsteady training.
This finding highlights the importance of SPU in stabilizing the training process. 

Hyperparameters $\beta$ and $\delta$ were tuned by grid search in 4 Atari games (Freeway, Gopher, Kangaroo, KungFuMaster). 
The results are depicted in Appendix \ref{app:hp search}.
We set $\beta=0.5$ and $\delta=0.0$ to maximize reactivation while avoiding over-perturbation, achieve the best mean HNS. 



\section{Conclusion}
In this study, we have introduced the Full Network Capacity (FNC) framework to address dormant neurons and to exploit network capacity in deep reinforcement learning (DRL) under limited samples. 
The FNC framework, with its two novel modules, Dormant Neuron Reactivation (DNR) and Stable Policy Update (SPU), has achieved state-of-the-art performance on the Atari 100K and DMControl 100K benchmarks with limited computational resources.

Our work has not only provided a practical solution to the sample-efficiency problem in DRL but also opened up new research directions. 
We believe that the insights gained from this study will inspire further research in the area of sample-efficient reinforcement learning, leading to the development of more advanced and efficient DRL algorithms. 
These advancements could have far-reaching implications in various real-world applications with uncertainty, such as robotics, autonomous vehicles, and financial trading, where data collection is often costly and time-consuming.

\section{Acknowledgement}
This work was supported by National Natural Science Foundation of China (No. 62476040).


% References

\bibliography{reference}


\newpage

\onecolumn


\title{Full Network Capacity Framework \\ for Sample-Efficient Deep Reinforcement Learning\\(Supplementary Material)}
\maketitle

\appendix

\section{DMControl 100K benchmark performance table}
\label{dm section}
\begin{table*}[htb]
\caption{
    Scores(mean and standard deviation) and aggregate metrics for FNC and competing methods \citep{Dreamer, CURL, DrQ, PlayVirtual} across the 6 DMControl 100K games. 
    We run our FNC with five random seeds per game.
    The scores of other methods refer to the work of Yu et al.\citep{PlayVirtual}.
}
\begin{center}
\begin{tabular}{lcccccc}
\hline
Game                & Dreamer       & CURL          & DrQ           & PlayVirtual    & FNC     \\ 
\hline
Ball In Cup Catch   & 246 $\pm$ 174 & 769 $\pm$ 43  & 913 $\pm$ 53  & 926 $\pm$ 31   & \textbf{962 $\pm$ 6}\\
Cartpole Swingup    & 326 $\pm$ 27  & 582 $\pm$ 146 & 759 $\pm$ 92  & 816 $\pm$ 36   & \textbf{850 $\pm$ 17}\\
Cheetah Run         & 235 $\pm$ 137 & 299 $\pm$ 48  & 344 $\pm$ 67  & 474 $\pm$ 50   & \textbf{475 $\pm$ 40}\\
Finger Spin         & 341 $\pm$ 70  & 767 $\pm$ 56  & 901 $\pm$ 104 & \textbf{915 $\pm$ 49}   & 799 $\pm$ 191\\
Reacher Easy        & 314 $\pm$ 155 & 538 $\pm$ 233 & 601 $\pm$ 213 & 785 $\pm$ 142  & \textbf{936 $\pm$ 87}\\
Walker Walk         & 277 $\pm$ 12  & 403 $\pm$ 24  & 612 $\pm$ 164 & 460 $\pm$ 173  & \textbf{767 $\pm$ 92}\\
\hline
Median Score & 295.5  & 560.0  & 685.5  & 800.5  & \textbf{824.5}  \\
Mean Score & 289.8  & 559.7  & 688.3  & 729.3  & \textbf{798.2} \\     
\hline
\end{tabular}
\end{center}
\label{dm performance}
\end{table*}


\section{HyperParameter Ablation Study}
\label{app:hp search}

\begin{table*}[htb]
\caption{
    Hyperparameter selection for $\beta$ and $\delta$ in DNR with the human-normalized scores.
    We run each variant with 3 random seeds per game and report the HNS score.
}
\begin{center}
\begin{tabular}{lccccc}
\hline
                & Freeway       & Gopher          & Kangaroo           & KungFuMaster    & Mean HNS     \\ 
\hline
$\beta=0$       & 84.7\% & 55.0\%  & 110.33\%  & 101.33\%   & 87.8\% \\
$\beta=0.25$    & 95.7\% & 74.7\%  & 97.7\%    & 73.0\%     & 85.3\%\\
$\beta=0.5$     & 97.3\% & 83.3\%  & 154.3\%   & 93.0\%     & 107.0\%\\
$\beta=0.75$    & 88.3\% & 44.0\%  & 125.7\%   & 101.3\%    & 89.8\%\\
$\beta=1$       & 85.7\% & 34.3\%  & 125.3\%   & 87.3\%     & 83.2\%\\
\hline
$\delta=0$      & 97.3\% & 83.3\%  & 154.3\%   & 93.0\%     & 107.0\%\\
$\delta=0.1$    & 87.3\% & 76.0\%  & 141.7\%   & 107.3\%    & 103.1\%\\
$\delta=0.2$    & 76.3\% & 77.3\%  & 153.7\%   & 87.3\%     & 98.6\%\\ 
\hline
\end{tabular}
\end{center}
\label{hp search}
\end{table*}


\section{Experiment Settings}
\label{app:implementation}
We use an open-source JAX implementation of BBF from \url{https://github.com/google-research/google-research/tree/master/bigger_better_faster},
and a JAX implementation of the DrQ algorithm from \url{https://github.com/evgenii-nikishin/rl_with_resets/tree/main}.
All experiments are performed on one RTX 3080 Ti GPU and require GPU memory less than 12G.
The runtime is about 3-4 hours for one seed on one game.

The experiments use five random seeds to evaluate performance.
We largely reuse the hyperparameters from previous methods \citep{BBF, DrQ}, and report the hyperparameter settings used in the DMControl 100k in Table \ref{dm_hp} and in the Atari 100k experiments in Table \ref{hp}.




\begin{table}[htb]
\caption{
    Hyperparameters for FNC in DMControl 100K benchmark.
    The ones introduced by this work are at the bottom of the table.
}
\label{dm_hp}
\begin{center}
\scalebox{1}{
\begin{tabular}{lr}
\hline
Parameter                       & Setting                               \\
\hline
Grey-scaling                    & True                                  \\
Observation down-sampling       & $64 \times 64$                        \\
Frames stacked                  & 3                                     \\
Action repetitions              &                                       \\
\hspace{2em}Cartpole Swingup    & 8                                     \\
\hspace{2em}Reacher Easy        & 4                                     \\
\hspace{2em}Cheetah Run         & 4                                     \\
\hspace{2em}Finger Spin         & 2                                     \\
\hspace{2em}Ball In Cup Catch   & 4                                     \\
\hspace{2em}Walker Walk         & 2                                     \\
Memory size                     & 100000                                \\
Seed steps                      & 1000                                  \\
Discount factor                 & 0.99                                  \\
Minibatch size                  & 512                                   \\
Optimizer                       & Adam                                  \\
\hspace{2em}Learning rate        & 0.0003                                \\
\hspace{2em}First moment decay   & 0.9                                   \\
\hspace{2em}Second moment decay  & 0.999                                 \\
\hspace{2em}$\epsilon$           & 0.00015                               \\
Critic update frequency         & 2                                     \\
Critic Q-function 
soft-update rate                & 0.005                                 \\
Actor update frequency         & 2                                     \\
Actor log std bounds         & [-10, 2]                              \\
Init temperature                & 0.1                                   \\
Data augmentation               & Shifts ($\pm4$ pixels), \\ & Intensity     \\
Reset interval 
              & 20000                                 \\
Layers getting hard reset       & Final 3                               \\
\hline
Dormant threshold $\delta$    & 0.0                                   \\
DNR weight $\beta$        & 0.8                                   \\
\hline
\end{tabular}}
\end{center}
\end{table}




\begin{table}[htb]

\caption{
    Hyperparameters for FNC in Atari 100K benchmark.
    The ones introduced by this work are at the bottom of the table.
}
\label{hp}
\begin{center}
\scalebox{1}{
\begin{tabular}{lr}
\hline
Parameter                       & Setting                               \\
\hline
Grey-scaling                    & True                                  \\
Observation down-sampling       & $84 \times 84$                        \\
Frames stacked                  & 4                                     \\
Action repetitions              & 4                                     \\
Reward clipping                 & [-1, 1]                               \\
Terminal on loss of life        & True                                  \\
Max frames per episode          & 108k                                  \\
Update                          & Distributional Q                      \\
Dueling                         & True                                  \\
Support of Q-distribution       & 51                                    \\
Discount factor                 & 0.97$\rightarrow$ 0.997               \\
Minibatch size                  & 32                                    \\
Optimizer                       & AdamW                                 \\
\hspace{2em}Learning rate        & 0.0001                                \\
\hspace{2em}First moment decay   & 0.9                                   \\
\hspace{2em}Second moment decay  & 0.999                                 \\
\hspace{2em}$\epsilon$           & 0.00015                               \\
\hspace{2em}Weight decay         & 0.1                                  \\
Max gradient norm               & 10                                    \\
Priority exponent               & 0.5                                   \\
Priority correction             & 0.4$\rightarrow$ 1                    \\
Training steps                  & 100k                                  \\
Evaluation episodes             & 100                                   \\
Memory size                     & Unbounded                             \\
Min replay size for sampling    & 2000                                  \\
Replay period every             & 1 step                                \\
Updates per step                & 2                                     \\
Multi-step return length        & 10 $\rightarrow$ 3                    \\
Encoder                         & Impala ResNet               \\
Hidden units                    & 2048                                  \\
Non-linearity                   & ReLU                                  \\
Target network                  &                                       \\
\hspace{2em}Update period       & 1                                     \\
\hspace{2em}EMA coefficient $\tau$ & 0.005                              \\
$\lambda$ (SPR loss coefficient)& 5                                     \\
$K$ (SPR prediction depth)      & 5                                     \\
Data augmentation               & Shifts ($\pm4$ pixels), \\
&Intensity     \\
Action selection                & Target network                        \\
Reset interval 
                                & 20000                                 \\
Cycle steps                     & 5000                                  \\
Layers getting hard reset       & Final 2                               \\
Shrink and Perturb              & 0.5                                   \\
\hline
Dormant threshold $\delta$    & 0.0                                   \\
DNR weight $\beta$        & 0.5                                   \\
\hline
\end{tabular}}
\end{center}
\end{table}







\end{document}
