\section{Simulation Experiments}\label{app:sim}
We perform simulations on the following environments.
\begin{enumerate}
    \item \texttt{Continuous RiverSwim}: This environment models an agent who is swimming in a river~\citep{strehl2008analysis}.~Though the original MDP is discrete, we use a continuous version of it.~The state denotes the location of the agent in the river in a single dimension, and the action captures the movement of the agent. The state and action spaces are $[0,6]$ and $[0,1]$, respectively. The state of the system evolves as follows:
    \begin{align*}
        s_{t+1} =
        \begin{cases}
             \min\{\max\{0, s_t - \frac{1}{2}(1 + \frac{w_t}{2})\}, 6\} &\mbox{ w.p. } \frac{2(1-a_t)}{5}\\
             s_t &\mbox{ w.p. } 0.2\\
             \min\{\max\{0, s_t + \frac{1}{2}(1 + \frac{w_t}{2})\}, 6\} &\mbox{ w.p. } \frac{2(1+a_t)}{5},
        \end{cases}
    \end{align*}
    where $\{w_t\}$ is a $0$-mean i.i.d. Gaussian random sequence.~The reward function is given by 
    \nal{
r(s,a) = 0.005(((s-6)/6)^4 + ((a-1)/2)^4) + 0.5((s/6)^4 + ((a+1)/2)^4).
    }
    \item \texttt{Truncated LQ System}: The state of an LQ~\citep{abbasi2011regret} system evolves as follows: \nal{s_{t+1} = A s_t + B a_t + w_t,} where $A, B$ are matrices of appropriate dimensions, and $w_t$ is i.i.d. Gaussian noise. The reward at time $t$ is $- s_t^\top P s_t - a_t^\top Q a_t$.~We clip the state vector since our framework allows only compact state-action spaces. More specifically, we ensure that the state value for each coordinate lies within the interval $[c_{\ell},c_u]$, and restrict the action space to be $[-1,1]^{d_\cA}$. Hence, the $i$-th coordinate of the state process evolves as \nal{s_{t+1}(i) = \max{\{\min{\{(A s_t + B a_t + w_t)(i), c_u\}}, c_\ell\}}.} We have used the following two sets of system parameters:
    \begin{enumerate}
        \item \texttt{Truncated LQ-$1$}: 
            \begin{align*}
                A = \begin{bmatrix}
                    -0.2 & -0.07\\
                    0.6 & 0.07
                \end{bmatrix}, \quad
                &B = \begin{bmatrix}
                    0.07 & 0.09\\
                    -0.03 & -0.1
                \end{bmatrix},
            \end{align*}
            $P = 0.4~I_{2}$\footnote{$I_n$ denotes identity matrix of size $n \times n$.}, $Q = 0.6~I_{2}$ and mean and standard deviation of $w_t$ are $0$ and $0.05$, respectively.~We consider $c_u = -c_\ell = 4$.
        \item \texttt{Truncated LQ-$2$}: 
            \begin{align*}
                A = \begin{bmatrix}
                    -0.2 & -0.07\\
                    0.6 & 0.07
                \end{bmatrix}, \quad
                &B = \begin{bmatrix}
                    0.1 & -0.01 & 0.12 & 0.08\\
                    0.02 & -0.1 & 0.3 & 0.001
                \end{bmatrix}.
            \end{align*}
            Values of $P$, $Q$, $c_u$, $c_\ell$ and mean and standard deviation of $w_t$ are the same as \texttt{Truncated LQ-$1$}.
    \end{enumerate}
        
    \item \texttt{Non-linear System}: We consider a non-linear system~\citep{kakade2020information} where the state evolves as \nal{s_{t+1}(i) = \max{\{\min{\{(A f(s_t) + B g(a_t) + w_t)(i), c_u\}}, c_\ell\},}} where $f$ and $g$ are non-linear functions, $A, B$ are matrices of appropriate dimensions, and $w_t$ is noise sequence. This system can be viewed as a generalization of the LQ control system in which the dynamics are linear in the feature vectors corresponding to state-action values. The feature maps $f(\cdot),g(\cdot)$ can be non-linear functions.~The reward function is a function of the state and the actions. We have set the values for the matrices $A, B$, $P$, $Q$, $c_u$ and $c_\ell$ to be the same as that of \texttt{Truncated LQ-$1$}. We set 
    \begin{align*}
        f(s)(i) = 0.5 s(i) + 0.5 s(i)^2,~\mbox{for } i \in \flbr{1, 2}, \mbox{ and }
        g(a) = a^2,
    \end{align*}
    where $v(i)$ denotes the $i$-th element of vector $v$. Similar to the \texttt{LQ system}, we consider the action space to be $[-1,1]^{d_\cA}$.
\end{enumerate}

\subsection{Choosing Hyperparameters}
Since $L_r$~(Assumption~\ref{assum:lip}), $c_a$~\eqref{def:ca}, $C_\eta$, $C_H$~\eqref{def:CH} may not be known, we instead provide their estimates\slash appropriate upper-bounds to~\algo~in lieu of these parameters.~Our theoretical upper-bounds on regret continue to hold, we simply replace these parameters with the chosen upper-bounds.~In addition to these~\algo~we pass $\delta$ and $\gamma$ as hyperparameters to~\algo.~A brief description of these quantities are as follows:
\begin{enumerate}
    \item $L_r$: We assume the knowledge of an upper-bound on $L_r$, the Lipschitz constant for the reward function~(Assumption~\ref{assum:lip}).
    \item $c_a$: \algo~activates a cell $\zeta$ if $N_t(\zeta) \geq \frac{c_a \log{\br{\frac{T}{\delta}}}}{\diamc{\zeta}^{d_\cS+2}}$~\eqref{Nmin}, and deactivates $\zeta$ if $N_t(\zeta) \geq \frac{c_a 2^{d_\cS+2} \log{\br{\frac{T}{\delta}}}}{\diamc{\zeta}^{d_\cS+2}}$~\eqref{Nmax}.
     \item $C_\eta$: Recall from Section~\ref{sec:algo} that if $\zeta$ is an active cell at time $t$, then its confidence radius $\eta_t(\zeta)$ satisfies $\eta_t(\zeta) \leq C_\eta ~\diamc{\zeta}$, where $C_\eta = 3(1 + L_p) + C_p$.~In order to avoid computing $\eta_t(\zeta)$, we use $C_\eta~ \diamc{\zeta}$ as a substitute for  $\eta_t(\zeta)$, and choose $C_\eta$ as a hyperparameter for ZoRL
    \item $C_H$: $C_H$ is the multiplicative constant associated with the episode duration that satisfies \eqref{def:CH}.
    \item $\delta$: $\delta \in (0,1)$ is the probability parameter.
    \item $\gamma$: $\gamma > 0$ is the accuracy parameter for \epe~subroutine that is used by \algo~in order to compute the proxy diameter of the chosen policy in an episode.
\end{enumerate}
The values of the following three hyperparameters are kept unchanged across four experiments: $L_r = 0.001$, $\delta = 0.1$ and $\gamma = 0.05$. Values of the rest of the parameters are reported in Table~\ref{tab:hyp_param}.

\begin{table}
    \centering
    \begin{tabular}{|c|c|c|c|}
        \hline
        Experiments & $C_a$ & $C_\eta$ & $C_H$\\
        \hline
        Truncated LQ-$1$ & 0.2 & 1 & 0.1\\
        \hline
        Truncated LQ-$2$ & 0.1 & 1 & 0.001\\
        \hline
        Continuous RiverSwim & 0.1 & 1 & 0.001\\
        \hline
        Non-linear System & 1 & 5 & 0.1\\
        \hline
    \end{tabular}
    \caption{ZoRL hyper-parameters.}
    \label{tab:hyp_param}
\end{table}

\subsection{Comparison with PZRL-MF and PZRL-MB}
\begin{figure}[ht]
    \centering
    \begin{subfigure}[b]{0.49\textwidth}
        \centering
        \includegraphics[width=\textwidth]{figures/RiverSwim_cumul_reward_pzrl.pdf}
        \caption{Continuous RiverSwim}
        \label{fig:rswim}
    \end{subfigure}
    \hfill
    \begin{subfigure}[b]{0.49\textwidth}
        \centering
        \includegraphics[width=\textwidth]{figures/GridWorld_cumul_reward.pdf} 
        \caption{Continuous GridWorld}         
        \label{fig:gworld}
    \end{subfigure}
    \caption{Comparison with PZRL-MF and PZRL-MB.}
    \label{fig:pzrl_compare}
\end{figure}
\citet{kar2024policy} allows the agent to play policies from a parametric class.~The latest version of~\citet{kar2024policy} proposes two new algorithms PZRL-MF and PZRL-MB\footnote{These replace the PZRL-H algorithm, that has been proposed in an earlier version of the same paper.}. Due to paucity of time, we could not compare PZRL-MF and PZRL-MB with~\algo~on the four environments discussed in Section~\ref{sec:sim}. However, here we compare PZRL-MB and PZRL-MF with \algo~on the following two environments: (i) \texttt{Continuous RiverSwim} and (ii) \texttt{Continuous GridWorld}. 

\texttt{Continuous RiverSwim:}~This environment has been discussed above. We use the following two policy parameterization schemes for PZRL-MF and PZRL-MB.
\begin{enumerate}
    \item $\phi(s;w) = w \cdot s$, $w \in [-1,1]$.
    \item $\phi(s;w) = w(1) + w(2)s^2$, $w = (w(1),w(2)) \in [-1,1]^2$.
\end{enumerate}

\texttt{Continuous GridWorld}: In GridWorld environment~\citep{sutton2018reinforcement}, the agent moves around a compact space, and the space contains a designated reward-yielding region such that the agent earns a reward of $1$ whenever it stays inside the reward-yielding region, and earns no reward otherwise.~We design a continuous version of the same environment; the reward-yielding region is taken to be a circle of radius $0.1$ units whose center is $[0.8,0.8]$. The state space is $[0,1]^2$ and the action space is $[0,2\pi]$. The state of the system evolves as follows:
\begin{align*}
    y_{t+1} &= s_t + \beta \begin{bmatrix} \cos{a_t}\\ \sin{a_t} \end{bmatrix} + w_t, \mbox{ and}\\
    s_t(i) &= (0\vee y_t(i)) \wedge 1, \mbox{ for } i=1,2,~\forall t \in \{0\}\cup \bN,
\end{align*}
where $w_t$ is a zero-mean i.i.d. Gaussian noise, and $\beta > 0$ is the step-size.~The standard deviation of $w_t$ is set to $0.1$, and we use a step size $\beta = 0.2$. For this environment, we parametrize the policies as follows: $\phi(s;w) = w(0) + s(0) w(1) + s(1) w(2)$, where $w \in [0,1]^3$ and $s \in [0,1]^2$.

We plot the cumulative rewards incurred by PZRL-MF, PZRL-MB, and \algo, averaged over $50$ runs for both the systems in Figure~\ref{fig:pzrl_compare}.

\textbf{Computing resources.} We have conducted experiments on a $11$-th Gen Intel Core-i7, $2.5$GHz CPU processor with $16$GB RAM using Python-$3$ and PyTorch library.