% \documentclass{uai2024} % for initial submission
\documentclass[accepted]{uai2024} % after acceptance, for a revised version; 
% also before submission to see how the non-anonymous paper would look like 
                        
%% There is a class option to choose the math font
% \documentclass[mathfont=ptmx]{uai2024} % ptmx math instead of Computer
                                         % Modern (has noticeable issues)
% \documentclass[mathfont=newtx]{uai2024} % newtx fonts (improves upon
                                          % ptmx; less tested, no support)
% NOTE: Only keep *one* line above as appropriate, as it will be replaced
%       automatically for papers to be published. Do not make any other
%       change above this note for an accepted version.

%% Choose your variant of English; be consistent
\usepackage[american]{babel}
% \usepackage[british]{babel}

% Recommended, but optional, packages for figures and better typesetting:
\usepackage{microtype}
\usepackage{graphicx}
% \usepackage{subfigure}
\usepackage{booktabs} % for professional tables
\usepackage{algorithm}
\usepackage{algorithmic}
% \usepackage[hidelinks,colorlinks=true,linkcolor=red,citecolor=blue,urlcolor=black]{hyperref}
\usepackage{hyperref}

\hypersetup{
  colorlinks   = true, %Colours links instead of ugly boxes
  urlcolor     = black, %Colour for external hyperlinks
  linkcolor    = red, %Colour of internal links
  citecolor   = blue %Colour of citations
}

\usepackage{natbib} % has a nice set of citation styles and commands


% For theorems and such
\usepackage{amsmath}
\usepackage{amssymb}
\usepackage{mathtools}
\usepackage{amsthm}

% if you use cleveref..
\usepackage[capitalize,noabbrev]{cleveref}


% additional pkgs
\usepackage{graphicx}
\usepackage{subcaption}
\usepackage{caption}
\usepackage[normalem]{ulem}
\usepackage{hyperref}
\usepackage{xurl}

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% THEOREMS
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\theoremstyle{plain}
\newtheorem{theorem}{Theorem}[section]
\newtheorem{proposition}[theorem]{Proposition}
\newtheorem{lemma}[theorem]{Lemma}
\newtheorem{corollary}[theorem]{Corollary}
\theoremstyle{definition}
\newtheorem{definition}[theorem]{Definition}
\newtheorem{assumption}[theorem]{Assumption}
\theoremstyle{remark}
\newtheorem{remark}[theorem]{Remark}
% Todonotes is useful during development; simply uncomment the next line
%    and comment out the line below the next line to turn off comments
%\usepackage[disable,textsize=tiny]{todonotes}
\usepackage[textsize=tiny]{todonotes}
\input{Definitions}

\title{Offline Reward Perturbation Boosts Distributional Shift in Online RL}


% It is OKAY to include author information, even for blind
% submissions: the style file will automatically remove it for you
% unless you've provided the [accepted] option to the icml2024
% package.

% List of affiliations: The first argument should be a (short)
% identifier you will use later to specify author affiliations
% Academic affiliations should list Department, University, City, Region, Country
% Industry affiliations should list Company, City, Region, Country

% You can specify symbols, otherwise they are numbered in order.
% Ideally, you should not use this facility. Affiliations will be numbered
% in order of appearance and this is the preferred way.
% \author[*1]{\href{mailto:<zyu32@uic.edu>?Subject=Your UAI 2024 paper}{Zishun Yu}{}}
\author[*1]{{Zishun Yu}{}}
\author[*1]{\href{mailto:<skang98@uic.edu>?Subject=Your UAI 2024 paper}{Siteng Kang}{}}
\author[1]{{Xinhua Zhang}{}}
% Add affiliations after the authors
\affil[1]{%
    Department of Computer Science\\
    University of Illinois Chicago\\
    Chicago, IL, USA
}

\begin{document}
\maketitle
\def\thefootnote{*}\footnotetext{Equal contribution. Correspondence to S.K.} 
\begin{abstract}
% 
Offline-to-online reinforcement learning has recently been shown effective in reducing the online sample complexity by first training from offline collected data.
However, this additional data source may also invite new poisoning attacks that target offline training.
In this work, we reveal such vulnerabilities in {\it critic-regularized} offline RL
by proposing a novel data poisoning attack method, which is stealthy in the sense that the performance during the offline training remains intact, but the online fine-tuning stage will suffer a significant performance drop.
Our method leverages the techniques from bi-level optimization to promote the over-estimation/distribution shift under offline-to-online reinforcement learning.
Experiments on four environments confirm the satisfaction of the new stealthiness requirement,
and can be effective in attacking with only a small budget and without having white-box access to the victim model.
%
\end{abstract}



\section{Introduction}
\label{sec:intro}

Offline reinforcement learning (RL) has recently opened up new opportunities of leveraging offline batch data to improve the RL algorithms,
significantly reducing the online sample complexity of interacting with the environment \citep{Levine2020Offline}.
It is particularly valuable for many applications where directly applying an automated policy can be dangerous, expensive, or unethical.
For example, educational assistants, autonomous driving, and healthcare.

However, due to the limited coverage of offline data or the suboptimality of the demonstrator \citep{Fu2020d4rl},
a purely offline trained model is generally not effective when deployed online, 
and a common wisdom is to fine-tune it via additional online interactions,
whose sample complexity is expected to be saved thanks to the initialization from offline training~\citep{xie2021policy, nakamoto2024cal}.

Interestingly, such a direct offline-to-online transfer (O2O) is often plagued with catastrophic performance drop at online transfer,
which poses safety challenges for the real system such as driving and therapy.
This is primarily due to the distributional shift of the state
\citep{fujimoto2019offpolicy,kumar2019stabilizing,fu2019disgnosing,kumar20discor},
and the $Q$-value has not been well estimated, often over-estimated,
for the state-actions lying outside the offline distribution~\citep{farahmand2010error,munos2005error}.

Existing literature~\citep[e.g.][]{Kumar2020CQL,Kostrikov2022IQL, lee2022offline, Yu2023Actor, nakamoto2024cal} shows that improved O2O RL methods can effectively control negative effect caused by the distribution shift, hence leading to improved online sample efficiency. 
Typical O2O solutions includes endowing conservatism on offline $Q$-function approximation~\citep{Kumar2020CQL, nakamoto2024cal}, or regularizing the divergence between the learned policy and the behavior policy~\citep{nair2020awac}, 
to avoid catastrophic distribution shift caused by false value over-estimation.
In addition, distribution correction~\citep{lee2022offline}, critic reconstruction~\citep{Yu2023Actor}, and ensemble methods~\citep{zhang2022policy, wang2023train} also show effective O2O transfers.

There is still a long list of O2O methods that emerged recently \citep[etc.]{wagenmaker2023leveraging, chen2023dcac, mark2023offline, lei2023uni}. Among the aforementioned works, surgery on the $Q$-function is one of the most prevalent principles to address O2O. As O2O heavily depends on a ``well-behaved'' $Q$-function, it also creates vulnerability in such scenarios, as one may manipulate $Q$-functions in a malicious way.

The key question we investigate in this paper is 

\vspace{-0.5em}
\begin{quote}
    Are the O2O algorithms robust to reward poisoning on the offline batch data?
\end{quote}
\vspace{-0.5em}

Since offline data often comes from crowd-sourcing or other third parties,
it may carry malicious poisons that catastrophically damage the online fine-tuning while remaining stealthy by keeping the offline performance competitive.

In general, poison attack is performed on the training data, 
such that the models trained with it will perform poorly on the test scenarios.
In O2O RL, the attacker may alter the state, action, or reward of the offline data.
In this paper, we focus on poisoning of rewards,
and aim to achieve two objectives:
\begin{itemize}\vspace{-0.5em}
    \item \textbf{Effectiveness}: 
    after offline training on the poisoned data, 
    the agent will suffer a catastrophic performance drop at the beginning of the online fine-tuning,
    compared with its 
    performance at the end of the offline training.
    \item \textbf{Stealthiness}: during the offline training,
    the performance as measured by interacting with the environment (but not using it to update the model) should be similar to that achieved by a clean trained agent.
    This is in addition to the standard $\ell_p$ norm constraints on the magnitude of reward modification.
    \vspace{-0.5em}
\end{itemize}

These definitions of stealthiness and effectiveness are particular realistic. As O2O RL is a two-phase learning scheme, attacks that aim to undermine the offline performance may be of less risk to the system because the victim can detect the low performance of the offline model. However, an attacker that is stealthy offline but effective online could be more surprising and harmful. Therefore, understanding such vulnerability of O2O RL is essential towards robust O2O transfer.

Our contribution is to achieve these goals,
revealing the vulnerability of O2O RL to data poisoning attack. 
Our innovations can be summarized as follows:
\begin{itemize}
\vspace{-0.5em}
    \item We propose the first poisoning attack on O2O RL that promotes the $Q$-function over-estimation and hence distributional shift.
    \item We achieve the poisoning through an efficient bi-level optimization technique.
    \item Our approach requires no access to the victim agent or the online environment.
    \vspace{-0.5em}
\end{itemize}

We applied our poisoner to Frozen Lake~\citep{openaigym} and three locomotion environments from D4RL \citep{Fu2020d4rl}.
The stealthiness is clearly verified,
and it is shown more effective in compromising online fine-tuning performance than other baselines.



\section{Related Work}
\label{sec:related}
The vulnerability to various types of attacks has been well studied in supervised learning field. 
Evasion attack \citep{Goodfellow2014Adv} assumes the attacker can manipulate testing inputs after the victim model is trained.
Data poisoning attack, on the other hand, is performed on the training inputs.
The attacker may insert \citep{Chen2017TargetedBA} or modify the training inputs \citep{Biggio2012PoisoningAA,Shafahi2018PoisonFT} to undermine the performance of the trained victim model.

\vspace{-0.5em}
\paragraph{Attacks in Online RL} 

Reward poisoning has been extensively studied in bandit~\citep{ma2018data, bogunovic2021stochastic, garcelon2020adversarial, guan2020robust, jun2018adversarial, liu2019data, lu2021stochastic, hajiesmaili2020adversarial, zuo2020near} and online RL~\citep{banihashem2022admissible, huang2019deceptive, liu2024efficient, Rakhsha2021Policy, Rakhsha2021Reward, sun2020vulnerability, zhang2020adaptive} settings.

\vspace{-0.5em}
\paragraph{Attacks in Offline RL} 

Reward poisoning in batch/offline RL~\citep{Ma2019Policy, rangi2022understanding, zhang2008value, zhang2009policy, Rakhsha2021Reward, Rakhsha2021Policy} is perhaps more relevant to our work, in contrast to online learning where the data collection procedure is also polluted due to attacked policy. In addition, \cite{Gong2022BackdoorORL} proposed the first backdoor attack in offline RL by altering the training observations; and \cite{Wu2023Reward} designed a data poising attack specifically on multi-agent RL.

\vspace{-0.5em}
\paragraph{Defenses in RL} To address the vulnerabilities raised in the literature, various defenses against adversarial attacks on RL have been proposed~\citep[e.g.,][]{zhang2009policy, banihashem2021defense, lykouris2021corruption, rangi2022saving}.

However, existing attacks in online RL require access to online environment and are therefore infeasible in many practical scenarios. 
On the other hand, offline RL attacks leads to poor performance during the validation and can be detected before online fine-tuning. 
To the best of our knowledge, the stealthiness notion—where the impact on performance is not noticeable offline but occurs online—has not been explored in current literature. Hence, 
none of the existing attack (or defense) methods can be directly applied to O2O RL settings to achieve our objectives.



\section{Problem Setup}
\label{setup}

In this section, we set up the three participants in the O2O poisoning problem: the environment, the victim agent, and the attacker.


\subsection{Preliminary}
\label{sec:prelim}
% sac setup
We formulate the RL process via the standard Markov Decision Process (MDP) 
$\mathcal{M}=(\mathcal{S},\mathcal{A},\mathbb{P}, R ,\gamma, \mu_0)$. 
Here $\mathcal{S}$ is the state space, $\mathcal{A}$ is the action space, 
$\mathbb{P}: \mathcal{S}\times\mathcal{A}\times\mathcal{S}\rightarrow[0,1]$ is the transition function, $R:\mathcal{S}\times\mathcal{A}\rightarrow \mathbb{R}$ is the reward function, 
$\gamma \in [0,1)$ is the discount factor, 
and $\mu_0: \mathcal{S}\rightarrow \mathbb{R}$ is the initial state distribution.

For the \textbf{victim agent}, we define its policy 
$\pi(a|s)$ as a  distribution of taking action $a$ at state $s$.
The agent's goal is to find the optimal policy that maximizes the expected return 
$\pi^* = \argmax_\pi J(\pi)$, where $ J(\pi) :=\mathbb{E}_\pi[\sum_{t=0}^{\infty}\gamma^t r_t | \mathcal{M}]$.

In the offline RL setting, there is a batch of transitions $\Dcal =\{(s, a, r, s')\}$, referred to as offline dataset $\mathcal{D}$, that are collected by applying an unknown behavior policy in the environment. And the offline agent aims to learn a high-return policy $\pi$ given $\mathcal{D}$, although the expected return $J_\pi$ may vary depending on the quality of dataset $\mathcal{D}$. O2O RL appends a subsequent online fine-tuning stage by continuing the training of $\pi$ and $Q$ (if applicable) using new online interactions along with (optionally) pre-collected offline data.

In the O2O literature, it has been shown that offline conservative $Q$-learning \citep[CQL,][]{Kumar2020CQL} followed by an off-policy algorithm---often soft actor critic \citep[SAC,][]{haarnoja2018soft}---for online fine-tuning is a strong yet simple baseline~\citep{lee2022offline, Yu2023Actor}. Intuitively, it is effective because CQL provides a good $Q$-function initialization that suppresses $Q$-values for out-of-distribution (OOD) actions, avoiding poor online exploration led by false over-estimation. And using an off-policy algorithm online allows faster learning as the $Q$-function is now freed from conservative constraints/penalties. 

As CQL+SAC has served as a common baseline in O2O literature~\citep{lee2022offline, nakamoto2024cal, Yu2023Actor}, we will use the same CQL+SAC scheme as our victim O2O agent for \textit{continuous} action experiments, including the MuJoCo~\citep{mujoco} locomotion tasks. For \textit{discrete} action environments such as Frozen Lake, we used DoubleDQN~\citep{van2016deep} as the online algorithm.

\paragraph{Soft Actor-Critic} 
SAC is an actor-critic algorithm based on the maximum entropy framework. Akin to canonical actor-critic, 
it includes actor update and critic update, 
as shown in \eqref{eq:sac-actor} and \eqref{eq:sac-critic}, respectively. 
In particular, we employed SAC-v2~\citep{haarnoja2018soft}, an alternative implementation that automatically adjusts the entropy of the policy, via the Lagrangian dual formulation, 
where the Lagrangian multiplier is often called the temperature $\alpha$,
and its update rule is given in~\eqref{eq:sac-temp} via its derivative in $\alpha$.
%
\begin{align}
\begin{split}\label{eq:sac-critic}
    & \Lcal^{\text{SAC}}_Q(\psi, \Dcal) := \!\! \expunder{(s, a, r, s') \sim \Dcal} \! \left[ \left( Q_{\psi}(s, a) - y(r, s') \right)^2 \right] \!\!\!\!\! \\
    & y(r, s') \! := \! r \! + \! \gamma \!\!\expunder{a'\sim\pi_\theta(s')}
        \! [Q_{\bar{\psi}}(s', a') \! - \! \alpha \log \! \pi_\theta(a'|s')] \\
\end{split} \\
    & \Lcal^{\text{SAC}}_\pi(\theta, \Dcal) \! := \! \expunder{s \sim \Dcal}\expunder{a\sim\pi_\theta(s)}[  \alpha \! \log \! \pi_\theta(a|s) \! - \! Q_\psi(s, a)   ] \label{eq:sac-actor} \\
    & \Lcal^{\text{SAC}}_{\text{temp}}(\alpha, \Dcal) \! := \! -  \alpha \expunder{s \sim \Dcal}\expunder{a\sim\pi_\theta(s)} [\log\pi_\theta(a|s) - \Bar{\Hcal}]. \label{eq:sac-temp}
\end{align}
%
Here the expectation  $\mathbb{E}_{a\sim\pi_\theta(s)}[\cdot]$ could be directly evaluated for discrete action spaces and be stochastically approximated for continuous action spaces.

The actor update~\eqref{eq:sac-actor} aims to maximize the $Q$-values hence maximizing the cumulative rewards alongside the policy's entropy. The critic update~\eqref{eq:sac-critic} aims to find a better soft $Q$-function approximation by minimizing the squared temporal difference error, where $\bar{\psi}$ stands for target network, a commonly used trick in RL literature to stabilize RL training. It can be often updated using the Polyak averaging (or exponential moving averaging), which is essentially $\bar{\psi} \leftarrow \tau \psi + (1-\tau) \bar{\psi}$, where $\tau \in (0,1)$ is a hyper-parameter that controls how fast the target network $\bar{\psi}$ evolves towards the current $Q$-network $\psi$. The temperature update~\eqref{eq:sac-temp} automatically tunes $\alpha > 0$ to ensure that the entropy of the policy is lower bounded by a target entropy $\Bar{\Hcal}$.


\paragraph{Conservative $Q$-Learning} 

CQL is a popular choice for offline and O2O RL that combats the distribution shift issue.
The central idea is to regularize the $Q$-values of actions that are not observed in the offline dataset. Such regularity avoids over-estimations of OOD actions that may have a low return in the real environment. We will also provide an illustration of such a conservative estimation in our toy example in Figure~\ref{fig:toy-example}. Specifically, we consider a commonly used variant of CQL, namely CQL($\mathcal{H}$), whose regularizer is given in \eqref{eq:cql-critic} along with the squared loss. In addition, CQL($\mathcal{H}$) follows \eqref{eq:cql-actor} to update policy for continuous action spaces, 
while in discrete space the policy is induced greedily from $Q_\psi$.
%
\begin{align}
\begin{split}\label{eq:cql-critic}
    & \Lcal^{\text{CQL}}_Q(\psi, \Dcal) := \expunder{(s, a, r, s') \sim \Dcal}  \left[ \left( Q_{\psi}(s, a) - y(r, s') \right)^2 \right] \\
    & + \lambda \underbrace{\! \expunder{(s, a)\sim\Dcal} \! \left[  \log \! \textstyle\sum\nolimits_u \! \exp(Q_\psi(s, u)) \! -  \!  Q_\psi(s, a) \right]}_{\text{\normalsize $=: \mathcal{R}^{\text{CQL}}(Q_\psi, \Dcal)$}
    }, \! \\
    %
        & \text{discrete: }
        y(r, s') \! := \! r \! + \! \gamma Q_{\bar{\psi}}(s'\! , \arg\!\max\nolimits_{a'}\! Q_\psi(s', a')) \! \\
        & \text{continuous: } 
        y(r, s') \! := \! r \! + \! \gamma \!\!\expunder{a'\sim\pi_\theta(s')}
        \! [Q_{\bar{\psi}}(s', a') ] \\
\end{split} \\
\label{eq:cql-actor}
    & \Lcal^{\text{CQL}}_\pi(\theta, \Dcal) := - \expunder{s \sim \Dcal}\expunder{a\sim\pi_\theta(s)}[  Q_\psi(s, a)   ]  .
\end{align}
%
{where $\mathcal{R}^{\text{CQL}}$ is a conservative regularizer}, similarly the expectation $\mathbb{E}_{a\sim\pi(s)}\! [\cdot]$ and the log-sum-exp $\log \! \textstyle\sum\nolimits_a \! \exp  Q(s, a)$ are tractable for discrete action spaces and can be stochastically approximated for continuous spaces.

Algorithm~\ref{algo:o2o} is an example of O2O protocol with CQL used for offline training and SAC for online fine-tuning. For our additional experiments in Section~\ref{sec:more-offline-rl}, one could replace the offline/online algorithms with corresponding alternatives.

\begin{algorithm}[!t]    
    \caption{O2O protocol: offline (CQL) + online (SAC)}
    \label{algo:o2o}
    \begin{algorithmic}[1]
    \STATE {\bfseries Input:} offline dataset $\Dcal = \{(s, a, r, s')\}$
    \STATE {\color{gray} // offline training phase with CQL.}
    \STATE initialize CQL parameters $\theta, \psi, \Bar{\psi}$
    \FOR{\text{number of offline iterations}}
        \STATE sample mini-batch from offline dataset $\Dcal$
        \STATE update $\psi$, $\theta$ with \eqref{eq:cql-critic}, \eqref{eq:cql-actor} respectively
        \STATE $\Bar{\psi} \leftarrow \tau\psi + (1-\tau) \Bar{\psi}$
    \ENDFOR
    \STATE {\color{gray} // online training phase with SAC.}
    \STATE load parameters $\theta, \psi, \Bar{\psi}$ for SAC
    \STATE initialize temperature $\alpha$ for SAC
    \FOR{\text{number of online iterations}}
        \STATE {\color{gray} // environmental step}
            \STATE $a \sim \pi_\theta(a|s), r \sim R(s, a), s' \sim \mathbb{P}(s'|s, a)$
            \STATE $\Dcal\leftarrow \Dcal \ \bigcup \ \{(s, a, r, s')\}$
        \STATE {\color{gray} // gradient step}
            \STATE sample mini-batch from online buffer $\Dcal$
            \STATE update $\psi$, $\theta$, $\alpha$ with \eqref{eq:sac-critic}, \eqref{eq:sac-actor}, \eqref{eq:sac-temp} respectively
            \STATE $\Bar{\psi} \leftarrow \tau\psi + (1-\tau) \Bar{\psi}$
    \ENDFOR
    \STATE {\bfseries Output:} network parameters $\psi, \theta$
    \end{algorithmic}
\end{algorithm}


\subsection{Motivation}
\label{sec:teaser}

\paragraph{Distribution Shift} 
% 
It is argubly well known, in the O2O literature~\citep{nair2020awac, lee2022offline, Yu2023Actor, nakamoto2024cal}, that (dramatic) distribution shifts caused by over-estimated $Q$ values for OOD state/actions lead to catastrophic performance drops during O2O transfer. This serves as the key motivation for our attacking algorithm, which we elaborate on next.

At offline training time, the target value for Bellman backups of critic update in \eqref{eq:sac-critic} uses actions $a'$ sampled from the learned policy $\pi_{\theta}$,
while the $Q$ function was trained only on actions produced by the offline data under the behavior policy (the expectation over $\Dcal$ in \eqref{eq:sac-critic}).
As a result, the offline learned $Q$ function typically over-estimates the value of $Q(s,a)$ for an OOD action $a$,
i.e., when $a$ is never applied at state $s$ in the offline dataset.
A similar issue also plagues the actor update in \eqref{eq:sac-actor},
where $Q_\psi$ is evaluated on $a \sim \pi_\theta(s)$.

During online fine-tuning, the agent has a chance to update over-estimated OOD actions due to, for example, $\epsilon$-greedy exploration and encountering OOD states. The bootstrap error resulting from over-estimation could wipe out the offline learned policy that previously performed well.

\begin{figure}[!t]
    \centering
    \begin{subfigure}[b]{\columnwidth} % b for bottom alignment
        \includegraphics[width=\textwidth]{UAI_2024/figures/toy_offline.pdf}
        \caption{Offline phase: Let $Q$, $\hat{Q}$ and $\tilde{Q}$ be the ground truth $Q$-function, the CQL approximation without being poisoned, and an uniformly poisoned $Q$-function, respectively. The bar plot shows the number of observed data for the corresponding action. It can be observed that $\hat{Q}$ well approximates in-distribution actions $\{-1, 0, 1\}$ and under-estimates OOD actions as expected. The poisoned $\tilde{Q}$ is stealthy as a uniform increase does not change the policy but breaks the conservatism which would lead to poor online performance.}
        \label{fig:toy-sub1}
    \end{subfigure}
    \vskip 1em
    \begin{subfigure}[b]{\columnwidth} % b for bottom alignment
        \includegraphics[width=\textwidth]{UAI_2024/figures/toy_online.pdf}
        \caption{Online phase: Suppose we initialize an online agent with the offline poisoned $\tilde{Q}$. After some online interactions, which are clean as the poisoning is only applied to offline data, one could observe that the majority of interactions are in-distribution because they have higher $\tilde{Q}$-values at the beginning. However, by providing many clean data for in-distribution actions, their $\tilde{Q}$ estimations converge to ground truth. 
        As a result, the ood actions become dominate due to higher $\tilde{Q}$ values because they were updated less frequently, hence promoting online distributional shift.
        }
        \label{fig:toy-sub2}
    \end{subfigure}
    
    \caption{A toy bandit example, with Gaussian-like reward function and eleven actions, to demonstrate the intuition that maximizing $Q$-values (uniformly) can achieve both stealthiness and effectiveness.}
    \label{fig:toy-example}
    \vspace{-1em}
\end{figure}

\paragraph{A Toy Example}  We now provide a toy bandit example in Figure~\ref{fig:toy-example} to further demonstrate our motivation. The key idea of this example is that uniformly lifting the $Q$-values can achieve both stealthiness and effectiveness, because a uniform over-estimation would not change the policy in the offline phase, as demonstrated in Figure~\ref{fig:toy-sub1}; and it will promote online distributional shift, as shown in Figure~\ref{fig:toy-sub2}.

While the toy example simply assumes that the $Q$-function can be directly manipulated to achieve a uniform over-estimation,
this is however infeasible in a poisoning attack setting. 
In Section~\ref{sec:attack}, we show that one could achieve it by formulating it as a bi-level optimization. 



\section{The Attack Algorithm}
\label{sec:attack}

We investigate the vulnerability of O2O RL under data poisoning during offline training. 
Since the attacker is not allowed to perform any attack during the online fine-tuning phase, 
the victim will eventually recover from any offline attack given infinite online training resource. 
Thus, we set the attacker's goal to be such that the victim model, when fine-tuned online,
suffers as much performance drop---both in magnitude and duration---at the \textit{initial} phase as possible.

% pseudo code for our ift update
\begin{algorithm}[!t]    
    \caption{Update $\delta_r$ with IFT}
    \label{algo:ift}
    \begin{algorithmic}[1]
    \STATE {\bfseries Input:} offline dataset $\Dcal = \{(s,a,r,s')\}$, 
       poison $\delta_r$, surrogate critic parameters $\psi$, step size $\eta$
    \STATE$v_1 \! \gets \! \frac{\partial \mathcal{L}_{\delta_r}}{\partial\psi}|_{\delta_r,\psi}$, where $\Lcal_{\delta_r}$ is the outer objective in \eqref{eq:attack_obj}
    %
    \STATE $v_2 \gets \text{InverseHVP}(v_1,\frac{\partial\mathcal{L}_Q(\psi,\Dcal)}{\partial\psi})$ with $\Lcal_Q$ from \eqref{eq:cql-critic}.
    %
    \STATE 
    $v_3 \gets \frac{\partial^2 \mathcal{L}_Q(\psi,\Dcal)}{\partial\delta_r \partial\psi}v_2$.
    In PyTorch, it can be implemented by
    $\tt{v_3} =  \tt{grad}(\frac{\partial\mathcal{L}_Q(\psi,\Dcal)}{\partial\delta_r},\psi,  \tt{grad\_outputs}=v_2)$
    \STATE {\bfseries Output:} Updated $\delta_r =\delta_r
    + \eta v_3$ as \eqref{eq:attack_obj} is maximization
    \end{algorithmic}
\end{algorithm}


\subsection{The Threat Model}
\label{subsec:threat}

Following the standard poisoning attack protocol, 
we assume that the victim may not access clean demonstrations during offline training.
Key to our threat model is the requirement that the \textbf{victim model must retain good ``online performance'' when offline training concludes},
because otherwise the attack would be detected and the model would be precluded from online fine-tuning.
Here the ``online performance'' is evaluated by hypothetically applying the policy to an online environment,
but without updating the policy (as opposed to online training).
In reality, the agent may have a very limited budget to run such evaluations, for example, running it only once before launching into online training.
However, given that offline policy evaluation is notorious for its high variance, 
we define the performance of offline training in this way, 
noting that the value of such evaluation is \textit{not} used by either the agent’s RL algorithm or the attacker’s poisoning algorithm.

Although reward, state, and action are all feasible targets of poisoning on the offline batch data,
we restrict our attention to reward because it is a single scalar and carries less structure than states and actions,
hence allowing more stealthy poisoning.
The attacker is not allowed to access 
the victim model,
such as its policy network or value functions. 
Following the common practice such as Witche's Brew \citep{Geiping2021witchesBrew} and continual input-aware poisoning \citep{Kang2023CIAP},
the attacker may internally train a \textit{surrogate} RL agent and queries it to construct the poisons.

In addition to the aforementioned stealthiness constraints,
we also impose the standard $\ell_p$ norm constraints on the reward perturbations.
For example, the $\ell_0$ norm constraints specifying how many offline transitions can be perturbed,
and $\ell_1$ norm constraints on the total or average amount of perturbation.
For a vector $\xvec$, its $\ell_1$ norm is $\nbr{\xvec}_1 := \sum_i \abr{x_i}$,
and its $\ell_\infty$ norm is $\nbr{\xvec}_\infty := \max_i \abr{x_i}$.


\subsection{The Poisoning Algorithm}

Due to the stealthy requirement,
the poisoning algorithms for offline RL such as \citet{Gong2022BackdoorORL} cannot serve our purpose as it would lead to poor online performance for the offline trained model.
%
Our inspiration originates from the distribution shift phenomenon,
which shows that over-estimation of the $Q$-function will lead to poor online performance, 
while keeping the performance during offline training competitive.
Thus, we seek to poison the reward by promoting the resulting $Q$ values at OOD actions,
hence maximally exacerbate the over-estimation problem.

Specifically, we first randomly sample $q$\% offline transitions $\mathcal{C}^p := \{(s, a, r^p, s')\}$ as candidate transitions to be poisoned. 
Then we perturb the reward on these transitions to construct a poisoned buffer $\Dcal^p=\{(s, a, r^p+\delta_{r}, s')\}$. 
Finally we combine it with the rest of clean transitions to construct the poisoned training set $\Dcal^t := \Dcal^p \ \cup \ (\Dcal\setminus \mathcal{C}^p)$.

Let $\delta_r$  be a vector whose components correspond to the reward perturbation on each transition in $\mathcal{C}^p$.
Then our poisoner conceptually solves the following constrained bi-level optimization for $\delta_r$:
%
\begin{align}
\label{eq:attack_obj}
\max_{\delta_r} \quad & \expunder{s \sim \Dcal} \expunder{a\sim \mu} \underbrace{[Q_{\psi^*}(s, a)]}_{\text{over-estimation}} 
- \beta \underbrace{\mathcal{R}(Q_{\psi^*}, \mathcal{D}^t)}_\text{extra stealthiness} \\
\label{eq:attack_L1_constr}
    \text{s.t.} \quad & \nbr{\delta_r}_1 / \abr{\Dcal^p} \leq \epsilon_1
    \quad \text{and} \quad 
    \nbr{\delta_r}_\infty \leq \epsilon_\infty\\
\label{eq:attack_inner_opt}    
      & \psi^* \leftarrow \text{\tt (surrogate)\!-victim-RL}(\Dcal^t).
\end{align}
where $\mu$ is a distribution over $\Acal$, $\mathcal{R}$ is a critic regularizer, and {\tt (surrogate)\!-victim-RL} is an offline RL algorithm, either the victim or a surrogate model. Ideally, we use uniform $\mu$ to promote uniform over-estimation for stealthiness. The regularizer $\mathcal{R}$ aims to further ensure stealthiness, as exact uniform over-estimation might not be always achievable, due to, e.g., optimization error or continuous action space. 

Note in the first term of the outer objective,
we do not require $a$ to be from the offline data,
i.e., it does not have to be what was taken at state $s$.
This exactly serves our purpose of simulating OOD actions,
and promoting their $Q$ values.
It is similar in spirit to the log-sum-exp term in \eqref{eq:cql-critic}.
For Frozen Lake task, whose action space is discrete and finite, it is straightforward to apply uniform $\mu$. 
While for locomotion tasks with bounded continuous space, the expectation over $a$ can be efficiently approximated with samples.

For the choice of regularizer $\mathcal{R}$, it can be typically the constraints derived for offline RL algorithms, for example commonly used KL~\citep{wu2019behavior}, uncertainty quantification~\citep{bai2022PBRL}, and CQL regularizer $\mathcal{R}^\text{CQL}_Q$, as its purpose is to improve offline performance (ensuring stealthiness) akin to offline RL regularizers. 
In practice, we use the CQL regularizer as it can be implemented for both discrete and continuous action spaces, respectively. 

To summarize, our poisoner solves
%
\begin{align}
\label{eq:attack_obj_final}
\max_{\delta_r} \quad &\! \expunder{s \sim \Dcal}  \expunder{a\sim \mathcal{U}(\Acal)} [Q_{\psi^*}(s, a)]
- \beta \mathcal{R}^\text{CQL}(Q_{\psi^*}, \mathcal{D}^t) \\
 \text{s.t.} \quad &\eqref{eq:attack_L1_constr} \text{ and } \eqref{eq:attack_inner_opt}.
\end{align}
where $\mathcal{U}$ stands for uniform distribution. And the intuition behind $\mathcal{R}^\text{CQL}$ is to constrain the (poisoned) Q-functions from deviating from the dataset actions, a common technique in offline RL. This is achieved by maximizing the $Q$ values of the dataset actions with $\mathcal{R}^\text{CQL}$.


\subsection{Solving the bi-level Optimization}

To solve \eqref{eq:attack_obj_final},
a key quantity needed is the derivative of the outer objective with respect to $\delta_r$,
which in turn needs the derivative of $Q_{\psi^*}$ with respect to $\delta_r$.
This is challenging because their dependence is through an offline RL algorithm.
The fundamental mathematical solution is the implicit function theorem (IFT),
based on which a number of techniques with improved computational and spatial complexity
have been widely used in previous works on hyper-parameter tuning \citep{Bengio2000Gradient,Maclaurin2015Gradient,Shaban2019Truncated,Lorraine2020Opt}. 
Here, we utilize these techniques in a similar way as described in Algorithm~\ref{algo:ift}, 
where instead of tuning the hyper-parameter, 
we update $\delta_r$. 
In particular, we follow \cite{Lorraine2020Opt} and approximate the Inverse Hessian Vector Product (HVP) by using the Neumann approximation.

\begin{algorithm}[t]    
    \caption{O2OP: Poison Generation via Surrogate Model}
    \label{algo:attack}
    \begin{algorithmic}[1]
    \STATE {\bfseries Input:} clean offline dataset $\Dcal = \{(s,a,r,s')\}$
    \STATE randomly pick a set of transitions $\mathcal{C}^p$ for poisoning
    \STATE initialize surrogate CQL model parameters $\theta, \psi, \Bar{\psi}$
    \STATE initialize poisoned dataset $\Dcal^p \! = \! \{(s, a, r^p + \delta_{r}, s')\}$
    \STATE combine clean and poisoned dataset into the training dataset $\Dcal^t = \Dcal^p \cup \Dcal \setminus \mathcal{C}^p$
    \FOR{\text{step = 1 ... number of offline steps}}
        \STATE sample a mini-batch from $\Dcal^t$
            \STATE {\color{gray} // {\tt surrogate-victim-RL} steps}
            \STATE update $\psi, \Bar{\psi}, \theta$ according to \eqref{eq:cql-critic} and \eqref{eq:cql-actor}
            \STATE {\color{gray} // IFT steps for poison update}
            \IF{  $ \text{step} \bmod \text{IFT\_freq.} ==  0$  }
            \STATE {\color{gray} // access to only surrogate model $\psi$}
            \STATE {update $\delta_r$ via Algorithm \ref{algo:ift} using $\psi$}
            \ENDIF    
    \ENDFOR
    \STATE {\color{gray} // output poisoned $\Dcal^t$ for subsequent victim training}
    \STATE {\bfseries Output:} $\Dcal^t$, which applies reward perturbation $\delta_r$
    \end{algorithmic}
\end{algorithm}

Equipped with the gradient in $\delta_r$,
we could simply perform gradient based updates such as ADAM.
However, this is very expensive because IFT-style algorithms require solving the inner offline RL to the optimal.
For computational efficiency, 
we only run offline RL for a few steps in each iteration,
and use the suboptimal $\psi$ to update $\delta_r$ via Algorithm~\ref{algo:ift}.
The entire procedure is summarized in Algorithm~\ref{algo:attack},
illustrating how the attacker generates the poison $\delta_r$, hence the poisoned dataset $\Dcal^t$. And the victim algorithm, not necessarily has to be the same as the surrogate algorithm (CQL) will then be trained on $\Dcal^t$.
We will refer to it as \textbf{O2O poisoner (O2OP)}.

It is noteworthy that the attacker does not require accessing the victim agent's model,
neither the policy nor the value functions.
Instead, it trains its own surrogate agent based on which the poison is constructed.
Surrogate models are quite commonly used \citep{Geiping2021witchesBrew,Kang2023CIAP, Souri2022Sleeper, Cherepanova2021Lowkey, Goldblum2023survey},
and its effectiveness is far from trivial because RL is well known for high variance.
With different seed and different mini-batches sampled,
the surrogate agent can be quite different from the real agent,
making it nontrivial for the learned poison to remain effective.



\section{Empirical Evaluation}
\label{sec:exp}

\begin{algorithm}[t]    
    \caption{Baselines: \tt poison-uniform/wb}
    \label{algo:baselines}
    \begin{algorithmic}[1]
    \STATE {\bfseries Input:} clean offline dataset $\Dcal = \{(s,a,r,s')\}$
    \STATE randomly pick a set of transitions $\mathcal{C}^p$ for poisoning
    \STATE initialize {\tt victim} model parameters $\theta, \psi, \Bar{\psi}$
    \STATE initialize poisoned dataset $\Dcal^p \! = \! \{(s, a, r^p + \delta_{r}, s')\}$
    \STATE {\color{gray} // fixed perturbation for {\tt poison-uniform}}
    \STATE {\bf if} {\tt poison-uniform}  {\bf then}  $\delta_{r} \leftarrow \epsilon_1$  {\bf end if}
    \STATE obtain the training dataset $\Dcal^t = \Dcal^p \cup \Dcal \setminus \mathcal{C}^p$
    \FOR{\text{step = 1 ... number of offline steps}}
        \STATE sample a mini-batch from $\Dcal^t$
            \STATE {\color{gray} // {\tt victim-RL} steps}
            \STATE update $\psi, \Bar{\psi}, \theta$ according to \eqref{eq:cql-critic} and \eqref{eq:cql-actor}
            \STATE {\color{gray} // simultaneously poisoning for {\tt poison-wb}}
            \IF{ {\tt poison-wb} {\bf and}  $ \text{step} \! \bmod \! \text{IFT\_freq.} \! == \! 0 \!$ } 
            \STATE {\color{gray} // {\tt poison-wb} accesses {\tt victim-RL}}
            \STATE {update $\delta_r$ via Algorithm \ref{algo:ift} using $\psi$}
            \ENDIF    
        % \ENDFOR
    \ENDFOR
    \end{algorithmic}
\end{algorithm}

We now empirically verify that our proposed poisoner O2OP fulfills the aforementioned objectives.
We tested on Frozen Lake~\citep{openaigym}, Hopper, HalfCheetah, and Walker2d environments from the D4RL dataset \citep{Fu2020d4rl}. 
In this section, we use CQL for offline training, and SAC or DDQN for online fine-tuning in continuous or discrete tasks, respectively.
Following the common protocol, 
we repeated experiments on each environment with 5 seeds, 
and then plotted the mean return from the 5 trials.

\paragraph{Baseline Comparators}

Since there is yet no existing algorithm addressing our task,
we adopted a uniform poisoner which sets all $\delta_r$ to $\epsilon_1$.
%
To study the effectiveness of using surrogate models,
we also compared with an attacker which has white-box access to the victim model.
These two methods will be referred to as {\tt poison-uniform} and {\tt poison-wb}, respectively.

\paragraph{Environments} Frozen Lake is a discrete text environment. 
The environment consists a 4-by-4 or 8-by-8 grid, with a goal state and several holes (terminal states). 
The agent receives a reward of $1$ for reaching to goal state, and reward of $0$ for all other states.
It should aim to reach the goal state without falling into a hole.
Locomotion tasks are simulated robotics environments, where the rewards are measured by the forward travel distance while staying ``stable''. D4RL dataset contains a collection of different skill levels for each locomotion task, depending on the average return of behavior policy that collects the dataset. We use ``medium'' level dataset for our experiments.
%
Figure \ref{fig:env_illu} visualized a typical 4-by-4 Frozen Lake, as well as locomotion environments. 


\subsection{Discrete Environment: Frozen Lake}

\begin{figure}[t]
\centering
\begin{subfigure}{0.18\textwidth}
  \centering
  \includegraphics[width=\textwidth]{UAI_2024/figures/frozen_lake.jpg}
\end{subfigure}
\begin{subfigure}{0.18\textwidth}
  \centering
  \includegraphics[width=\textwidth]{UAI_2024/figures/half_cheetah.jpg}
\end{subfigure}

\vspace{0.01\textwidth} % Space between rows

\begin{subfigure}{0.18\textwidth}
  \centering
  \includegraphics[width=\textwidth]{UAI_2024/figures/hopper.jpg}
\end{subfigure}
\begin{subfigure}{0.18\textwidth}
  \centering
  \includegraphics[width=\textwidth]{UAI_2024/figures/walker2d.jpg}
\end{subfigure}
% 
\caption{Visualizations of Frozen Lake, HalfCheetah, Hopper and Walker2d, respectively.$^*$}
\label{fig:env_illu}
\end{figure}
% 
\footnotetext{Figures borrowed from~\citep{openaigym}.}

\begin{figure}[t]
\includegraphics[width=0.5\textwidth]{UAI_2024/figures/O2OP_frozenlake_01.png}
\caption{O2O return in offline phase (left) and online phase (right) for Frozen Lake with $\epsilon_1 = 0.1$}%
\label{fig:frozen_01}%
\vspace{1em}
\includegraphics[width=0.5\textwidth]{UAI_2024/figures/O2OP_frozenlake_002.png}
\caption{Frozen Lake with $\epsilon_1 = 0.02$}%
\label{fig:frozen_002}%
\end{figure}

We trained an offline discrete CQL agent for 100 epochs, 
with 500 steps in each epoch.
The online 
agent was trained for 50 epochs on clean online environment, with a buffer carried over from their offline phase.
For this environment, we included all offline transitions $\Dcal$ in our candidate set $\Dcal^p$,
and tested with $\epsilon_1 \in \{0.1, 0.02\}$ and $\epsilon_\infty = 1$ from \eqref{eq:attack_L1_constr}.
O2OP first generated $\delta_r$ from a surrogate model as described in Algorithm~\ref{algo:attack},
and used it to poison a new victim which was trained by CQL with a different initialization and mini-batch sampling seed.

Figure~\ref{fig:frozen_01} shows the average online return during the offline training (left) and online fine-tuning (right), 
both at $\epsilon = 0.1$.
All poisoned victims perform similarly to the clean trained agent during the offline phase, 
fulfilling the stealthiness objective.
However, our O2OP drove down the online return from 0.65 to 0.3, 
which is only slightly higher than that of the white-box poisoner (0.25).
In contrast, the online return of the uniform baseline stayed above 0.45.
We also aggregated the average returns over all offline or online steps by taking their mean. 
This is provided in the legend.

We further reduced our budget to $\epsilon_1 = 0.02$ in Figure~\ref{fig:frozen_002}.
Here, the stealthiness remains satisfied offline.
During online fine-tuning,
the uniform poisoned victim agent has a minimum average return above 0.5, 
while our O2OP drives it below 0.35,
which is almost the same as the white-box attacker.
This confirms the effectiveness of our O2OP.

\begin{figure}[!t]
\includegraphics[width=0.5\textwidth]{UAI_2024/figures/O2OP_hopper_cql.png}
\caption{O2O return in offline phase (left) and online phase (right) for Hopper with 2\% poison.}%
\label{fig:hopper_cql}%
\vspace{1em}
\includegraphics[width=0.5\textwidth]{UAI_2024/figures/O2OP_halfcheetah_cql.png}
\caption{O2O return in offline phase (left) and online phase (right) for HalfCheetah with 2\% poison.}%
\label{fig:halfcheetah_cql}%
\vspace{1em}
\includegraphics[width=0.5\textwidth]{UAI_2024/figures/O2OP_walker2d_cql.png}
\caption{O2O return in offline phase (left) and online phase (right) for Walker2d with 2\% poison.}%
\label{fig:walker2d_cql}%
\end{figure}


\subsection{Continuous environments}

We next move on to illustrate the attack effectiveness in a \textit{continuous} space.
The continuous CQL agents were trained for 600 epochs offline, 
with 500 gradient steps per epoch.
The online continuous SAC agents were trained for additional 100 epochs. 
We reduced the poison ratio to $2$\% (i.e. $q=2$) for more realistic attacks.

\paragraph{Hopper}

As the hopper-medium dataset has rewards ranging in $(0,6)$, 
we increased our poison's $\ell_1$ norm budget to $\epsilon_1 = 4$.
To improve stealthiness, 
we enforced the constraint $\|\delta_r\|_{\infty}\leq \epsilon_\infty = 5$.
Accordingly, the same choices were made on both baselines {\tt poison-uniform/wb}.
Despite the slightly high values of $\epsilon_1$ and $\epsilon_\infty$,
we only poison 2\% of the transitions, 
which is consistent with poisoning or backdoor attacks in supervised learning.

Figure \ref{fig:hopper_cql} shows that, 
analogously to Frozen Lake, 
all the three poisoners perform similarly to the clean unpoisoned case in terms of the offline performance, 
which again confirms the stealthiness of O2OP.
During online fine-tuning, however, 
O2OP achieves a performance drop from 3000 to 2600 (when online iteration is around 46000),
while the white-box version can further slash it to 2000.
In contrast, {\tt poison-uniform} can hardly degrade the online return, if at all.
This shows that O2OP remains effective in this continuous space with a small poison ratio.

\paragraph{HalfCheetah}

The reward in halfcheetah-medium lies between $-3$ and $9$, 
with the mean around $5$.
We again only poisoned 2\% transitions,
and set $\epsilon_1 = 4$ and $\epsilon_\infty = 5$.
Similarly to Hopper, Figure \ref{fig:halfcheetah_cql} shows our O2OP effectively created a return drop during the online fine-tuning,
while {\tt poison-uniform} is again nearly harmless to the victim at the same ratio and budget.
The offline stealthiness is evidenced once more as the four methods achieve similar offline returns.

\paragraph{Walker2d}

The walker2d-medium dataset has similar reward range as halfcheetah-medium,
and we thus used identical settings to it.
As shown in Figure \ref{fig:walker2d_cql}, 
the poisoned offline return remains comparable to the clean offline return, i.e., stealthy. 
Although the online return seems less stable than in the previous experiments, 
O2OP managed to curtail the return from $3800$ to nearly $2500$ at its lowest, 
while the {\tt clean} and {\tt poison-uniform} baselines produce returns fluctuating between $3200$ and $4000$.



\section{Ablation Studies}

We further experiment with our method using different $\ell_p$ budgets, alternative victim algorithms, different model architectures, and under defense strategies.


\subsection{Impact of $\ell_p$ Budget}

We also tested with different budget of $\epsilon_1$ and $\epsilon_\infty$ on Hopper.
As Figure \ref{fig:hopper_budget} shows, 
different budgets do not affect the offline return too much.
On the other hand, the amount of online performance drop does vary significantly with the budgets.
In general, a larger poison budget leads to a greater drop.

\begin{figure}[t]
\includegraphics[width=0.49\textwidth]{UAI_2024/figures/hopper_compare_budgets.png}
\caption{O2O return in offline phase (left) and online phase (right)  on Hopper with varying $\epsilon_1$ and $\epsilon_\infty$ budgets.}%
\label{fig:hopper_budget}%
\end{figure}


\subsection{ Alternative Choice of Victims }\label{sec:more-offline-rl}

In addition, we will apply our attack to different choices of offline victims. 
Specifically, we consider two critic-regularized offline algorithms: BRAC~\citep{wu2019behavior} and PBRL~\citep{bai2022PBRL}, where one regularizes the critic updates to avoid over-estimation akin to CQL. The corresponding updates of BRAC and PBRL are listed below:
%
\begin{align}
\begin{split}\label{eq:brac-critic}
    & \Lcal^{\text{BRAC}}_Q(\psi, \Dcal) := \!\! \expunder{(s, a, r, s') \sim \Dcal} \! \left[ \left( Q_{\psi}(s, a) - y(r, s') \right)^2 \right] \!\!\!\!\! \\
    & y(r, s') \! := \! r \! + \! \gamma \!\!\expunder{a'\sim\pi_\theta(s')}
        \! [Q_{\bar{\psi}}(s', a') \! - \! \alpha D_{s'}(\pi_\theta|\pi_b)] \\
\end{split} \\ 
    & \Lcal^{\text{BRAC}}_\pi(\theta, \Dcal) \! := \! \expunder{s \sim \Dcal}\!\expunder{a\sim\pi_\theta(s)}[  \alpha D_{s'}(\pi_\theta|\pi_b) \! - \! Q_\psi(s, a)   ] \label{eq:brac-actor} 
\end{align}
%
\begin{align}
\begin{split}\label{eq:pbrl-critic}
    & \Lcal^{\text{PBRL}}_Q(\psi, \Dcal) := \!\! \expunder{(s, a, r, s') \sim \Dcal} \! \left[ \left( Q_{\psi}(s, a) - y(r, s') \right)^2 \right] \!\!\!\!\! \\
    & + \expunder{s \sim \Dcal} \expunder{a \sim \pi_\theta} \left[ ( Q_{\bar{\psi}}(s, a) - \alpha \mathcal{E}_{\bar{\psi}}(s, a) - Q_{\psi}(s, a) )^2 \right]\\
    & y(r, s') \! := \! r \! + \! \gamma \!\!\expunder{a'\sim\pi_\theta(s')}
        \! [Q_{\bar{\psi}}(s', a') \! - \! \alpha \mathcal{E}_{\bar{\psi}}(s', a')] \\
\end{split} \\
    & \Lcal^{\text{PBRL}}_\pi(\theta, \Dcal) \! := - \expunder{s \sim \Dcal} \expunder{a\sim\pi_\theta(s)}[ Q_\psi(s, a)   ] \label{eq:brac-actor} 
\end{align}
%
where $D_{s}(\pi_\theta|\pi_b):= D(\pi_\theta(\cdot|s)|\pi_b(\cdot|s))$ is a (sample-based approximation of) divergence between the learned policy $\pi_\theta$ and a reference/behavior policy $\pi_b$ (optionally learned by behavior cloning); and $\mathcal{E}_{\bar{\psi}}(s, a):= \mathrm{std}(Q_{\bar{\psi}}^{(i)}(s, a))$ is an uncertainty quantification using ensembled $Q$-functions.

\begin{figure}[!t]
\includegraphics[width=0.48\textwidth]{UAI_2024/figures/different_algorithms.pdf}
\caption{Ablation on different victim algorithms with Frozen Lake: we in addition test BRAC and PBRL as offline victim algorithms.}%
\label{fig:frozen_diff_algo}%
\vspace{1em}
\includegraphics[width=0.48\textwidth]{UAI_2024/figures/different_surrogate.pdf}
\caption{Ablation on different surrogate models with Frozen Lake: CQL remains the offline victim algorithm, but BRAC and PBRL are used as surrogate to learn $\delta_r$.}%
\label{fig:frozen_diff_surrogate}%
\end{figure}

We conducted additional experiments with BRAC+DQN and PBRL+DQN in Frozen Lake to validate the effectiveness beyond CQL as victim (where BRAC or PBRL is used for both victim and surrogate). Figure~\ref{fig:frozen_diff_algo} shows the proposed attack remain effective for different O2O RL choices.


\subsection{Alternative Choice of Surrogate}

To further test O2OP’s effectiveness when the surrogate and victim models are different,
we now use BRAC and PBRL as surrogate models, and keep CQL as the victim. Figure~\ref{fig:frozen_diff_surrogate} shows that our O2OP remains effective with different surrogate models.

\begin{figure}[!t]
\includegraphics[width=0.48\textwidth]{UAI_2024/figures/different_network.png}
\caption{O2O return in offline phase (left) and online phase (right) for Frozen Lake when surrogate model having different network architectures.}%
\label{fig:frozen_diff_network}%
\vspace{1em}
\includegraphics[width=0.48\textwidth]{UAI_2024/figures/potential_defense.png}
\caption{O2O return in offline phase (left) and online phase (right) for Frozen Lake under potential defense.}%
\label{fig:frozen_defense}%
\end{figure}


\subsection{Impact of Network Architecture}

To further demonstrate the effectiveness of O2OP when the surrogate and victim models have different network architectures, we use the same victim architecture (two hidden layers of size 256 each) for the clean, O2OP-same-network, and O2OP-different-network experiments. 
% 
O2OP-same-network means the surrogate model has the same architecture as the victim model, while O2OP-different-network uses a network with layers of sizes $\{32, 64, 128\}$ to generate $\delta_r$. 
%
Figure~\ref{fig:frozen_diff_network} demonstrates that O2OP remains effective even with different surrogate model architectures.

\subsection{Assessing O2OP under Defense}

We next study how well our attack remains effective in the face of defense algorithms.
To this end, we added two simple defense strategies: (i) using a single-class SVM, an unsupervised outlier detection method, to filter and remedy the data; (ii) a uniform decrease, by the mean of the learned $\delta_r$, to all poisoned rewards. Note that the second defender is a ``strong'' one in the sense that it leverages knowledge (the mean of $\delta_r$) that is not typically available to defenders. Nonetheless, we observe that these defenses were not effective as shown in Figure~\ref{fig:frozen_defense}.



\section{Further Details}

\paragraph{Regularizer $\mathcal{R}^\text{CQL}$} For continuous action space, we follow the implementation of d3rlpy~\citep{d3rlpy}. For discrete action space, we first observe that $\mathcal{R}^\text{CQL}$ is equivalent to a cross-entropy loss (or negative log-likelihood):
% 
\begin{align}
    & \mathcal{R}^\text{CQL}(Q, \!\Dcal) \! := \!\!\!\! \expunder{(s, a)\sim\Dcal} \!\! \left[  \log \! \textstyle\sum_u \! \exp(Q(s, u)) \! - \! Q(s, a) \right] \!\! \\
    &= \! - \!\!\! \expunder{(s, a)\sim\Dcal} \!\! \left[ \log \! \frac{\exp Q(s, a)}{\sum_{u} \! \exp Q(s, u)} \! \right] 
    \! = \! - \!\!\! \expunder{(s, a)\sim\Dcal} [\log\!\pi_Q(a|s)]
\end{align}
%
We then use label smoothing with $\epsilon=0.1$ for a smoother regularization, as different actions $a$ may present in the same state $s$, unlike in standard classification problems.

\vspace{-0.5em}
\paragraph{Offline Dataset Collection} D4RL does not have a dataset for Frozen Lake. Instead, we collect an offline dataset ourselves by following a collection procedure similar to prior offline RL works \citep{kumar2019stabilizing, wu2019behavior}. We first train a near-optimal policy through online interaction and then use this policy to collect a certain number of trajectories in the environment. The collected dataset has $5$ trajectories with $195$ transitions and an average return of $1$.

\vspace{-0.5em}
\paragraph{IFT Optimizer} We use a community implementation of the IFT optimizer\footnote{Available at \href{https://github.com/money-shredder/iftopt}{here}.} for our bi-level optimization.



\section{Conclusion}\label{sec:conclusion}

\paragraph{Summary} We proposed a novel reward poisoning method that reveals the vulnerability of O2O RL fine-tuning under a novel stealthiness notion—impact occurs only during online fine-tuning while the offline RL performance remains intact. Our approach leverages the distribution shift phenomenon during O2O transfer by promoting $Q$-function over-estimation for out-of-distribution actions through a bi-level optimization performed with the application of the implicit function theorem.

\paragraph{Limitation} Our work only tested critic-regularized offline RL methods—CQL, BRAC, and PBRL—as our method is motivated by the over-estimated $Q$-function to make those critic regularizations less effective. It remains unclear whether such vulnerability exists in other categories of O2O algorithms, such as actor regularization~\citep{nair2020awac}, replay distribution correction~\citep{lee2022offline}, or policy ensemble~\citep{zhang2022policy, wang2023train}.

\paragraph{Future Direction} To further extend our understanding of O2O RL, it is important to study the aforementioned non-critic-regularized methods, as each of these categories may present unique vulnerabilities and characteristics that differ from critic-regularized methods.

Additionally, exploring effective defense is vital for a robust O2O training pipeline. Future research could focus on developing resilient learning algorithms and enhancing data sanitization techniques to detect and remove perturbed data.



\paragraph{Societal Impact} 
% 
Our work focuses on understanding of the vulnerability of RL algorithms, particularly in the context of O2O transfer. While we introduce a novel reward poisoning method to study vulnerabilities in RL fine-tuning, it is important to highlight that our research is conducted strictly within a controlled experimental setting and is intended purely for academic and scientific purposes.

The environments we use, Frozen Lake and MuJoCo locomotion tasks, are toy-level simulations. These simplified scenarios ensure that our research remains theoretical and cannot be misused by third parties to cause real-world harm. Our intention is to identify weaknesses in RL systems to help develop more resilient and secure algorithms.


By exposing and analyzing these vulnerabilities, we aim to contribute to the broader field of RL safety and robustness, ultimately leading to stronger and more reliable RL fine-tuning models. This, in turn, can enhance the safety and performance of RL applications in various domains.

Our work does not support or encourage the malicious use of reward poisoning techniques. Instead, our findings are intended to serve as a foundation for developing effective defense strategies against such attacks. By sharing our insights with the research community, we hope to foster a collaborative effort towards mitigating the risks associated with adversarial attacks in RL.

Overall, our work is designed to advance the field of RL in a positive and constructive manner, with the ultimate aim of creating safer and more robust RL systems that can benefit society as a whole.



\paragraph{Acknowledgement}

We thank the reviewers and the meta-reviewer for assessing our paper and for their constructive
feedback. 
This work is supported by NSF grant RI:1910146 and NIH
grant R01CA258827.

\newpage
\bibliography{paper}
\bibliographystyle{uai2024}

\end{document}
