%\documentclass{uai2023} % for initial submission
\documentclass[accepted]{uai2023} % after acceptance, for a revised
                                  
%% There is a class option to choose the math font
% \documentclass[mathfont=ptmx]{uai2023} % ptmx math instead of Computer
                                         % Modern (has noticable issues)
% \documentclass[mathfont=newtx]{uai2023} % newtx fonts (improves upon
                                          % ptmx; less tested, no support)
% NOTE: Only keep *one* line above as appropriate, as it will be replaced
%       automatically for papers to be published. Do not make any other
%       change above this note for an accepted version.

%% Choose your variant of English; be consistent
\usepackage[american]{babel}
% \usepackage[british]{babel}

%% Some suggested packages, as needed:
\usepackage{natbib} % has a nice set of citation styles and commands
    \bibliographystyle{plainnat}
    \renewcommand{\bibsection}{\subsubsection*{References}}
\usepackage{mathtools} % amsmath with fixes and additions
% \usepackage{siunitx} % for proper typesetting of numbers and units
\usepackage{booktabs} % commands to create good-looking tables
%\usepackage{tikz} % nice language for creating drawings and diagrams

\usepackage{soul}
\usepackage{url}
%\usepackage[hidelinks]{hyperref}
%\usepackage[utf8]{inputenc}
%\usepackage[small]{caption}
\usepackage{graphicx}
\usepackage{amsmath}
\usepackage{amsthm}
\usepackage{booktabs}
\usepackage{algorithm}
%\usepackage{algorithmic}
\usepackage{algpseudocode}
\usepackage{color}
\usepackage{amsfonts,amssymb}
\usepackage{graphicx} %use graph format   
\usepackage{epstopdf}
\usepackage{booktabs}
\usepackage{stfloats}
\usepackage{diagbox}
\usepackage{stfloats}
\usepackage{multirow}
\urlstyle{same}
\usepackage[switch]{lineno}

\usepackage{xspace}
\newcommand{\dor}{DOREA\xspace}
\newcommand{\dora}{DOREA agent\xspace}
\usepackage{makecell}
\usepackage{ulem}

%% Provided macros
% \smaller: Because the class footnote size is essentially LaTeX's \small,
%           redefining \footnotesize, we provide the original \footnotesize
%           using this macro.
%           (Use only sparingly, e.g., in drawings, as it is quite small.)

%% Self-defined macros
\newcommand{\swap}[3][-]{#3#1#2} % just an example

\title{An Effective Negotiating Agent Framework based on Deep Offline Reinforcement Learning}

% The standard author block has changed for UAI 2023 to provide
% more space for long author lists and allow for complex affiliations
%
% All author information is authomatically removed by the class for the
% anonymous submission version of your paper, so you can already add your
% information below.
%
% Add authors
\author[1]{Siqi Chen}
\author[1]{Jianing Zhao}
\author[2]{Gerhard Weiss}
\author[1]{Ran Su}
\author[3]{Kaiyou Lei\thanks{Corresponding author, Kaiyou Lei <kylei2022@163.com>}}

% Add affiliations after the authors
\affil[1]{%
    College of Intelligence and Computing\\
    Tianjin University\\
    Tianjin, China
}
\affil[2]{%
    Department of Advanced Computing Sciences\\ 
    Maastricht University\\
    Maastricht, the Netherlands
}
\affil[3]{%
    %School of Computer and Information Science\\ 
    College of Computer and Information Science\\
    Southwest University\\
    Chongqing, China
  }
  
\begin{document}
\maketitle

\begin{abstract}
Learning is crucial for automated negotiation, and recent years have witnessed a remarkable achievement in application of reinforcement learning (RL) for various negotiation tasks. Conventional RL methods focus generally on learning from active interactions with opposing negotiators. However, collecting online data is expensive in many realistic negotiation scenarios. While previous studies partially mitigate this problem through the use of opponent simulators (i.e., agents following known strategies), in reality it is usually hard to fully capture an opponent’s negotiation strategy. Moreover, a further challenge lies in an agent's capability of adapting to dynamic variations of an opponent's preferences or strategies, which may happen from time to time for different reasons in subsequent negotiations. In response to these challenges, this article proposes a novel Deep Offline Reinforcement learning Negotiating Agent framework that allows to learn an effective strategy using previously collected negotiation datasets without requiring interaction with an opponent. This is in contrast to existing RL-based negotiation approaches that all rely on active interaction with opponents. Furthermore, the strategy fine-tuning mechanism is included to adjust the learned strategy in response to the preferences or strategy changes of the opponent. The performance of the proposed framework is evaluated based on a diverse set of state-of-the-art baselines under different settings. Experimental results show that the framework allows to learn effective strategies exclusively with offline datasets, and is also capable of effectively adapting to changes of an opponent's preferences or strategy.
\end{abstract}

\section{Introduction}
\label{intro}

Negotiation is a process where parties of different interests exchange offers to mutually explore the likelihoods of achieving mutual benefit, resolving
conflicts or finding mutually acceptable solutions. With that, negotiation can serve as a fundamental and powerful mechanism for managing conflicts~\citep{jennings01}.
This mechanism, however, can be time-consuming and costly for humans~\citep{fatima:an}.
Automated negotiation~\citep{chen201501,chen202202} has therefore become a subject of central interest in multi-agent systems over the past decade due to its advantages over non-computerized negotiation, such as alleviating the efforts of human negotiators, reaching better outcomes by compensating limitations of human computational and reasoning abilities, and so on.

%Learning is critical for automated negotiation. 
Reinforcement learning (RL) is a powerful learning paradigm for control tasks.
Specifically, RL can be utilized to automatically acquire near-optimal behavioral skills (represented by policies) for given tasks.
The successful application of RL algorithms in diverse fields (e.g., natural language
processing, computer vision and complex games~\citep{Silver2017,Devlin2019}) has also led to the exploration of RL in automated negotiation~\citep{bakker2019,ijcai2020-42,CHANG2021,SenguptaAdaptive,chen202102,YangCN20,aaai2023,chen202302}. 
Despite the remarkable progress that has been achieved so far, conventional RL methods for negotiation typically focus on online learning from active interactions with the environment (i.e., everything in the negotiation scenario including the opponent and the domain) to iteratively collect data to be used for policy improvement. 
However, this kind of online learning is of limited value and often impractical for negotiation, mainly because data collection based on online interactions is very expensive.
For example, training a RL agent from scratch in an e-commence scenario against a negotiation partner is likely to lead to a large number of unacceptable results and low-quality customer experience. 
While previous approaches partially mitigate this problem by using opponent simulators (i.e., agents applying known strategies) for training, in realistic settings, it is usually hard to fully capture an opponent’s negotiation strategy due to uncertain user states and actions, noisy environments, and the fact that negotiators aim at hiding information related to their strategies in order to hamper exploitation through their opponents.

Due to the limited value of online RL for automated negotiation, a key question is whether data collected during previous negotiation sessions can be effectively utilized by an agent to learn its negotiation skills. 
In particular, can an agent do so, and moreover adapt its negotiation strategy when there are changes in opponent preference or strategy due to various reasons (e.g., different social motives of users, market demand). 
Therefore, a novel Deep Offline Reinforcement learning Negotiating Agent (DOREA) framework is proposed, which can learn an effective strategy from offline datasets of previously collected negotiation experiences. 
The \dor framework does also enable an agent to fine-tune a learned strategy and to adapt it to changes of the opponent preferences or strategy. %in subsequent negotiations
%\footnote{In online negotiation, the agent can freely interact with the other agent following its strategy, which is in contrast to what the agent does learning in offline datasets.}.
%\footnote{Note that online negotiation in this context refers to the negotiation that an agent conducts with the other party after offline RL training (i.e., the agent can have interactions with the opponent), and offline negotiation data means negotiation history between two parties, which is used for offline RL learning in \dor framework.}. 

%may lead to severe bootstrap errors, which destroys the good initial policy obtained via offline RL.

% In contrast, negotiation data between parties is easily obtainable. 
% Thus, a question naturally arises -- can we exclusively leverage the previously collected negotiation data to learn the strategy (or RL policy in this context) like data-driven learning paradigms. 
% Moreover, how can the learned strategy adapt to the situations when the preferences change due to various reasons (e.g., different social motives of users, market demand)?  
% In response to this question, a novel deep offline reinforcement learning negotiating agent framework is proposed, which can learn from datasets of previously collected negotiation experiences. 
% This framework does also enable an agent to fine-tune a learned policy and to adapt it to state-action distribution shifts induced by changes in either preferences or opponent strategy may lead to severe bootstrap errors, which destroys the good initial policy obtained via offline RL.

%In this paper, we observe that state-action distribution shift may lead to severe bootstrap
%error during fine-tuning, which destroys the good initial policy obtained via offline RL.
%This paradigm can be extremely valuable in settings where online interaction is impractical, either because data collection is expensive or dangerous (e.g., in robotics [10], education [11], healthcare [12], and autonomous driving [13]).

The remainder of this paper is structured as follows.
Section~\ref{related} overviews important related work.
Section~\ref{pre} provides the reader with background knowledge that is relevant for the remaining sections.
The technicalities of the DOREA framework are presented in Section~\ref{appro}.
An in-depth analysis is given in Section~\ref{exps}.
Lastly, Section~\ref{concs} concludes and identifies interesting future research directions.


\section{Related Work}
\label{related}

Recently, RL-based negotiating agents have attracted considerable research attention\citep{chen202101,chen202203}.
For example, \citet{bakker2019} propose a RL framework (RLBOA) built on the BOA architecture for automated negotiation. 
The Tabular Q-learning algorithm is used to train the bidding strategy. 
To have a compact state representation, RLBOA maps the offers to the utility space and discretizes the utility space into a number of equal bins. 
A problem with such discretization is that it can lead to loss of information conveyed in the offers, e.g., the state/action domain structure.
Moreover, the Q-learning approach suffers from large state space and over-estimation of $Q$ value problem.
\cite{ijcai2020-42} pre-train a negotiation strategy through supervised learning (SL) with synthetic data in order to accelerate the learning process. Initialized by the learned SL strategy, the negotiation agent evolved using a model-free Deep RL method called Deep Deterministic Policy Gradient (DDPG) \citep{DBLP:TS15} with additional negotiation experience. 
A limitation of this approach is that it only addresses negotiations of a single issue.
\citet{chen202102} considered the negotiation scenarios where the opponent may change its strategy at times. 
They proposed a negotiating agent based on Bayesian policy reuse to detect an opponent strategy and respond with the best learned RL policy from existing policies.
\citet{aaai2023} proposed a reward-based negotiating agent strategy through a multi-issue policy network. 
The policy network was trained to predict the optimal policy in policy-based RL without incorporating utility functions. 
% compare ijcai 2022 below
%The most related work to ours is done by 
%However, less research attention is given to historical data to learn a negotiation strategy.

Although the existing work has advanced the field of automated negotiation, it still suffers from one common limitation, that is, the requirement for a large number of online interactions with the environment in order to train the policy.
%In automated negotiation, few attempts have been made to use historical data to create a negotiation strategy
The work most closely related to ours is \citet{Sengupta22}. 
There a negotiation framework is proposed that trains a base model with negotiation history for its bidding strategy. 
A binary classifier enables the detection of changes in utility functions, and then the adapted model is provided to automatically adapt to such changes by using parameter sharing based transfer learning technique with newly collected datasets during negotiation. 
A drawback of this framework is that the opponent must keep its strategy fixed all the time, otherwise both the classifier and adapted model will be ineffective (as they are trained by the negotiation traces produced by the opponent strategy).
%there is a fixed opponent strategy 
%in specific domains and  
%where opponent strategies are assumed to be fixed.
%assuming opponent strategies are fixed. 
%Besides, a mechanism is provided to automatically adapt to such changes by updating the bidding strategy.
In contrast, our approach is considerably broader in its applicable range of negotiation tasks because it is not restricted to any opponent strategy, utility function or the quality of previously collected datasets (as we will show later in Section~\ref{exp_influence}, a dataset collected by even a simple strategy can produce a negotiating agent based on \dor framework whose performance is still acceptable.). 

%considerably more generous in negotiation tasks, i.e., not limited to any opponent strategy or utility or domains.

%\subsection{Offline Reinforcement Learning}
Offline RL is a new RL paradigm concerned with learning exclusively from datasets of previously-collected experiences \citep{levine2020}.
This learning pattern is very valuable in environments where online interaction is impractical or expensive, and has achieved remarkable successes in robotics \citep{chen2022lapo,yu2021conservative}, autonomous driving \citep{tennenholtz2022uncertainty}, healthcare \citep{fatemi2021medical}, and other fields \citep{Prudencio2022,chen202201,chen202204}.  
Although recently much research effort has been devoted to learning useful negotiation strategies with RL, to the best of our knowledge, our work is the first attempt to use (1) offline RL for learning an effective negotiation strategy and (2) offline-to-online techniques for fine-tuning the learned strategy and adapting it to changes of opponent preferences or strategies. 

%under changing utility functions.
% After learning the offline policies, people can still choose to adjust the policies online. 
% The additional advantage is that their initial policies may be safer and cheaper to interact with the environment than the initial random policies.
% Distribution shift is one challenge of offline RL.
% Rencent model-free based methods attempted to solve this problem by constraining the learned policy to be close to the behavior policy via implicit or explict divergence regularition.
% For example,
%  Batch Constrained Q-learning (BCQ)~\cite{fujimoto2019off}directly constraning the learned policy to the behavior policy used to collect the dataset.
%  Conservative Q-Learning (CQL)~\cite{kumar2020conservative} proposed a complimentary approach to reducing the harmful effect of out-of-distribution Q-values by learning a conservative Q-function and explicitly punishing Q-values of actions not seen in the dataset.
% ~\cite{fujimoto2021minimalist} introduces TD3+BC (Twin Delayed Deep Deterministic policy gradient + Behavior Cloning), a minimalist algorithm that does not even contain a model of the behavioral policy. 
% Inspite of its simplicity, it surprisingly achieves state of the art performance on benchmark tasks.
% ~\cite{niu2022trust}proposed a Dynamics-Aware Hybrid Offlineand-Online Reinforcement Learning (H2O) framework. 
% This framework uses offline and online reinforcement learning for training at the same time. 
% The framework combines limited real data in offline RL and unrestricted exploration through imperfect simulators in online RL to address the drawbacks of both approaches.
% Model-based RL algorithms provide another solution to offline RL.
% ~\cite{kidambi2020morel} propose a method named Model-based Offline Reinforcement Learning (MOReL), which measures their model’s epistemic uncertainty through an ensemble of dynamics models. 
% ~\cite{yu2020mopo} propose a method named MOPO which add an additional reward penalty on generated transitions with large variance from the learned dynamic model. 


\section{Preliminaries}
\label{pre}

\subsection{Negotiation Settings}
\label{negosettings}  

This work adopts a bilateral multi-issue negotiation environment widely used in the automated negotiation field (e.g., \citep{chen201302,chen201401,chen201502,SenguptaAdaptive,chen202102}). 
A negotiation scenario consists of a domain description and preference profiles of both parties.
%Both parties have private preference over multi issues under negotiation.
The preference profiles of a domain determine the utility functions (as shown below).
%constitute the utility space~$\mathcal{U}$. 
Let $I$ be the set of negotiation agents, with $i$ representing a specific agent ($i \in \{o,s\}$ where s refers to the agent and o to its opponent). 
$J$ is the set of issues under negotiation, with $j$ being a particular issue ($j \in \{1, ..., n\}$ where $n$ is the number of issues).
%Participants aim at reaching an agreement by a given deadline referred to as $T_{\textit{max}}$.
%~\citep{ito2011new}. 
%During the negotiation, two parties send alternating offers to each other until both sides agree on an offer together, or a deadline is reached \citep{ito2011new}.
%An offer is thereby a vector of values, with one value for each issue.
The utility function of agent $i$ maps any negotiation outcome $\omega$ from outcome space $\Omega$ to a real-valued number in the range of [0, 1], and is defined as:

\begin{equation}
U_i(\omega)=\sum_{j=1}^n(w_j^i \cdot V_j^i(v_{jk})) \quad
\end{equation}

where $\omega$ is an outcome represented as a vector of values, with one value for each issue; $v_{jk}$ is the k-th possible choice of issue $j$; and $V_j^i$ is the evaluation function of agent $i$ for issue $j$ that maps a choice of issue $j$ (e.g., $v_{jk}$) to a real number in the interval of [0, 1]; and $w_j^i$ ($j \in \{1, \ldots, n\}$) the weighting preference which agent $i$ ascribes to issue $j$.

During negotiation, both parties exchange offers in each round to express their demands, relying on the stacked alternating offers protocol \citep{soap2017}.
%Specifically, each agent in turn makes an offer $\omega$, to the other agent who may then accept or reject the proposal. 
%This process continues until one agent accepts a proposal or the deadline expires.
%If the negotiation fails, each agent $a_i$ receives a fixed utility value $u_r\in\mathbb{R}$ (a.k.a the reservation value that can be expected in the case of negotiation failure).

%Each agent $a_i$ has a utility function $u_i$ that is assigned to each other $\omega\in\Omega$ a utility value $u_1(\omega)$ and $u_2(\omega)$ corresponding to this offer.



\subsection{Reinforcement Learning}
%TODO:The formula and description are different
We follow the standard protocol that formulates a RL environment as a Markov decision process (MDP), that is, $M=(\mathcal{S},\mathcal{A},\mathbb{P},r,\gamma)$, where $\mathcal{S}$ is the state-space, $\mathcal{A}$ is the action space, $\mathbb{P}:\mathcal{S}\times\mathcal{A} \to \mathcal{S'}$ is the transition function, $r(s,a)$ is the reward function, and $\gamma\in[0,1)$ is the discount factor. 
A policy is a distribution $\pi(a|s)$, which denotes the probability of taking action $a_t$ conditioned on the current state $s_t$. 
The objective of the RL agent is to find a policy that maximizes the expected return $\mathbb{E}_\pi[ {\textstyle \sum_{t=0}^{\infty}}\gamma^t r_t ]$. 
Everything in the negotiation scenario including the opponent is considered as the environment. 

%Next, we consider each component (states, actions, rewards) of the MDP to model the negotiation problem.
%For our offline RL, everything in the negotiation scenario including the opponent is considered as the environment. 
%Let us denote the state and action in an environment as $s_t$ and $a_t$ respectively. The state consists of only the information about the offers and the action determines what utility value to bid next. For a negotiation session with timelimit $t_{max}$, we defined our state space and action space as

\textbf{States}.
As negotiation domains varies significantly due to different structure (e.g., the issue number, the issue types, size of outcome space), states are necessarily described in a domain-independent way.
Following the ideas presented in \citet{chen202102,chen202301}, this work employs a similar approach by representing an outcome $\omega$ as $U_s(\omega)$ ($U_s$ is the utility function of the negotiating agent).
Specifically, two factors are taken into account.
First, the timeline, which is relevant because negotiation fails if no agreement can be achieved before the deadline ($T_{\textit{max}}$).
Second, the offer trajectory, which is crucial because it has an strong impact on the agent's decision-making. 
Therefore, the state $s$ at time $t$ is defined as follows:

\begin{align}  
\setlength{\abovedisplayskip}{3pt}   \setlength{\belowdisplayskip}{3pt}
\begin{split}         
     s_{t}=& \bigg(\frac{t}{T_{\max }}, u_{s}(\omega_o^{t-3}), u_{s}(\omega_s^{t-3}),  \\
     &  u_{s}(\omega_o^{t-2}), u_{s}(\omega_s^{t-2}), u_{s}(\omega_o^{t-1}), u_{s}(\omega_s^{t-1})\bigg)
\end{split}  
\end{align}  
where $T_{\max}$ denotes the maximum number of rounds of a negotiation session,
$\omega_o^{t-n}$ denotes the offer received from the opponent at step $t-n$, $\omega_s^{t-n}$ denotes the offer proposed by the \dor 
 agent, and $u_s$ denotes the self utility function.
Note that although more pairs of $ (\omega_s^{t-n} $, $ \omega_o^{t-n} )$ (i.e., $n>3$) could improve effectiveness of the agent at the cost of much more computational resources and time, the current choice already guarantees that the algorithm runs smoothly in practice and makes no significant differences compared to the case when $n=5$ or $7$ is adopted.

\textbf{Actions}.
The set of actions at a given state consist of all possible target utility values in the range [$u_r,1$].
So, the action at time $t$ is defined as $a_t=u^t_{s}$ (where $ u^t_{s} $ denotes the utility of the next offer).
To generate the offer corresponding to the utility value $u_s^t$, we define an inverse utility function $ \mathcal{F}^{-1} $ that maps a real-valued number $ u $ to an outcome $ \omega $ and
selects the best possible outcome that maximizes the estimated opponent utility at the given utility.
Formally, the inverse utility function is defined as

\begin{align}
 \setlength{\abovedisplayskip}{3pt}   \setlength{\belowdisplayskip}{3pt}
		\begin{split}
			\mathcal{F}^{-1}\left( u_s^t \right) &= \mathop{\arg\max}\limits_{\omega} U_o^{'}\left( \omega\right) 
		\end{split}
\end{align}
where $ U_o^{'} $ denotes the opponent's utility function estimated on the basis of issue frequency of the opponent's historical offers, following the approach of \citet{van2012agent}.

% In this way, the original legal proposal will be lost. 
% We also need a function $\mathcal{F}:\mathcal{U}\rightarrow\Omega$ to map utility $u$ into legal proposal $\omega$, where $\Omega$ denotes the proposal space.
% \begin{equation}  
% \mathcal{F}\left(u_{s}\right)  & =\underset{\omega}{\arg \max } U_{o}^{\prime}(\omega), \text { where } u_{s}  & \leq U_{s}(\omega) \leq u_{s}+\Delta_{u}  \end{equation}

\textbf{Rewards}.
The agent is given a positive reward when an agreement is reached, and a punishment of -1 when no agreement can be settled before the deadline. 
%delete?
The RL agent's acceptance strategy is simple, that is, if the opponent's offer is better than the intended next own offer, the agent then accepts it, otherwise rejects.
Formally, the reward function is defined as follows:
\begin{small} 
\begin{equation}     
r_{t+1}\left(s_{t}, a_{t}\right)=\begin{cases}
U_{s}(\omega), & \text { if there is an agreement } \omega \\ 
-1, & \text { if no agreement reached by deadline} \\ 
0. & \text { otherwise } \end{cases}
\end{equation} 
\end{small}

Actor-critic approaches can provide an effective way to optimize the RL objective. 
In the conventional actor-critic formalism \citep{barto1983neuronlike,sutton2018reinforcement}, an approximated Q-function $Q_\theta$ is learnt by minimizing the squared Bellman error (refereed to as policy evaluation), and optimizes the policy $\pi_\phi$ by maximizing the Q-function (referred as policy improvement).
The Q-function $Q_\theta(s,a)$ is an estimation of how good is it to take action $a$ at the state $s$.
The above objectives are as follows:
\begin{small}
\begin{align} 
\begin{split}
\setlength{\abovedisplayskip}{3pt}   \setlength{\belowdisplayskip}{3pt}
Q(\theta)&=\arg \min _{Q} \mathbb{E}_{\bigl(\mathbf{s}, \mathbf{a},\mathbf{s}^{\prime}\bigr) \sim\mathcal{D}}\biggl[\Bigl(Q(\mathbf{s}, \mathbf{a})  \\   &-\Bigl(r(\mathbf{s}, \mathbf{a})+\gamma \mathbb{E}_{\mathbf{a}^{\prime} \sim \pi_\phi(\mathbf{a}^{\prime} \mid \mathbf{s}^{\prime})}      \bigl[Q_\theta(\mathbf{s}^{\prime},\mathbf{a}^{\prime})\bigr]\Bigr)\Bigr)^{2}\biggr]  \label{eq:5}
\end{split} 
\end{align} 
\end{small} 

\begin{align} 
\begin{split} 
\setlength{\abovedisplayskip}{3pt}   \setlength{\belowdisplayskip}{3pt}
    & \pi_\phi = \arg \max _{\pi} \mathbb{E}_{\mathbf{s} \sim \mathcal{D}}\left[\mathbb{E}_{\mathbf{a} \sim \pi_\phi(\mathbf{a} \mid \mathbf{s})}\left[Q_\theta(\mathbf{s}, \mathbf{a})\right]\right] 
    \label{eq:6)}
\end{split} 
\end{align} 

where $\mathcal{D}$ can either be the replay buffer $\mathcal{B}$ generated by previous policy $\pi_\phi$ through online environment interactions, or a fixed dataset $\mathcal{D}=\{(s_t^i,a_t^i,s_{t+1}^i,r_t^i)\}_{i=1}^n$ as common in offline RL setting.

\section{DOREA Framework}
\label{appro}

\begin{figure}[ht]  \centering 
%\includegraphics[width=0.5\textwidth]
\includegraphics[width=3.2in, height=2.8in]{picture/framework.eps}  
\caption{Overview of the proposed Deep Offline Reinforcement learning Negotiating Agent (DOREA) framework.}  
\label{Fig:1}   
\end{figure}

The DOREA framework consists of two key components: an offline learning based strategy module and an strategy fine-tuning mechanism.
Figure~\ref{Fig:1} provides an overview of  the framework.


\subsection{Offline Learning Based Strategy Module}
\label{offline module}

The offline learning based strategy module 
comprises of two steps.
First, a negotiation history denoted by $\mathcal{H}$ is collected. This history consists of previous negotiation traces between two parties, including the negotiation scenario, the exchanged offers between them, time stamp of each offer, both sides' preferences (utility functions), and the negotiation results (e.g., agreement/failure).
A party follows a negotiation strategy, yet $\mathcal{H}$ can be obtained by a mixture of multiple strategies.
These data $h$ ($h \in \mathcal{H}$) are converted into transitions of RL (i.e., $h = (s,a,r,s') $) through a pre-processing procedure
(e.g., mapping all offers to utility values $r$, generating corresponding action $a$ and state $s$), 
and then saved as offline data $\mathcal{D}_{off}$.

Second, the module aims to learn an effective negotiation strategy from historical datasets $\mathcal{D}_{off}$ via offline RL.
However, the negotiation datasets collected may be suboptimal (e.g., absent data or data having non-expert quality), the state and action space coverage is limited, and this may result in a distribution shift, that is, the offline RL-agent encounters online data $\mathcal{D}_{on}$ that have different state-action distribution from the offline data $\mathcal{D}_{off}$) --- causing overestimation of the Q-value of out of distribution (OOD) action using classic off-policy RL algorithms.
Consequently, the learned negotiation strategy might choose potentially inappropriate actions.
Therefore, this framework employs Conservative Q-learning (CQL) \citep{kumar2020conservative}, which can reduce the harmful effect of a distribution shift by explicitly penalizing the Q-value of actions not available in offline dataset $\mathcal{D}_{off}$.
CQL pessimistically evaluates the current policy and obtains the lower-bound of the real Q-function. 
It aims to training the Q-function by using the sum of standard temporal-difference (TD) error and the regularizer (see Eq.~\ref{eq:9}). 
This is achieved through minimizing the expectation of Q-value of action with overestimation on the sampling distribution, and maximizing the expectation of Q-value on the offline dataset.
CQL can be instantiated as an actor-critic algorithm like SAC \citep{haarnoja2018soft}.
SAC is an off-policy algorithm designed to optimize a stochastic policy, which objective is to both maximize the expected return and the entropy of the policy:

\begin{equation}
\setlength{\abovedisplayskip}{3pt}   \setlength{\belowdisplayskip}{3pt}
\pi_\phi=\underset{\pi}{argmax} \sum_{t=0}^{T} \mathbb{E}_{s,a\sim\pi}\gamma^tr_t(s,a)+\alpha\mathbb{H}(\pi(.|s)) 
\label{eq:7}
\end{equation} 

where $\mathbb{H}$ is the entropy and $\alpha>0$ is the temperature parameter, $\gamma$ is discount factor, and $r_t$ is reward function at time-stamp $t$.
The corresponding Q-function $Q^{\pi}(s, a)$ can be expressed as:

\begin{equation}  
\setlength{\abovedisplayskip}{3pt}   \setlength{\belowdisplayskip}{3pt}
Q_\theta(s, a)\!=\!\underset{s, a \sim \pi}{\mathbb{E}}\left[\sum_{t=0}^{\infty} \gamma^{t} r\!\left(s, a\right)\!+\!\alpha\! \sum_{t=1}^{\infty} \gamma^{t} \mathbb{H}\!\left(\pi\left(\cdot \mid s\right)\right)|s, a\right]
\end{equation}

Here, a variant of CQL -- CQL($\mathcal{H}$) is chosen because it generally outperforms other variants \citep{kumar2020conservative}.
In order to more effectively mitigate the impact of distribution shift, multiple (N) pessimistic Q-functions are employed.
Each policy evaluation step $Q(\theta_i)$ ($i \in I$ and $\theta_i$ means the parameters for $i-th$ Q-function) 
minimizes the following problem:

\begin{align}
\begin{split}
\setlength{\abovedisplayskip}{3pt}   \setlength{\belowdisplayskip}{3pt}
Q(\theta_i)\!=\!&\min _{Q} \!\alpha \mathbb{E}_{\mathbf{s} \sim \mathcal{D}_{off}}\!\underbrace{\!\bigg[\!\log \!\sum_{\mathbf{a}} \!\exp Q(\mathbf{s},\! \mathbf{a}) \!-\!\mathbb{E}_{\mathbf{a} \sim \hat{\pi}_{\beta}(\mathbf{a} \mid \mathbf{s})}[Q(\mathbf{s},\! \mathbf{a})]\!\bigg]\!}_{\text {CQL regularizer }}\\
&+\frac{1}{2} \underbrace{\mathbb{E}_{\mathbf{s}, \mathbf{a}, \mathbf{s}^{\prime} \sim \mathcal{D}_{off}}\left[\left(Q_{\theta_i}\!-\!\boldsymbol{B}^{\pi_{\phi_i}} Q_{\bar{\theta}_i}\right)^{2}\right]}_{\text {standard TD error}}
\label{eq:9}
\end{split}
\end{align}

where $\hat{\pi}_{\beta}(\mathbf{a}|\mathbf{s}):=\frac{\sum_{s, a \in \mathcal{D}_{off}} \mathit{1} \left[s=s_{0}, a=a_{0}\right]}{\sum_{s \in \mathcal{D}_{off}} \mathit{1}\left[s=s_{0}\right]}$ is the empirical behavior strategy, $\alpha$ is the trade-off factor, $\bar{\theta}_i$ is the delayed parameter, and $\boldsymbol{B}^{\pi_{\phi_i}}$ is the Bellman operator, which constitute the Bellman error with the third part of Eq.~\eqref{eq:9}.
Policy improvement step $\pi(\phi_i)$ is the same as SAC defined in Eq.~\eqref{eq:7}.
And the learned strategy is represented by an ensemble of the N CQL based Q-functions and policies that trained via update rules Eq.~\eqref{eq:7},\eqref{eq:9} and expressed as $\{Q_{\theta_i},\pi_{\phi_i}\}_{i=1}^N$, where $\theta_i$ and $\phi_i$ represent the parameters of the $i-th$ Q-function and policy, respectively.
The corresponding Q-function and policy is described as follows: 

\begin{small}
\begin{align} 
\begin{split}
\setlength{\abovedisplayskip}{3pt}   \setlength{\belowdisplayskip}{3pt}
Q_{\theta}\!:=\!\frac{1}{N}& \sum_{i=1}^{N} Q_{\theta_{i}}, \\      
\pi_{\phi}(\cdot \!\mid\! s)\!=\!\mathcal{N}\bigg(\!\frac{1}{N}\!\sum_{i=1}^{N} \mu_{\phi_{i}}(s),&\frac{1}{N} \!\sum_{i=1}^{N}\left(\sigma_{\phi_{i}}^{2}(s)\!+\!\mu_{\phi_{i}}^{2}(s)\right)\!-\!\mu_{\phi}^{2}(s)\!\bigg) 
\end{split} 
\end{align} 
\end{small}
where $\theta:=\{\theta_i\}_{i=1}^N$ and $\phi:=\{\phi_i\}_{i=1}^N$.


\subsection{Strategy Fine-tuning Mechanism}

Having obtained a strategy via the offline learning based strategy module, the DOREA framework employs strategy fine-tuning to adjust its strategy when there is a change in the opponent's preferences or strategy in subsequent online negotiation.
To effectively adapt to changes in the opponent, inspired by the work of \citet{lee22d}, the strategy fine-tuning mechanism aims at safely utilizing online
samples and mitigating the distribution shift more effectively.


Suffering from the distribution shift problem, a good initial offline strategy may be destroyed quickly using these online data directly with off-policy RL algorithms. 
It is thus necessary to utilize offline and online data effectively to fine-tune strategy.
As such, a prioritized sampling scheme component called balanced experience reply is used.
This component utilizes online data by sampling offline data related to the current policy.
In this way, \dora can implicitly recognize the change of an opponent's strategy or utility function without explicitly modelling the opponent.

The online negotiation data history is denoted as $\mathcal{H}^{new}$ and save them $h^{new}$ ($h^{new} \in \mathcal{H}^{new}$) in $\mathcal{D}_{on}$ through the same pre-processing as in Section \ref{offline module}.
The \dor framework creates a prioritized buffer, which stores both the offline negotiation data $\mathcal{D}_{off}$ and the online data $\mathcal{D}_{on}$ respectively during fine-tuning.
Then, the prioritized buffer sorts all available samples according to their $online$-$ness$.
To measure $online$-$ness$ of samples, we use density ratio $\omega(s,a) :=d^{on}(s,a)/d^{off}(s,a)$, a probability proportional to the density ratio between online samples and offline samples, where $d^{on}(s,a)$ and $d^{off}(s,a)$ denotes the distribution of state-action pairs in the online and offline buffer, respectively.
\dor estimates the density ratio by training a neural network $\omega_{\psi }(s,a)$ called density ratio estimator. 
The training procedure for the density ratio estimator $\omega_{\psi }(s,a)$ follows the approach of \citet{sinha2022experience} and uses the variational representation of f-divergences~(\citep{NIPS2007_72da7fd6}).
Let $f(y):=y \log \frac{2 y}{y+1}+\log \frac{2}{y+1}$, and the Jensen-Shannon (JS) divergence is defined as $D_{J S}(P \| Q)=\int_{\mathcal{X}} f(d P(x) / d Q(x)) d Q(x)$ .
Model $\omega_{\psi }$ is updated by maximizing the lower bound of the JS divergence:

\begin{equation}
\setlength{\abovedisplayskip}{3pt}   \setlength{\belowdisplayskip}{3pt}
    \mathcal{L}^{\mathrm{DR}}(\psi)=\mathbb{E}_{x \sim P}\left[f^{\prime}\left(w_{\psi}(x)\right)\right]-\mathbb{E}_{x \sim Q}\left[f^{*}\left(f^{\prime}\left(w_{\psi}(x)\right)\right)\right] 
    \label{eq:11}
\end{equation}

where $f^*$ is the convex conjugate of $f$. 
For the first term in Eq.~\eqref{eq:11}, the expectation is estimated by sampling from $\mathcal{D}_{on}$, and the second is sampled from $\mathcal{D}_{off}$.


Additionally, we employ an ensemble agent whose parameters are initialized by $\{Q_{\theta_i},\pi_{\phi_i}\}_{i=1}^N$ obtained in the offline learning module.
$\theta$ and $\phi$ are updated via SAC update rules Eq.~\eqref{eq:5},\eqref{eq:7}, respectively, during strategy fine-tuning.


\section{Experiments}
\label{exps}

Three experiments are conducted in order to demonstrate the effectiveness
of the \dor framework. 
The first experiment explores the following three performance aspects: effectiveness of the negotiating agent strategy learned on the basis of previously collected offline data; impact of data collected by more advanced strategies on the performance of the \dor agent; and performance of the learned strategy in comparison to the strategies used to collect the data.
The second (third) experiment investigates whether the \dor agent learned from offline datasets can also adapt to changes of its opponent’s preferences (strategy) in subsequent online negotiations.


\subsection{Experimental Setup}

\begin{table}[!ht]
		\caption{Statistics of all 18 domains in the experiments. The domains are classified into three groups according to outcome space (i.e., small, medium and large domains).}
        \label{tab:1}
		\centering
        \resizebox{1.0\columnwidth}{!}{
		\begin{tabular}{lccc}
			\hline
			Domain  & Outcome Space  & Opposition & Number of Issues\\
			\hline
                NiceOrDie    &  3 & 0.840 & 1 \\
                Ultimatum     & 9  & 0.545 & 2 \\
                FiftyFifty2013  & 11 & 0.707 & 1 \\
                Laptop       & 27  & 0.160 & 3 \\
                Planes        & 27    & 0.164 & 3 \\
                DefensiveCharms & 36 & 0.322 & 3  \\ \hline
                Coffee       & 112   & 0.447 & 3  \\
                Outfit       & 128   & 0.198 & 4  \\
                DogChoosing  & 270   & 0.051 & 5 \\
                Acquisition   & 384  & 0.117 & 5 \\
                HouseKeeping  & 384  & 0.272 & 5  \\
                Icecream     & 720   & 0.148 & 4 \\ \hline
                Animal       & 1152   & 0.110 & 5  \\
                Camera      & 3600  & 0.212 & 6 \\
                Lunch		& 3840	& 0.399 & 6 \\
                SmartPhone   & 12000   & 0.224 & 6 \\
                Kitchen      & 15625   & 0.057 & 6 \\
                Wholesaler  & 56700    & 0.308 & 7 \\
			\hline
		\end{tabular}
            }
\end{table}

In our experimental settings, 
each agent plays against an opponent 
in every domain for a number of repetitions. Moreover, in each repetition a pair of agents conduct negotiation twice where they exchange the order who starts with bidding.
The experiments consider the whole set of domains created for ANAC 2013. As shown in Table~\ref{tab:1}, these domains differ in their size of outcome space (i.e., the set of possible outcomes), ranging from 3 to 56700, in the opposition (i.e., the minimal Euclidean distance to the optimal outcome for both sides), ranging from 0.051 to 0.84, and in the number of negotiated issues, ranging from 1 to 7.   
Note that the choice of the ANAC 2013 domains is taken because 1) these domains cover a wide range of domain characteristics, 2) 
designers of all agents know these domains well and so none of these agents has a disadvantage, and 3) these domains are also adopted in other recent work \citep{SenguptaAdaptive,chen202102,deanalysis,chen202301} for comparability reasons. 
To better support RL training and evaluation in a convenient way, we developed a python-based negotiation environment that also provides a core set of abstractive behaviors (interfaces) to implement a negotiating agent.

During each negotiation session, the reservation value for all domains is 0, the discount factor of negotiation outcomes is ignored in negotiations, 
and the maximum round per session is 1000.  
The repetition number is set to 300.
For the implementation details of \dora, the batch size is 256 and the size of both the offline and online reply buffer is set to 2e+6.
The learning rates of the actor network and the critic network is 1e-4 and 3e-4, respectively.
The discount factor in RL training is 0.99.
\dora is trained for 1e+6 timesteps.
Moreover, the CQL algorithm is based on the open SAC version~\footnote{See https://github.com/vitchyr/rlkit.}, other parameter settings are identical to the setup of \citet{kumar2020conservative}.
Following the suggestion of \citet{lee22d}, the ensemble size N is set to 5.
More details can be found in the appendix.

% The following metrics are considered in experiments: 
% \begin{enumerate}
%             \item[(1)]
%     \textit{\textbf{Domain utility}}: the average utility of all agents $\in A$ (including self play) in domain $d \in D$, where $A$ and $D$ denote the set of agents and domains, respectively.
% 		\item[(2)] \textit{\textbf{Average utility}}: the average utility acquired by an agent $a \in A$ when negotiating with every agent $b \in A$ (including $a$) across all domains $D$.
% 		\item[(3)] \textit{\textbf{Average utility against opponent $a$}}: the average utility obtained by other agents $b \in A \backslash a$ when negotiating against opponent $a$ across all domains $D$.
% \end{enumerate}

\subsection{Influence of Offline Dataset}
\label{exp_influence}

To investigate whether a useful strategy can be learned through offline datasets and what the influence of offline datasets on the \dor's performance is, two different datasets were collected separately.
In both datasets, there are the same four opponents with each employing a distinct strategy from the four ANAC winner agents' strategies (winner strategies) -- AgreeableAgent2018, PonpokoAgent, Caduceus and Atlas3~\footnote{
There were the ANAC winners in 2018, 2017, 2016 and 2015, respectively.
}.
The first one (referred to as winner dataset) consists of the negotiation traces generated by four agents with each using one of the winner strategies playing against those opponents in all 18 domains.
The other dataset (referred to as random dataset) was built from negotiations between a simple random agent that uses a random bidding strategy and also accepts offers according to a probability distribution (random strategy) and the four opponents.
Moreover, the negotiations between the random agent and the opponents were repeated four times in order to obtain an equal size of the winner dataset.
Through training separately with the two different datasets, two negotiating agents referred to as \dor-winner and \dor-random can be acquired.
%As trained on offline datasets comprised of negotiations using different strategies against the same set of opponents, it is interesting to investigate whether the two \dora agents can perform well against the same set of opponents.
Note that, as the experiment below aims at analyzing the influence of datasets on offline learning (corresponding to Sec.~\ref{offline module}), 
the strategy fine-tuning mechanism is therefore disabled to avoid performance improvements achieved through this mechanism.

\begin{figure}[ht] 
\centering 
\includegraphics[width=0.5\textwidth]{picture/5_2_1_new.eps} 
\caption{
Box plots showing the utility against the four opponents (each using one ANAC winner's strategy) for \dor-winner, \dor-random and two baselines (the random strategy and the average utility of the four winner agents).
The results are obtained in the HouseKeeping domain. 
Points represent the utility of agreements reached when playing against each opponent, and outlier is marked by the diamond symbol.} 
\label{Fig:2}  
\end{figure}

Figure~\ref{Fig:2} compares the performance of the two \dor agents and two baselines (i.e., strategies used for collecting the two offline datasets) against the four opponents encountered in the offline datasets. 
As depicted in the figure, 
\dor-winner clearly achieved the best performance, whereas \dor-random had a much lower utility against each of the four winners.
Specifically, \dor-winner led \dor-random with a large margin of between 104.5\% to 175.8\% in the four cases, and it achieved a mean score of 0.85 against the four opponents, 132.4\% higher than that of \dor-random.

The results indicate that the training of an \dor agent
with samples from advanced strategies than simple strategies can bring about a considerable performance improvement. 
This is because
advanced strategies can exhibit more useful state-actions pairs leading to high rewards.
Another valuable observation is that both \dor-winner and \dor-random managed to outperform the 
the strategies used for collecting
the two offline datasets.
More precisely,
\dor-winner exceeded the average performance of the four winner strategies in terms of average utility against opponent by 28.4\%, and \dor-random advanced the random agent by 30.6\%. 
The \dor framework's capability of solving a distribution shift (see Sec.\ref{offline module}) may account for this success.
 
\begin{figure*}[!ht]  
\centering  
\includegraphics[width=0.85\textwidth]{picture/5_2_1b_new.eps}  
\caption{Domain utility of the two DOREA agents and baselines in all 18 domains. The average score of all agents in each domain is marked as red solid line.}  
\label{Fig:3}   
\end{figure*}

Next, to look closer into the performance of the \dor agent, Figure~\ref{Fig:3} shows the results for each of the 18 domains, where seven agents (including two \dor agents, the random agent and four ANAC winner agents) are considered and the repetition was set to 100 for each domain to ensure statistical significance of the results. 
One can see that
\dor-winner was the most successful agent with a notable advantage, and ranked first in all domains except the NiceOrDie domain in which it ranked second. Moreover, it performed 37.3\% higher than the average score of all agents across 18 domains, and outperformed the second best agent (Atlas3) by a margin of 22.4\%.
To sum up, the experimental results show that the \dor framework is capable of learning an effective strategy from offline datasets and that the learned strategy was more performant than those strategies used for collecting the data. 
%\textcolor{red}{due to the superiority of offline learning}. 

\subsection{Performance of \dor with Changes in the Opponent Preferences}
\label{sec5.3}

\begin{figure*}[htp]   
\centering   
\includegraphics[width=1.0\textwidth]{picture/5_3_a_test.pdf}  
\caption{Four illustrative examples of fine-tuning performance of \dora against four ANAC winner agents in the HouseKeeping domain. 
DOREA w/o sft represents the DOREA agent without strategy fine-tuning, SAC denotes the agent learning from scratch using SAC and SAC-sft denotes an SAC based agent initialized by the \dora. The solid lines and shaded regions represent mean and standard deviation, respectively.}
\label{Fig:4}    
\end{figure*}

As the opponent encountered in the offline dataset may change its preference profiles in subsequent online negotiations for many reasons that are hard to model. This experiment studies the performance of the \dor framework against opponents with varying preferences.
We assume that the opponent's preferences remain static for 250 sessions before being changed again in online negotiations.
For simplicity, we also assume that the opponent keeps its strategy fixed when changing preferences.
100 distinct sets of preference profiles of an opposing party are randomly generated for each domain (i.e., these preferences are different to that used in the offline dataset and are also different to each other).
In particular, this experiment focuses on the offline-to-online performance against an opponent, that is, how well an agent can adapt to an opponent when it changes from the preferences shown in the offline dataset to some different preferences in subsequent online negotiations.
The \dora is trained with the winner dataset as described above.
Three baselines are introduced for comparative evaluation -- the \dora without strategy fine-tuning (denoted as DOREA w/o sft), the RL-agent that employs SAC algorithm and learns from scratch online (denoted as SAC agent), and another SAC based agent initialized by the parameters of the \dora (denoted as SAC-sft agent).

Illustrative examples of online negotiations against the four opponents in the Housekeeping domain are presented in Fig.~\ref{Fig:4}, where the results are averaged by the negotiations in which the opponent tries all of the 100 preference profiles.
Quantitatively similar results have been obtained for the other domains, which are not reported here due to limited space.
According to the figure, the \dora clearly outperformed the baselines in terms of learning efficiency and final performance. 
Precisely, the \dora achieved a stable performance around between 50 to 70 sessions, while both the SAC-sft and SAC agent reached it much slowly (approximately after 200 sessions).
Besides, the \dora obtained the highest average utility of 0.81, leading the SAC agent (i.e., learning from scratch) and the DOREA agent w/o sft (i.e., no fine-tuing) by a large margin.
This shows the effectiveness of the strategy fine-tuning mechanism, which provides helpful offline data for the current negotiation and speeds up fine-tuning process, starting from pessimistic initialization. 

Table~\ref{tab:3} summarizes the performance of the \dora and the baselines after 200 sessions in domains of small, medium and large size (refer to Table~\ref{tab:1}).
Like the results observed above, the \dora was still the best agent across the three classes of domains with an average utility of 0.837.
It clearly achieved a better performance, leading the DOREA w/o sft by a margin of 28.8\% on average. 
The SAC-sft agent, following \dora, were ranked second in all three classes of domains.
In sum, \dora managed to outperform the baselines when competing against an opponent that changes its preferences in online negotiations.

%The SAC-sft agent made the 2nd place in small domains, and the DOREA w/o sft made the 2nd place in medium and large domains.
%To sum up, \dora managed to .
%The average utility of the \dora was is better than that of the SAC agent, leading a margin between $22.38\%$ to $48.14\%$ in the six domains. 
%Clearly, the \dora also achieves a better performance than its variant without strategy fine-tuning. 
%To sum up, these methods improve the sampling efficiency and final performance of negotiation tasks.
%varying degrees with the change of the opponent's utility function $U^{o}_{ft}$. %\textcolor{red}{and in the range of $22.86\%$ to $49.09\%$ with the change of its own utility function $U^{s}_{ft}$.}
% \textcolor{red}{
% The rationality of our offline-to-online fine-tuning scheme.
% On the one hand, we used pessimistic Q function to estimate the value during offline training, which can ensure that our critic network has a pessimistic mood at the initial stage of online fine tuning.
% In the online fine-tuning stage, online samples are very important to fine-tuning, but due to distribution shift (e.g., changes in preference profiles), online samples are also potentially dangerous OOD samples. 
% The use of pessimistic Q-function slows down the adjustment speed to a certain extent, to prevent the policy adjustment from being too radical and serious performance degradation. 
% Of course, in order to make efficient use of online data, we give priority to the samples encountered online in the balanced replay scheme, and also encourage the use of samples with the approach strategy in offline data sets. In addition, we use multiple Q functions of offline pessimistic training to ensure the degree of pessimism of our strategy at the initial stage.}

\begin{table}[ht] 
\caption{Average utility in three classes of domains, the bounds are based on the 95\% confidence interval.}
\centering    
\setlength{\tabcolsep}{1.2mm}{} 
\resizebox{0.5\textwidth}{!}{ 
\begin{tabular}{c|cccc}           
\toprule               
Domain & DOREA  & DOREA w/o sft & SAC & SAC-sft\\  \hline  
Small domain & $\mathbf{0.79}_{\pm0.02}$  & $0.57_{\pm0.04}$  & $0.61_{\pm0.04}$ & $0.74_{\pm0.03}$\\
Medium domain & $\mathbf{0.88}_{\pm0.03}$  & $0.71_{\pm0.05}$ & $0.69_{\pm0.06}$ & $0.81_{\pm0.04}$  \\  
Large domain & $\mathbf{0.84}_{\pm0.03}$  & $0.67_{\pm0.03}$  & $0.65_{\pm0.40}$ & $0.77_{\pm0.04}$ \\  
\bottomrule              
\end{tabular} 
}        
\label{tab:3}  
\end{table}  


\subsection{Performance of \dor with Changes in the Opponent Strategies}
\label{sec5.4}

This experiment investigates the performance of the \dora against opponents who adopt a different strategy in subsequent online negotiations. 
Here an opponent can use a new strategy that has not been seen during offline learning phase. 
As such, the opponent strategy pool not only includes the ANAC winner agents used in Sec.~\ref{exp_influence}, that is, AgreeableAgent2018,  PonpokoAgent,  Caduceus and Atlas3, but also considers the runner-ups of respective ANAC editions -- Agent36, CaduceusDC16, YXAgent and ParsAgent as new strategies. 
Moreover, MiCRO \citep{deanalysis}, a recently proposed effective negotiation strategy, is also considered in the pool as well.
The experiment settings here are similar to Sec.\ref{sec5.3} except that an opponent can change its strategy while its preferences are kept fixed.

\begin{table}[!ht] 
\caption{Comparison of the \dora with baselines against an opponent with a different strategy in online negotiations. All results are obtained across all 18 domains.}  
\centering   
\resizebox{\linewidth}{!}{ 
\begin{tabular}{c|c|c|c|c}         
\toprule                
& \makecell[c]{Average util. \\against opponent} & \makecell[c]{SAC-sft} &\makecell[c]{DOREA\\ w/o sft }  & \makecell[c]{DOREA\\}  \\ \hline  
AgreeableAgent2018 & $0.54_{\pm0.06}$ & $0.79_{\pm0.06}$ & $\textbf{0.83}_{\pm0.03}$   & $0.82_{\pm0.07}$  \\  
PonpokoAgent & $0.60_{\pm0.04}$ & $0.85_{\pm0.02}$ & $0.84_{\pm0.05}$   & $\textbf{0.87}_{\pm0.03}$   \\           
Caduceus & $0.62_{\pm0.06}$ & $0.84_{\pm0.03}$ & $0.86_{\pm0.04}$    & $\textbf{0.88}_{\pm0.07}$   \\ 
Atlas3 & $0.52_{\pm0.08}$ & $0.81_{\pm0.03}$ & $0.85_{\pm0.06}$    & $\textbf{0.87}_{\pm0.01}$   \\
\midrule 
CaduceusDC16 & $0.61_{\pm0.02}$ & $0.65_{\pm0.04}$ & $0.69_{\pm0.05}$    & $\textbf{0.72}_{\pm0.01}$  \\  
YXAgent & $0.45_{\pm0.08}$ & $0.42_{\pm0.07}$ & $0.41_{\pm0.02}$     & $\textbf{0.52}_{\pm0.06}$  \\       
Agent36 & $0.47_{\pm0.04}$ & $0.55_{\pm0.04}$ & $0.57_{\pm0.04}$    & $\textbf{0.71}_{\pm0.04}$   \\ 
ParsAgent & $0.53_{\pm0.05}$ & $0.61_{\pm0.02}$ & $0.65_{\pm0.03}$    & $\textbf{0.73}_{\pm0.05}$   \\         
MiCRO & $0.51_{\pm0.02}$ & $0.59_{\pm0.05}$ & $0.62_{\pm0.05}$   &  $\textbf{0.69}_{\pm0.04}$    \\      \bottomrule              
\end{tabular}  }  
\label{tab:2}     
\end{table}

The results are given in Table~\ref{tab:2}, where the column ``average utility against opponent" indicates the average utility against an agent (each row) achieved by all agents in the opponent strategy pool, and each entry of other columns means the average utility obtained by the column agent playing against the row agent. The first part (first four rows) of the table represents strategies seen in the offline datasets and the lower part represents new strategies.  

Some interesting observations follow from these outcomes. 
First, when encountering the four winner strategies that have been used during offline training, the performance of the \dora w/o sft was better than the mean performance of the opponent strategies (see the second column of the table), 48.96\% higher than the mean score of the opponents.
However, this advantage in performance decreased about 13.95\%, when the opponent switched to an unknown strategy.
This demonstrates that overall the \dora w/o sft was an effective strategy, but strategy adjustment was required when facing unknown strategies if a stronger performance is expected. 
Then, relying on the fine-tuning mechanism, the \dora further improved  performance, achieving on average an increase of 9.58\% over its variant without fine-tuning. 
There was only one special case where the \dora got a slightly worse utility (around 1.20\%) than the \dora w/o sft against AgreebleAgent2018. 
We suspect that in this case, the initial strategy was already good enough, making the \dora end up with a similar performance level. 
The SAC-sft agent again lagged behind the \dora with a considerable difference like results shown in Sec.\ref{sec5.3}.
These results validate that the strategy fine-tuning mechanism is effective for negotiations where the opponent changes its strategy.

%the \dora was the most successful one against all opponents.
%It achieved higher utility than SAC-sft, \dor w/o sft and the mean performance of all opponents when playing against almost all opponent strategies.






% better than other competitors due to the effective strategy learned during the offline training phase.
% \textcolor{red}{The \dora w/o sft also clearly achieved a higher score against each of the opponent strategies than the average utility against the corresponding opponent obtained by all agents in the opponent pool, leading by a margin of 50.23\% on average.}   
% Then, relying on the fine-tuning mechanism, the \dora further improves its performance by taking advantage of the strategy fine-tuning mechanism, achieving on average an improvement of 4.23\% over its variant without fine-tuning. 
% There was only one special case where \dora got a slightly worse utility (around 3.52\%) than \dora w/o sft against AgreebleAgent2018. 
% We suspect that in this case, the initial strategy is already good enough, thereby achieving a similar performance level of \dora w/o sft.
% %Although better than the \dora w/o sft, 
% The SAC-sft agent again lagged behind the \dora with a considerable difference like results shown in Sec.\ref{sec5.3}.
% These results validate that the strategy fine-tuning mechanism is effective for negotiations where the opponent changes its strategy. %and the improvement it brings about would be even greater for low-quality initial strategies.
% \textcolor{red}{\sout{One can see that \dora-random w/o sft performs poorly, with a utility being lower than both \dora and the mean performance of all agents against each ANAC agent.
% However, when strategy fine-tuning is enabled, \dora-random shows a considerable improvement of about 100\%.
% These results validate that strategy fine-tuning is effective for negotiation tasks, and the improvement it brings about would be even greater for low-quality initial strategies.
% }}

% our DOREA agents outperformed 8 agents in all the domains.
% \dora (winner) get highest average utility against 7 opponent in all 8 opponent agents.
% %DOREA(runner-up) gain certain advantages against other opponents, and has achieved the highest average utility when against ParsAgent.
% We also consider the performance of fine-tuning the offline strategy.
% Under this setting, we change the opponent's strategy(change the opponent) without changing its utility function, and keep the fine-tuning mechanism open and observe the performance of different DOREA agent.
% Compared with offline strategy, the average utility of online fine-tuning strategy has been improved.

%%%done-Gerhard
\section{Conclusion and Future work}
\label{concs}

This paper proposes a novel Deep Offline Reinforcement learning Negotiating Agent (DOREA) framework to learn strategy from previous negotiation datasets. 
%generated by any strategy. %which solve in practice. 
The \dor framework consists of two key components: 
the offline learning based strategy module and the strategy fine-tuning mechanism. 
The offline learning based strategy module leverages previously collected datasets to learn an effective negotiation strategy without interaction with opponents.
Moreover, the strategy fine-tuning mechanism quickly fine-tunes the learned strategy via interactions and allows to adapt to changes of opponent preferences or strategies.  
%The performance of the DOREA framework is evaluated based on a diverse set of state-of-the-art baselines under different settings.
Experimental results show that it is effective against a diverse set of state-of-the-art negotiating agents when exclusively using offline datasets, and is also capable of adapting to opponent preference or strategy changes.
%The experimental analysis took various key aspects of automated negotiation into account, including the average utility and agreement achievement rate. 
%In addition, an analysis was also performed from the transfer perspective.

We think the results clearly justify to invest further research efforts into this approach and open several new research avenues, among which we consider the following as most promising.
First, as opponent modeling is another helpful way to improve the efficiency of negotiation, it’s worthwhile investigating how to combine opponent modeling techniques with the proposed framework. 
Then, as the acceptance strategy also has impact on the performance of the learned strategy, it is very promising to explore the possibility to train the acceptance strategy instead of using the simple one used in the framework.
A third important avenue we see is to enlarge the scope of the proposed framework to other negotiation forms such as concurrent negotiations and multi-lateral negotiations.


% \section{Back Matter}
% There are a some final, special sections that come at the back of the paper, in the following order:
% \begin{itemize}
%   \item Author Contributions (optional)
%   \item Acknowledgements (optional)
%   \item References
% \end{itemize}
% They all use an unnumbered \verb|\subsubsection|.
% For the first two special environments are provided.
% (These sections are automatically removed for the anonymous submission version of your paper.)
% The third is the ‘References’ section.
% (See below.)
% (This ‘Back Matter’ section itself should not be included in your paper.)

\begin{contributions} 					
Siqi Chen conceived the presented idea, designed the experiments and wrote the paper.
Jianing Zhao created the code, developed the negotiation environment, carried out the experiments. 
Gerhard Weiss edited \& reviewed the paper and conducted formal analysis.
Ran Su created the figures \&  tables, analyzed the data and wrote the paper.
Kaiyou Lei reviewed and supervised the work. 
\end{contributions}

\begin{acknowledgements} 
This work was supported by the National Natural Science Foundation of China (Grant Nos.: 61602391, 62222311), and Ant Group.
\end{acknowledgements}

% References
\normalem
\bibliography{sigproc2022}
%\bibliographystyle{plainnat}
%\bibliographystyle{abbrvnat}
\end{document}

