\section{Problem Formulation}
\label{sec:problem}

\input{uai_2024/tikz/podp}

In the PODP setting that we consider,
let $\intent \in \intentdomain$ represent a user's latent, conversation-level \emph{goal}. Each PODP \emph{episode}---i.e., user-chatbot conversation---begins with the user expressing their goal, $\theta$, in a potentially lossy manner via a natural language query, $\query \in \querydomain$. Per Definition~\ref{def:underspec}, we consider a query $\query$ to be \emph{under-specified} if there is an information gap between the user's goal and stated query:
%user's goal and their stated query:

\begin{definition}[Under-specification]
\label{def:underspec}
Query under-specification is the 
partial observability of user's goal given a query, i.e., $\Pr(\intentdomain\mid\querydomain)$ is unknown and not deterministic. 
\end{definition}

Table~\ref{tab:OA_underspec_examples} lists some examples of under-specified queries in the OpenAssistant dataset (see Section~\ref{sec:meQueryUnderspecCommon} for details). 

Once initiated, a conversation dialogue is assumed to proceed iteratively until terminated by the user (Figure~\ref{fig:podp}). In this context, the chatbot's natural language responses constitute the \emph{action space} of the PODP and are denoted by $\action \in \actiondomain$, while the user's follow-up utterances constitute \emph{observations} denoted by $\observation \in \observationdomain$. 
We denote the multi-turn, variable-length \emph{conversation history} between the user and chatbot by $C\coloneq \query \times \conversationHistory$. We use $\mathcal{C}\coloneq \querydomain \times \conversationHistoryDomain$ to refer to the space of conversation histories. Then, for any chat conversation $C$ with user goal $\theta$, the task of the chatbot system is to produce actions with maximum utility according to a fixed but unknown \emph{user utility function}, $\utility: \intentdomain \times \mathcal{C} \mapsto \mathbb{R}$. 
Although the reward function of the PODP, $\utility$, is unknown we can observe samples from it. 
For example, many \copilots~allow users to rate their conversations; these ratings can be directly interpreted as $\utility(\intentdomain,C)$. 
Recent work~\citep{lin2024interpretable} infers $\utility$ across a user population using a small sample of rated conversations. 
In general $\utility$ can rely on a mix of implicit factors, such as response length, and explicit factors, such as thumbs up/down or user ratings of paired responses. 
Moreover, in Figure~\ref{fig:pareto} we saw that the cognitive cost imposed on the user can be another component influencing $\utility$; in our experiments, we use response length---$\texttt{len}(\action)$---as a simple proxy for a user's cognitive cost from action $\action$.

We define the \emph{policy} $\policy$ of a chatbot interacting with a user as a stationary (but not necessarily Markovian) mapping from conversation histories to natural language responses $\policy: \mathcal{C} \rightarrow \actiondomain$ (Figure~\ref{fig:podp}). An optimal chatbot policy is one that maximizes expected utility:
\begin{equation}
\label{eq:ideal_policy}
\policy^\ast \approx \argmax_\policy \mathbb{E}_{\{\intent, \query\}} \mathbb{E}_{\action \sim \policy} [\utility(\intent, \query, \conversationHistory, \action)]. 
\end{equation}
In Equation~\ref{eq:ideal_policy}, note that the policy influences the responses $\action$ in all turns of the conversation, and that $\Pr(\intentdomain,\querydomain)$ is sampled from the user population. 

\subsection{Policies Induced By Prompting LLMs}

System messages (also known as \emph{prompts}) are often used to ``steer'' an LLM and induce specific behaviors (e.g., $\prompt=$``Behave as a helpful assistant''). 
For LLMs that do not support a separate system message $\prompt$, the prompt and conversation transcript can be concatenated together into the LLM's input context $\coloneq \prompt \circ C$. Otherwise, PODP policies can be induced by using a prompt $\prompt$ and LLM input context $\coloneq C$. Such PODP policies are denoted as $\pi^\prompt$.    
If we restrict our attention to the chatbot policies we can access via prompting, we can rephrase the \emph{policy} optimization objective (i.e., Equation~\ref{eq:ideal_policy}) in terms of \emph{prompt}~optimization:
\begin{equation}
\label{eq:prompt_policy}
\prompt^\ast \approx \argmax_{\prompt} \mathbb{E}_{\{\intent, \query\}} \mathbb{E}_{\action \sim \policy^\prompt} [\utility(\intent, \query, \conversationHistory, \action)]. 
\end{equation}

When we implement a policy by querying a blackbox LLM API with context $\coloneq C$ (i.e. $\prompt$ is empty), we refer to the induced PODP policy as the RLHF policy $\defaultPolicy$. We can expect good PODP performance out-of-the-box from an LLM only if its RLHF-finetuning guarantees that $\defaultPolicy \approx \policy^\ast$ (which is unverifiable).

\subsection{Query Under-specification Causes Sub-optimal Interactions}

Modern LLMs are typically fine-tuned via RLHF, where the training objective~\citep{NEURIPS2022_b1efde53} corresponds to: 
\begin{equation}
\label{eq:rlhf_policy}
\defaultPolicy \approx \argmax_\policy \mathbb{E}_{\{\intent, \query\} \sim \text{lab}} \mathbb{E}_{\action \sim \policy} [\utility(\intent, \query, \action)].
\end{equation}

The combination of query under-specification and RLHF fine-tuning impacts policy learning (i.e., via Equation~\ref{eq:rlhf_policy}) in two ways: (1) distribution shifts between the preferences of annotators and those of end-users may skew the learned policy; and (2) RLHF's emphasis on annotation of, and optimization over, \emph{single-turn} interactions produces myopic policies that greedily maximize single-turn utility. 

With respect to (1), annotators may not be able to reliably infer users' \emph{true} preferences (i.e., $\theta$) when evaluating possible responses to user queries---i.e., $\Pr_{\text{lab}}(\intentdomain\mid\querydomain) \neq \Pr(\intentdomain\mid\querydomain)$. Additionally, the utility function may also shift. For example, \citet{singhal2023long} observe that RLHF annotators may 
prefer longer, more detailed responses relative to end-users.

With respect to (2), the focus on single-turn interactions means annotators are less likely to be exposed to conversations where a chatbot asks the user clarification questions to better understand and respond to the user's query, because such conversations will, by definition, require multiple turns. In the single-turn setting, annotators may also perceive responses that attempt to answer users' queries (albeit incorrectly or verbosely) as more \emph{helpful} than responses containing clarification questions. Policy learning with such preferences may thus underestimate the value of uncertainty-reducing behaviors such as clarification, and the resulting policy may be sub-optimal for \emph{multi-turn} conversational outcomes in PODPs. 
We empirically show that these challenges render $\defaultPolicy$ sub-optimal compared to $\policy^\ast$. 

\subsection{Meta-Policies}
\label{sec:meta-policies}

When prompting LLMs to produce chatbot responses, we are not limited to using a fixed prompt for all conversation turns. 
Instead, we can define a meta-policy, $\recalibratedPolicy: \mathcal{C} \mapsto \prompt$ as a mapping from conversation prefixes to prompts. A PODP agent acting during an episode can first invoke the meta-policy $\recalibratedPolicy$, and then query the LLM with prompt $\prompt\coloneq\recalibratedPolicy(C)$ to produce its action.
For PODP policies implemented through a composition of a meta-policy with an LLM, the original problem of finding a good $\policy^\ast$ is replaced with finding a good \emph{meta-policy} $\recalibratedPolicy^\ast$:
\begin{eqnarray*}
\label{eq:meta_policy}
\recalibratedPolicy^\ast \approx \argmax_{\recalibratedPolicy} \mathbb{E}_{\{\intent, \query\}} \mathbb{E}_{\action \sim \policy^\prompt} [\utility(\intent, \query, \conversationHistory, \action) \mid \\ \prompt = \recalibratedPolicy(\query, \conversationHistory)]. 
\end{eqnarray*}

Note that learning a meta-policy $\beta$ is a \emph{different} decision-making problem than the PODP decision-making problem (i.e., action space of prompts instead of chatbot responses). 

\subsection{Characterizing and Inducing Chatbot Response Behaviors} 
\label{sec:taxonomy}

To empirically evaluate $\defaultPolicy$ and to design prompt-based interventions, we introduce a taxonomy (detailed in Section~\ref{sec:llmSuboptWhenUnderspec}) that can be used to (1) characterize LLM response behavior; and (2) constrict the meta-policy's action space.  
Regarding (1), we refer to the distribution of response strategies of $\defaultPolicy$ as the LLM's ``conversational prior'' (e.g., see Figure~\ref{fig:me2_distOverTauPred} for GPT-4's conversational prior). 

To build intuition for how this taxonomy may serve both purposes, note that $\defaultPolicy$ can be viewed as a hierarchical probabilistic process in which the chatbot first samples a latent \actiontype, $\sampledactiontype \sim \ActionTypeSet$, and then generates a natural language response conditioned on the \actiontype, $\action \mid \sampledactiontype$. Then, if $\defaultPolicy$ is found to be miscalibrated in its distribution over $\ActionTypeSet$, we can \emph{intervene} via prompts to promote 
desired response behavior(s). 

\input{uai_2024/tikz/response_action_spectrum}

We specifically consider a set of response strategies,  $\ActionTypeSet=\{ \textsc{Refuse}, \textsc{Respond}, \textsc{Hedge}, \textsc{Clarify}, \textsc{Interrogate} \}$. To motivate this choice, recall that
in the PODP, the chatbot cannot observe the user's intent, $\intent$, and must instead act based on the \emph{belief state}---i.e., $\Pr(\intent\mid \query, \conversationHistory)$. In this context, possible
\actiontypes~lie along a spectrum characterized by the relative \emph{absence} or \emph{presence} of (belief)-uncertainty-reducing behavior(s) (Figure~\ref{fig:spectrum}). 

On the \emph{uncertainty-agnostic} end of this spectrum, the chatbot may rely on its inductive prior to \emph{respond directly}---i.e.,
despite uncertainty about the user's preferences. 
Responding directly
relies on assumptions and/or potentially spurious semantic correlations between the preferences the user \emph{does} express and those that the \copilot~must infer. 
On the \emph{uncertainty-reducing} end, a chatbot may ask an unbounded number of questions before responding~(\emph{Interrogate}). This can allow the system to best approximate a user's fully specified intent but is completely irrational for the user to engage with. 
As Figure~\ref{fig:pareto} shows, any deviations from the \emph{Respond} \actiontype~must be done in a thoughtful manner, lest the user have a worse cost-utility benefit even as the system reduces uncertainty in its beliefs. 

In a PODP, it is critical to balance information-seeking (exploration) against utility maximization (exploitation). In Section~\ref{sec:motivatingExperiments}, we demonstrate that $\defaultPolicy$ places too much weight on \actiontypes~that myopically maximize one-step utility (i.e., \textsc{Respond} and \textsc{Hedge}). 
In Section~\ref{sec:dataAgnosticInterventions}, we demonstrate that a simple prompt is able to shift the distribution over response strategies 
toward \textsc{Clarify} when queries are under-specified, and thereby improve the PODP policy. 