\section{Algorithmic Approach}
\label{sec:algApproach}

 In this section, we outline two algorithmic interventions to improve upon $\defaultPolicy$~in PODPs. The first intervention uses a fixed prompt to an LLM-based policy $\policy^\prompt$ that nudges the LLM to prefer cost-aware uncertainty-reducing \actiontypes~like clarifications when appropriate. We saw in Section~\ref{sec:motivatingExperiments} that this \emph{data-agnostic} approach can be substantially better than $\defaultPolicy$~when queries are under-specified and users patiently respond to all clarifications. However, real-world users may have varying propensities to engage with clarifying questions. So, we devise a second intervention in Section~\ref{sec:dataBasedIntervention} that uses historical conversational logs to fit an appropriate \emph{meta-policy} $\recalibratedPolicy$ that can be more optimal for the PODP. 
 
\subsection{Data-Agnostic Interventions}
\label{sec:dataAgnosticInterventions}
We saw in Section~\ref{sec:motivatingExperiments} that \copilots{} have sufficient capabilities at \emph{detecting} under-specified queries (Section~\ref{sec:meQueryUnderspecCommon}) and generating \textsc{Clarify} responses if prompted explicitly (Section~\ref{sec:suboptDefaultPolicyMultiStep}). However, they do \emph{not} appear to sufficiently condition on their latent under-specification judgments when generating responses in the absence of intervention (i.e., when relying on the baseline system message in $\defaultPolicy$). Thus, we consider two approaches that explicitly emphasize the possibility of under-specification and the benefits of clarification when appropriate and allow graceful recovery of default system behavior when warranted---e.g., when queries are well-specified. 


\paragraph{Approach 1: Chain of Thought (CoT).}
We evaluate a chain-of-thought~\citep{wei2022chain} intervention in the form of a modified system message that encourages the \copilot{} to ``ask yourself whether you have sufficient information to provide a good answer, and then respond accordingly'' when responding to queries (see Appendix~\ref{app:tauSysMsgs}). 

\paragraph{Approach 2: Clarify When Appropriate (Clarify-Flex).}
We also evaluate a more flexible, context-aware relaxation of the ``always clarify'' system message that we experimented with in Section~\ref{sec:suboptDefaultPolicyMultiStep}. This modified system message instructs the \copilot~to ask clarifying questions about important factors only \emph{if} they have not been specified, and to respond directly otherwise (see Appendix~\ref{app:tauSysMsgs}).

\paragraph{Key Findings and Limitations.}
In order to compare our data-agnostic interventions to $\defaultPolicy{}$, we conduct a slightly modified version of the two-step recommendation experiment presented in Section~\ref{sec:meDefaultPolicyMiscalibrated}. Here, we consider $\prompt_0$ values $\in \{\textsc{Baseline}, \textsc{CoT}, \textsc{ClarifyFlex}\}$, and sequential combinations $\in \{(\prompt_0, \prompt_1) \mid \prompt_1 = \prompt_0 \lor \prompt_1 = \textsc{Baseline}\}$. 

We begin by using our LLM-based $\tau$-classifier to map each intervention to a distribution over \actiontypes, so as to assess the extent to which highlighting uncertainty and encouraging contextual awareness at response generation time induces changes in response behavior relative to baseline. As Figure~\ref{fig:distOverTauHatFlexibleTaus} illustrates, while the \textsc{CoT} intervention behaves quite similarly to the \textsc{Baseline}, \textsc{ClarifyFlex} meaningfully diverges, favoring \emph{interrogation} when queries are critically under-specified, then shifting toward \emph{clarify}, and finally toward \emph{direct response} (i.e., converging with \textsc{Baseline}) as the degree of specification increases. 

\begin{figure}[!htb]
    \includegraphics[width=\linewidth]{uai_2024/images/dist_over_t0_tau_hat_flexible_taus.png}
    \vspace{-6mm}
    \caption{Distribution of the \actiontypes~ $\hat{\sampledactiontype}_0$ induced by the three prompts $= \{\textsc{Baseline}, \textsc{CoT}, \textsc{ClarifyFlex}\}$; grouped by under-specification levels.}
    \label{fig:distOverTauHatFlexibleTaus}
\end{figure}

Next, we examine the distribution over the average utility of recommended items for each sequential combination of $(\prompt_0, \prompt_1)$. As Figure~\ref{fig:distOverAvgUtilFlexibleTaus} illustrates, (\textsc{ClarifyFlex, Baseline}) is the best-performing sequential combination when queries are critically under-specified, with relative advantage diminishing as specification increases. When queries are sufficiently specified, (\textsc{ClarifyFlex, Baseline}) and (\textsc{CoT, Baseline}) obtain slightly higher median $\bar{\utility}$ than (\textsc{Baseline, Baseline}), but we generally see convergence due to the fact that both baseline and interventions tend toward direct response in this setting. 

\begin{figure}[!htb]
    \includegraphics[width=\linewidth]{uai_2024/images/dist_over_avg_util_by_tau_seq_flexible_taus.png}
    \caption{Distribution of $\bar{\utility}$ for each $(\prompt_0, \prompt_1)$ sequence; grouped by under-specification levels.}
    \label{fig:distOverAvgUtilFlexibleTaus}
\end{figure}

From this analysis, we conclude that among the data-agnostic interventions we consider, \textsc{ClarifyFlex} is best able to improve upon the baseline $\defaultPolicy{}$ when queries are critically under-specified, while maintaining the flexibility to converge to \emph{direct response} as specification increases. 
In summary, through  Figures~\ref{fig:distOverTauHatFlexibleTaus},\ref{fig:distOverAvgUtilFlexibleTaus}, we see that in a synthetic user model (that provides templatized answers to clarification questions), it is possible to improve upon the performance of the baseline LLM---i.e., \textsc{ClarifyFlex} performs better than $\defaultPolicy$ when evaluated in the PODP.

\subsection{Data-Based Intervention}
\label{sec:dataBasedIntervention}

Here, we introduce an intervention that leverages collected conversation logs to learn \emph{when} and \emph{how} to improve upon $\defaultPolicy{}$---i.e., by redistributing probability mass away from uncertainty-agnostic \emph{direct response} and cost-agnostic \emph{hedging} toward cost- and context-aware \actiontypes{} such as \emph{clarify} when appropriate---in a way that is more tunable and adaptive to different user populations than the data-agnostic interventions we consider in Section~\ref{sec:dataAgnosticInterventions}.

We begin by considering meta-policies $\recalibratedPolicy$ as described in Section~\ref{sec:meta-policies}. 
Remember that learning a mapping $\recalibratedPolicy: \mathcal{C} \mapsto \prompt$ is a \emph{different} decision-making problem than the original PODP policy. 
As described in Section~\ref{sec:taxonomy}, we will use the taxonomy we developed in Table~\ref{tbl:response_types} to reduce the action space of the meta-policies. 
Given a $\ActionTypeSet$ with corresponding prompts $\prompt_\tau: \tau \in \ActionTypeSet$, we consider the restricted set of meta-policies $\recalibratedPolicy: \mathcal{C} \mapsto \ActionTypeSet$. A PODP agent using $\recalibratedPolicy$ will, at each timestep, first calculate $\hat{\sampledactiontype} = \recalibratedPolicy(\mathcal{C})$, look up the corresponding prompt $\prompt_{\hat{\sampledactiontype}}$ and finally query the LLM with $(\prompt_{\hat{\sampledactiontype}}, \mathcal{C})$ to produce an action in the PODP. 

Conceptually, if we had the ability to simulate the PODP environment, then we could learn a meta-policy $\recalibratedPolicy$ through online Reinforcement Learning (RL): i.e., sample prompts at each turn in the conversation from the current $\recalibratedPolicy$, observe the resulting conversation-level outcomes, and update the parameters of $\recalibratedPolicy$ using e.g., PPO. 
However, we typically cannot simulate user-chatbot conversations with high fidelity, and running online RL with users directly can be very sample inefficient and result in a poor user experience. 

Instead, we use an offline approach inspired by Asymmetric Imitation Learning~\citep{pinto2018asymmetric}. 
We assume access to a dataset $\data$ containing logs of user-chatbot dialogues along with conversation-level utility ratings,  
$\data = \{ (C_1, U_1) \dots (C_n, U_n) \}$. 
Such a dataset can be collected, for example, from an already deployed chatbot.
Notice that the data contains signals about the true $\intent_i$ (i.e. $U_i \ \coloneq \utility(\intent_i, C_i) $) beyond what can be inferred from $C_i$, but the learner $\recalibratedPolicy$ does not have access to $\intent_i$. Hence, imitating optimal actions in $\data$ reduces to asymmetric imitation learning.   

We use the $\sampledactiontype$-classifier developed in Section~\ref{sec:llmSuboptWhenUnderspec} to annotate all of the chatbot responses in $\data$ with their \actiontype~$\hat{\sampledactiontype}$. 
Consequently, we can estimate a Q-value function $Q(C, \hat{\sampledactiontype})$ on the annotated data as:
\begin{equation*}
\hat{Q} = \argmin_Q \sum_{i \in \data} \sum_{\action_j \in C_i} (Q(C_i[:\action_j], \hat{\sampledactiontype}(\action_j)) - U_i )^2,
\end{equation*}
where $C[:\action]$ denotes the conversation prefix upto the chatbot response indicated by $\action$. 
The Q-value function $\hat{Q}(C, \tau)$ estimates the eventual utility the learner will receive if we take action $\tau$ upon observing conversation $C$ and then follow the baseline system (i.e., $\defaultPolicy{}$) at all future timesteps. 

When new conversations arrive, we evaluate the predicted $Q$ values for each $\tau \in \ActionTypeSet$ and choose the argmax:
 \begin{equation}
 \label{eq:metapol}
     \recalibratedPolicy(C) = \argmax_{\tau \in \ActionTypeSet} \hat{Q}(C, \tau).
 \end{equation}

We empirically evaluate this Q-value estimation approach in the synthetic recommendation experiment. We operationalize reward as the average utility (i.e., alignment between an item's features and the user's true preferences) over the set of recommended items. In the synthetic setup, we can generate responses (and eventual conversation rewards) for all possible $\tau \in \ActionTypeSet$  for each query seen in the dataset $\data$. So we compute $Q^\ast$ for all queries seen in $\data$. However we need to estimate $Q$ for new queries as they arrive so as to implement Equation~\ref{eq:metapol}. 

We construct a regressor to estimate $Q^\ast$ as follows: we use a pre-trained SentenceTransformer model~\citep{mpnet-base-v2} to encode a stratified sample of our synthetic corpus (we stratify by the degree of under-specification so that the resulting distribution over labels mimics the OpenAssistant results we report in Figure~\ref{fig:oass_underspecification_rates}). 

Then, for new conversation histories, e.g. $\query$,  
we encode it using the same embedding model and retrieve its $k$-nearest neighbors, with $k=5$. We then retrieve each neighbor's $Q^\ast$ and corresponding $\tau$. We can then predict the Q-value of each candidate $\tau$ as the average of the $Q^\ast(\tau)$ values contributed by neighbors. This is akin to an asymmetric imitation learning baseline~\citep{sinclair2023hindsight}. We greedily choose the argmax $\tau$ at $t_0$, simulate user answers to LLM responses containing questions as in Section~\ref{sec:suboptDefaultPolicyMultiStep}, follow $\defaultPolicy{}$ at $t_1$, and report the resulting episode-level rewards (i.e., average utility over items in the rec set). We present empirical results for this approach in Figure~\ref{fig:exp4results}, and observe that our learned meta-policy achieves higher reward relative to baseline.
The empirical results demonstrate that both strategies we evaluate---i.e., \emph{designing good prompts} (Section~\ref{sec:dataAgnosticInterventions}), and \emph{learning meta-policies} (Section~\ref{sec:dataBasedIntervention}) can be better than $\defaultPolicy$. We observe in Figure~\ref{fig:exp4results} that the meta-policy is slightly preferred over \textsc{ClarifyFlex}, however this ordering may not be universal: when historical data is not representative of future conversations, we may prefer \textsc{ClarifyFlex} over learning a meta-policy.

\begin{figure}[!htb]
    \centering
    \includegraphics[width=0.8\linewidth]{uai_2024/images/baseline_vs_clarflex_vs_metapolicy_by_underspec_lbl.png}
    \caption{Our learned meta-policy outperforms baseline across all under-specification buckets, especially when queries are critically under-specified. And it converges to baseline when queries are sufficiently specified.}
    \label{fig:exp4results}
\end{figure}