\section{Introduction}
\label{sec:introduction}

\begin{figure}[!tb]
    \includegraphics[width=\linewidth]{uai_2024/images/example.png}
    \caption{An example failure where a user's query is under-specified (blue text). Current \copilots~produce long responses in order to hedge against uncertainty (purple text). Clarifying the user's context can avert this failure.}
    \label{fig:example}
\end{figure}

In contrast to their task- or domain-specific predecessors, modern conversational agents have employed large language models (LLMs) to achieve high proficiency levels (i.e., at or exceeding that of humans) in challenging, open-domain settings~\citep{openai2023gpt4}. The implicit objective for the agent in such settings is to respond to a user in a way that maximizes the user's utility given their conversation goal(s). 

However, humans are often unable or rationally unwilling to fully verbalize (i.e., explicitly state) their goals and preferences for various reasons (e.g., efficiency) and may instead rely on their conversational partner(s) to fill in the gaps~\citep{piantadosi2012communicative}. This leads users to issue \emph{under-specified} queries in which the \copilot{} observes only a subset of the preferences and constraints required to provide a high-quality answer -- see Figure~\ref{fig:example} for an example. Empirically, we observe that under-specification is common: 
we classified a random sub-sample of the queries in the OpenAssistant dataset~\citep{köpf2023openassistant} and found that more than 23\% of queries posed to \copilots~today are 
severely under-specified (see Figure~\ref{fig:oass_underspecification_rates} and Section~\ref{sec:meQueryUnderspecCommon} for details).

\begin{figure}[tbp]
\begin{center}
    \includegraphics[width=\linewidth]{uai_2024/images/distribution_revised.pdf}
    \caption{Real-world users asked severely under-specified queries more than $23\%$ of the time in the OpenAssistant dataset ($n = 600$).}
    \label{fig:oass_underspecification_rates}
    \end{center}
\end{figure}


\input{uai_2024/tikz/pareto}

In this paper, we explore the relationship between query under-specification, LLM response behavior, and user satisfaction. We begin by proposing a taxonomy of LLM response strategy types (see Table~\ref{tbl:response_types}) to characterize the behavior of SoTA models in the face of query under-specification---i.e., their ``conversational priors''---
with respect to utility and cognitive cost~\citep{tankelevitch2023metacognitive}. Figure~\ref{fig:pareto} provides a demonstrative example.  
Note that each \actiontype{} (a) can be characterized by syntactic and semantic features (i.e., length, presence or absence of conditional statements/questions, etc.) and (b) will give rise to a joint distribution over cost and utility that impose different trade-offs depending on the user's true but latent preferences. 

We use this taxonomy and a combination of synthetic and real-world queries to empirically demonstrate that: (a) SoTA LLMs are predisposed to respond directly or hedge in lieu of asking a small number of clarifying questions when queries are under-specified; and (b) such miscalibration can lead to unsatisfactory and/or sub-optimal performance on downstream tasks (as illustrated in Figure~\ref{fig:example} and Section~\ref{sec:meDefaultPolicyMiscalibrated}).

To address the miscalibration of LLMs outlined above, we formalize user-chatbot interactions as a partially observable decision process (PODP), where a user with a partially observable goal engages in a turn-by-turn conversation with a chatbot. 
In this PODP, the chatbot's policy $\policy$
is a fixed mapping from conversation prefixes (which can span multiple turns) to natural language responses. Then, for any given conversation and user goal,
the chatbot seeks to provide a natural language response that maximizes utility according to a fixed but unknown user utility function. 
Note that utility is computed with respect to the user's \emph{latent} goal, 
which may be \emph{fully} or \emph{partially} observable via their query.

Intuitively, when the goal 
is partially observable and the user is amenable to answering a small number of clarifying questions, a policy that produces a natural language response containing questions at timestep $t_0$ and incorporates the information gained to produce higher-quality responses at future timestep(s) will yield higher expected \emph{cumulative} utility, relative to a myopic policy that tends to respond directly or hedge at $t_0$. 
We build upon this insight to propose two interventions (Sections~\ref{sec:dataAgnosticInterventions} and~\ref{sec:dataBasedIntervention}) to make \copilots{} produce better-calibrated responses in the face of query under-specification. Both of the interventions require only API access to frozen, black-box LLMs.

Our first intervention (Section~\ref{sec:dataAgnosticInterventions}) is inspired by prior research on the generation of clarification questions~\cite{rao-daume-iii-2018-learning, majumder2021ask}, and uses a static, ``clarification-aware'' prompt to nudge LLMs to clarify when appropriate rather than reverting to default response behavior. 
Our second intervention (Section~\ref{sec:dataBasedIntervention}) leverages historical conversation logs to learn a meta-policy---i.e., a mapping from conversation prefixes to a finite set of prompts. 
Then during a PODP episode, the chatbot first invokes this meta-policy, and then calls the LLM with the resulting prompt 
to produce a contextually appropriate PODP action.
We expect the two proposed interventions to be effective in different data regimes --- if high-quality logged data is readily available, the approach in Section~\ref{sec:dataBasedIntervention} is a practical alternative to resource-intensive approaches such as fine-tuning LLMs on the collected data.
Conversely, if we do not have access to sufficient high-quality data, we may prefer the data-agnostic approach of Section~\ref{sec:dataAgnosticInterventions}.

In Section~\ref{sec:limitations}, we highlight that our proposed interventions can be further improved---for instance, reasoning about ``good'' clarification questions to ask (currently left up to the LLM) and the propensity of users to answer with relevant information. 
Empirically, we evaluate both interventions on recommendation tasks featuring a synthetic user model. We find that each intervention achieves higher expected utility relative to baseline when queries are under-specified, and converges to baseline as query specification increases. 