\section{Motivating Experiments}
\label{sec:motivatingExperiments}

Here, we establish that: (1) query under-specification is common in real-world human-chatbot conversations; and (2) $\defaultPolicy$ can be sub-optimal when queries are under-specified. 

\subsection{Query Underspecification is Common}
\label{sec:meQueryUnderspecCommon}
We annotated the OpenAssistant dataset~\citep{köpf2023openassistant} to explore how often users issue under-specified queries to open-domain \copilots. 
We restrict our study to queries in English with at least $3$ words ($\approx 40\%$ of over $10,000$ conversations) and subsample 600 queries uniformly at random.  
We created an LLM-based classifier to map each query to a predicted under-specification label, whose accuracy we also validate on a synthetic corpus (see Appendix~\ref{app:HelperLLMUnderspecClassifier}). Class labels include:
\begin{itemize}[left=0pt]
    \item \textsc{Critical under}: One or more important factors upon which an answer to this query might depend are not specified or are unknown; 
    it is difficult to provide a high-quality response without knowing these factors.
    \item \textsc{Minor under}: Less important factors that the query might depend on are not specified or are unknown; however, it is possible to provide a high-quality response even without knowing these factors.
    \item \textsc{Sufficient}: All important factors upon which an answer to this query might depend are sufficiently specified.
\end{itemize}

Figure~\ref{fig:oass_underspecification_rates} summarizes the results of this experiment, which shows that query under-specification is prevalent. A few examples of critically under-specified queries are listed in Table~\ref{tab:OA_underspec_examples}. Note also that many OpenAssistant users have experience with prompting, and we conjecture a higher prevalence of under-specified queries from novice user populations. 

\begin{table}[ht]
    \centering
    \resizebox{0.99\linewidth}{!}{
    \begin{tabular}{|l|}
    \toprule
    \textbf{Critically Under-specified Queries (Abridged)}\\
    \midrule
    Suggest me places near 72nd St where I can park my car.\\
    \hline
    What are some up and coming and high quality youtube channels \\
    in science and technology that I have probably not heard of?\\
    \hline
    A friend of mine barely talks to me anymore and I don't know why.\\
    \bottomrule
    \end{tabular}
    }
\caption{Examples from the OpenAssistant dataset tagged by our classifier (details in Appendix~\ref{app:HelperLLMUnderspecClassifier}).}
\label{tab:OA_underspec_examples}
\end{table}

\subsection{LLM Policies Can Be Sub-optimal When Queries are Under-specified} 
\label{sec:llmSuboptWhenUnderspec}

When queries are under-specified, $\defaultPolicy$~has difficulties optimally trading-off information seeking with greedy, utility-maximizing response tendencies. To study this, we define seven broad categories for query responses in Table~\ref{tbl:response_types}. We use these definitions with an LLM-based classifier, which we validate in Appendix~\ref{app:HelperLLMTauClassifier}. Let $\tau$ be the predicted response type of response $a$. Our experiments show that both for real-world and synthetic queries, the current SoTA LLM, GPT-4, prefers to either directly respond or hedge, instead of clarifying via a short question.

\begin{table}[ht]
    \centering
    \resizebox{0.99\linewidth}{!}{
        \begin{tabular}{ll}
        \toprule
        \textbf{Response type $\tau$} & \textbf{Response characteristics} \\
        \midrule
        \textsc{Refuse} & Contains an explicit or implicit refusal to answer.\\
        \textsc{Direct response} & No questions or hedging; addresses query.\\
        \textsc{Hedge} & Many answers, conditioned on uncertain factors.\\
        \textsc{Clarify} & Limited/prioritized set of questions (i.e., $\leq 3$).  \\
        \textsc{Interrogate} & Large/exhaustive number of questions (i.e., $> 3$). \\
        \textsc{Missing} & The response is empty/blank.\\
        \textsc{Miscellaneous} & Describes or follows query instructions.\\
        \bottomrule
        \end{tabular}
    }
    \caption{For the motivating experiments in Section~\ref{sec:motivatingExperiments}, we categorize LLM responses into seven response types.}
    \label{tbl:response_types}
\end{table}

\label{sec:meDefaultPolicyMiscalibrated}

\subsubsection{Synthetic query corpus}
\label{sec:synthCorpusConstruction}
The goal for the synthetic corpus is to have a full-information setting where we can explicitly control the degree of under-specification and measure the utility of any given response. We generate queries for three different recommendation domains (movies, gifts, plants) that each have four constraint dimensions $\intent_i$ that can be active (set to a specific value, e.g., $\intent_{age} = $ ``25-35 years''), or inactive, (e.g., $\intent_{age} = \emptyset$). We base this setup on~\citet{radlinski2019coached}, who studied users' preferences for movies expressed in a conversational recommendation setting. The user goal is then to get a recommendation that satisfies \emph{all} of these constraints. Constraint values and the number of active dimensions are sampled via uniform sampling. After determining the ground truth user goal $\intent$, we generate a potentially under-specified user query by sampling a subset of active constraint dimensions to reveal. With a slight abuse of notation, let $\query$ be the vector of revealed active constraints. We categorize the resulting queries as:  
\begin{small}
\begin{align}
\label{eq:multiClassGroundTruthLabels}
\query \mapsto \begin{cases}
  \textsc{critical under } &  |\query| \leq 1,\\
  \textsc{sufficient } & |\query| = |\intent|, \\
  \textsc{minor under } &  \text{otherwise.}
\end{cases}
\end{align}
\end{small}
Details can be found in Appendix~\ref{app:synthetic_data}.


\subsubsection{Sub-optimality of LLM in single-step interaction}
\label{sec:subOptDefaultSingleStep}

For each query, we use GPT-4 with the default system message to generate a natural language response $\action \sim \defaultPolicy$ and assign it a response type label $\sampledactiontype$ from Table~\ref{tbl:response_types} using our LLM-based classifier.

Figure~\ref{fig:me2_distOverTauPred} shows the distribution over response strategies (by corpus and under-specification severity) for $\defaultPolicy{}$.
We observe that for both synthetic and real-world queries, using the uncertainty-agnostic \textsc{Direct Response} strategy is preferred by a large margin across \emph{all} under specification buckets. While there is evidence that uncertainty-aware \actiontypes{} (i.e., \textsc{Hedge}, \textsc{Clarify}, and \textsc{Interrogate}) 
are increasingly used when under-specification rises, the sheer magnitudes still express a clear bias for $\defaultPolicy{}$ to respond or hedge---rather than clarify---in the face of under-specification. This indicates that there is headroom to improve utility even over SoTA LLMs.
 
 \begin{figure}[tb]
    \includegraphics[width=\linewidth]{uai_2024/images/dist_over_tau_pred_by_underspec.png}
    \caption{Even under severe levels of under-specification, GPT-4 prefers to directly answer a user query.}
    \label{fig:me2_distOverTauPred}
\end{figure}


\subsubsection{Sub-optimality of LLM in multi-step interactions}
\label{sec:suboptDefaultPolicyMultiStep}

Intuitively, a policy asking a few relevant questions in the beginning should be able to outperform $\defaultPolicy{}$ in many cases since $\defaultPolicy$ often defaults to \textsc{Direct Response}. The following two-step recommendation task shows this. 
We compare $\defaultPolicy{}$ with two simple static policies described in Table~\ref{tab:respStrat}. We use 
modified system messages to encourage different behavior for the first 
response, and follow $\defaultPolicy$ as the default policy after (for full prompts, see Appendix~\ref{app:tauSysMsgs}).


\begin{table}[tbph]
    \centering
    \resizebox{0.95\linewidth}{!}{
        \begin{tabular}{ll}
        \toprule
        \textbf{Policy $\policy^\prompt$} & \textbf{System prompt $\prompt_0$} \\
        \midrule
        $\defaultPolicy{}$ & Default LLM system message (unmodified). \\
        $\policy^{\text{Clarify}}$ & Ask about $\leq$ 3 of \emph{most relevant} factors.  \\
        $\policy^{\text{Hedge}}$ & Condition on option(s) for each uncertain factor.\\
        \bottomrule
        \end{tabular}
    }
    \caption{We evaluated three different policies that encourage different initial \actiontypes~to show the possible room for improvement in multi-step interactions.}
    \label{tab:respStrat}
\end{table}



We use the queries and ground truth user goals from the synthetic query corpus outlined in Section~\ref{sec:synthCorpusConstruction}, but focus on the movie domain only, following~\citet{cheng2023llfbench}. Each episode begins at $t=0$ (denoted $t_0$) with the user issuing query $\query$ to ask for movie recommendations that satisfy their true preferences $\intent$. When the \copilot{} provides recommendations (i.e., chooses action types \textsc{Direct Response} or \textsc{Hedge}), we terminate the episode and compute the utility of the recommendation. If the chatbot asks questions, we use another LLM as a user simulator, requiring the latter to divulge information in a templatized format about constraints $\theta_i$ \emph{only} if explicitly asked (see Appendix~\ref{app:simulatingUserResponseToQuestions}).  

\textbf{Item Utilities.} We begin by measuring the utility of items recommend by each $\policy$, operationalized as the fraction of constraints (out of $4$) that an item satisfies, averaged across all items recommended to the user.
Instead of comparing individual policies, we compare response types $\sampledactiontype$ to eliminate cases where setting the system message $\prompt$ did not induce the desired response type. 
Figure~\ref{fig:me2_distAvgUtil} shows how multi-step episode utilities develop when we group by the type of the first system response, $\sampledactiontype_0$. \textsc{Clarify} does not generate any utility at time $t_0$, since no recommendations have been made, but does much better in the second time step $t=1$, especially for critically under-specified queries. When we \textsc{Hedge} in the beginning, we do get utility at $t_0$, but generate less than when we directly reply, since utility is averaged over \emph{all} (possibly irrelevant) recommendations.

These findings suggest that there is headroom for improvement over $\defaultPolicy$ in multi-step interactions. 

\begin{figure}[tb]
    \includegraphics[width=\linewidth]{uai_2024/images/dist_of_accumulated_util_by_t0_tau_pred_and_timestep.png}
    \caption{Distribution of accumulated item utilities $\utility$ at timesteps $t=0, 1$; grouped by under-specification levels.}
    \label{fig:me2_distAvgUtil}
\end{figure}

\textbf{Costs.} We now consider how the \emph{cost} of capturing this headroom---i.e., moving from an under-specified query to a more fully specified version---varies over the uncertainty-aware strategies that we consider---i.e., \textsc{Hedge} and \textsc{Clarify}. To proxy for the cognitive burden associated with reading and answering clarification questions or parsing the many cases or conditions mentioned in hedging responses, we define a cost function, $c: \actiondomain \rightarrow \mathbb{R}_{\geq 0} := \texttt{len}(\action)$ (measured by counting all unigrams in $\action$). 

\begin{figure}[!htb]
\centering
    \includegraphics[width=0.85\linewidth]{uai_2024/images/dist_of_t0_resp_cost_t0_tau_pred.png}
    \caption{Distribution of response cost at $t=0$ for each \actiontype~$\sampledactiontype_0$; grouped by under-specification levels.}
    \label{fig:me2_distt0RespCost}
\end{figure}

Figure~\ref{fig:me2_distt0RespCost} illustrates the benefit of \textsc{Clarify} -- it carries a relatively low cost in terms of output length. Interestingly, the \textsc{direct response} action produces the longest answer among all other response types when queries are critically under-specified. Inspecting the produced responses, we see that \textsc{direct response} produces long answers by adding explanations or extended lists of recommendations. When queries are sufficiently specified, \textsc{Hedge} leads to the highest cost answers, as it still enumerates over many answer options. Overall, we see that \textsc{Clarify} obtains the lowest average cost across \emph{all} under-specification buckets, suggesting that a policy could achieve higher utility with lower costs by considering the \textsc{Clarify} action more often. 