\section{Helper LLM-based Classifiers}
\label{app:appHelperLLMClassifiers}
In this section, we provide descriptions, system messages, and validation results for each of the helper-LLM-based classifiers that we rely on throughout the paper.

\subsection{LLM-based query underspecification classifier}
\label{app:HelperLLMUnderspecClassifier}
\paragraph{Task description:} We use a helper LLM to map queries from our synthetic and real-world corpora to a set of class labels that describe the extent to which a given query is (or is not) under-specified, i.e., $\{\textsc{Critical under, minor under, sufficient}\}$.
We introduce these labels in Section~\ref{sec:meQueryUnderspecCommon}, where we discuss them within the context of the labels we (i.e., human annotators) manually assign to a randomly sampled subset of the OpenAssistant corpus, but they also apply when the helper LLM is asked to classify queries. For convenience, we repeat them below:

\begin{itemize}[left=0pt]
    \item \textsc{Critical under}: One or more important factors upon which an answer to this query might depend are not specified or are unknown; (annotators agree that) it is difficult to provide a high-quality response without knowing these factors.
    \item \textsc{Minor under}: Less important factors that the query might depend on are not specified or are unknown; however, it is possible to provide a high-quality response even without knowing these factors.
    \item \textsc{Sufficient}: All important factors upon which an answer to this query might depend are sufficiently specified.
\end{itemize}

\paragraph{Helper LLM prompt}
The prompt that we provide to the helper LLM for this task is shown below; it is also included within the \texttt{helper\_task\_system\_messages.json} file contained within our supplemental materials. 
\begin{lstlisting}[language=json]
 {
    "classify_queries_multiclass": "For each query in this list <list>{{input.question}}</list>, assign exactly one of the following labels:\n
         - sufficient: All important factors upon which an answer to this query might depend are sufficiently specified.\n
         - minor_under: One or more less important factors upon which an answer to this query might depend are not specified or are unknown; however, it is possible to provide a high-quality response even without knowing these factors.\n
         - critical_under: One or more important factors upon which an answer to this query might depend are not specified or are unknown; it is difficult to provide a high-quality response without knowing these factors.\n
     You MUST assign EXACTLY ONE label from the list above.\n
     Return your answer as a string.\n
     DO NOT answer any questions contained in the query, or include any expository text.\n
     The result should be DIRECTLY parsable in Python."
 }
\end{lstlisting}

\paragraph{Helper LLM configuration:} We use GPT-4~\citep{openai2023gpt4} for all query underspecification classification calls. 

\paragraph{Validation:}
We use our synthetic query corpus to validate our use of this LLM-based underspecification classifier. As we describe in Section~\ref{sec:synthCorpusConstruction} and detail in Appendix~\ref{app:synthetic_data}, by virtue of how we construct these queries, we control the number of attributes that are revealed. As such, we have access to ground-truth underspecification labels defined in terms of the number of revealed attributes, referred to (with slight abuse of notation) as $|q|$ in the mapping shown below. Note that $|q|$ takes values in $\{0, \dots, |\theta|-1\}$ for masked queries, and will be equal to $|\theta|$ for sufficiently specified queries, where $|\theta|$ refers to the cardinality of the intent-specific attribute space. 

\begin{small}
\begin{align*}
\label{eq:multiClassGroundTruthLabels}
q \mapsto \begin{cases}
  \textsc{critical under } &  |q| \leq 1,\\
  \textsc{sufficient } & |q| = |\theta|, \\
  \textsc{minor under } &  \text{otherwise.}
\end{cases}
\end{align*}
\end{small}

We evaluate our LLM-based query underspecification classifier on our synthetic query corpus, which contains 600 queries split across the following intent domains: movie recommendation, gift recommendation, and plant recommendation. We report performance metrics and confusion matrices over all synthetic queries, and broken down by intent-specific queries below. 

\begin{table}[!htb]
\begin{center}
\label{overall multi-class underspec}
\begin{tabular}{lllll}
\toprule
 & precision & recall & f1-score & support \\
\midrule
critical\_under & 0.583 & 0.139 & 0.224 & 202 \\
minor\_under & 0.443 & 0.472 & 0.457 & 398 \\
sufficient & 0.720 & 0.873 & 0.789 & 600 \\
accuracy &  &  & 0.617 & 1200 \\
macro avg & 0.582 & 0.495 & 0.490 & 1200 \\
weighted avg & 0.605 & 0.617 & 0.584 & 1200 \\
\bottomrule
\end{tabular}
\caption{Classifier performance: over all intents}
\end{center}
\end{table}

\begin{figure}[!htb]
\begin{center}
    \includegraphics[width=0.5\linewidth]{uai_2024/images/queryUnderspec_conf_mat_gpt4_all_intents.png}
    \caption{Confusion matrix: all intents}
    \label{fig:cmAllIntents}
\end{center}
\end{figure}


%['intent'] ('movie_rec',)
\begin{table}[!htb]
\begin{center}
\label{movie rec multi-class}
\begin{tabular}{lllll}
\toprule
 & precision & recall & f1-score & support \\
\midrule
critical\_under & 0.059 & 0.010 & 0.017 & 99 \\
minor\_under & 0.284 & 0.214 & 0.244 & 196 \\
sufficient & 0.642 & 0.925 & 0.758 & 295 \\
accuracy &  &  & 0.536 & 590 \\
macro avg & 0.328 & 0.383 & 0.340 & 590 \\
weighted avg & 0.425 & 0.536 & 0.463 & 590 \\
\bottomrule
\end{tabular}
\caption{Classifier performance: movie recommendation queries}
\end{center}
\end{table}

\begin{figure}[!htb]
\begin{center}
    \includegraphics[width=0.5\linewidth]{uai_2024/images/queryUnderspec_conf_mat_gpt4_movie_rec.png}
    \caption{Confusion matrix: movie recommendation queries}
    \label{fig:cmMovieRecs}
    \end{center}
\end{figure}

%['intent'] ('gift_rec',)
\begin{table}[!htb]
\begin{center}
\label{gift rec multi-class}
\begin{tabular}{lllll}
\toprule
 & precision & recall & f1-score & support \\
\midrule
critical\_under & 0.852 & 0.371 & 0.517 & 62 \\
minor\_under & 0.652 & 0.861 & 0.742 & 122 \\
sufficient & 0.928 & 0.908 & 0.918 & 184 \\
accuracy &  &  & 0.802 & 368 \\
macro avg & 0.811 & 0.713 & 0.725 & 368 \\
weighted avg & 0.824 & 0.802 & 0.792 & 368 \\
\bottomrule
\end{tabular}
\caption{Classifier performance: gift recommendation queries}
\end{center}
\end{table}

\begin{figure}[!htb]
\begin{center}
    \includegraphics[width=0.5\linewidth]{uai_2024/images/queryUnderspec_conf_mat_gpt4_gift_rec.png}
    \caption{Confusion matrix: gift recommendation queries}
    \label{fig:cmGiftRecs} 
\end{center}
\end{figure}


%['intent'] ('plant_rec',)
\begin{table}[!htb]
\begin{center}
\label{plant rec multi-class}
\begin{tabular}{lllll}
\toprule
 & precision & recall & f1-score & support \\
\midrule
critical\_under & 1.000 & 0.098 & 0.178 & 41 \\
minor\_under & 0.357 & 0.512 & 0.421 & 80 \\
sufficient & 0.683 & 0.694 & 0.689 & 121 \\
accuracy &  &  & 0.533 & 242 \\
macro avg & 0.680 & 0.435 & 0.429 & 242 \\
weighted avg & 0.629 & 0.533 & 0.513 & 242 \\
\bottomrule
\end{tabular}
\caption{Classifier performance: plant recommendation queries}
\end{center}
\end{table}

\begin{figure}[!htb]
\begin{center}
    \includegraphics[width=0.5\linewidth]{uai_2024/images/queryUnderspec_conf_mat_gpt4_plant_rec.png}
    \caption{Confusion matrix: plant recommendation queries}
    \label{fig:cmPlantRecs}
\end{center}
\end{figure}

\pagebreak 

%\ch{\paragraph{Example OpenAssistant queries classified by helper LLM as critically underspecified:}
%\begin{itemize}
%    \item Please give me a prompt for stable diffusion to generate a good looking image.
%    \item How is the USA president at war 2?
%    \item A friend of mine barely responds or talks to me anymore and I don't know why.
%    \item What are some up and coming and high quality youtube channels in science and technology that I have probably not heard of? Note that I am subscribed to close to 1000 channels.
%    \item What temperature will it be next week?
%    \item How is the education in Korea?
%    \item For how long per day is it advised to take off a removable cast?
%    \item Suggest me places near 72nd St where I can park my car. Please also order them by price and add the price of each one to the right of their names.
%    \item How do I prepare for a job interview?
%    \item What is the weather like in Prague?
%\end{itemize}}



\subsection{LLM-based \ActionType{} classifier}
\label{app:HelperLLMTauClassifier}

\paragraph{Task description:} We use a helper-LLM-based $\tau$ classifier to map chatbot natural language responses to a set of labels intended to characterize a given response's syntactic and semantic contents. We primarily use this classifier as a way of assessing whether and to what extent the behaviors we seek to induce via modified system messages \emph{actually} produce observable effects in the intended direction(s) and/or converge with the behavior of $\defaultPolicy$. 

The label set we use for this classifier includes the set of response strategies that we refer to as $\mathcal{T}$ throughout the paper---i.e., $\{ \textsc{Interrogate, Clarify, Hedge} \}$,and also includes additional options---i.e., $\{ \textsc{Direct response, Refuse, Miscellaneous, Missing} \} $. While we do not explicitly induce this latter set of behaviors, we need the \textsc{Direct response} option to characterize the baseline system behavior and (more broadly) uncertainty-agnostic LLM responses in general. The \textsc{Refuse}, \textsc{Miscellaneous}, and \textsc{Missing} options are needed to characterize the behavior of $\defaultPolicy$ in open-domain settings such as the OpenAssistant corpus we consider, as well as to handle rare parsing/extraction errors that result in inadvertently blank LLM responses. The defining characteristics of each response strategy are presented/contained within the task system message in the next section. 

\paragraph{Helper LLM prompt:}
The prompt that we provide to the helper LLM for this task is shown below; it is also included within the \texttt{helper\_task\_system\_messages.json} file contained within our supplemental materials. 
\begin{lstlisting}[language=json]
{
    "sm_map_llmr_to_tau": str = "For each (query,response) in this list <list>{{input.pair}}</list>, map the response to exactly one of the following labels:\n
    
        - interrogate: The response contains a large number (i.e., more than 3) of follow-up questions and and does NOT contain plausible responses conditioned on possible answers to these questions.\n
        - clarify: The response contains a limited number (i.e., 3 or less) of follow-up questions and does NOT contain plausible responses conditioned on possible answers to these questions.\n
        - hedging: The response does not commit to one specific answer but instead provides many plausible/possible/qualified answers, options, or conditions under which certain answers/options may or may not hold. It may also discuss (potentially conflicting) different view points without taking a definitive stance.\n
        - direct_response: The response does NOT contain questions. The response does NOT contain multiple plausible answers, with corresponding descriptions of conditions or criteria under which each response would be suitable.\n
        - refuse: The response contains an explicit or implicit refusal to answer. It may mention criteria which would be needed in order to provide an answer, but it does NOT contain plausible responses conditioned on these criteria.\n 
        - misc: The response may describe, summarize, or try to explain the query, or appear to follow instructions provided in the query (rather than answer an information-seeking request or ask clarifying questions).\n
        - missing_response: The response is empty or blank.\n

    You MUST assign exactly one label from the list above.\n
    Return your answer as a string.\n
    DO NOT answer any questions contained in the response, or include any expository text.\n
    The result should be DIRECTLY parsable in Python."
}
\end{lstlisting}

\paragraph{Helper LLM configuration:} We use GPT-4~\citep{openai2023gpt4} for all query underspecification classification calls. 

\paragraph{Validation:} We manually annotate $\defaultPolicy$ responses to a subset of the OpenAssistant corpus that we consider, and use these human-annotator assigned ground-truth $\tau$s to validate our helper LLM-based $\tau$ classifier. We note that some of our $\tau$s of interest are not sufficiently represented amongst the $\defaultPolicy$ responses (i.e., \textsc{Clarify, hedge, interrogate}). We thus use our system-message-based interventions to induce responses for these strategies and include them (unlabeled) in our manually annotated subset. We report classification performance metrics and a confusion matrix below. 

\begin{table}[!htb]
\begin{center}
\label{overall multi-class tau}
\begin{tabular}{lllll}
\toprule
 & precision & recall & f1-score & support \\
\midrule
clarify & 0.763 & 0.935 & 0.841 & 31 \\
direct\_response & 0.822 & 0.903 & 0.861 & 154 \\
hedging & 0.925 & 0.649 & 0.763 & 57 \\
interrogate & 1.000 & 0.680 & 0.810 & 25 \\
misc & 0.154 & 0.250 & 0.190 & 8 \\
refuse & 0.667 & 0.400 & 0.500 & 5 \\
accuracy &  &  & 0.807 & 280 \\
macro avg & 0.722 & 0.636 & 0.661 & 280 \\
weighted avg & 0.831 & 0.807 & 0.808 & 280 \\
\bottomrule
\end{tabular}
\caption{$\tau$-classifier performance on human-annotated LLM responses to OpenAssistant queries}
\end{center}
\end{table}

\begin{figure}[!htb]
\begin{center}
    \includegraphics[width=0.5\linewidth]{uai_2024/images/oasst_tau_labels_confusion_matrix.png}
    \caption{Confusion matrix: annotated OpenAssistant query responses}
    \label{fig:cmTauLblsOasst}
\end{center}
\end{figure}


