\section{Empirical Evaluation}
\label{sec:empiricalEvaluation}
% Estalish our main claims and then explain how this section is setup/the supporting pieces and claims we use to support our main claims
In this section, we present results for a set of experiments designed to demonstrate that (1) 
% query-underspecification is a real-world problem, and 
$\defaultPolicy{}$ is miscalibrated in some commonly occurring situations characterized by query underspecification; (2) the approach we propose in Section~\ref{sec:algApproach} is able to recalibrate $\defaultPolicy{}$ to respond more appropriately in the face of underspecification; and (3) our recalibrated meta-policy, $\recalibratedPolicy{}$, results in improved expected reward relative to baseline in the multi-turn recommendation setting.

\subsection{Experimental Setup}
\label{sec:experimentalSetup}

\par \textbf{Datasets}:
% Describe synthetic and real-world (OpenAssistant) datasets; can re-use (parts of) previous motivating experiments description of synth dataset here; when we introduce the synthetic dataset, we motivate it in terms of Helper CLAIM A

\par \textbf{Response strategies}:
% Describe what is meant by ``baseline'' and then describe each $\tau$ we evaluate; provide summaries of sys messages

\par \textbf{Helper subroutines}:
% Explain the helper LLM calls/classifiers that we use; briefly describe each and include pointers to relevant appendix section(s) where we can provide full details including the prompt text

% LLM-based classifier for degree of query-underspecification
% LLM-based classifier for LLM responses (i.e., maps each response to $\tau \in \mathcal{T}$
% LLM-based extraction of recommended items and questions
% LLM-based mapping from extracted questions to synthetic query parameter space, to facilitate templatized user response generation/simulation

\par \textbf{Response evaluation functions}
% introduce/define cost and (two-step) utility/reward functions

% This way, we have introduced all the tools/models/functions/notation our helper claims rely on, and we can explain them later without worrying about introducing undefined terms


% MAIN CLAIM 1: $\pi^{llm}$ is miscalibrated in some common situations.

% MAIN CLAIM 2: Our approach successfully recalibrates $\pi$ to become more appropriate, thereby becoming better than $\pi^{LLM}$ overall.

% Helper CLAIM A: We can induce situations where $\pi^{LLM}$ might be off, so that we can study MC1 --> describe synthetic recsys setup.
% % we do not need results to support claim a; rather, we provide details about our synthetic query corpus construction procedure. [DONE; this content exists in old drafts and we can bring it back in or put it in appendix]







% In this section, we present results for a set of experiments designed to (1) motivate our focus on interventions in support of cost- and context-aware uncertainty reduction; and (2) demonstrate that the meta-policy approach we propose in Section~\ref{sec:algApproach} results in improved expected reward relative to baseline in the multi-turn recommendation setting. 

% \subsection{Can LLMs Reliably Detect Underspecified Queries?}
% % \subsection{Can LLMs reliably detect underspecified queries?}
% \label{sec:exp1}
% We begin by demonstrating that while \copilots{} do a decent job of \emph{detecting} under-specified queries in the zero-shot setting, the default policy---i.e., \defaultPolicy, appears to empirically over-weight the (latent) ``direct response'' option in all but the most extreme cases of query under-specification. 

% % \begin{itemize}
% %     \item Describe synthetic query corpus construction procedure.
% %     \item 
% % \end{itemize}

% % \subsection{Is under-specification a problem ``in the wild''?}
% % \subsection{Is Underspecification a Real-World Problem?}
% \subsection{How Prevalent is Underspecification ``in the Wild''?}
% We proceed to use the LLM-based query classification procedure outlined in Section~\ref{sec:exp1} to examine the composition of random query subsets drawn from three publicly available datasets featuring conversations between humans and \copilots{} that have been collected with human consent---namely, WildChat~\citep{zhao2024inthewildchat} and OpenAssistant 1 and 2~\cite{köpf2023openassistant}.

% \textcolor{magenta}{briefly describe each dataset and the inclusion/exclusion criteria we use when sampling; show resulting distribution over query labels for the subsamples; do some text summarization/summary stat analysis to reason about what, if anything, characterizes/differentiates underspecified queries at the lexical/syntactic/semantic levels. conclude by establishing that this problem does exist ``in the wild'' and warrants attention due to sub-optimality of baseline policy in the face of such uncertainty/ the potential for nudging the tau distribution towards uncertainty-reducing response strategies to improve performance on downstream recommendation tasks.}

% \begin{figure}[!htb]
% \begin{center}
%     \includegraphics[width=0.7\linewidth]{uai_2024/images/exp1_gpt4_oasst_query_lbl_dist.png}
%     \caption{placeholder; update w/addtl results or convert to table}
%     \label{fig:}
% \end{center}
% \end{figure}

% % \subsection{Do $\tau$-based Interventions Reliably Shift \defaultPolicy?}
% \subsection{Is \defaultPolicy{} Controllable via $\tau$-based Interventions?}
% % \subsection{Are System-message-based Interventions Effective?}

% \textcolor{magenta}{demonstrate that the $\tau$-based interventions we propose succeed with high probability in temporarily shifting pi LLM's response behavior in the intended semantic direction.}
% %(3) that tau-based interventions succeed with high probability in temporarily shifting pi LLM's response behavior in the intended semantic direction. 

% % examine the relative frequency of under-specified queries among random subsets drawn from three publicly available datasets featuring conversations between humans and \copilots{}

% \subsection{} 

% from WildChat~\cite{} and OpenAsst1~\cite{} and 2~\cite{}

% in three publicly available datasets featuring conversations between humans and \copilots{}

% , responsibly collected 

% exhibits a bias toward direct response 

% %We begin by empirically demonstrating that while \copilots{} do a decent job of detecting under-specified queries in the zero-shot setting, they 

% often exhibit a \emph{bias to respond}

% they are less likely to condition their responses in a way that allows them to address this uncertainty in all but the most extreme cases of underspecification. 






%for downstream recommendation tasks, relative to baseline.

%via a learned meta-policy

% for our algorithmic approach to offer an improvement upon the baseline pi LLM in expectation, we must establish the following: (1) that LLMs are able to detect under-specified queries when they occur in synthetic settings where we can control the degree of under-specification. (2) that under-specified queries exist ``in the wild'' and can be detected in-real time as we establish in (1). (3) that tau-based interventions succeed with high probability in temporarily shifting pi LLM's response behavior in the intended semantic direction. (4) That pi-llm in the absence of interventions yields sub-optimal expected reward in a synthetic setting where we can control the degree of under-specification. (5) in the two-step setting where reward is computed with respect to the set of items that are ultimately recommended, our learned rho-based meta policy offers increased expected reward compared to a baseline in which the only constraint imposed is on the number of items to recommend.



% In this section, we conduct a series of experiments to 

% demonstrate that: (1) LLMs are able to correctly distinguish between 






Experiments we conduct to validate the LLM-based operations we perform in the main experiments:
\begin{itemize}
    \item Experiment 0.1 (GPT4): How good is our "detect under-specification" classifer? This is needed because we use this classifier in Experiment 1. Use the "Can LLMs reliably detect under-specification" results, perhaps put in appendix. % use synthetic queries
    \item EXPERIMENT 0.2 (GPT4): How good is our "detect tau" classifer? This is needed because we use this classifier in Experiment 2. Perhaps put in appendix.
    \item EXPERIMENT 0.3 (GPT4): If using LLM as a "User Model", how do we ensure it is not leaking additional information about latent preferences/relevances? This is needed because we use such User Models in Experiment 4,5.
\end{itemize}

\begin{itemize}
    \item EXPERIMENT 1 (GPT4): What fraction of real-world queries are under-specified? --> run the "detect under-specification" setup on a slice of real-world queries (e.g. OpenAssistant dataset).
    \item EXPERIMENT 2 (GPT4):  How controllable is the $\pi_{LLM}$ by switching the prompts for it? Create Figure 4 (histogram of tau's) for $\pi_{LLM}$ responses. Do this also for tau-prompted LLMs. Verify that setting the sys message to tau indeed produces responses of that type w.h.p.
    % 
\end{itemize}

\textcolor{magenta}{I'm going to summarize the setup and drop in summary plots as we get results; text/plots are not final, just an attempt to get all the pieces in place that we can refine when we have a sense of exact narrative}
\par \textbf{Experiment 1 (GPT4): What fraction of real-world queries are under-specified?}

In this experiment, we draw a random subsample of 600 queries from the Open Assistant 1 dataset~\cite{köpf2023openassistant}.  This dataset consists of conversational trees, where levels in the tree correspond to alternating conversational turns between a user and ``assistant''. We are interested in under-specification at $t_0$; as such, we restrict our attention to the root nodes of each conversational tree, as these nodes contain the first user-issued utterance. Additional inclusion criteria include: queries should be classified as English language, and contain at least 3 (not necessarily unique) unigrams. We then ask a helper LLM (GPT4) to review each query $q \in C_{\text{oasst}}$ and map it to exactly one predicted label, $\hat{y}$. The prompt for this task describes and distinguishes between the labels as follows (see Appendix \ref{app:meOneClassify} for details):
\begin{itemize}[left=0pt]
    \item \textsc{sufficient}: All important factors upon which an answer to this query might depend are sufficiently specified.
    \item \textsc{Minor under}: One or more less important factors upon which an answer to this query might depend are not specified or are unknown; however, it is possible to provide a high-quality response even without knowing these factors.
    \item \textsc{Critical under}: One or more important factors upon which an answer to this query might depend are not specified or are unknown; it is difficult to provide a high-quality response without knowing these factors.
\end{itemize}

%Figure~\ref{fig:exp1_oasstLblDist} illustrates the resulting distribution over labels:
%\begin{figure}[!htb]
%\begin{center}
%    \includegraphics[width=0.7\linewidth]{uai_2024/images/exp1_gpt4_oasst_query_lbl_dist.png}
%    \caption{todo}
%    \label{fig:exp1_oasstLblDist}
%    \end{center}
%\end{figure}
% I think we should probably try to do some semantic analysis (topic modeling? another llm call? most frequently occurring ngrams; number of tokens, etc.?) to summarize/compare each of these query subsets

\par \textbf{Experiment 2 (GPT4):  How controllable is $\pi_{LLM}$ when we condition on different $\tau \in \mathcal{T}$}
In this experiment, we take our synthetic query corpus and ask each LLM that we evaluate (currently, GPT4) to generate a set of responses, where each response is conditioned on a different $\tau \in \mathcal{T}$. We then provide a helper LLM (GPT4) with each (query, tau-induced response) pair and ask it to map the response to a label $\hat{\tau} \in \mathcal{T}^\prime$ where  $\mathcal{T}^\prime \coloneq (\mathcal{T} \setminus \{ \text{baseline}\}) \cup \{\text{direct response}\}$. 







%%%%%%%% 







%%%% EXPERIMENT 0.1 (GPT4)
% How good is our "detect under-specification" classifer? This is needed because we use this classifier in Experiment 1. Use the "Can LLMs reliably detect under-specification" results, perhaps put in appendix.

% corpus: Need ground truth labels to evaluate. So, perhaps use known masking of theta and checking recovery of the extent of masking? (i.e. existing results?)
% using this classifier based on its strong performance is slightly different than the original intent which was zero-shot bc we want to see what it does without "help".
% Fair point. If 0-shot perf is already "good enough" then we can report them for this purpose also. If 0-shot is not "good enough" then agree that we may need to ensure this classifer becomes "good enough" before we run it on WildChat.

%%%% EXPERIMENT 0.2 (GPT4)
%% How good is our "detect tau" classifer? This is needed because we use this classifier in Experiment 2. Perhaps put in appendix.

%  corpus: Need ground truth labels to evaluate. Perhaps we manually annotate a few responses to assess classifier accuracy? 
% ^ makes sense; so we use synthetic dataset here and manually label responses for some subset (n=100?). 

%%%% EXPERIMENT 0.3 (GPT4)
%% If using LLM as a "User Model", how do we ensure it is not leaking additional information about latent preferences/relevances? This is needed because we use such User Models in Experiment 4,5.

% corpus: ok, for this seems like we need access to the query, full prefs, and question we are showing the LLM "user" and then to manually label whether anything other than what was asked shows up in the answer?
% that can work. So far, we can only show the prompts for the user model, which were engineered a bit to reduce chance of info leakage. But we have no assessment of if they still do/don't, so a manual annotation of info leakage can help here.
% not sure if this would work for news data bc not quite clear how we are making full vs underspecified query there but for synthetic we can also rig the Q/A process st the LLM "user" only sees the true pref relevant to the question asked
% partly why Tobias was restricting news rec clarifications to be yes/no single-sentence questions (to reduce chance of info leakage about the gold document).

% How prevalent is the under-specification problem?
% Answer in 2 parts: how often are queries under-specified (expt1), and how sub-optimal is pi_LLM for them (expt3).

%%%% EXPERIMENT 1 (GPT4)
%% What fraction of real-world queries are under-specified? --> run the "detect under-specification" setup on a slice of real-world queries (e.g. WildChat dataset).

% questions: do we get ground-truth labels? or are we justifying using it unsupervised here b/c of experiment 0.1? will we just consider user turn0 from wildchat? (eg, the initial query)
% answer: If we can run Expt 0.1, then we just consider turn0 of WildChat and appeal to generalization of classifier quality from Expt 0.1

%%%% EXPERIMENT 2 (GPT4)
%% How controllable is the pi_LLM by switching the prompts for it?
%% Create Figure 4 (histogram of tau's) for pi_LLM responses. Do this also for tau-prompted LLMs. Verify that setting the sys message to tau indeed produces responses of that type w.h.p.

% corpus: synth reco dataset?

% also note: the labels for responses induced by taus not necessarily equal to the tau space, because there is the possibility the llm decides no clarification is needed and responds directly, even if tau == clarify
% exactly! In the worst case, tau really doesn't nudge the LLM all that much. This expt will show how steerable the LLM policy is through the prompts. 
% ^ agree. though, would also say: graceful "recovery" or "degradation" of tau toward baseline (ie , direct response) as degree of specification approaches "full" is a desirable property of the intervention.
% Do we expect the tau to have that graceful behavior, or do we expect the meta-policy (rho) to place more and more probability mass on "default" (i.e. empty prompt) as queries become more fully specified?
% yes, although we could nudge it more explicitly. ie, the clarify prompt right now says: ""If the question is under-specified (i.e. the answer depends on many factors that have not been specified), ask the user about the most relevant factors so that you will be able to produce a good answer."  <- we don't tell it explicitly what to do if the question is *not* under-specified.
% From the current exposition point of view, it seems easier to argue that tau=Clarify will always clarify, and that rho will learn to put more probability on tau=default when appropriate. From performance of the approach point of view, seems like tau=Clarify gracefully reverting to default behavior (pi_LLM which is far more expressive policy than turn-wise rho can decide the interpolation) can be better.
% yeah, we would need to change the prompt text if we want it to *always* clarify. we previously had something like, always ask vs. something closer to the current clarify, which is like "ask when needed".
% we've decided to revise tau texts s.t. the 'context-aware' aspect is removed and behavior should be deterministic  for the most part. this way we shift the 'context-aware' part of the argument to the learned meta-policy. 

%%%% EXPERIMENT 3 (Many different LLMs)
% How sub-optimal is pi_LLM on queries where we KNOW clarification is mandatory? --> consider a toy example of an extreme shift (uses coverage of clarification facets dataset)  run the "detect headroom for improvement" on carefully constructed query set where tau=Clarify is guaranteed to be the optimal action

% corpus?
% what do we mean by 'extreme shift' here? like, some subset of synthetic dataset with high degree of under-spec?
% 'extreme shift' refers to the artificial space of theta's and the uniform distribution over those thetas that we use as the query/user population in our synth expt --- this may be far from what real-world users prefer or ask about (e.g. 99% of real-world users may want one type of plantrec theta). Just a reminder to us that this is an artificially generated extreme shift case, but one where we are reasonably sure that CLARIFY can produce an improvement over pi_LLM.

%%%% EXPERIMENT 3/ALTERNATIVE 1 %%% planning to do this one, w/synthetic queries and tau texts that map q to specific action whp
%% How sub-optimal is pi_LLM on under-specified queries? --> (needs synthetic utility function about relevance of recommended item; not the coverage of clarification facets || also, needs user model) run the "detect headroom for improvement" setup on the fraction of under-specified queries identified from Expt1 AND/OR on under-specified queries generated from the Theta-masking process across 3 recommendation domains.
% --> also establishes usefulness of the tau space.

% this is the two-step experiment we've previously discussed 
% Not quite, that is expt4 below. Here we would need to evaluate all possible action sequences of rho and pick the highest utility sequences to report the "headroom for improvement" no?
% ah, agree. yeah, this is the step that makes exp4 possible



%%%% EXPERIMENT 3/ALTERNATIVE 2
%% How sub-optimal is pi_LLM on real-world queries? --> Take the results of experiment 1. Take the most confident predictions of "UNDERSPECIFIED". Pipe those through pi_LLM, classify the response types using Expt 0.2 "detect tau" classifer. We are arguing that ideally the response type for these queries is Clarify. ???

% REDO EXPERIMENT 3 across several LLMs --- GPT4, Llama2, ...

% Ok, so you have convinced me that pi_LLM is suboptimal, and it is a prevalent problem. Does the proposed fix actually work?

%%%% EXPERIMENT 4 (Many different LLMs)
%% Does the rho meta-policy result in a policy that is better than the default policy? 

% REDO EXPERIMENT 4 across several LLMs --- GPT4, Llama2, ...

%%%% EXPERIMENT 5 (GPT4)
%% Repeat experiment 4 on news recommendation task. Benefit of news recommendation is that it does not require specifying a toy preference space theta, while still being a realistic problem (i.e. retrieval).



% I think we actually move this last one as a more 'evaluation'-style experiment after proposing the method, since 1-3 above should help to motivate it.
\subsection{Does intervention to promote uncertainty reduction improve performance on downstream tasks?}
% What is the impact of uncertainty reduction on downstream performance (i.e., item recommendation)?}
\label{sec:}










Options:

Revisit Expt 2.
\begin{itemize}
\item 2-step experiment with masked queries. Utility = if eventual recommended item satisfies fully-specified user preferences.
\end{itemize}

Real-world data.
\begin{itemize}
\item Need a dataset/candidate set that has not already been ingested into LLM, but can fit into context window.
\item Some degree of redundancy / overlap between candidate items
\item Realistic query distribution. Ideas to generate
\begin{itemize}
    \item Use referring search queries for web pages (e.g., what did people search for to get this page?)
    \item Pick a target item and then generate initial queries backwards through LLM instructions
\end{itemize}
\item Algorithm:
\begin{enumerate}
\item Issue (first) query, let system respond.
\item If target item is in output, stop. Else answer question via user model who can only answer general questions about target article, not idiosyncratic things. 
\item Repeat above two steps with max of $K$ rounds. Utility is number of questions asked + number of incorrect items returned before correct one.
\end{enumerate}
\end{itemize}

