\section{Limitations}
\label{sec:limitations}

While our empirical results demonstrate that both of our proposed interventions improve expected conversation-level utility when queries are under-specified, it is worth noting some limitations associated with the way we have modeled user-chatbot interactions. First, we note that our model relies on the assumption that users are both \emph{willing} and \emph{able} to answer clarification questions when asked---that is, that they will (1) ``tolerate'' the questions with high probability (i.e., will not defect by exiting the conversation), and (2) truthfully reveal their preferences. In practice, the propensity and ability to answer will vary among users and over query intent domains (e.g., due to personal preferences, epistemic uncertainty regarding a specific topic, etc.).

In our empirical results, the optimistic nature of these assumptions is offset by the conservative nature of the information gain we consider: oftentimes, LLM questions will ask for more granularity about already-revealed $\intent$s, and while real users would often be able to provide such detail, our lossy, parameterized approximation cannot. As such, any improvement in expected reward associated with sequential response strategies that incorporate uncertainty reduction at $t_0$ may be underestimated. We have focused on undiscounted expected utility maximization, but the incorporation of a discount rate would be one way to incorporate heterogeneity with respect to question tolerance. 
Human validation of our proposed interventions will also be critical: while the interventions are well-motivated from an information-theoretic perspective, for some users, the marginal improvement in expected utility may not outweigh the cognitive cost associated with having to answer questions. 

Additionally, we note that while we have relied on helper LLMs to classify queries and responses (i.e., with respect to under-specification, and response strategy), human validation of these classifiers is an important next step. We have considered a relatively restricted intent domain, but in more general settings, reasonable annotators may disagree about whether a query is under-specified when they do not have access to ground-truth $\intent$. Relatedly, we have focused on a recommendation setting (i.e., movie recs) that admits objective computation of utility; extension of our approach to intents characterized by more subjective evaluation criteria may require alternative approaches to modeling utility. 

In the data-based intervention outlined in Section~\ref{sec:dataBasedIntervention}, we have assumed that historical conversation logs are representative of the user population and joint distribution over users and queries seen in the online setting. This assumption may be violated in practice, with potentially negative consequences for meta-policy performance. Our estimates regarding the prevalence of query underspecification may also contain artifacts---e.g., due to small sample size, and non-stationarity of the user population.

Finally, we have made assumptions regarding the prompt-based steerability of LLMs, along with the ability of LLMs to select ``good'' clarification questions when prompted to clarify. Empirical validation of these assumptions on a broad set of LLMs, along with studying the generation and selection of marginal information-gain maximizing questions, are important directions for future work. 








% Query under-specification, in particular, is an area where it is reasonable to expect relatively high levels of inter-annotator disagreement 




% First, we note that our user model does rely on two somewhat unrealistic assumptions---i.e., that the user will (1) ``tolerate'' the questions with high probability (i.e., will not defect by exiting the conversation), and (2) truthfully reveal their preferences when asked---the optimistic nature of these assumptions is offset by the conservative nature of the information gain we consider: oftentimes LLM questions will ask for more granularity about already-revealed $\theta$s, and while real users would often be able to provide such detail, our lossy approximation cannot. As such, any improvement in expected reward associated with sequential response strategies that incorporate uncertainty reduction at $t_0$ may be underestimated.
