\section{Related Work}
\label{sec:relatedWork}

Even though LLMs are powerful conversationalists and recommenders~\citep{he2023large}, they have many failure modes~\citep{borji2023categorical} such as generating hallucinations or failing to complete more complex reasoning tasks~\citep{bubeck2023sparks} (Section 8). Regarding LLM-powered conversations that require stronger collaboration between two parties, ~\citet{lin2023decisionoriented} introduce the concept of ``decision-oriented dialogues'' and show that current LLMs still are far from human performance. In this paper, we investigate a specific cause (query under-specification) and show how we can improve LLMs for them. 

We conjecture that query under-specification is an artifact introduced or amplified during post-training and alignment workflows such as reinforcement learning from human feedback (RLHF)~\citep{NEURIPS2022_b1efde53}. In RLHF, LLMs are fine-tuned to output results that align with the preferences of annotators. Status quo approaches focus on pairwise comparisons of single-step responses to a given input query. As such, well-specified and/or simpler queries that admit multiple possible, high-quality responses \emph{without} the need for clarification questions may be over-represented during fine-tuning. Additionally, when annotators \emph{do} encounter under-specified queries, their preferences about how to handle ambiguity may differ in meaningful ways from those of end-users, skewing the learned policy. For example,~\citet{singhal2023long} observe that annotators tend to prefer longer responses---which help to ``cover all bases'' when queries are under-specified---relative to end-users, who must bear the cognitive cost of LLM verbosity. Annotators may also provide feedback they feel is ``expected'' of them that diverges from their true conversational preferences (due to the Hawthorne effect; see \cite{mccambridge2014systematic}). 

Query under-specification has been studied and addressed in information retrieval~\citep{dang2010query,azad2019query}. There are two broad approaches: algorithmic or user-centric techniques. Algorithmic approaches include query expansion~\citep{azad2019query}, query reformulation~\citep{dang2010query} etc. User-centric approaches focus on asking good clarifications~\citep{rao-daume-iii-2018-learning,majumder2021ask}. 
Hybrid approaches are possible: for instance,~\citet{diao2023active} use active learning to determine what questions to ask in an LLM's context window so as to improve its reasoning. 
We take a user-centric approach of seeking clarification, and rely on a suitably prompted LLM (rather than a separate active learning policy) to discover appropriate questions to ask. 

We showed that LLMs are misaligned when queries are under-specified. 
Others have shown misalignment for other reasons (e.g. toxicity~\citep{bai2022constitutional}) and studied better ways to align LLMs.  
There are two kinds of approaches to align LLMs better: fine-tuning (e.g., DPO~\citep{rafailov2023direct}, KTO~\citep{ethayarajh2024kto}, RLHF~\citep{NEURIPS2022_b1efde53}, etc.) and prompt injection (e.g., Constitutional A, I~\citep{bai2022constitutional}, meta-prompting~\citep{qin2021learning}). We take the latter approach and extend the meta-prompting of~\citet{qin2021learning} to work not only with soft-prompts but with natural language prompts and black-box LLMs.

Our proposed interventions rely on asking users clarification questions. User studies conducted with search engines~\citep{zamani2020analyzing} and pre-LLM conversation systems~\citep{christakopoulou2016towards} demonstrated that users \emph{do} engage with clarifying questions in those contexts. Conducting user studies in \copilots~to assess users' propensity to answer questions is an exciting avenue for future work.

We frame the conversation between a user and chatbot as a PODP, which is mathematically equivalent to a partially observable Markov Decision Process (POMDP)~\citep{littman2009tutorial}. Others have framed the interactions as multiple rounds of bandit interactions~\citep{zuo2022hierarchical}, but as we argued before, single-turn utility maximization is too myopic for multi-turn conversational outcomes. Thus, we adapt solution concepts from POMDP like Q-learning~\citep{watkins1992q}, information-gathering~\citep{sadigh2016information} for use with LLM-induced policies.
