Abstract: A peculiarity of conversational search systems is that they involve mixed-initiatives such as system-generated query clarifying questions. Evaluating those systems at a large scale on the end task of IR is very challenging, requiring adequate datasets containing such interactions. However, current datasets only focus on either traditional ad-hoc IR tasks or query clarification tasks, the latter being usually seen as a reformulation task from the initial query.
Only few datasets are known to contain both document relevance judgments and the associated clarification interactions such as Qulac and ClariQ. Both are based on the TREC Web Track 2009-12 collection but cover a very limited number of topics
(237 topics), far from being enough for training and testing conversational IR models.
To fill the gap, we propose a methodology to automatically build large-scale conversational IR datasets from ad-hoc IR datasets in order to facilitate explorations on conversational IR.
Our methodology is based on two processes: 1) generating query clarification interactions through query clarification and answer generators, and 2) augmenting ad-hoc IR datasets with simulated interactions.
In this paper, we focus on MsMarco and augment it with query clarification and answer simulations. We perform a thorough evaluation showing the quality and the relevance of the generated interactions for each initial query. This paper shows the feasibility and utility of augmenting ad-hoc IR datasets for conversational IR.
Submission Length: Long submission (more than 12 pages of main content)
Changes Since Last Submission: ### Section 4.2.2 Human Evaluation (Add)
```tex
Concerning the accuracy, we acknowledge that this may be low. We observed that this discrepancy may arise from questions labeled as non-natural. The main limitation comes from automatic facet extraction. Some keywords may not be representative of the document topics. Regarding the kappa, the disagreement may come from various reasons; naturalness may play an important role in judgment as observed in the table \ref{table:evalH}. Additionally, annotators must always choose a preferred question, even if none of the proposed questions are actually useful given the query or if clarifying questions are similar, thus adding noise to the agreement.
\begin{table}[t]
\centering
\begin{tabular}{ p{\linewidth} }
\toprule
\textbf{Examples for human evaluation}\\
\midrule
\textbf{Query}: webster family definition \\
\textbf{CQ1}: Are you looking for Noah Webster (1758-1843) lexicographer? \\
\textbf{CQ2}: Would you like to know more about Webster family definition? \\
\textbf{CQ3}: Are you referring to the lexicographer Noah Webster (1758-1843)? \\
\textbf{Passage}: Noah Webster (1758-1843) was a lexicographer and a language reformer. He is often called the Father of American Scholarship and Education. In his lifetime, he was also a lawyer, schoolmaster, author, newspaper editor, and an outspoken politician. \\ \hline
\textbf{Query}: what is venous thromboembolism\\
\textbf{CQ1}: Would you like to know more about venous thromboembolism?\\
\textbf{CQ2}: Would you like to know more about venous thromboembolism? \\
\textbf{CQ3}: Are you looking for venous thromboembolism?\\
\textbf{Passage}: Venous thromboembolism (VTE) is the formation of blood clots in the vein. When a clot forms in a deep vein, usually in the leg, it is called a deep vein thrombosis or DVT. If that clot breaks loose and travels to the lungs, it is called a pulmonary embolism or PE. Together, DVT and PE are known as VTE - a dangerous and potentially deadly medical condition. \\\hline
\bottomrule
\end{tabular}
\caption{Table showing examples of clarifying questions shown to human evaluators. Evaluators must assess the most natural / relevant and useful question.}
\label{table:evalH}
\end{table}
```
### Introduction (Add)
```tex
The resulting dataset is called MIMarco for Mixed Initiative MsMarco.
```
### Section 5.4 (Replace)
Remove
```tex
For $30.3\%$ of the queries, BM25+CLART5 obtains a better MRR@10 while $11.1\%$ obtain a lower MRR@10.
```
Add
```tex
Out of all the queries, BM25+CLART5 achieves a superior MRR@10 for $30.3\%$ of them, whereas it yields a lower MRR@10 for $11.1\%$ of the queries.
```
### Section 3.3 (Add)
```tex
We acknowledge that this choice relies on a strong hypothesis that passages annotated represent the user's search intent, we further discuss it in Section \ref{conclusion}.
```
### Table 9
Add the missing values in Table 9.
### Reference to Table 10
Add reference to Table 10 in the text: "as well as examples of failure cases in Table \ref{table:multi_quaali_fail}."
### Baseline in Table 7(add)
Add baseline in Table 7: MonoT5 without fine-tuning but with interaction data and also reference in the text.
```tex
We compare the MonoT5 performances with and without interaction. Interactions are added in the model's context, following the same pattern as for CLART5 as shown in the equation \ref{clart5:context}.
```
Assigned Action Editor: ~Edward_Grefenstette1
Submission Number: 2103
Loading