\begin{figure}
    \centering
    \subfigure[Average SUS score where the dashed line is placed at 68, above which a system is usable \citep{Brooke1996SUS:Scale}.]{\label{fig:average-sus}%
        \includegraphics[width=0.5\textwidth]{Image/average_sus.jpg}
    }%
    %  \begin{subfigure}{0.5\textwidth}
    %      \centering
    %      \includegraphics[width=\textwidth]{Image/average_sus.jpg}
    %      \captionsetup{width=.95\textwidth}
    %      \caption{}
    %      \label{fig:average-sus}
    %  \end{subfigure}%
    % 
    \subfigure[Average ratio of timely and precise cues. The error bars indicate standard error across sessions.]{\label{fig:average-timely-precise}%
        \includegraphics[width=0.5\textwidth]{Image/average_timely-precise.jpg}
    }%
    % \begin{subfigure}{0.5\textwidth}
    %     \centering
    %     \captionsetup{width=.8\textwidth}
    %     \includegraphics[width=\textwidth]{Image/average_timely-precise.jpg}
    %     \caption{}
    %     \label{fig:average-timely-precise}
    % \end{subfigure}
    % 
    \caption{Measurements averaged over all sessions.}
    \label{fig:average-measurement-comparison}
\end{figure}

\begin{figure}
    \centering
    \subfigure[Gaze is available]{\label{fig:btx-distri-with-gaze}%
        \includegraphics[width=0.5\textwidth]{Image/cue-distri-wizard-algo-with-gaze.jpg}
    }%
    %  \begin{subfigure}{0.5\textwidth}
    %      \centering
    %      \includegraphics[width=\textwidth]{Image/cue-distri-wizard-algo-with-gaze.jpg}
    %      \captionsetup{width=.8\textwidth}
    %      \caption{}
    %      \label{fig:btx-distri-with-gaze}
    %  \end{subfigure}%
    %  
    \subfigure[Gaze is not available]{\label{fig:btx-distri-without-gaze}%
        \includegraphics[width=0.5\textwidth]{Image/cue-distri-wizard-algo-without-gaze.jpg}
    }%    
    %  \begin{subfigure}{0.5\textwidth}
    %      \centering
    %      \captionsetup{width=.8\textwidth}
    %      \includegraphics[width=\textwidth]{Image/cue-distri-wizard-algo-without-gaze.jpg}
    %      \caption{}
    %      \label{fig:btx-distri-without-gaze}
    %  \end{subfigure}
    \caption{Distribution of cues computed by dividing the number of cues of each level by the total number of cues from all sessions. The ratio of timely and precise cues at each level are shown by the circles.}
    \label{fig:btx-distri-plot}
\end{figure}

\section{Results}\label{sec:result}
\figureref{fig:average-measurement-comparison} illustrates the SUS score and the ratio of cues labeled both timely and precise within a session averaged across all sessions. When gaze is available our framework achieves performance similar as the Wizard Group in these criteria, supporting Hypothesis 1. The bar plot representing cue distributions computed across sessions in \figureref{fig:btx-distri-with-gaze} indicates that our framework generates diverse cues across different levels like the human wizard, and that it does so with good timing and precision, as reflected by the circles representing the ratio of timely and precise cues at each level.

To validate Hypothesis 2 we first make a within-group comparison for the Automated Group. \figureref{fig:average-measurement-comparison} shows that without gaze, the framework is unusable (SUS below 68) and there is a significant drop in the quality of guidance. The cause is two-fold. Without gaze the framework only provides guidance at the action-level $A_N$ (\secref{sec:generic state estimate}), leading to the skewed distribution of cues in \figureref{fig:btx-distri-without-gaze}. Comparing the filled and empty circles for the ``Joint Actuation'' level in Figures \ref{fig:btx-ratio-precise} and \ref{fig:btx-ratio-timely} shows that, without gaze, the timing of guidance degraded more dramatically than the precision of guidance. Thus, poor cue timing is the primary reason for the degradation without gaze. Intuitively, this is expected. If only observations of arrow manipulation are available, the framework cannot tell if the subject is off track until s/he makes drastically wrong movements, leading to cues that come too late. One example is given in \href{https://drive.google.com/file/d/16VCQUJO8bzgxvKf7ro9K_rVtIHLNzBwo/view?usp=sharing}{this demo}.

Qualitatively similar, but smaller, performance drops are seen in the Wizard Group. Although usability is maintained (\figureref{fig:average-sus}), the quality of guidance drops significantly (\figureref{fig:average-timely-precise}). Similar to the results with our framework, blue circles in \figureref{fig:btx-ratio-plot} reveal the timeliness of guidance to be the limiting factor of performance. This is especially clear for cues at the pose mimicry task level, which is not surprising. Since the pose mimicry task sits at the highest level of the hierarchy, it is the most subject to ambiguity given observations of arrow manipulations. Indeed, the wizard remarked that s/he had to adopt a strategy similar to our framework when gaze information was absent, waiting until the subject made clearly incorrect or unexpected arrow manipulations before providing a cue at the pose mimicry level. \href{https://drive.google.com/file/d/1WBEmvSd3zE248x2CijQSJd9WC91KTIhs/view?usp=sharing}{This video} shows one such occasion from a Wizard Group session. This anecdotal evidence further supports Hypothesis 2.

\begin{figure}
    \centering
    \subfigure[Timely and precise.]{\label{fig:btx-ratio-timely-precise}%
        \includegraphics[width=0.5\textwidth]{Image/between-group-timely-precise.jpg}
    }%    
    %  \begin{subfigure}{0.5\textwidth}
    %      \centering
    %      \includegraphics[width=\textwidth]{Image/between-group-timely-precise.jpg}
    %      \captionsetup{width=.8\textwidth}
    %      \caption{}
    %      \label{fig:btx-ratio-timely-precise}
    %  \end{subfigure}%
    \subfigure[]{\label{fig:btx-ratio-legend}%
        \includegraphics[width=0.5\textwidth]{Image/between-group-legend.jpg}
    }     
    %  \begin{subfigure}{0.5\textwidth}
    %      \centering
    %      \captionsetup{width=.8\textwidth}
    %      \includegraphics[width=\textwidth]{Image/between-group-legend.jpg}
    %      \label{fig:btx-ratio-legend}
    %  \end{subfigure}  
    \subfigure[Precise.]{\label{fig:btx-ratio-precise}%
        \includegraphics[width=0.5\textwidth]{Image/between-group-precise.jpg}
    }%  
    %  \begin{subfigure}[b]{0.5\textwidth}
    %      \centering
    %      \includegraphics[width=\textwidth]{Image/between-group-precise.jpg}
    %      \captionsetup{width=.8\textwidth}
    %      \caption{}
    %      \label{fig:btx-ratio-precise}
    %  \end{subfigure}%
    \subfigure[Timely.]{\label{fig:btx-ratio-timely}%
        \includegraphics[width=0.5\textwidth]{Image/between-group-timely.jpg}
    }%  
    %  \begin{subfigure}[b]{0.5\textwidth}
    %      \centering
    %      \captionsetup{width=.8\textwidth}
    %      \includegraphics[width=\textwidth]{Image/between-group-timely.jpg}
    %      \caption{}
    %      \label{fig:btx-ratio-timely}
    %  \end{subfigure}
    \caption{Ratio of timely and / or precise cues per type.}
    \label{fig:btx-ratio-plot}
\end{figure}


We also make a between-group comparison when gaze was not available. \figureref{fig:btx-distri-without-gaze} shows that unlike our framework, the human wizard still provides cues at different levels in the absence of gaze. \figureref{fig:btx-ratio-plot} shows that the wizard provides higher quality cues than our framework, especially in terms of timeliness. This is expected given the simplified movement models $P( \{c^t\}_{t=1}^{T} | x_i)$.

