\section{Discussion}\label{sec:discussion}

Our framework assumes intention variables at each level are defined only for the optimal branch, even though alternative plans leading to the same goal exist. While our framework can adapt the changes in the plan, re-planning does not occur until the person has finished executing a task. If the human agent follows a sub-optimal (but equally viable) plan, our framework might provide unwanted guidance. This assumption also leaves little room for creativity, e.g., if the agent wishes to exclude some part of the recipe based on preference. 

Our framework also assumes that intention variables at different levels are independent, and that motion $\{c^t\}_{t=1}^{T}$ and intent $x_i$ are independent for non-leaf tasks. The effects of this over-simplification are especially clear when gaze is not available. The distributions of cues generated by our framework and by the human wizard are quite different (\figureref{fig:btx-distri-without-gaze}). Unlike the human wizard, our framework generates no cues at the higher levels of the hierarchy. 

To address these shortcomings, we could (and should) use a more complex inference model that incorporates more information from within and outside the optimal branch. Much information about high level intent can be extracted from observed movements by reasoning through the tree. For example, the subject performing a motion uniquely associated with a leaf node task that only appears in one higher level action provides strong evidence that the subject intends to perform the high level task.

Despite their shortcomings, these dramatic oversimplifications serve positive purpose in highlighting the power of the gaze cue. When gaze is available, our framework generates cues nearly as well as the human wizard (\figureref{fig:average-measurement-comparison} and \figureref{fig:btx-distri-with-gaze}). This suggests that there may be little gained by additional observation of task-related behavior. This is consistent with impressions of the human wizard, who remarked that providing guidance was difficult without gaze because deducing higher level intent became more difficult. In sessions where gaze is available, the wizard relied heavily on the subject's gaze patterns to estimate high-level intention, as reflected in \href{https://drive.google.com/file/d/1BPI4Es8BWKtsEDW6-Uwtl5gVKG__4_oB/view?usp=sharing}{this video}. Interestingly, in the above example the wizard identifies the subject's wrong intention at the pose-mimicry level in a way close to what our framework does in \href{https://drive.google.com/file/d/1GqhQlm7RdSrZ6M7uEwHzzXqm0RQuXtaU/view?usp=sharing}{a similar situation}. The efficacy of gaze allows our framework to compensate for its flawed movement model and to achieve human-level performance.

The HTN facilitates automatic gaze model construction, as we can identify the entities involved in each task from its task and method definitions. Although this specifies the spatial locations likely to be gazed at during task performance, it does not include information about the temporal components of the gaze trajectory, which we modelled using Markov Chains. In this work, we hand-tuned the transition parameters to encode our empirical finding that the agent looks at irrelevant entities more frequently when executing the wrong task. Future work should address automated construction of the gaze models, as well as the extension to continuous, rather than discrete gaze distributions. 

Some problems relating to cue timeliness and precision might be addressed by incorporating some form of Theory-of-Mind into the framework to model agent state and the effect of past history \citep{Devin2016AnExecution, Favier2022RobustState}. \figref{fig:btx-ratio-timely} indicates that the timeliness of guidance from our framework is comparable to that from the wizard for lower level tasks, but worse at the highest level (pose mimicry). We suspect that the main reason is that our framework cannot adapt to the agent's pace of execution, whereas the human wizard can modify his/her rate of feedback based on past observations of the agent's behavior. Our framework also generates unwanted cues most frequently at the pose-mimicry level. We suspect this is because the conditional probabilities $p(x_i | \cdot)$ are reset to the same initial values after each cue. A human agent typically retains high-level intent for a long time after receiving a high-level cue, but our framework retains no memory of past cues, possibly resulting in repeated cue generation. The reset also means that if the agent misses a cue, s/he must wait for the probabilities to build up again. 

% Intuitively, human agents will retain high level intent for a long time after receiving a high level cue, but may need more repeated guidance at lower levels or if they missed a previous cue.

In addition to the directions pointed out above, our model could be improved in many other directions. The posing task we considered has only one type of task per level. Future work should examine plan trees of varying depth with more task types per level. Humans provide different forms of guidance for the same task. The guidance controller could be extended to provide increased granularity of cues.  We collected only subjective assessments. Future studies could collect objective metrics, such as task completion time and joint movement trajectories. Further work could be done to distinguish between the {\em need} versus the {\em desire} for guidance. Our framework partially incorporates this with the ``not seeking guidance'' variable and partially assessed it in the the first multiple-choice question (\apdxref{apx: exp question playback}), but much work remains to be done.

