\begin{figure}[ht]
    \centering
    \subfigure[Flowchart of the framework.]{\label{fig:framework-generic-flowchart}%
      \includegraphics[width=0.3\textwidth]{Image/generic_flowchart.JPG}
     }%
    % \begin{subfigure}{0.3\textwidth}
    %     \centering
    %     \includegraphics[width=\textwidth]{Image/generic_flowchart.JPG}
    %     \captionsetup{width=1.0\textwidth}
    %     \caption{}
    %     \label{fig:framework-generic-flowchart}
    % \end{subfigure}%
    % 
    \subfigure[Steps followed by the intention estimator.]{\label{fig:framework-generic-conversion}%
      \includegraphics[width=0.7\textwidth]{Image/generic conversion.jpg}
    }%
    % \begin{subfigure}{0.7\textwidth}
    %     \centering
    %     \includegraphics[width=\textwidth]{Image/generic conversion.jpg}
    %     \captionsetup{width=1.0\textwidth}
    %     \caption{}
    %     \label{fig:framework-generic-conversion}
    % \end{subfigure}
    % 
    \caption{Overview of the framework.}
    \label{fig:framework-generic-overview}
\end{figure}

\section{Proposed Framework}\label{sec:framework}
Our proposed guidance framework consists of three components: an HTN planner, an intention estimator and a guidance controller (\figref{fig:framework-generic-flowchart}). The HTN planner receives observations about the task-related movements of the agent and maintains the world state of the HTN domain. Given the root task and the world state, the planner finds the set of tasks the agent is currently expected to execute, which we refer to as the \emph{optimal branch}. The intention estimator then estimates the agent's intention at each level of the optimal branch. These estimates are conditional probabilities given observations of the agent's movements and gaze. The guidance controller generates verbal and gestural cues to the agent based on the estimated intention. Intuitively, if all probabilities are high, the controller gives no cue. If one or more of the probabilities is low, the controller gives a cue appropriate for the highest level at which the probability is low. 

\subsection{HTN Planner}\label{sec:generic planner}

The planner keeps track of actions, which are executed via observable task-related movements, and maintains the world state for planning. We make the simplifying assumption that the agent is perfectly rational, i.e., follows the optimal plan. We use HATP \citep{Lallement2014HATP:Robotics} (released under 2-clause BSD license) to find the optimal plan and extract the optimal branch, corresponding to what the agent should be doing at the current time. However, any other HTN planner such as SHOP2 \citep{Nau2003SHOP2:System} could be used instead.

\subsection{Intention Estimation}\label{sec:generic state estimate}

\figref{fig:framework-generic-conversion} illustrates the steps followed for estimating intention at the current time step $T$. Let $\{A_i, i = 1,2,...,N\}$ denote tasks on the branch, where $N$ is the number of levels in the branch, $A_1$ is the goal and $A_N$ the leaf action. The intention is a set of independent, binary-valued random variables $\{x_i, i = 1,2,...,N\}$, where $x_i = 1$ if the agent is executing, intending to execute, or at least aware of the need to execute $A_i$. To estimate intention, we compute the probability that $\{x_i = 1\}$ given the past history of task-related movements $c^t$ and gaze $g^t$ up to time step $T$. By Bayes rule
\begin{align}
    P(x_i | \{c^t\}_{t=1}^{T},\{g^t\}_{t=1}^{T})
    &\propto P( \{c^t\}_{t=1}^{T}, \{g^t\}_{t=1}^{T} | x_i) \label{eq:Bayes-model-first}\\
    &\propto P( \{g^t\}_{t=1}^{T} | x_i)P( \{c^t\}_{t=1}^{T} | x_i) \label{eq:Bayes-model-independent}
\end{align}
where in \eqref{eq:Bayes-model-independent} we assume movement and gaze are conditionally independent given the intention variable $x_i$. The conditional probabilities for different levels $i$ are estimated separately, based on the same observed movements and gaze history. We construct $N$ gaze models, one per level of the hierarchy $x_i$. However, we compute the conditional probability of the the movement sequence only at the lowest level, $P( \{c^t\}_{t=1}^{T} | x_N)$, corresponding to actions. This assumes the movement sequence to be independent of the intermediate goals, which is certainly not true, but given the ambiguity pointed out above, is a reasonable simplifying assumption. More realistic movement models that do not assume independence could be easily incorporated. We discuss this issue further in \secref{sec:discussion}. 

We treat $c^t$ as a discrete random variable that assumes one of three values: ``not observed'', ``consistent'' or ``inconsistent'', where the consistency is defined with respect to action $A_N$. For instance, the apprentice's arm reaching for the fridge door is consistent with the action ``OpenFridge''. We model $P( \{c^t\}_{t=1}^{T} | x_N)$ with two 3-state Markov Chains: one for $x_N = 1$ and one for $x_N=0$. Intuitively, sequences with many movements consistent with $A_N$ are more likely if $x_N=1$ than if $x_N=0$. 

We also treat $g^t$ as a discrete random variable. The number of possible values depends upon the number of physical entities associated with the task $A_i$, which we can extract automatically from the HTN based on the predicate describing $A_i$ and the method for $A_i$ used in the plan. Assuming a total of $M_i$ entities $\{e_1, e_2, ..., e_{M_i}\}$ involved in task $A_i$, $g_t$ assumes one of $M_i+2$ values: $M_i$ values indicating the gaze falls on one of the associated entities $e_{m_i}$, one value indicating the gaze does not fall on any of the relevant entities, and one value indicating no gaze observation. We model $P( \{g^t\}_{t=1}^{T} | x_i)$ with two $(M_i+2)$-state Markov Chains: one for $x_N = 1$ and one for $x_N=0$. Distinguishing between gaze towards different entities enables us to model task-specific gaze patterns, e.g., alternations of the apprentice's gaze between the onion entity and the knife entity when performing the task ``Peel(onion,knife)'' in \figref{fig:key-example-htn-domain}.

At this point it is worth recapping the kitchen example. If the apprentice has a wrong recipe in mind, s/he is likely to look at the wrong one, which is an entity not present in the predicate of task ``Recipe1()'' (\figref{fig:generic-htn-plan-execution}). Therefore, gaze on the wrong recipe is inconsistent, causing $P( \{g^t\}_{t=1}^{T} | x_{\text{Recipe1}})$ to drop for $x_{\text{Recipe1}}=1$. On the other hand, if everything is fine, gaze on the steak is consistent with the tasks ``Recipe1()'', ``Thaw(steak)'' and ``TakeOut(steak)'' at all levels of the hierarchy in \figref{fig:generic-htn-plan-execution}, since the entity $e_{\text{steak}}$ is relevant to all of those tasks. Intuitively, entities relevant to tasks on the optimal branch are organized hierarchically. High-level tasks have more related entities, whereas lower-level tasks have fewer related entities: a subset of those in the higher-level task. Thus, gaze models at higher levels are more spatially diffuse than gaze models at lower levels, which are more spatially localized. These differences in modelled gaze behavior enable the framework to use gaze for disambiguation among different levels of the pyramid.

We add an additional ``not seeking guidance'' variable $x_0$, which indicates whether the agent is seeking guidance ($x_0=0$) or not ($x_0=1$). In our example, when needing help, the apprentice might look towards the master chef. Our framework assumes a humanoid robot guide, which many people tend to anthropomorphize \citep{Damiano2018AnthropomorphismCo-evolution}\citep{Fussell2008HowRobots}. The associated gaze feature assumes three values: ``no-observation'', ``looking at the robot'' and ``looking elsewhere''. Associated gaze trajectories are modelled by 3-state Markov chains, similar to the above. 

\subsection{Guidance Controller}\label{sec:generic controller}

The guidance controller maps the estimated intention, $\{P(x_i | \cdot),i=0,1,2,...,N\}$ where $\cdot$ is an abbreviation for the movement/gaze trajectory, to a cue depending upon whether the agent \emph{wants} guidance ($P(x_0=1 | \cdot)$ is low),  or whether the agent \emph{needs} guidance ($P(x_i=1 | \cdot )$ is low for some $i$ between 1 and $N$). The guidance controller first evaluates whether the agent needs guidance. Formally, given a threshold $h$,
\begin{align}
    &\text{Provide guidance if}\quad \exists i:\ P(x_i = 1| \cdot ) \leq h, i \in \{1,2,...,N\} \\
    &\text{Choose cue at level }i \text{ where } i = \max\{n|P(x_n = 1| \cdot ) \leq h, n = 1,2,...,N\}
\end{align}
If the agent does not need guidance, the controller checks whether s/he \emph{wants} guidance. Formally, 
\begin{align}
    &\text{Provide guidance if}\quad P(x_0 = 1| \cdot ) \leq h \\ 
    &\text{Choose cue at level }i \text{ where } i = \text{argmin}_n \{P(x_n = 1| \cdot ), n = 1,2,...,N\}
\end{align}
The guidance cues are both task and level dependent. If guidance is needed and there are multiple levels of confusion, the highest level needed cue is given. If guidance is wanted, the controller provides the cue at the level with the lowest estimated intention. 










