\section{Introduction}\label{sec:intro}

Humans can offer effective guidance in different roles, such as mentor \citep{Marcdante2018ChoosingMentor}, teacher \citep{Stokhof2017HowEducation}, coach \citep{LewisTheSport}, etc. They rely on not only spoken language and observation of directly task-related movements, but also on observations of secondary behaviours, such as eye gaze \citep{Kredel2017Eye-TrackingResearch, Brennan2008CoordinatingSearch, Schneider2017Real-TimeQuality, Wang2019TheLearning}. Gaze is an important cue that helps people to estimate others' intention \citep{Foulsham2014EyeTasks, Newn2017EvaluatingGames, Neider2010CoordinatingGaze} and confidence \citep{Emhardt2020InferringMovements}, which is often crucial for proper guidance. 

Here, we propose a framework that enables a humanoid robot to provide guidance to a human performing a hierarchical task based on observations of the human's task-related movements and intent estimation from eye gaze. While past work on human-robot collaboration has studied robot guidance in tasks such as coaching \citep{Fasola2013AElderly}, teaching \citep{Conati1997On-LineNetworks} or navigating \citep{Kanda2009AnMall}, to the best of our knowledge, eye gaze has not been exploited for robot guidance. More work has been done on intention estimation from eye gaze, which has applied to applications ranging from eye typing  \citep{Pi2017ProbabilisticTyping} to control of wheel chairs \citep{Cojocaru2019UsingMovement}, robotic arms \citep{Li20173-D-Gaze-BasedImpairments} or exoskeletons \citep{Frisoli2012ATasks}. \citep{Huang2019NonverbalTeachers} studied the inverse problem of a robot student generating gaze towards human teachers. Gaze has also been exploited in action recognition \citep{LiInVideo}. However, in these studies tasks typically have a flat structure. Intention estimation finds the most probable goal from among a set of similar alternatives, e.g. desired destinations or directions of movement. Little attention has been paid to intention estimation from eye gaze for tasks with a multi-level hierarchical structure \citep{Erol1996ComplexityPlanning, Conati1997On-LineNetworks, Erol1994HTNExpressivity}.

One of the main problems in providing guidance in hierarchical tasks is to determine the level at which guidance is needed. For example, consider a kitchen scenario where an apprentice is making a meal under the guidance of a master chef (\figureref{fig:key-example}). The apprentice might encounter difficulty at different levels of the task hierarchy, requiring different cues from the master chef. For example, suppose that the apprentice has peeled the onion and opened the refrigerator, then stops. This might be due to confusion at a high level (the apprentice is unclear which recipe they are following) or at a lower level (the apprentice knows to follow Recipe 1, but cannot find the steak). To provide precise guidance, the master chef must estimate the apprentice's intention, i.e., which tasks are being executed at each level of the hierarchy.

\begin{figure}
    \centering
    \includegraphics[width=0.7\textwidth]{Image/key-example.jpg}
    \captionsetup{width=1.0\textwidth}
    \caption{Hierarchical tasks in a kitchen scenario represented as a tree structure. The highest level of the hierarchy (root node) corresponds to the recipe being followed. Leaf nodes result in observable movements by the actor. Arrows indicate the order in which tasks must be executed.}
    \label{fig:key-example}
\end{figure}


Intention estimation at all levels of the hierarchy based only on observations of the apprentice's task-related movements is difficult. Instantaneous observations typically provide direct information about task performance at the lowest level of the hierarchy, but different high level tasks might contain the same lower level tasks. For example, the task ``peeling the onion'' is consistent with both Recipes 1 and 2. Thus, based on this observation alone, the master chef cannot determine whether the apprentice is following the correct recipe, and must wait until the apprentice reaches for the steak or the pork to disambiguate this. This ambiguity impairs the timeliness of the cues being delivered. 

These ambiguities leading to imprecise and poorly timed cues can be resolved by observations of secondary behaviors, such as gaze and facial expressions. In the first example above, the master chef might have high confidence that the apprentice knows which recipe to follow since he saw the apprentice looking at the recipe earlier, before the apprentice started any cooking related movements. In the second example, the chef might provide guidance earlier if the apprentice looks at the master chef and/or expresses confusion through facial expression. 

This work proposes a framework for integrating observations of task-related movements and gaze, enabling a humanoid robot to provide timely and precise guidance to a human agent performing hierarchical tasks. We focus on humanoid robots, since the human tendency to anthropomorphize  \citep{Damiano2018AnthropomorphismCo-evolution}\citep{Fussell2008HowRobots} will evoke social interactions \citep{Damiano2018AnthropomorphismCo-evolution} and elicit corresponding gaze behaviours \citep{Mukawa2001GazeExamples}, which can be exploited for disambiguation. We implemented and tested the framework in virtual reality and compared its performance against that of a human wizard to validate its effectiveness and to understand the contribution of gaze.




