Keywords: visual attention, foveal view, sharp vision, reinforcement learning, human-robot interaction
TL;DR: An investigation on foveal representations for training efficient attention models towards human-level sharp vision.
Abstract: Human vision is capable of focusing on subtle visual cues at high resolution by relying on a foveal view coupled with an attention mechanism. Recently, there have been several studies that proposed deep reinforcement learning based attention models. However, these studies do not explicitly consider the design of a foveal representation and its effect on an attention system is unclear. In this paper, we investigate the effect of using a hierarchy of visual streams in training an efficient attention model towards achieving a human-level sharp vision. We perform our evaluation on a simulated human-robot interaction task where the agent attends to faces that are looking at it. The experimental results show that the performance of the system relies on factors such as the number of visual streams, their relative field-of-view and we demonstrate that maintaining a hierarchy within the visual streams is crucial to learn attention strategies.