Abstract: People have foveated vision and thus are generally able to attend to just a single object
within their field of view at a time. Our goal is to learn a model that can automatically
identify which object is being attended, given a person’s field of view captured by a first
person camera. This problem is different from traditional salient object detection because
our goal is not to identify all of the salient objects in the scene, but to identify the single
object to which the camera wearer is attending. We present a model that learns based on
very weak supervision, with just annotations of the label of the class that is attended in
each frame, without bounding boxes or other spatial location information. We show that
by learning disentangled representations for localization and classification, our model
can effectively localize novel attended objects that were never seen during training. We
propose a multi-stage knowledge distillation strategy to train our generalized localizer
model. To the best of our knowledge, our work is the first to explore the problem of
learning generalized attended object localization models in egocentric views under weak
supervision.
0 Replies
Loading