Less is More: Adaptive Feature Selection and Fusion for Eye Contact Detection

Published: 01 Jan 2024, Last Modified: 05 Mar 2025ACM Multimedia 2024EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Detecting eye contact is essential for embodied robots to engage in natural interactions with humans, enhancing the intuitiveness and comfort of these exchanges. However, eye contact detection often presents a significant challenge due to a variety of factors, such as low contrast and various forms of occlusions. Existing methods incorporate convolutional neural networks (CNNs) or Transformers to learn discriminative representations, but usually ignore the influence of noisy or less relevant regions in facial images. To address this gap, we propose the deep feature selection and fusion network (FSFNet) for eye contact detection in multi-party conversations. Our proposed method adaptively selects fine-grained visual features and reduces the impacts of irrelevant features. Specifically, we present a local feature selection scheme that leverages the attention scores to progressively concentrate on the most informative features. By integrating the carefully selected features into the multi-head self-attention module, we can maintain the superior properties of Transformers while simultaneously reducing the overall computational demands. We evaluate the proposed method on the official eye contact detection datasets, which achieves promising results of 0.8174 and 0.79 on the validation and test sets, respectively. We have made the source code publicly accessible in https://github.com/ma-hnu/FSFNet.
Loading