ACENet: Attention-Driven Contextual Features-Enhanced Lightweight EfficientNet for 2D Hand Pose Estimation

Sartaj Ahmed Salman, Ali Zakir, Gibran Benitez-Garcia, Hiroki Takahashi

Published: 2023, Last Modified: 03 Jun 2025IVCNZ 2023EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Computer Vision (CV) has seen remarkable advancements recently, but traditional input devices remain outdated, especially considering the critical role of hand movements in human interactions. This has prompted intensified research in Hand Pose Estimation (HPE) for applications ranging from gesture-controlled car systems to virtual reality. Despite deep learning’s success in HPE, challenges like hand flexibility, occlusion, and size variability persist. As the emphasis shifts towards RGB-based 3D pose estimation over pricier multi-view camera systems, we proposed a 2D HPE approach called ACENet: Attention-Driven Contextual Features-Enhanced Lightweight EfficientNet. The framework employs the EfficientNet backbone integrated with a Squeeze-and-Excitation (SE) block for feature extraction and is enhanced by a Global Context (GC) block. It effectively predicts complex hand poses from single RGB images, emphasizing crucial features for enhanced representation. Tests on the Carnegie Mellon University (CMU) Panoptic dataset showed our model’s average accuracy increases by 2.11% with respect to OCPM, underscoring its significant potential across various sectors.