Harmonizing local and global features: enhanced hand gesture segmentation using synergistic fusion of CNN and transformer networks

Published: 01 Jan 2024, Last Modified: 08 May 2025Signal Image Video Process. 2024EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Hand gesture segmentation is an important research topic in computer vision. Despite ongoing efforts, achieving optimal gesture segmentation remains challenging, attributed to factors like gesture morphology and intricate backgrounds. In light of these challenges, we propose a novel hand gesture segmentation approach that strategically combines the strengths of Convolutional Neural Networks (CNN) for local feature extraction and Transformer Networks for global feature integration. To be more specific, we design two feature fusion modules. One employs an attention mechanism to learn how to fuse features extracted by CNN and Transformer. The second module utilizes a combination of group convolution and activation functions to implement gating mechanisms, enhancing the response of crucial features while minimizing interference from weaker ones. Our proposed method achieves mIoU score of 93.53%, 97.25%, and 90.39% on OUHANDS, HGR1, and EgoHands hand gesture datasets respectively, which outperforms the state-of-the-art methods.
Loading