When CNN meet with ViT: decision-level feature fusion for camouflaged object detection

Guowen Yue, Ge Jiao, Chen Li, Jiahao Xiang

Published: 01 Jan 2025, Last Modified: 24 Jul 2025Vis. Comput. 2025EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Despite the significant advancements in camouflaged object detection achieved by convolutional neural network (CNN) methods and vision transformer (ViT) methods, both have limitations. CNN-based methods fail to explore long-range dependencies due to their limited receptive fields, while ViT-based methods lose detailed information due to large-span aggregation. To address these issues, we introduce a novel model, the double-extraction and triple-fusion network (DTNet), which leverages the global context modeling capabilities of ViT-based encoders and the detail capture capabilities of CNN-based encoders through decision-level feature fusion to make up the respective shortcomings for more complete segmentation of camouflaged objects. Specifically, it incorporates a boundary guidance module, designed to aggregate high-level and low-level boundary information through multi-scale feature decoding, thereby guiding the local detail representation of the transformer. It also includes a global context aggregation module, which shrinks the information of adjacent channels from top to bottom and aggregates information of high-level and low-level scales from bottom to top for feature decoding. It also contains a multi-feature fusion module to fuse global context features and local detail features. This module employs the attention mechanism in different channels to assign varying weights to long-range and short-range information. Through extensive experimentation, it has proven that the DTNet significantly surpasses 20 recently state-of-the-art methods in terms of performance. The related code and datasets will be posted at https://github.com/KungFuProgrammerle/DTNet.