Abstract: As societal focus on image authenticity grows, image manipulation localization has become a crucial and challenging task in computer vision. Current methods relying on dual-stream encoders to extract features from both RGB and noise images often suffer from feature misalignment and information loss during fusion. Moreover, many localization methods use loss functions to identify manipulated areas, but balancing weights between manipulated regions and edges remains challenging. To address these challenges, we propose a novel method that integrates features in dual-stream networks with adaptive selective state spaces. By treating the two output features from the dual-stream encoder as system inputs, we construct a feature space that optimizes the system’s state space. Introducing temporal dynamics enriches the feature representation and enhances learning capabilities, significantly improving the accuracy and reliability of image manipulation localization. Additionally, we propose an edge residual review module that refines the boundaries of manipulated regions from the preliminary output, subsequently enhancing the input features for improved re-localization accuracy. Extensive experiments demonstrate that our approach yields competitive results on diverse large-scale image datasets, outperforming most state-of-the-art methods in both precision and robustness.
External IDs:dblp:journals/tcsv/WangCCXLZW26
Loading