On Inherent Adversarial Robustness of Active Vision Systems

Amitangshu Mukherjee; Timur Ibrayev; Kaushik Roy

On Inherent Adversarial Robustness of Active Vision Systems

Amitangshu Mukherjee, Timur Ibrayev, Kaushik Roy

Published: 06 Jan 2025, Last Modified: 06 Jan 2025Accepted by TMLREveryoneRevisionsBibTeXCC BY 4.0

Abstract: Deep Neural Networks (DNNs) are susceptible to adversarial inputs, such as imperceptible noise and naturally occurring challenging samples. This vulnerability likely arises from their passive, one-shot processing approach. In contrast, neuroscience suggests that human vision robustly identifies salient object features by actively switching between multiple fixation points (saccades) and processing surroundings with non-uniform resolution (foveation). This information is processed via two pathways: the dorsal (where) and ventral (what) streams, which identify relevant input portions and discard irrelevant details. Building on this perspective, we outline a deep learning-based active dorsal-ventral vision system and adapt two prior methods, FALcon and GFNet, within this framework to evaluate their robustness. We conduct a comprehensive robustness analysis across three categories: adversarially crafted inputs evaluated under transfer attack scenarios, natural adversarial images, and foreground-distorted images. By learning from focused, downsampled glimpses at multiple distinct fixation points, these active methods significantly enhance the robustness of passive networks, achieving a 2-21 % increase in accuracy. This improvement is demonstrated against state-of-the-art transferable black-box attack. On ImageNet-A, a benchmark for naturally occurring hard samples, we show how distinct predictions from multiple fixation points yield performance gains of 1.5-2 times for both CNN and Transformer based networks. Lastly, we qualitatively demonstrate how an active vision system aligns more closely with human perception for structurally distorted images. This alignment leads to more stable and resilient predictions, with lesser catastrophic mispredictions. In contrast, passive methods, which rely on single-shot learning and inference, often lack the necessary structural understanding.

Submission Length: Long submission (more than 12 pages of main content)

Changes Since Last Submission: This camera ready version contains: 1. Final manuscript incorporating all changes based on the comments made by the action editor and the reviewers. 2. The GitHub link to the code. 3. The YouTube link to the video presentation.

Video: https://youtu.be/_o7cw6MI5o0

Code: https://github.com/Amitangshu1013/RAVS

Supplementary Material: pdf

Assigned Action Editor: ~Tim_Genewein1

Submission Number: 3255

Loading