Abstract: This paper focuses on improving the mathematical interpretability of convolutional neural networks (CNNs) in the context of image classification. Specifically, we tackle the instability issue arising in their first layer, which tends to learn parameters that closely resemble oriented band-pass filters when trained on datasets like ImageNet. Subsampled convolutions with such Gabor-like filters are prone to aliasing, causing sensitivity to small input shifts. In this context, we establish conditions under which the max pooling operator approximates a complex modulus, which is nearly shift invariant. We then derive a measure of shift invariance for subsampled convolutions followed by max pooling. In particular, we highlight the crucial role played by the filter's frequency and orientation in achieving stability. We experimentally validate our theory by considering a deterministic feature extractor based on the dual-tree complex wavelet packet transform, a particular case of discrete Gabor-like decomposition.
Submission Length: Long submission (more than 12 pages of main content)
Changes Since Last Submission: # Section 1.1 (Introduction - Motivations)
- Clarified motivations;
- Added a paragraph explicitly stating the restriction of the analysis to the first CNN layer;
- Updated references to include recent work on antialiasing and deep learning architectures;
- Included a disclaimer clarifying the applicability of the study with respect to Transformer-based models.
# Section 1.2 (Introduction - Related work)
- Corrected an erroneous reference (Oyallon et al., 2017);
- Introduced the dual-tree wavelet packet transform at first mention;
- Clarified terminology regarding the continuous versus discrete frameworks.
# Section 1.3 (Introduction - Paper outline)
- Clarified the relationship between stride and subsampling;
- Justified the use of standard convolution ($\ast$) instead of cross-correlation ($\star$);
- Replaced the vague term "Gabor hypothesis" with a more explicit formulation (also updated in Section 2.2);
- Added cross-references to key results to improve navigability;
- Clarified our contributions with respect to Waldspurger’s prior work.
# Section 6.3 (Experiments and Results)
- Explained that the MSE is computed separately for each filter frequency band;
- Added a concluding paragraph highlighting the alignment between theoretical predictions and empirical results.
# Section 7 (Conclusion)
- Rephrased the take-home message;
- Improved the discussion of our companion paper;
- Added a paragraph discussing three key limitations of the current study.
Assigned Action Editor: ~Vincent_Dumoulin1
Submission Number: 4292
Loading