As large as it gets – Studying Infinitely Large Convolutions via Neural Implicit Frequency Filters

Published: 20 May 2024, Last Modified: 20 May 2024Accepted by TMLREveryoneRevisionsBibTeX
Abstract: Recent work in neural networks for image classification has seen a strong tendency towards increasing the spatial context during encoding. Whether achieved through large convolution kernels or self-attention, models scale poorly with the increased spatial context, such that the improved model accuracy often comes at significant costs. In this paper, we propose a module for studying the effective filter size of convolutional neural networks (CNNs). To facilitate such a study, several challenges need to be addressed: (i) we need an effective means to train models with large filters (potentially as large as the input data) without increasing the number of learnable parameters, (ii) the employed convolution operation should be a plug-and-play module that can replace conventional convolutions in a CNN and allow for an efficient implementation in current frameworks, (iii) the study of filter sizes has to be decoupled from other aspects such as the network width or the number of learnable parameters, and (iv) the cost of the convolution operation itself has to remain manageable i.e.~we can not naïvely increase the size of the convolution kernel. To address these challenges, we propose to learn the frequency representations of filter weights as neural implicit functions, such that the better scalability of the convolution in the frequency domain can be leveraged. Additionally, due to the implementation of the proposed neural implicit function, even large and expressive spatial filters can be parameterized by only a few learnable weights. Interestingly, our analysis shows that, although the proposed networks could learn very large convolution kernels, the learned filters are well localized and relatively small in practice when transformed from the frequency to the spatial domain. We anticipate that our analysis of individually optimized filter sizes will allow for more efficient, yet effective, models in the future. Our code is available at .
Certifications: Featured Certification
Submission Length: Regular submission (no more than 12 pages of main content)
Changes Since Last Submission: • We moved the comparison to previous works applying convolution in the frequency domain and beyond to the related work section in Table 1 and explicitly indicate that results in terms of numbers are not comparable. [4rXV, Mxbe, wtTg] • We added an ablation on incorporating more network operations than the convolution in the frequency domain in Section F and list the results in Table A9 for CIFAR-10. [4rXV, Mxbe] • We extended our related work section with literature about long convolutions. [4rXV] • We changed Figure 2 for a clearer understanding of the workflow of NIFF and the advantages it provides compared [Mxbe, wtTg] • We changed Figure 3 to highlight the difference between small spatial convolutions, large spatial convolutions and NIFF as well as the improvement in computational complexity we achieve by NIFF compared to large spatial convolutions. [Mxbe] • We more clearly explain the workflow of NIFF in Section 3.1. [4rXV] • We evaluated the baseline on ImageNet-1k with the advanced image preprocessing suggested by Liu et al. (2022) and updated Table 2 accordingly. [Mxbe] • We added the number of learnable hyperaparameters for all networks and datasets (Table 2, A1, and A2). [Mxbe] • We added the results for ImageNet-100 to Table 2 to provide a better synopsis on all high-resolution results. Note that for these models, Pytorch numbers for timm baseline models are not available. • We evaluated the inference time for our proposed NIFF, the baseline and large spatial convolutions in Tables A7 and A8. [Mxbe] • We added the training and inference runtimes for several networks in Tables A5 (prior A3) and A7 on CIFAR-10 and in Tables A6 (prior A4) and A8 on ImageNet-100. [4rXV, Mxbe] • We changed the headings in Table A5 (prior Table A3) and A6 (prior Table A4) for more clarity which column represents the baseline. [Mxbe] • We added an evaluation of MobileNet-v3 on ImageNet-100 and CIFAR-10 in Tables A1 and A2. (As suggested by [wtTg]). Yet, we observe that both the baseline model and the NIFF version have comparably low accuracy. For MobileNets, the training pipeline is usually highly optimized for best performance. The data augmentation scheme from Liu et al. (2022b), that we employ for all trainings to achieve comparable results, does not seem to have a beneficial effect here, neither on ImageNet-1k (for Sandler et al. (2018), Table 2) nor on ImageNet-100 (for Sandler et al. (2018); Howard et al. (2019), Table A1). [wtTg] • We added quantitative comparisons between full convolution and separated depth-wise and 1x1 convolutions for all ResNets and DenseNet-121 in Table 2, A1 and A2. [4rXV] • We ablated on linear, non-circular convolutions with our NIFF and quantitatively show in Tables A3 and A4 for ImageNet-100 and CIFAR-10 that mimicking linear kernels is not beneficial for the network performance. We discuss these findings in section D. Further, we show resulting spatial kernels in Figures A20 and A21. [4rXV] • We briefly discuss in Section 5 that our findings are in line with prior work Romero et al. (2022) and ablate on the shape of the learned filters in Section A. (suggested by [4rXv]) Romero et al. (2022) learn large spatial kernels in the spatial domain and combine these kernels with a Gaussian mask which can be elliptic-shaped instead of the common square-shape of CNN kernels. Hence, we ablate on fitting a Gaussian onto our spatial kernels and show in Figures A2 and A3 that learned kernels exhibit on average square-shapes (Section A). Please note that our filter parameterization differs significantly from the one proposed in Romero et al. (2022) so that our kernels are not necessarily smooth and a Gaussian fit might not produce a reasonable mask. [4rXV] • We ablated the effect of different padding methods prior to our NIFF in Section G and show resulting spatial kernels in Figures A18 and A19. [4rXV] • We extended Section 6 by a discussion of the overall receptive field of the network due to the second principle component of the last layer of the network, indicating that in combination with downsampling, this results to an overall large receptive field. [4rXV] • We included the number of flops, for linear convolutions with our NIFFs in Figure 8. • We removed duplicate references. [Mxbe] • We also fixed minor grammatical issues which we encountered while proofreading the original text. Following the AEs final suggestions, we added the following changes to our camera-ready version. • We tempered the last line of our abstract. • We added several references, including a discussion of related work on dynamic and steerable filters. • We added the link to our GitHub repository to the abstract of our paper.
Supplementary Material: pdf
Assigned Action Editor: ~Evan_G_Shelhamer1
Submission Number: 2115