Spatial-Enhanced Multi-Level Wavelet Patching in Vision Transformers

Published: 01 Jan 2024, Last Modified: 14 Nov 2024IEEE Signal Process. Lett. 2024EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: By seamlessly integrating wavelet transforms into the image patching stage of ViT, we leverage the power of multi-level wavelet transforms to decompose images into a diverse array of frequency-domain features. These features, integrated with spatial characteristics at equivalent scales, enrich image details, enhancing ViT's proficiency in delineating intricate textures and distinct edges. Consequently, we registered a notable 2.7% accuracy enhancement on the ImageNet100 dataset in ViT. Our wavelet patching module, designed for versatility, seamlessly fits into various ViT derivatives without necessitating architecture modifications. This advancement has uplifted the performance of several leading vision transformers by 0.46–4.3%, preserving parameter efficiency without notable FLOPs increment.
Loading