Abstract: The cost of deploying vision transformers increasingly represents a barrier to wider industrial adoption. Existing compression techniques require additional end-to-end fine-tuning or incur a significant drawback to energy efficiency, making them ill-suited for online (real-time) inference, where a prediction is made on any new input as it comes in. We introduce the Visual Word Tokenizer (VWT), a training-free method for reducing energy costs while retaining performance. The VWT groups visual subwords (image patches) that are frequently used into visual words while infrequent ones remain intact. To do so, intra-image or inter-image statistics are leveraged to identify similar visual concepts for sequence compression. Experimentally, we demonstrate a reduction in energy consumed of up to 47%. Comparative approaches of 8-bit quantization and token merging can lead to significantly increased energy costs (up to 500% or more). Our results indicate that VWTs are well-suited for efficient online inference with a marginal compromise on performance.
Submission Length: Regular submission (no more than 12 pages of main content)
Previous TMLR Submission Url: https://openreview.net/forum?id=TY4qi6dBnA
Changes Since Last Submission: - In Table 2, we have combined the power and time results into energy for measuring efficiency in Joules. This addresses a previous main criticism where it was unclear if a method was more efficient when it had a lower power but higher runtime. The table now better represents the energy efficiency of each method under analysis. Figure 9 in the Appendix has also been correspondingly switched to measuring energy.
- We have better organised the paper in a number of ways. First, we reduced the number of subsections by grouping our results into “Experimental Results” and “Ablation Studies” both in the main paper and Appendix. We have also improved the captions and paragraphs (in “Experimental Results”) that describe Tables 1, 2, and 3. A yellow highlight has also been applied to the VWT methods in Tables 2 and 3 for further visual clarity. Lastly, we have merged a previous subsection on choosing between the intra-image or inter-image approaches into the conclusion.
- We have incorporated the additional experiments requested by the previous reviews into the random dropping of tokens or the use of VWTs with quantization in the Appendix, specifically Tables 9 and 10. These changes were made during the previous review process but we reiterate here our incorporation of reviewer feedback.
Assigned Action Editor: ~Blake_Aaron_Richards1
Submission Number: 5761
Loading