Abstract: The cost of deploying vision transformers increasingly represents a barrier to wider industrial adoption. Existing compression techniques require additional end-to-end fine-tuning or incur a significant drawback to runtime, making them ill-suited for online (real-time) inference, where a prediction is made on any new input as it comes in. We introduce the Visual Word Tokenizer (VWT), a training-free method for reducing energy costs while retaining performance and runtime. The VWT groups visual subwords (image patches) that are frequently used into visual words while infrequent ones remain intact. To do so, intra-image or inter-image statistics are leveraged to identify similar visual concepts for sequence compression. Experimentally, we demonstrate a reduction in wattage of up to 25% with only a 20% increase in runtime at most. Comparative approaches of 8-bit quantization and token merging achieve a lower or similar energy efficiency but exact a higher toll on runtime (up to 100% or more). Our results indicate that VWTs are well-suited for efficient online inference with a marginal compromise on performance.
Submission Length: Regular submission (no more than 12 pages of main content)
Changes Since Last Submission: 1. Added additional description of the wattage calculation process for further clarity.
2. Changed the term "image space" to "pixel space" for further clarity.
3. Changed the subsection title "Random Matching of the Visual Words" to "Random Merging of Tokens" for further clarity.
4. Added in the Appendix, the subsections B.1 (and Table 9) for the random dropping of tokens and B.2 (and Table 10) for the use of VWTs with other compression techniques.
5. Improved terminology used in subsection 4.6 for further clarity.
Assigned Action Editor: ~Lu_Jiang1
Submission Number: 4604
Loading