PQV-Mobile: A Combined Pruning and Quantization Toolkit to Optimize Vision Transformers for Mobile Applications
Keywords: Quantization, Pruning, Vision Transformers, Mobile Applications
Abstract: While Vision Transformers (ViTs) are extremely
effective at computer vision tasks and are replacing
convolutional neural networks as the new
state-of-the-art, they are complex and memoryintensive
models. In order to effectively run these
models on resource-constrained mobile/edge systems,
there is a need to not only compress these
models but also to optimize them and convert
them into deployment-friendly formats. To this
end, this paper presents a combined pruning and
quantization tool, called PQV-Mobile, to optimize
vision transformers for mobile applications. The
tool is able to support different types of structured
pruning based on magnitude importance,
Taylor importance, and Hessian importance. It
also supports quantization from FP32 to FP16 and
int8, targeting different mobile hardware backends.
We demonstrate the capabilities of our tool
and show important latency-memory-accuracy
trade-offs for different amounts of pruning and
int8 quantization with Facebook Data Efficient
Image Transformer (DeiT) models. Our results
show that even pruning a DeiT model by 9.375%
and quantizing it to int8 from FP32 followed by
optimizing for mobile applications, we find a latency
reduction by 7.18× with a small accuracy
loss of 2.24%. We plan to open-source this tool
Submission Number: 53
Loading