Keywords: LMMs, Multimodal LLMs, Skipping computation, Overparametrization, Model Compression.
TL;DR: The work shows that there is redundant computations inside MLLMs and thus the potential for significantly improving inference costs without sacrificing performance.
Abstract: Large Language Models (LLMs) have demonstrated remarkable success in both textual and multimodal domains. However, this success often comes with substantial computational costs, particularly when handling lengthy sequences of multimodal inputs. While recent efforts have focused on enhancing training efficiency, by proposing parameter and data-efficient methods, there has been less emphasis on addressing inference costs. In this study, we complement existing training-efficient approaches by investigating the computation redundancy in Multimodal Large Language Models (MLLMs) during inference. We propose different methods to skip computations, such as skipping entire blocks, FFN or self-attention (SA) layers. Additionally, we explore parallelizing certain layers, such as FFN and SA layers, or even entire blocks, which can reduce the overall model depth. Our findings validate that (1) significant amount of computations can be avoided at inference time, especially for tasks such as Visual Question Answering (VQA). (2) When training with compressed LLMs, over 97% of the original performance can be retained, even when skipping half of the blocks or removing 70% of the weights. Alternatively, (3) properly training with smaller LLMs can yield comparable performance to LLMs 2 or 3 times larger. To conclude, we extend our investigation to recent MLLMs, such as LLaVA-1.5, showing similar observations. Our work show that there is redundant computations inside MLLMs and thus the potential for significantly improving inference costs without sacrificing performance. The code is available here: https://github.com/mshukor/ima-lmms.
Submission Number: 3
Loading