【Proposal】Efficient Inference for Large Multimodal Models

20 Oct 2024 (modified: 05 Nov 2024)THU 2024 Fall AML SubmissionEveryoneRevisionsBibTeXCC BY-NC-SA 4.0
Keywords: Large Multimodal Models, Multimodal Large Language Models, Visual Instruction Tuning, Speculative Decoding, Inference Acceleration
Abstract: Large Multimodal Models (LMMs) have achieved notable success in visual instruction tuning, yet their inference is time-consuming due to the auto-regressive decoding of Large Language Model (LLM) backbone. Speculative Decoding (SD) has proven effective for lossless auto-regressive decoding acceleration by a draft-then-verify mode. In this work, we explore the application of speculative decoding to boost the inference efficiency of LMMs, with a particular focus on leveraging useful information generated during the LMM's processing in the context of multimodal tasks, such as vision embeddings, hidden states, and key-value (KV) caches. Concurrently, we try to develop alignment techniques between the target model and the draft model in this scenario, with the objective of improving the acceleration effect as much as possible. We anticipate achieving a speedup ratio of over 2x. Code and model will be released in the near future.
Submission Number: 22
Loading