Keywords: Wyner-Ziv coding, semantic compression, and multimodal data fusion
Abstract: A typical large multimodal model (LMM) involves several encoders, one for each modality, for contextual encoding. Transmission of the encoder outputs, potentially residing at different devices, can lead to significant and intolerable communication overhead for resource-constrained environments. Then, a large language model (LLM) combines the encoded sources with text that can be regarded as side information before the generative process. This structure resembles the Wyner-Ziv problem that promises considerable compression of multiple correlated sources. Motivated by the Wyner-Ziv theorem, we propose a novel compression algorithm for the encoded sources and examine it in terms of semantic efficiency. The developed algorithm is applied to two architectures in terms of performance-complexity tradeoff, namely incorporation of sources (i) at the beginning (for best performance) and (ii) at the later layers (for fast inference) of a decoder. The results indicate that the compression for fast inference has less impact on bad (noisy/low-throughput) channels than on the best performance case, and the semantic similarity can be moderately preserved under certain circumstances. Additionally, the performance drop is negligible for certain compression ratios in both approaches.
Submission Number: 22
Loading