Abstract: Highlights•PILL bridges the gap between pre-trained LM and multimodal understanding.•MAG prevents visual information from interfering with the LLM’s text modeling.•MoMAE equips each modality with dedicated FFNs to address modal entanglement issue.•PILL exhibits superior efficiency and competitive performance.
Loading