Prompting as Multimodal FusingDownload PDF

Anonymous

16 Nov 2021 (modified: 05 May 2023)ACL ARR 2021 November Blind SubmissionReaders: Everyone
Abstract: Tsimpoukelli et al. (2021) devise Frozen, empowering a language model to solve multimodal tasks by pretraining a vision encoder whose outputs are prompts fed to the language model. The vision encoder has a dual objective: extracting image features and aligning image/text representation spaces. We propose to disentangle the objectives by using prompt vectors to align the spaces; this lets the vision encoder focus on extracting image features. We show that this disentangled approach is modular and parameter-efficient for processing tasks that involve two or more modalities.
0 Replies

Loading