Keywords: medical analysis, vision-and-language, multi-modal learning
TL;DR: This paper proposes a simple approach to extract generic representations from medical images and texts, which can be applied to a broad range of medical tasks.
Abstract: Medical vision-and-language pre-training (Med-VLP) has shown promising improvements on many downstream medical tasks owing to its applicability to extracting generic representations from medical images and texts. Practically, there exist two typical paradigms, i.e., the \textbf{fusion-encoder paradigm} and the \textbf{dual-encoder paradigm}, depending on whether a heavy fusion module is used. The former outperforms on multi-modal tasks owing to the sufficient interaction between modalities; the latter outperforms on uni-modal and cross-modal tasks due to the single-modality encoding ability. To take advantage of these two paradigms, we propose an effective yet straightforward scheme named PTUnifier to unify the two paradigms thanks to the identical input format by introducing visual and textual pseudo tokens, which serve as a feature bank that stores the most representative images/texts. By doing so, a single model could process various tasks adopting different input formats (i.e., image-only, text-only, and image-text-pair). Furthermore, we construct a pool of pseudo tokens (instead of static ones) to improve diversity and scalability. Experimental results show that our approach achieves state-of-the-art results on a broad range of tasks, spanning uni-modal tasks (i.e., image/text classification and text summarization), cross-modal tasks (i.e., image-to-text generation and image-text/text-image retrieval), and multi-modal tasks (i.e., visual question answering), demonstrating the effectiveness of our approach. Note that the adoption of pseudo tokens is orthogonal to most existing Med-VLP approaches, and we believe that our approach could be a beneficial and complementary extension to these approaches.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics
Submission Guidelines: Yes
Please Choose The Closest Area That Your Submission Falls Into: Machine Learning for Sciences (eg biology, physics, health sciences, social sciences, climate/sustainability )
4 Replies
Loading