Cross-Modal Generative Augmentation for Multimodal Biological Classification

Published: 20 May 2026, Last Modified: 20 May 2026Accepted by TMLREveryoneRevisionsBibTeXCC BY 4.0
Abstract: Recent advances in vision-language models have enabled cross-modal generation between text and images, achieving remarkable progress in general-domain understanding. However, their potential in scientific and biological applications remains largely unexplored, where datasets often couple complex visual observations with structured metadata or textual descriptors. We propose a cross-modal generative framework that supports direction-agnostic generation (image-to-text or text-to-image) depending on modality availability to enrich multimodal biological classification. Our framework integrates generative augmentation and multimodal alignment to provide complementary augmentation for visual and textual representations, enabling the synthesis of complementary modality data that may otherwise be unavailable in biological datasets. Experimental results on the HAM10000 and EMPO500 datasets demonstrate improvements across multiple evaluation metrics across diverse biological datasets over baseline models. The proposed framework is model-agnostic and compatible with open-weight alternatives, paving the way for biologically grounded multimodal generation and analysis.
Submission Type: Regular submission (no more than 12 pages of main content)
Supplementary Material: zip
Assigned Action Editor: ~Bryan_Allen_Plummer1
Submission Number: 7896
Loading