Multimodal summarization with modality features alignment and features filtering

Published: 01 Jan 2024, Last Modified: 31 Jul 2025Neurocomputing 2024EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Highlights•Maximum Mean Discrepancy to align the textual and visual modalities.•Using CLIP to extract visual features and a filter to enhance utilization.•Feasibility of Large Language Model for data preprocessing.
Loading