Abstract: Highlights•Maximum Mean Discrepancy to align the textual and visual modalities.•Using CLIP to extract visual features and a filter to enhance utilization.•Feasibility of Large Language Model for data preprocessing.
External IDs:dblp:journals/ijon/TangLCL24
Loading