Abstract: Cross-modal embeddings form the foundation for multi-modal models. However, visualization methods for interpreting cross-modal embeddings have been primarily confined to traditional dimensionality reduction (DR) techniques like PCA and t-SNE. These DR methods primarily focus on feature distributions within a single modality, whilst failing to incorporate metrics (e.g., CLIPScore) across multiple modalities. This paper introduces AKRMap, a new DR technique designed to visualize cross-modal embeddings metric with enhanced accuracy by learning kernel regression of the metric landscape in the projection space. Specifically, AKRMap constructs a supervised projection network guided by a post-projection kernel regression loss, and employs adaptive generalized kernels that can be jointly optimized with the projection. This approach enables AKRMap to efficiently generate visualizations that capture complex metric distributions, while also supporting interactive features such as zoom and overlay for deeper exploration. Quantitative experiments demonstrate that AKRMap outperforms existing DR methods in generating more accurate and trustworthy visualizations. We further showcase the effectiveness of AKRMap in visualizing and comparing cross-modal embeddings for text-to-image models. Code and demo are available at https://github.com/yilinye/AKRMap.
Lay Summary: An exponentially growing volume of images have been generated by AI with simple text prompt inputs, yet users of generative AI platforms like Midjourney or WebUI can only see a few images and prompts at a time on the interface, without knowing how well the model generates on a larger prompt corpus. We would like to provide a way for users to directly see and explore a large number of generated images together with their prompts through visualization.
In our paper, we develop a new visualization method called AKRMap, which shows each pair of prompt and generated image as a point in a dense 2D scatterplot, and renders another contour/heat map to reveal the global distribution of a metric that reflects how well the generated image matches the prompt. Our visualization can be directly displayed in computational notebook and it also supports interactive features such as zooming.
We conduct experiments to show that our visualization method can help explore large generated dataset of up to half a million images and compare the generated images of different AI models. We release our visualization tool to support transparent evaluation of generated data in a large scale.
Link To Code: https://github.com/yilinye/AKRMap
Primary Area: General Machine Learning->Evaluation
Keywords: Visualization, Cross-modal Embeddings, Text-to-image generation
Submission Number: 9051
Loading