FreeStyler: A Free-Form Stylization Method via Multimodal Vector Quantization

Published: 01 Jan 2024, Last Modified: 13 Nov 2024CVM (2) 2024EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Image stylization refers to the process of transforming an input image into a new one, while retaining its original content but in different styles. However, most existing works only support single-modal guidance, which is not ideal for real-world applications. To tackle this limitation, we propose FreeStyler, a flexible framework for image stylization that is capable of handling various input scenarios. Our approach goes beyond the traditional approach of relying on content and style images to generate a stylized image and supports situations where these references are absent. Specifically, in such cases, FreeStyler allows performing the stylization through text or audio information. The core of FreeStyler is a vector quantized style transfer framework that encodes content and style information into a shared discrete latent feature space, followed by a stylization transformer for style fusion and an image decoder for stylized image reconstruction. To enable free-form stylization, we introduce a novel pseudo-paired token predictor that can estimate tokens from varying input forms without the need for additional text or audio data. Specifically, we leverage Contrastive Language-Image Pre-training (CLIP) as prior knowledge to align discrete representations across different modalities and train the framework using an image and pseudo caption pair provided by Bootstrapping Language-Image Pre-training (BLIP). Through qualitative and quantitative experiments, our method has demonstrated superior performance compared to state-of-the-art stylization methods.
Loading

OpenReview is a long-term project to advance science through improved peer review with legal nonprofit status. We gratefully acknowledge the support of the OpenReview Sponsors. © 2025 OpenReview