Hierarchical Image Transformer Based on the Segment Anything Model

23 Sept 2023 (modified: 25 Mar 2024)ICLR 2024 Conference Desk Rejected SubmissionEveryoneRevisionsBibTeX
Keywords: SAM, Hierarchica transformer, Variable mask ratio, Image transformer, Lora
TL;DR: We propose a fine-tuned hierarchical image transformer Model based on low rank (LoRA) based on large-scale image segmentation Segment Anything Model (SAM).
Abstract: We propose a hierarchical image transformer generation Model based on Segment Anything Model (SAM), and explore a new research paradigm of customized large-scale model for transformer image generation. samformer applies a fine-tuning strategy based on Low rank (LoRA) to SAM image encoders, Together with self-attention, hierarchical vae encoders, mask transformer decoders, and predictive head transformer, it is fine-tuned on Imagenet-like image datasets. SAM is chosen based on its semantic mask for the image level. Specifically, in the first stage, we use SAM-based segmentation features for hierarchical encoding using attention and decoding through reconstruction losses, and the first stage algorithm achieves better performance in both vector quantization and image generation. The second phase of the mask transformer predicts masks by focusing on markers in all directions at each layer in parallel. We also observed that the warm-up fine-tuning strategy resulted in samformer successfully converging and reducing losses. Unlike SAM, samformer can perform both unconditional and class-conditional image generation tasks. Our trained B-size samformer model achieves fid scores of 3.58 and 4.28 for unconditional image generation on ffhq data set and class conditional image generation on imagenet data set respectively, which is comparable to the most advanced methods and reduces the pre-training time of large models by 3 times; The M-sized pre-trained model also achieves results that are competitive with the most advanced methods. We conducted a large number of experiments to verify the effectiveness of our design. In addition, we demonstrate the effectiveness of the proposed method in multiple generation tasks, such as unconditional image generation, class conditional image generation, text conditional image generation, etc. Because samformer updates only a small number of SAM parameters, its training costs are quite insignificant compared to other large generation models.
Primary Area: generative models
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2024/AuthorGuide.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors' identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 8155
Loading