Skrr: Skip and Re-use Text Encoder Layers for Memory Efficient Text-to-Image Generation

Hoigi Seo; Wongi Jeong; Jae-sun Seo; Se Young Chun

Skrr: Skip and Re-use Text Encoder Layers for Memory Efficient Text-to-Image Generation

Hoigi Seo, Wongi Jeong, Jae-sun Seo, Se Young Chun

Published: 01 May 2025, Last Modified: 23 Jul 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY-NC-SA 4.0

TL;DR: In this paper, we target to reduce the size of text encoder in T2I diffusion mode with Skip and Re-use.

Abstract: Large-scale text encoders in text-to-image (T2I) diffusion models have demonstrated exceptional performance in generating high-quality images from textual prompts. Unlike denoising modules that rely on multiple iterative steps, text encoders require only a single forward pass to produce text embeddings. However, despite their minimal contribution to total inference time and floating-point operations (FLOPs), text encoders demand significantly higher memory usage, up to eight times more than denoising modules. To address this inefficiency, we propose Skip and Re-use layers (Skrr), a simple yet effective pruning strategy specifically designed for text encoders in T2I diffusion models. Skrr exploits the inherent redundancy in transformer blocks by selectively skipping or reusing certain layers in a manner tailored for T2I tasks, thereby reducing memory consumption without compromising performance. Extensive experiments demonstrate that Skrr maintains image quality comparable to the original model even under high sparsity levels, outperforming existing blockwise pruning methods. Furthermore, Skrr achieves state-of-the-art memory efficiency while preserving performance across multiple evaluation metrics, including the FID, CLIP, DreamSim, and GenEval scores.

Lay Summary: Modern tools can turn text into detailed images, helping artists, educators, and everyday users bring their ideas to life. These tools rely on powerful artificial intelligence (AI) models that understand language and generate visuals. But there’s a problem: the part of the model that reads the text — called the text encoder — uses a lot of memory, even though it only runs once per image. We created a method called Skrr to make this process more memory-efficient. Skrr works by skipping or reusing parts of the text encoder that don’t add much value. This helps reduce the size of the model without lowering the quality of the images it creates. It’s a bit like trimming unnecessary scenes from a movie without changing the story. Our experiments showed that even when we removed nearly half of the text encoder, the image quality remained high. In fact, Skrr outperformed other popular compression techniques on multiple benchmarks. This makes it easier to run text-to-image systems on devices with less memory — like phones or laptops — helping make powerful AI more accessible to more people.

Application-Driven Machine Learning: This submission is on Application-Driven Machine Learning.

Primary Area: Deep Learning->Generative Models and Autoencoders

Keywords: Generative model, Efficiency, Pruning

Submission Number: 1093

Loading