Abstract: Controllable text-to-image (T2I) diffusion models generate images conditioned on both text
prompts and semantic inputs of other modalities like edge maps. Nevertheless, current
controllable T2I methods commonly face challenges related to efficiency and faithfulness,
especially when conditioning on multiple inputs from either the same or diverse modalities. In
this paper, we propose a novel Flexible and Efficient method, FlexEControl, for controllable
T2I generation. At the core of FlexEControl is a unique weight decomposition strategy, which
allows for streamlined integration of various input types. This approach not only enhances
the faithfulness of the generated image to the control, but also significantly reduces the
computational overhead typically associated with multimodal conditioning. Our approach
achieves a reduction of 41% in trainable parameters and 30% in memory usage compared
with Uni-ControlNet. Moreover, it doubles data efficiency and can flexibly generate images
under the guidance of multiple input conditions of various modalities.
Submission Length: Regular submission (no more than 12 pages of main content)
Changes Since Last Submission: We have made the following updates to the paper:
- Created a new Figure 2 and Figure 3 and updated the figure captions.
- Added more qualitative results including more than 2 conditions and two foregrounds and for ablations.
- Reorganized the experiments section and added additional experiments.
- Included additional experimental settings and implementation details.
- Added a limitation/failure case section.
- Revised the conclusion section.
Overall, we have also thoroughly proofread the entire paper, improved the writing and presentation, provided additional clarifications, and included more details. All modifications are highlighted in blue.
Assigned Action Editor: ~Hongsheng_Li3
Submission Number: 2559
Loading