Abstract: Highlights•Using large language models to provide rich semantic information as external knowledge.•Supervised contrastive learning can optimize the distribution of visual features.•It is robust for both natural and fine-grained scenes.•Cycle consistency promotes that generated images maintain similar content and structure.
Loading