The Preference is in the Details: Text-to-Image Preference Alignment with Fine-grained Visual Cues
Track: long paper (up to 10 pages)
Domain: machine learning
Abstract: Aligning text-to-image diffusion models with human preferences is essential for reliable deployment, yet existing approaches largely treat preference alignment as an output-level objective driven by coarse comparisons. Human visual judgment, however, is structured around fine-grained perceptual factors such as semantic coherence, texture fidelity, and local consistency, which require alignment at the level of internal representations. In this work, we present PreFine, a framework that reformulates preference learning as a representational alignment problem by introducing structured, fine-grained preference supervision through controlled perturbations of high-quality images. These perturbations induce targeted variations along perceptually meaningful axes, encouraging diffusion models to develop representations that are sensitive to localized degradations while remaining robust to irrelevant variations. We further introduce a difficulty-aware curriculum that progressively refines perceptual sensitivity during training, enabling improved alignment with human judgments. Our experiments show that PreFine consistently boosts alignment metrics across models and datasets, with gains in win rates of up to 13.0% in Aesthetics Score and 15.2% in ImageReward. These results suggest that fine-grained preference supervision improves alignment between learned visual representations and human perceptual evaluation, highlighting the role of structured preference signals in scalable alignment of generative models.
Presenter: ~Pulkit_Bansal1
Submission Number: 63
Loading