Abstract
In recent years, deep generative models have developed rapidly and can generate high-quality images based on input texts. Assessing the quality of synthetic images in a way consistent with human preferences is critical for both generative model evaluation and preferred image selection. Previous works aligned models with human preferences by training scoring models on image pairs with preference annotations. These carefully annotated image pairs well describe human preferences for choosing images. However, current training paradigm of these preference models is to directly maximize the preferred image score while minimizing the non-preferred image score in each image pair through cross-entropy loss. This simple and naive training paradigm mainly has two problems: 1) For image pairs of similar quality, it is unreasonable to blindly minimize the score of non-preferred images and can easily lead to overfitting. 2) The human robustness to small visual perturbations is not taken into account, resulting in the final model being unable to make stable choices. Therefore, we propose Stable Preference to redefine the training paradigm of human preference model and a anti-interference loss to improve the robustness to visual disturbances. Our method achieves state-of-the-art performance on two popular text-to-image human preference datasets. Extensive ablation studies and visualizations demonstrate the rationality and effectiveness of our method.
This work was done when Hanting Li was an intern at Huawei Noah’s Ark Lab.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Chen, X., Fang, H., Lin, T.Y., Vedantam, R., Gupta, S., Dollár, P., Zitnick, C.L.: Microsoft coco captions: Data collection and evaluation server. arXiv preprint arXiv:1504.00325 (2015)
Digirolamo, G.J., Hintzman, D.L.: First impressions are lasting impressions: a primacy effect in memory for repetitions. Psychonomic Bull. Rev. 4(1), 121–124 (1997)
Ding, M., et al.: Cogview: mastering text-to-image generation via transformers. Adv. Neural. Inf. Process. Syst. 34, 19822–19835 (2021)
Ding, M., Zheng, W., Hong, W., Tang, J.: Cogview2: faster and better text-to-image generation via hierarchical transformers. Adv. Neural. Inf. Process. Syst. 35, 16890–16902 (2022)
Esser, P., Rombach, R., Ommer, B.: Taming transformers for high-resolution image synthesis. In: Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., pp. 12873–12883 (2021)
Gu, S., et al.: Vector quantized diffusion model for text-to-image synthesis. In: Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., pp. 10696–10706 (2022)
Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., Hochreiter, S.: Gans trained by a two time-scale update rule converge to a local nash equilibrium. Adv. Neural Inf. Process. Syst. 30 (2017)
Hinz, T., Heinrich, S., Wermter, S.: Semantic object accuracy for generative text-to-image synthesis. IEEE Trans. Pattern Anal. Mach. Intell. 44(3), 1552–1565 (2020)
Hong, S., Yang, D., Choi, J., Lee, H.: Inferring semantic layout for hierarchical text-to-image synthesis. In: Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., pp. 7986–7994 (2018)
Ilharco, G., et al.: Openclip, July 2021. https://doi.org/10.5281/zenodo.5143773, https://doi.org/10.5281/zenodo.5143773, if you use this software, please cite it as below
Judd, T., Ehinger, K., Durand, F., Torralba, A.: Learning to predict where humans look. In: Proc. IEEE Int. Conf. Comput. Vis., pp. 2106–2113. IEEE (2009)
Kirstain, Y., Polyak, A., Singer, U., Matiana, S., Penna, J., Levy, O.: Pick-a-pic: an open dataset of user preferences for text-to-image generation. arXiv preprint arXiv:2305.01569 (2023)
Kynkäänniemi, T., Karras, T., Laine, S., Lehtinen, J., Aila, T.: Improved precision and recall metric for assessing generative models. Adv. Neural Inf. Process. Syst. 32 (2019)
Li, J., Li, D., Xiong, C., Hoi, S.: Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: Proc. Int. Conf. Mach. Learn., pp. 12888–12900. PMLR (2022)
Lin, T.Y., Goyal, P., Girshick, R., He, K., Dollár, P.: Focal loss for dense object detection. In: Proc. IEEE Int. Conf. Comput. Vis., pp. 2980–2988 (2017)
Liu, X., Gong, C., Wu, L., Zhang, S., Su, H., Liu, Q.: Fusedream: training-free text-to-image generation with improved CLIP+GAN space optimization. arXiv preprint arXiv:2112.01573 (2021)
Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017)
Murray, N., Marchesotti, L., Perronnin, F.: Ava: a large-scale database for aesthetic visual analysis. In: Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., pp. 2408–2415. IEEE (2012)
Naeem, M.F., Oh, S.J., Uh, Y., Choi, Y., Yoo, J.: Reliable fidelity and diversity metrics for generative models. In: Proc. Int. Conf. Mach. Learn., pp. 7176–7185. PMLR (2020)
Nichol, A., et al.: Glide: Towards photorealistic image generation and editing with text-guided diffusion models. arXiv preprint arXiv:2112.10741 (2021)
Palmer, S.E., et al.: Aesthetic issues in spatial composition: effects of position and direction on framing single objects. Spat. Vis. 21(3), 421–450 (2008)
Park, D.H., Azadi, S., Liu, X., Darrell, T., Rohrbach, A.: Benchmark for compositional text-to-image synthesis. In: Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 1) (2021)
Paszke, A., et al.: Pytorch: an imperative style, high-performance deep learning library. Adv. Neural Inf. Process. Syst. 32 (2019)
Radford, A., et al.: Learning transferable visual models from natural language supervision. In: Proc. Int. Conf. Mach. Learn., pp. 8748–8763. PMLR (2021)
Ramesh, A., Dhariwal, P., Nichol, A., Chu, C., Chen, M.: Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.061251(2), 3 (2022)
Ramesh, A., et al.: Zero-shot text-to-image generation. In: Proc. Int. Conf. Mach. Learn., pp. 8821–8831. PMLR (2021)
Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models. In: Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., pp. 10684–10695 (2022)
Saharia, C., et al.: Photorealistic text-to-image diffusion models with deep language understanding. Adv. Neural. Inf. Process. Syst. 35, 36479–36494 (2022)
Sajjadi, M.S., Bachem, O., Lucic, M., Bousquet, O., Gelly, S.: Assessing generative models via precision and recall. Adv. Neural Inf. Process. Syst. 31 (2018)
Salimans, T., Goodfellow, I., Zaremba, W., Cheung, V., Radford, A., Chen, X.: Improved techniques for training gans. Adv. Neural Inf. Process. Syst. 29 (2016)
Schuhmann, C., et al.: Laion-5b: an open large-scale dataset for training next generation image-text models. Adv. Neural. Inf. Process. Syst. 35, 25278–25294 (2022)
Wang, Z.J., Montoya, E., Munechika, D., Yang, H., Hoover, B., Chau, D.H.: Diffusiondb: a large-scale prompt gallery dataset for text-to-image generative models. arXiv preprint arXiv:2210.14896 (2022)
Wu, X., et al.: Human preference score v2: a solid benchmark for evaluating human preferences of text-to-image synthesis. arXiv preprint arXiv:2306.09341 (2023)
Wu, X., Sun, K., Zhu, F., Zhao, R., Li, H.: Human preference score: Better aligning text-to-image models with human preference. In: Proc. IEEE Int. Conf. Comput. Vis., pp. 2096–2105 (2023)
Xu, J., et al.: Imagereward: Learning and evaluating human preferences for text-to-image generation. arXiv preprint arXiv:2304.05977 (2023)
Xu, T., Zhang, P., Huang, Q., Zhang, H., Gan, Z., Huang, X., He, X.: Attngan: fine-grained text to image generation with attentional generative adversarial networks. In: Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., pp. 1316–1324 (2018)
Xu, X., Wang, Z., Zhang, G., Wang, K., Shi, H.: Versatile diffusion: text, images and variations all in one diffusion model. In: Proc. IEEE Int. Conf. Comput. Vis., pp. 7754–7765 (2023)
Zhang, X., et al.: Gpt-4v (ision) as a generalist evaluator for vision-language tasks. arXiv preprint arXiv:2311.01361 (2023)
Zhou, Y., et al.: LAFITE: towards language-free training for text-to-image generation. arxiv 2021. arXiv preprint arXiv:2111.13792
Zhu, D., Chen, J., Shen, X., Li, X., Elhoseiny, M.: Minigpt-4: enhancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592 (2023)
Acknowledgments
This work was supported by the Anhui Provincial Natural Science Foundation under Grant 2108085UD12. We acknowledge the support of GPU cluster built by MCC Lab of Information Science and Technology Institution, USTC.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2025 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Li, H., Niu, H., Zhao, F. (2025). Stable Preference: Redefining Training Paradigm of Human Preference Model for Text-to-Image Synthesis. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15086. Springer, Cham. https://doi.org/10.1007/978-3-031-73390-1_15
Download citation
DOI: https://doi.org/10.1007/978-3-031-73390-1_15
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-73389-5
Online ISBN: 978-3-031-73390-1
eBook Packages: Computer ScienceComputer Science (R0)