Skip to main content

Stable Preference: Redefining Training Paradigm of Human Preference Model for Text-to-Image Synthesis

  • Conference paper
  • First Online:
Computer Vision – ECCV 2024 (ECCV 2024)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 15086))

Included in the following conference series:

  • 218 Accesses

Abstract

In recent years, deep generative models have developed rapidly and can generate high-quality images based on input texts. Assessing the quality of synthetic images in a way consistent with human preferences is critical for both generative model evaluation and preferred image selection. Previous works aligned models with human preferences by training scoring models on image pairs with preference annotations. These carefully annotated image pairs well describe human preferences for choosing images. However, current training paradigm of these preference models is to directly maximize the preferred image score while minimizing the non-preferred image score in each image pair through cross-entropy loss. This simple and naive training paradigm mainly has two problems: 1) For image pairs of similar quality, it is unreasonable to blindly minimize the score of non-preferred images and can easily lead to overfitting. 2) The human robustness to small visual perturbations is not taken into account, resulting in the final model being unable to make stable choices. Therefore, we propose Stable Preference to redefine the training paradigm of human preference model and a anti-interference loss to improve the robustness to visual disturbances. Our method achieves state-of-the-art performance on two popular text-to-image human preference datasets. Extensive ablation studies and visualizations demonstrate the rationality and effectiveness of our method.

This work was done when Hanting Li was an intern at Huawei Noah’s Ark Lab.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 64.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 79.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Chen, X., Fang, H., Lin, T.Y., Vedantam, R., Gupta, S., Dollár, P., Zitnick, C.L.: Microsoft coco captions: Data collection and evaluation server. arXiv preprint arXiv:1504.00325 (2015)

  2. Digirolamo, G.J., Hintzman, D.L.: First impressions are lasting impressions: a primacy effect in memory for repetitions. Psychonomic Bull. Rev. 4(1), 121–124 (1997)

    Article  Google Scholar 

  3. Ding, M., et al.: Cogview: mastering text-to-image generation via transformers. Adv. Neural. Inf. Process. Syst. 34, 19822–19835 (2021)

    Google Scholar 

  4. Ding, M., Zheng, W., Hong, W., Tang, J.: Cogview2: faster and better text-to-image generation via hierarchical transformers. Adv. Neural. Inf. Process. Syst. 35, 16890–16902 (2022)

    Google Scholar 

  5. Esser, P., Rombach, R., Ommer, B.: Taming transformers for high-resolution image synthesis. In: Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., pp. 12873–12883 (2021)

    Google Scholar 

  6. Gu, S., et al.: Vector quantized diffusion model for text-to-image synthesis. In: Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., pp. 10696–10706 (2022)

    Google Scholar 

  7. Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., Hochreiter, S.: Gans trained by a two time-scale update rule converge to a local nash equilibrium. Adv. Neural Inf. Process. Syst. 30 (2017)

    Google Scholar 

  8. Hinz, T., Heinrich, S., Wermter, S.: Semantic object accuracy for generative text-to-image synthesis. IEEE Trans. Pattern Anal. Mach. Intell. 44(3), 1552–1565 (2020)

    Article  Google Scholar 

  9. Hong, S., Yang, D., Choi, J., Lee, H.: Inferring semantic layout for hierarchical text-to-image synthesis. In: Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., pp. 7986–7994 (2018)

    Google Scholar 

  10. Ilharco, G., et al.: Openclip, July 2021. https://doi.org/10.5281/zenodo.5143773, https://doi.org/10.5281/zenodo.5143773, if you use this software, please cite it as below

  11. Judd, T., Ehinger, K., Durand, F., Torralba, A.: Learning to predict where humans look. In: Proc. IEEE Int. Conf. Comput. Vis., pp. 2106–2113. IEEE (2009)

    Google Scholar 

  12. Kirstain, Y., Polyak, A., Singer, U., Matiana, S., Penna, J., Levy, O.: Pick-a-pic: an open dataset of user preferences for text-to-image generation. arXiv preprint arXiv:2305.01569 (2023)

  13. Kynkäänniemi, T., Karras, T., Laine, S., Lehtinen, J., Aila, T.: Improved precision and recall metric for assessing generative models. Adv. Neural Inf. Process. Syst. 32 (2019)

    Google Scholar 

  14. Li, J., Li, D., Xiong, C., Hoi, S.: Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: Proc. Int. Conf. Mach. Learn., pp. 12888–12900. PMLR (2022)

    Google Scholar 

  15. Lin, T.Y., Goyal, P., Girshick, R., He, K., Dollár, P.: Focal loss for dense object detection. In: Proc. IEEE Int. Conf. Comput. Vis., pp. 2980–2988 (2017)

    Google Scholar 

  16. Liu, X., Gong, C., Wu, L., Zhang, S., Su, H., Liu, Q.: Fusedream: training-free text-to-image generation with improved CLIP+GAN space optimization. arXiv preprint arXiv:2112.01573 (2021)

  17. Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017)

  18. Murray, N., Marchesotti, L., Perronnin, F.: Ava: a large-scale database for aesthetic visual analysis. In: Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., pp. 2408–2415. IEEE (2012)

    Google Scholar 

  19. Naeem, M.F., Oh, S.J., Uh, Y., Choi, Y., Yoo, J.: Reliable fidelity and diversity metrics for generative models. In: Proc. Int. Conf. Mach. Learn., pp. 7176–7185. PMLR (2020)

    Google Scholar 

  20. Nichol, A., et al.: Glide: Towards photorealistic image generation and editing with text-guided diffusion models. arXiv preprint arXiv:2112.10741 (2021)

  21. Palmer, S.E., et al.: Aesthetic issues in spatial composition: effects of position and direction on framing single objects. Spat. Vis. 21(3), 421–450 (2008)

    Google Scholar 

  22. Park, D.H., Azadi, S., Liu, X., Darrell, T., Rohrbach, A.: Benchmark for compositional text-to-image synthesis. In: Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 1) (2021)

    Google Scholar 

  23. Paszke, A., et al.: Pytorch: an imperative style, high-performance deep learning library. Adv. Neural Inf. Process. Syst. 32 (2019)

    Google Scholar 

  24. Radford, A., et al.: Learning transferable visual models from natural language supervision. In: Proc. Int. Conf. Mach. Learn., pp. 8748–8763. PMLR (2021)

    Google Scholar 

  25. Ramesh, A., Dhariwal, P., Nichol, A., Chu, C., Chen, M.: Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.061251(2), 3 (2022)

  26. Ramesh, A., et al.: Zero-shot text-to-image generation. In: Proc. Int. Conf. Mach. Learn., pp. 8821–8831. PMLR (2021)

    Google Scholar 

  27. Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models. In: Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., pp. 10684–10695 (2022)

    Google Scholar 

  28. Saharia, C., et al.: Photorealistic text-to-image diffusion models with deep language understanding. Adv. Neural. Inf. Process. Syst. 35, 36479–36494 (2022)

    Google Scholar 

  29. Sajjadi, M.S., Bachem, O., Lucic, M., Bousquet, O., Gelly, S.: Assessing generative models via precision and recall. Adv. Neural Inf. Process. Syst. 31 (2018)

    Google Scholar 

  30. Salimans, T., Goodfellow, I., Zaremba, W., Cheung, V., Radford, A., Chen, X.: Improved techniques for training gans. Adv. Neural Inf. Process. Syst. 29 (2016)

    Google Scholar 

  31. Schuhmann, C., et al.: Laion-5b: an open large-scale dataset for training next generation image-text models. Adv. Neural. Inf. Process. Syst. 35, 25278–25294 (2022)

    Google Scholar 

  32. Wang, Z.J., Montoya, E., Munechika, D., Yang, H., Hoover, B., Chau, D.H.: Diffusiondb: a large-scale prompt gallery dataset for text-to-image generative models. arXiv preprint arXiv:2210.14896 (2022)

  33. Wu, X., et al.: Human preference score v2: a solid benchmark for evaluating human preferences of text-to-image synthesis. arXiv preprint arXiv:2306.09341 (2023)

  34. Wu, X., Sun, K., Zhu, F., Zhao, R., Li, H.: Human preference score: Better aligning text-to-image models with human preference. In: Proc. IEEE Int. Conf. Comput. Vis., pp. 2096–2105 (2023)

    Google Scholar 

  35. Xu, J., et al.: Imagereward: Learning and evaluating human preferences for text-to-image generation. arXiv preprint arXiv:2304.05977 (2023)

  36. Xu, T., Zhang, P., Huang, Q., Zhang, H., Gan, Z., Huang, X., He, X.: Attngan: fine-grained text to image generation with attentional generative adversarial networks. In: Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., pp. 1316–1324 (2018)

    Google Scholar 

  37. Xu, X., Wang, Z., Zhang, G., Wang, K., Shi, H.: Versatile diffusion: text, images and variations all in one diffusion model. In: Proc. IEEE Int. Conf. Comput. Vis., pp. 7754–7765 (2023)

    Google Scholar 

  38. Zhang, X., et al.: Gpt-4v (ision) as a generalist evaluator for vision-language tasks. arXiv preprint arXiv:2311.01361 (2023)

  39. Zhou, Y., et al.: LAFITE: towards language-free training for text-to-image generation. arxiv 2021. arXiv preprint arXiv:2111.13792

  40. Zhu, D., Chen, J., Shen, X., Li, X., Elhoseiny, M.: Minigpt-4: enhancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592 (2023)

Download references

Acknowledgments

This work was supported by the Anhui Provincial Natural Science Foundation under Grant 2108085UD12. We acknowledge the support of GPU cluster built by MCC Lab of Information Science and Technology Institution, USTC.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Feng Zhao .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2025 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Li, H., Niu, H., Zhao, F. (2025). Stable Preference: Redefining Training Paradigm of Human Preference Model for Text-to-Image Synthesis. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15086. Springer, Cham. https://doi.org/10.1007/978-3-031-73390-1_15

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-73390-1_15

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-73389-5

  • Online ISBN: 978-3-031-73390-1

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics