Abstract
Most text-to-image generation works focus on the semantic consistency and neglect the style of the generated image. In this paper, a novel text-to-image generation method is proposed to generate image with style image constraint. In order to provide more comprehensive information by mining long–short-range information dependencies, the multi-group attention module is introduced to capture the multi-scale dependency information in the semantic feature. The adaptive multi-scale attention normalization is adopted to pay the multi-scale style feature attention in the style fusion process. The style information related to semantic feature is filtered out by the style feature attention. This selected style information is transferred to the generated results by aligning the mean and variance of the semantic feature and the style feature. Experiments conducted on common datasets show the validity of the proposed approach.










Similar content being viewed by others
References
Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial nets. In: Proceedings of Advances in Neural Information Processing Systems, pp. 2672–2680 (2014)
Zhang, H., Xu, T., Li, H., Zhang, S., Wang, X., Huang, X., Metaxas, D.N.: Stackgan++: realistic image synthesis with stacked generative adversarial networks. IEEE Trans. Pattern Anal. Mach. Intell. 41(8), 1947–1962 (2018)
Xu, T., Zhang, P., Huang, Q., Zhang, H., Gan, Z., Huang, X., He, X.: Attngan: fine-grained text to image generation with attentional generative adversarial networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1316–1324 (2018)
Qiao, T., Zhang, J., Xu, D., Tao, D.: Mirrorgan: learning text-to-image generation by redescription. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1505–1514 (2019)
Qiao, T., Zhang, J., Xu, D., Tao, D.: Learn, imagine and create: text-to-image generation from prior knowledge. In: Proceedings of Advances in Neural Information Processing Systems, pp. 885–895 (2019)
Ruan, S., Zhang, Y., Zhang, K., Fan, Y., Tang, F., Liu, Q., Chen, E.: Dae-gan: Dynamic aspect-aware gan for text-to-image synthesis. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 13960–13969 (2021)
Ramesh, A., Pavlov, M., Goh, G., Gray, S., Voss, C., Radford, A., Chen, M., Sutskever I.: Zero-shot text-to-image generation. In: International Conference on Learning Representations, pp. 8821–8831 (2021)
Zhang, H., Koh, J. Y., Baldridge, J., Lee, H., Yang, Y.: Cross-modal contrastive learning for text-to-image generation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 833–842 (2021)
Liao, W., Hu, K., Yang, M. Y., Rosenhahn, B.: Text to image generation with semantic-spatial aware GAN. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 18187–18196 (2022)
Park, H., Yoo, Y., Kwak, N.: Mc-gan: multi-conditional generative adversarial network for image synthesis. arXiv preprint arXiv: 1805.01123 (2018)
Li, Y., Gan, Z., Shen, Y., Liu, J., Cheng, Y., Wu, Y., Carin, L., Carlson, D., Gao, J.: StoryGAN: A Sequential Conditional GAN for Story Visualization. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6329–6338 (2019)
Yin, G., Liu, B., Sheng, L., Yu, N., Wang, X., Shao, J.: Semantics disentangling for text-to-image generation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2327–2336 (2019)
Cheng, J., Wu, F., Tian, Y., Wang, L., Tao, D.: Rifegan: rich feature generation for text-to-image synthesis from prior knowledge. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 10911–10920 (2020)
Tao, M., Tang, H., Wu, S., Sebe, N., Jing, X., Wu, F., Bao, B.: Df-gan: deep fusion generative adversarial networks for text-to-image synthesis. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 16515–16525 (2022)
Efros A. A., Freeman W. T.: Image quilting for texture synthesis and transfer. In: Annual Conference on Computer Graphics and Interactive Technique, pp. 341–346 (2001)
Kyprianidis, J.E., Collomosse, J., Wang, T., Isenberg, T.: State of the “art”: a taxonomy of artistic stylization techniques for images and video. IEEE Trans. Vis. Comput. Graphics 29(5), 866–885 (2013)
Gatys, L., Ecker, A., Bethge M.: A neural algorithm of artistic style. arXiv preprint arXiv: 1508.06576 (2015)
Johnson, J., Alahi, A., Li, F.: Perceptual losses for real-time style transfer and super-resolution. In: European Conference on Computer Vision, pp. 694–711 (2016)
Huang, X., Belongie, S.: Arbitrary style transfer in real-time with adaptive instance normalization. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 1501–1510 (2017)
Park, D. Y., Lee, K. H.: Arbitrary style transfer with style-attentional networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5880–5888 (2019)
Liu, S., Lin, T., He, D., Li, F., Wang, M., Li, X., Sun, Z., Li, Q., Ding E.: Adaattn: Revisit attention mechanism in arbitrary neural style transfer. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 6649–6658 (2021)
Park, J., Kim, Y.: Styleformer: transformer based generative adversarial networks with style vector. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 8983–8992 (2022)
Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., Krueger, G., Sutskever, I.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021)
Liu, Z., Cai, W., Xu, Z.J.: Multi-scale deep neural network (MscaleDNN) for solving Poisson-Boltzmann equation in complex domains. Commun. Comput. Phys. 28(5), 1970–2001 (2020)
Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014)
Lin, T, Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., Zitnick, C. L.: Microsoft coco: common objects in context. In: European conference on computer vision, pp. 740–755 (2014)
Phillips, F., Mackintosh, B.: Wiki art gallery, inc.: a case for critical thinking. Issues Account. Educ. 26(3), 593–608 (2011)
Loshchilov, I., Hutter, F.: Decoupled Weight Decay Regularization. arXiv preprint arXiv:1711.05101, (2017)
Salimans, T., Goodfellow, I., Zaremba, W., Cheung, V., Radford, A., Chen, X.: Improved techniques for training gans. In: Proceedings of Advances in Neural Information Processing Systems, pp. 2234–2242 (2016)
Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., Hochreiter, S.: Gans trained by a two time-scale update rule converge to a local nash equilibrium. In: Proceedings of Advances in Neural Information Processing Systems, pp. 6626–6637 (2017)
Zhu, M., Pan, P., Chen, W., Yang, Y.: Dm-gan: dynamic memory generative adversarial networks for text-to-image synthesis. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5802–5810 (2019)
Li, B., Qi, X., Lukasiewicz, T., Torr, P.: Controllable text-to-image generation. In: Proceedings of Advances in Neural Information Processing Systems, pp. 2063–2073 (2019)
Acknowledgements
This work is partially supported by the National Natural Science Foundation of China (Nos. 62076153, 62176144), the major fundamental research project of Shandong, China (No. ZR2019ZD03), and the Taishan Scholar Project of Shandong, China (No. ts20190924).
Author information
Authors and Affiliations
Contributions
ZW wrote the main manuscript text and prepared figures. All authors reviewed the manuscript.
Corresponding author
Ethics declarations
Conflict of interest
The authors declare no competing interests.
Additional information
Communicated by B. Bao.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Wang, Z., Liu, L., Zhang, H. et al. Generative adversarial text-to-image generation with style image constraint. Multimedia Systems 29, 3291–3303 (2023). https://doi.org/10.1007/s00530-023-01160-4
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00530-023-01160-4