Mask Embedding for Realistic High-Resolution Medical Image Synthesis

Ren, Yinhao; Zhu, Zhe; Li, Yingzhou; Kong, Dehan; Hou, Rui; Grimm, Lars J.; Marks, Jeffery R.; Lo, Joseph Y.

doi:10.1007/978-3-030-32226-7_47

Yinhao Ren¹⁶,
Zhe Zhu¹⁷,
Yingzhou Li¹⁸,
Dehan Kong¹⁹,
Rui Hou²⁰,
Lars J. Grimm¹⁷,
Jeffery R. Marks²¹ &
…
Joseph Y. Lo^16,17,20

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 11769))

Included in the following conference series:

International Conference on Medical Image Computing and Computer-Assisted Intervention

10k Accesses
7 Citations
3 Altmetric

Abstract

Generative Adversarial Networks (GANs) have found applications in natural image synthesis and begin to show promises generating synthetic medical images. In many cases, the ability to perform controlled image synthesis using masked priors such as shape and size of organs is desired. However, mask-guided image synthesis is challenging due to the pixel level mask constraint. While the few existing mask-guided image generation approaches suffer from the lack of fine-grained texture details, we tackle the issue of mask-guided stochastic image synthesis via mask embedding. Our novel architecture first encodes the input mask as an embedding vector and then inject these embedding into the random latent vector input. The intuition is to classify semantic masks into partitions before feature up-sampling for improved sample space mapping stability. We validate our approach on a large dataset containing 39,778 patients with 443,556 negative screening Full Field Digital Mammography (FFDM) images. Experimental results show that our approach can generate realistic high-resolution (\(256\times 512\)) images with pixel-level mask constraints, and outperform other state-of-the-art approaches.

You have full access to this open access chapter, Download conference paper PDF

A generative adversarial network for synthetization of regions of interest based on digital mammograms

Article Open access 13 April 2022

A multimodal comparison of latent denoising diffusion probabilistic models and generative adversarial networks for medical image synthesis

Article Open access 26 July 2023

XCAT-GAN for Synthesizing 3D Consistent Labeled Cardiac MR Images on Anatomically Variable XCAT Phantoms

Keywords

1 Introduction

The rapid development of generative models especially on training methodology [1, 9] and architecture [10], has led to the significant improvement of resolution and quality of the output images. In the image editing scenario, semantic control of the generated images such as object category and shape is highly desired. Many studies have explored conditional Generative Adversarial Networks (cGANs) [4, 12] using class labels (one-hot vectors) [4] and mask labels [8]. Producing highly stochastic outputs as well as capturing the full entropy of the conditional distributions are of a great challenge for current cGANs.

Most cGANs derive from the basic generator–discriminator architecture. By adding conditional information to both the generator and the discriminator, cGANs can control some characteristics of the generated images. The most straightforward way to incorporate class label information is to directly concatenate the label vector with the latent vector in the generator and then to concatenate the conditional information with the latent features in the discriminator [12]. On the other hand, incorporating a pixel-level mask label requires special design of the networks to preserve the fine-grained texture details while satisfying the mask constraint [8].

In this paper we propose a novel approach to improve the generation of realistic high-resolution medical images with semantic control. We use a U-Net [16] style generator that takes both a latent vector (Gaussian noise vector) and semantic mask. Our generator first perform embedding of the mask input and then concatenate the mask embedding vector to the latent noise vector as the input of the feature projection path. The mask is an image providing constraints and could be an edge map, a segmentation map, a gradient field, etc.

We summarize our contributions as follows:

1.
We propose to use mask embedding in semantic cGANs that takes both a mask and a random latent vector as the input. With this novel structure we can generate highly stochastic images with fine-grained details.
2.
We apply the proposed approach to a medical image synthesis task, and generated realistic high-resolution images. Specifically we synthesize mammograms with a binary mask that indicates the breast shape. To our best knowledge this is the first work that can generate realistic high-resolution mammograms with semantic control.

2 Motivation

The ability to synthesize FFDM images with cancer lesions in a controlled fashion is greatly desired by the medical imaging machine learning community. Specifically, a mask-guided stochastic generator for medical image data augmentation could potentially yield gains in detection and classification algorithms given the low occupancy of pathology related pixels in most medical imaging modalities. To actually realize this gain, the generator needs to (1) have efficient mechanisms for sample space mapping to majority of the available training images (e.g. Latent Vector); (2) learn joint distribution of semantic inputs and feature realizations (e.g. Mask Embedding). This work represent the first stage for our multi-stage study to synthesize mammogram with lesions. We investigate the feasibility of synthesizing clinically normal FFDM images with mask constraint.

3 Related Work

Medical image synthesis has been a well-motivated research topic for a long time. The ability to generate infinite number of realistic looking medical phantom greatly enables studies such as virtual clinical trials and data augmentation for computer aided diagnosis algorithms.

Recently many studies have been using GANs to synthesize medical images. Those methods can be grouped into unconditioned synthesis [6, 13] and conditioned synthesis [3, 14, 15]. There is also a detailed survey of medical image synthesis using GANs [18]. Note that mammogram synthesis using GANs has been proposed in [11], but their approach focuses on image realism and resolution in the unconditional setting, hence the shape or other morphological characteristics of their synthesized mammograms cannot be controlled.

4 Network Architecture

The challenge to train an image translation model with latent vector input is to produce latent features that are compatible with the input mask constraint. In other words, the initial latent projection layer could produce features that fall outside the manifold constrained by the semantic mask input in the latent space, resulting in the following deconvolution layers to compensate for this inefficiency, and eventually leading to model capacity reduction. Our solution is to pose an additional constraint on the initially projected features by injecting the mask embedding vector into the input latent vector. This process allows a more efficient initial feature projection that produces latent features that are compatible with the mask input, thus preserving the output image quality.

Our model consists of a generator and a discriminator shown in Fig. 1. The key concept is to perform mask embedding in the generator before the latent feature projection layer to increase the overall feature projection efficiency. The generator follows a U-Net style design that can be divided into the mask projection path and the latent projection path. The discriminator takes the output of the generator as well as the corresponding mask and produce a probability score.

The input of the generator’s mask projection path is a \(256\times 512\) mask (a binary image in our case). This mask projection path has 7 convolution layers each with a stride of 2 and depth of 8 features. The output of the mask projection path is a 32-dimensional vector (mask embedding) and is injected into the latent vector as the input of the latent projection layer. The latent vector is a 100-dimensional vector thus the input of the latent projection path is a 132-dimensional vector. Each mask projection feature block (except for the last one) is then concatenated onto the corresponding latent projection feature block to complete the U-Net structure. The initial latent feature block is produced by a dense layer followed by 7 deconvolution layers with stride of 2 and size of 4. The number of kernels of each deconvolution layer starts from 128 and decreases by a factor of 0.8 (rounded to the nearest integer) in each following layer. The output of the projection layer is the synthesized image.

5 Progressive Training

We used the progressive training strategy for GAN [9]. The training was divided into 7 phases. In each phase we doubled the network dimensions and gradually faded in the newly added layer. The model was grown from the resolution of \(4\times 8\) to \(256\times 512\). We stopped at this resolution due to hardware limitation. We adjusted the batch size and learning rate for each phases so that the standard WGAN-GP [5] converging mechanism can be achieved. We trained our model on three 1080 Ti for approximately a week to reach the maximum resolution. For each phase we train the network until the discriminator loss converges and no further observable improvement is made on the synthesized images. More details can be found in our open sourced implementation.

6 Experiments

6.1 Dataset

We used the mammography dataset collected from our institution. The dataset contains 39,778 negative screening subjects. Each exam has at least 4 images (Craniocaudal view and Mediolateral oblique view for each side of the breast), resulting in 443,556 images in total. The pixel values are truncated and normalized to [0, 1] using the window level settings provided by the DICOM header. Each image was padded and resized to \(256\times 512\). For each mammography image a binary skin mask is generated using Otsu thresholding, where 1 denotes breast region and 0 denotes background. For comparison against the pix2pix model [8], we extracted the edge map of each mammography images using Sobel filters in both horizontal and vertical direction and then overlay the texture map with the corresponding skin mask.

6.2 Results

Several example results using randomly sampled skin masks and latent vectors are shown in Fig. 2. We compared our proposed model with the pix2pix model and our baseline model without embedding mechanism. Our proposed model generates mammograms with much more realistic texture details.

6.3 Comparison to Pix2Pix Method

We compare the results of our approach with the well-known pix2pix image translation model that takes only semantic mask as input. Results are shown in Fig. 3(c). Due to the image transformation nature of this model, our first approach using a smooth skin mask as the only input failed to generate any meaningful texture details. In order to evaluate the response of pix2pix model to perturbation of mask input, we constructed the texture map as mentioned in Sect. 6.1. Even trained on masks with the prior information of high frequency tissue structures, the standard pix2pix model still under performs our proposed model in terms of fine-grained texture details and variety of parenchymal patterns generated. The pix2pix result lacks stochasticity in which a very limited mapping between mask input space and sample space is possible, thus limiting the output variation. Moreover, the same problem limits training stability since the model is forced to map similar input binary patterns with drastically different realization of tissue structures without having the proper mechanism.

6.4 Comparison to Baseline Method

We explore the effect of mask embedding mechanism by removing the mask embedding layer from our proposed model and training this baseline model from scratch. The design of our baseline model is equivalent to Tub-GAN [19]. The latent input vector is adjusted to be 100+32 so that the total number of parameters in the latent projection layer stays the same to our proposed model. The exact same training schedule for our proposed model is repeated. The results are shown in Fig. 3(d). The generated images have more high resolution details compared to pix2pix mode, but lack parenchyma complexity and usually contain obvious artifacts formed during up-sampling. This is an indication of model losing capacity due to the constraint posed by the mask input. A larger collection of comparison images can be found in supplementary material.

6.5 Evaluation

For natural image generation there have been several objective metrics to measure the performance of the generative models such as Inception Score [17], Mode Score [2] and Fréchet Inception Distance [7], in medical image generation however there is no such metric available. Thus we design a reader study and let the expert radiologists assess the realism and quality of the synthesized results.

We randomly picked 50 real breast masks and generated mammograms using the three different approaches: pix2pix, our approach without mask embedding and our approach using mask embedding. All images were presented to readers in random order. Two expert radiologists were asked to rate each mammogram using 5 scores (5: definitely realistic, 4: realistic, 3: neutral, 2: fake, 1: definitely fake). The averaged score for real mammograms, synthesized results using pix2pix, synthesized results using our approach without and with mask embedding are 3.78, 1.08, 1.34, 2.38 respectively. Although subjective, these numerical results confirm that our approach with mask embedding provides a considerable improvement in realism.

7 Conclusion

We have proposed to use binary mask constraint to guide image synthesis while preserving output variety and fine-grained texture details. The challenge was to compensate for the generator capacity reduction caused by the pixel-level mask constraint. Our solution is to use mask embedding to further guide the initial projection of latent features to increase the probability of latent features falling within the manifold constrained by the mask. Our approach enables the semantic control of the synthesized mammograms while ensuring the fine-grained texture details are looking realistic. This technique can potentially be applied to other high resolution medical image modalities as well as natural images.

References

Arjovsky, M., Chintala, S., Bottou, L.: Wasserstein generative adversarial networks. In: Precup, D., Teh, Y.W. (eds.) Proceedings of the 34th International Conference on Machine Learning, vol. 70, pp. 214–223 (2017)
Google Scholar
Che, T., Li, Y., Jacob, A.P., Bengio, Y., Li, W.: Mode regularized generative adversarial networks. CoRR abs/1612.02136 (2016)
Google Scholar
Costa, P., et al.: End-to-end adversarial retinal image synthesis. IEEE Trans. Med. Imaging 37(3), 781–791 (2018)
Article Google Scholar
Goodfellow, I., et al.: Generative adversarial nets. In: Ghahramani, Z., Welling, M., Cortes, C., Lawrence, N.D., Weinberger, K.Q. (eds.) Advances in Neural Information Processing Systems, vol. 27, pp. 2672–2680 (2014)
Google Scholar
Gulrajani, I., Ahmed, F., Arjovsky, M., Dumoulin, V., Courville, A.C.: Improved training of Wasserstein GANs. In: Guyon, I., et al. (eds.) Advances in Neural Information Processing Systems, pp. 5767–5777 (2017)
Google Scholar
Han, C., et al.: GAN-based synthetic brain MR image generation. In: 2018 IEEE 15th International Symposium on Biomedical Imaging (ISBI 2018), pp. 734–738, April 2018. https://doi.org/10.1109/ISBI.2018.8363678
Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., Klambauer, G., Hochreiter, S.: GANs trained by a two time-scale update rule converge to a Nash equilibrium. CoRR abs/1706.08500 (2017)
Google Scholar
Isola, P., Zhu, J.Y., Zhou, T., Efros, A.A.: Image-to-image translation with conditional adversarial networks. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), July 2017
Google Scholar
Karras, T., Aila, T., Laine, S., Lehtinen, J.: Progressive growing of GANs for improved quality, stability, and variation. CoRR abs/1710.10196 (2017)
Google Scholar
Karras, T., Laine, S., Aila, T.: A style-based generator architecture for generative adversarial networks. CoRR abs/1812.04948 (2018)
Google Scholar
Korkinof, D., Rijken, T., O’Neill, M., Yearsley, J., Harvey, H., Glocker, B.: High-resolution mammogram synthesis using progressive generative adversarial networks. CoRR abs/1807.03401 (2018)
Google Scholar
Mirza, M., Osindero, S.: Conditional generative adversarial nets. arXiv e-prints arXiv:1411.1784, November 2014
Moradi, M., Madani, A., Karargyris, A., Syeda-Mahmood, T.F.: Chest X-ray generation and data augmentation for cardiovascular abnormality classification, p. 57, March 2018
Google Scholar
Nie, D., et al.: Medical image synthesis with deep convolutional adversarial networks. IEEE Trans. Biomed. Eng. 65(12), 2720–2730 (2018)
Article Google Scholar
Nie, D., et al.: Medical image synthesis with context-aware generative adversarial networks. In: Descoteaux, M., Maier-Hein, L., Franz, A., Jannin, P., Collins, D.L., Duchesne, S. (eds.) MICCAI 2017. LNCS, vol. 10435, pp. 417–425. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-66179-7_48
Chapter Google Scholar
Ronneberger, O., Fischer, P., Brox, T.: U-net: convolutional networks for biomedical image segmentation. CoRR abs/1505.04597 (2015)
Google Scholar
Salimans, T., et al.: Improved techniques for training GANs. In: Lee, D.D., Sugiyama, M., Luxburg, U.V., Guyon, I., Garnett, R. (eds.) Advances in Neural Information Processing Systems, vol. 29, pp. 2234–2242 (2016)
Google Scholar
Yi, X., Walia, E., Babyn, P.: Generative adversarial network in medical imaging: a review. CoRR abs/1809.07294 (2018)
Google Scholar
Zhao, H., Li, H., Maurer-Stroh, S., Cheng, L.: Synthesizing retinal and neuronal images with generative adversarial nets. Med. Image Anal. 49, 14–26 (2018)
Article Google Scholar

Download references

Acknowledgments

This work was supported in part by NIH/NCI U01-CA214183 and U2C-CA233254, and an equipment donation by NVIDIA Corporation.

Author information

Authors and Affiliations

Department of Biomedical Engineering, Duke University, Durham, USA
Yinhao Ren & Joseph Y. Lo
Department of Radiology, Duke University School of Medicine, Durham, USA
Zhe Zhu, Lars J. Grimm & Joseph Y. Lo
Department of Mathematics, Duke University, Durham, USA
Yingzhou Li
Department of Automation, Beijing Institute of Technology, Beijing, China
Dehan Kong
Department of Electrical Engineering, Duke University, Durham, USA
Rui Hou & Joseph Y. Lo
Department of Surgery, Duke University School of Medicine, Durham, USA
Jeffery R. Marks

Authors

Yinhao Ren
View author publications
You can also search for this author in PubMed Google Scholar
Zhe Zhu
View author publications
You can also search for this author in PubMed Google Scholar
Yingzhou Li
View author publications
You can also search for this author in PubMed Google Scholar
Dehan Kong
View author publications
You can also search for this author in PubMed Google Scholar
Rui Hou
View author publications
You can also search for this author in PubMed Google Scholar
Lars J. Grimm
View author publications
You can also search for this author in PubMed Google Scholar
Jeffery R. Marks
View author publications
You can also search for this author in PubMed Google Scholar
Joseph Y. Lo
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Joseph Y. Lo .

Editor information

Editors and Affiliations

University of North Carolina at Chapel Hill, Chapel Hill, NC, USA
Dinggang Shen
University of Georgia, Athens, GA, USA
Tianming Liu
Western University, London, ON, Canada
Terry M. Peters
Yale University, New Haven, CT, USA
Lawrence H. Staib
University of Strasbourg, Illkirch, France
Caroline Essert
United Imaging Intelligence, Shanghai, China
Sean Zhou
University of North Carolina at Chapel Hill, Chapel Hill, NC, USA
Pew-Thian Yap
Western University, London, ON, Canada
Ali Khan

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 9408 KB)

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Ren, Y. et al. (2019). Mask Embedding for Realistic High-Resolution Medical Image Synthesis. In: Shen, D., et al. Medical Image Computing and Computer Assisted Intervention – MICCAI 2019. MICCAI 2019. Lecture Notes in Computer Science(), vol 11769. Springer, Cham. https://doi.org/10.1007/978-3-030-32226-7_47

Download citation

DOI: https://doi.org/10.1007/978-3-030-32226-7_47
Published: 10 October 2019
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-32225-0
Online ISBN: 978-3-030-32226-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Societies and partnerships

The Medical Image Computing and Computer Assisted Intervention Society (opens in a new tab)

Navigation