Abstract
In image-to-image translation, each patch in the output should reflect the content of the corresponding patch in the input, independent of domain. We propose a straightforward method for doing so – maximizing mutual information between the two, using a framework based on contrastive learning. The method encourages two elements (corresponding patches) to map to a similar point in a learned feature space, relative to other elements (other patches) in the dataset, referred to as negatives. We explore several critical design choices for making contrastive learning effective in the image synthesis setting. Notably, we use a multilayer, patch-based approach, rather than operate on entire images. Furthermore, we draw negatives from within the input image itself, rather than from the rest of the dataset. We demonstrate that our framework enables one-sided translation in the unpaired image-to-image translation setting, while improving quality and reducing training time. In addition, our method can even be extended to the training setting where each “domain” is only a single image.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
- 1.
Pretrained model from https://github.com/kazuto1011/deeplab-pytorch.
References
Almahairi, A., Rajeswar, S., Sordoni, A., Bachman, P., Courville, A.: Augmented cyclegan: Learning many-to-many mappings from unpaired data. In: International Conference on Machine Learning (ICML) (2018)
Amodio, M., Krishnaswamy, S.: Travelgan: Image-to-image translation by transformation vector learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 8983–8992 (2019)
Bachman, P., Hjelm, R.D., Buchwalter, W.: Learning representations by maximizing mutual information across views. In: Advances in Neural Information Processing Systems (NeurIPS) (2019)
Benaim, S., Wolf, L.: One-sided unsupervised domain mapping. In: Advances in Neural Information Processing Systems (NeurIPS) (2017)
Bousmalis, K., Silberman, N., Dohan, D., Erhan, D., Krishnan, D.: Unsupervised pixel-level domain adaptation with generative adversarial networks. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2017)
Caesar, H., Uijlings, J., Ferrari, V.: Coco-stuff: Thing and stuff classes in context. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2018)
Chen, L.C., Papandreou, G., Kokkinos, I., Murphy, K., Yuille, A.L.: Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE Trans. Pattern Anal. Mach. Intell. (TPAMI) 40(4), 834–848 (2018)
Chen, Q., Koltun, V.: Photographic image synthesis with cascaded refinement networks. In: IEEE International Conference on Computer Vision (ICCV) (2017)
Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning (ICML) (2020)
Choi, Y., Choi, M., Kim, M., Ha, J.W., Kim, S., Choo, J.: Stargan: Unified generative adversarial networks for multi-domain image-to-image translation. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2018)
Choi, Y., Uh, Y., Yoo, J., Ha, J.W.: Stargan v2: Diverse image synthesis for multiple domains. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2020)
Chopra, S., Hadsell, R., LeCun, Y.: Learning a similarity metric discriminatively, with application to face verification. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2005)
Cordts, M., et al.: The cityscapes dataset for semantic urban scene understanding. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2016)
Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: ImageNet: A Large-Scale Hierarchical Image Database. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2009)
Doersch, C., Gupta, A., Efros, A.A.: Unsupervised visual representation learning by context prediction. In: IEEE International Conference on Computer Vision (ICCV) (2015)
Dosovitskiy, A., Brox, T.: Generating images with perceptual similarity metrics based on deep networks. In: Advances in Neural Information Processing Systems (2016)
Dosovitskiy, A., Fischer, P., Springenberg, J.T., Riedmiller, M., Brox, T.: Discriminative unsupervised feature learning with exemplar convolutional neural networks. IEEE Trans. Pattern Anal. Mach. Intell. (TPAMI) 38(9), 1734–1747 (2015)
Fu, H., Gong, M., Wang, C., Batmanghelich, K., Zhang, K., Tao, D.: Geometry-consistent generative adversarial networks for one-sided unsupervised domain mapping. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2019)
Gatys, L.A., Ecker, A.S., Bethge, M.: Image style transfer using convolutional neural networks. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2016)
Gokaslan, A., Ramanujan, V., Ritchie, D., In Kim, K., Tompkin, J.: Improving shape deformation in unsupervised image-to-image translation. In: European Conference on Computer Vision (ECCV) (2018)
Goodfellow, I., et al.: Generative adversarial nets. In: Advances in Neural Information Processing Systems (2014)
Gu, S., Chen, C., Liao, J., Yuan, L.: Arbitrary style transfer with deep feature reshuffle. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2018)
Gutmann, M., Hyvärinen, A.: Noise-contrastive estimation: A new estimation principle for unnormalized statistical models. In: International Conference on Artificial Intelligence and Statistics (AISTATS) (2010)
He, K., Fan, H., Wu, Y., Xie, S., Girshick, R.: Momentum contrast for unsupervised visual representation learning. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2020)
Hénaff, O.J., Razavi, A., Doersch, C., Eslami, S., Oord, A.v.d.: Data-efficient image recognition with contrastive predictive coding. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2019)
Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., Hochreiter, S.: GANs trained by a two time-scale update rule converge to a local Nash equilibrium. In: Advances in Neural Information Processing Systems (2017)
Hinton, G.E., Salakhutdinov, R.R.: Reducing the dimensionality of data with neural networks. Science 313(5786), 504–507 (2006)
Hjelm, R.D., et al.: Learning deep representations by mutual information estimation and maximization. arXiv preprint arXiv:1808.06670 (2018)
Hoffman, J., et al.: Cycada: Cycle-consistent adversarial domain adaptation. In: International Conference on Machine Learning (ICML) (2018)
Huang, X., Liu, M.Y., Belongie, S., Kautz, J.: Multimodal unsupervised image-to-image translation. European Conference on Computer Vision (ECCV) (2018)
Isola, P., Zhu, J.Y., Zhou, T., Efros, A.A.: Image-to-image translation with conditional adversarial networks. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2017)
Isola, P., Zoran, D., Krishnan, D., Adelson, E.H.: Crisp boundary detection using pointwise mutual information. In: European Conference on Computer Vision (ECCV) (2014)
Isola, P., Zoran, D., Krishnan, D., Adelson, E.H.: Learning visual groups from co-occurrences in space and time. arXiv preprint arXiv:1511.06811 (2015)
Johnson, J., Alahi, A., Fei-Fei, L.: Perceptual losses for real-time style transfer and super-resolution. In: European Conference on Computer Vision (ECCV) (2016)
Karras, T., Laine, S., Aila, T.: A style-based generator architecture for generative adversarial networks. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2019)
Karras, T., Laine, S., Aittala, M., Hellsten, J., Lehtinen, J., Aila, T.: Analyzing and improving the image quality of stylegan. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2020)
Kim, T., Cha, M., Kim, H., Lee, J., Kim, J.: Learning to discover cross-domain relations with generative adversarial networks. In: International Conference on Machine Learning (ICML) (2017)
Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: International Conference on Learning Representations (ICLR) (2015)
Kolkin, N., Salavon, J., Shakhnarovich, G.: Style transfer by relaxed optimal transport and self-similarity. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2019)
Larsson, G., Maire, M., Shakhnarovich, G.: Colorization as a proxy task for visual understanding. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR). pp. 6874–6883 (2017)
Lee, H.Y., Tseng, H.Y., Huang, J.B., Singh, M.K., Yang, M.H.: Diverse image-to-image translation via disentangled representation. In: European Conference on Computer Vision (ECCV) (2018)
Li, C., et al.: Alice: Towards understanding adversarial learning for joint distribution matching. In: Advances in Neural Information Processing Systems (2017)
Liang, X., Zhang, H., Lin, L., Xing, E.: Generative semantic manipulation with mask-contrasting gan. In: European Conference on Computer Vision (ECCV) (2018)
Liu, M.Y., Breuel, T., Kautz, J.: Unsupervised image-to-image translation networks. In: Advances in Neural Information Processing Systems (2017)
Liu, M.Y., et al.: Few-shot unsupervised image-to-image translation. In: IEEE International Conference on Computer Vision (ICCV) (2019)
Lotter, W., Kreiman, G., Cox, D.: Deep predictive coding networks for video prediction and unsupervised learning. arXiv preprint arXiv:1605.08104 (2016)
Löwe, S., O’Connor, P., Veeling, B.: Putting an end to end-to-end: Gradient-isolated learning of representations. In: Advances in Neural Information Processing Systems (NeurIPS) (2019)
Luan, F., Paris, S., Shechtman, E., Bala, K.: Deep photo style transfer. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2017)
Malisiewicz, T., Gupta, A., Efros, A.A.: Ensemble of Exemplar-SVMs for object detection and beyond. In: IEEE International Conference on Computer Vision (ICCV) (2011)
Mao, X., Li, Q., Xie, H., Lau, Y.R., Wang, Z., Smolley, S.P.: Least squares generative adversarial networks. In: IEEE International Conference on Computer Vision (ICCV) (2017)
Mechrez, R., Talmi, I., Shama, F., Zelnik-Manor, L.: Maintaining natural image statistics with the contextual loss. In: Asian Conference on Computer Vision (ACCV) (2018)
Mechrez, R., Talmi, I., Zelnik-Manor, L.: The contextual loss for image transformation with non-aligned data. In: European Conference on Computer Vision (ECCV) (2018)
Mescheder, L., Geiger, A., Nowozin, S.: Which training methods for gans do actually converge? In: International Conference on Machine Learning (ICML) (2018)
Misra, I., van der Maaten, L.: Self-supervised learning of pretext-invariant representations. arXiv preprint arXiv:1912.01991 (2019)
Misra, I., Zitnick, C.L., Hebert, M.: Shuffle and learn: unsupervised learning using temporal order verification. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9905, pp. 527–544. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46448-0_32
Ngiam, J., Khosla, A., Kim, M., Nam, J., Lee, H., Ng, A.Y.: Multimodal deep learning. In: International Conference on Machine Learning (ICML) (2011)
Oord, A.v.d., Li, Y., Vinyals, O.: Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748 (2018)
Owens, A., Wu, J., McDermott, J.H., Freeman, W.T., Torralba, A.: Ambient sound provides supervision for visual learning. In: European Conference on Computer Vision (ECCV) (2016)
Park, T., Liu, M.Y., Wang, T.C., Zhu, J.Y.: Semantic image synthesis with spatially-adaptive normalization. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2019)
Pathak, D., Krahenbuhl, P., Donahue, J., Darrell, T., Efros, A.A.: Context encoders: Feature learning by inpainting. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 2536–2544 (2016)
Radford, A., Metz, L., Chintala, S.: Unsupervised representation learning with deep convolutional generative adversarial networks. In: International Conference on Learning Representations (ICLR) (2016)
Rao, K., Harris, C., Irpan, A., Levine, S., Ibarz, J., Khansari, M.: Rl-cyclegan: Reinforcement learning aware simulation-to-real. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2020)
Richter, S.R., Vineet, V., Roth, S., Koltun, V.: Playing for data: Ground truth from computer games. In: European Conference on Computer Vision (ECCV) (2016)
Shaham, T.R., Dekel, T., Michaeli, T.: Singan: Learning a generative model from a single natural image. In: IEEE International Conference on Computer Vision (ICCV) (2019)
Shocher, A., Bagon, S., Isola, P., Irani, M.: Ingan: Capturing and remapping the" dna" of a natural image. In: IEEE International Conference on Computer Vision (ICCV) (2019)
Shocher, A., Cohen, N., Irani, M.: “zero-shot” super-resolution using deep internal learning. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2018)
Shrivastava, A., Malisiewicz, T., Gupta, A., Efros, A.A.: Data-driven visual similarity for cross-domain image matching. ACM Transactions on Graphics (SIGGRAPH Asia) 30(6) (2011)
Shrivastava, A., Pfister, T., Tuzel, O., Susskind, J., Wang, W., Webb, R.: Learning from simulated and unsupervised images through adversarial training. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2017)
Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. In: International Conference on Learning Representations (ICLR) (2015)
Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., Wojna, Z.: Rethinking the inception architecture for computer vision. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2016)
Taigman, Y., Polyak, A., Wolf, L.: Unsupervised cross-domain image generation. In: International Conference on Learning Representations (ICLR) (2017)
Tang, H., Xu, D., Sebe, N., Yan, Y.: Attention-guided generative adversarial networks for unsupervised image-to-image translation. In: International Joint Conference on Neural Networks (IJCNN) (2019)
Tian, Y., Krishnan, D., Isola, P.: Contrastive multiview coding. arXiv preprint arXiv:1906.05849 (2019)
Torralba, A., Efros, A.A.: Unbiased look at dataset bias. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2011)
Ulyanov, D., Vedaldi, A., Lempitsky, V.: Improved texture networks: Maximizing quality and diversity in feed-forward stylization and texture synthesis. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2017)
Vincent, P., Larochelle, H., Bengio, Y., Manzagol, P.A.: Extracting and composing robust features with denoising autoencoders. In: International Conference on Machine Learning (ICML) (2008)
Wang, T.C., Liu, M.Y., Zhu, J.Y., Tao, A., Kautz, J., Catanzaro, B.: High-resolution image synthesis and semantic manipulation with conditional gans. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2018)
Wang, Z., Bovik, A.C., Sheikh, H.R., Simoncelli, E.P.: Image quality assessment: from error visibility to structural similarity. IEEE Trans. Image Process. 13(4), 600–612 (2004)
Wu, W., Cao, K., Li, C., Qian, C., Loy, C.C.: Transgaga: Geometry-aware unsupervised image-to-image translation. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2019)
Wu, Z., Xiong, Y., Yu, S.X., Lin, D.: Unsupervised feature learning via non-parametric instance discrimination. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2018)
Yi, Z., Zhang, H., Tan, P., Gong, M.: Dualgan: Unsupervised dual learning for image-to-image translation. In: IEEE International Conference on Computer Vision (ICCV) (2017)
Yoo, J., Uh, Y., Chun, S., Kang, B., Ha, J.W.: Photorealistic style transfer via wavelet transforms. In: IEEE International Conference on Computer Vision (ICCV) (2019)
Yu, F., Koltun, V., Funkhouser, T.: Dilated residual networks. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2017)
Zhang, L., Zhang, L., Mou, X., Zhang, D.: Fsim: A feature similarity index for image quality assessment. IEEE Trans. Image Process. 20(8), 2378–2386 (2011)
Zhang, R., Isola, P., Efros, A.A.: Colorful image colorization. In: European Conference on Computer Vision (ECCV) (2016)
Zhang, R., Isola, P., Efros, A.A.: Split-brain autoencoders: Unsupervised learning by cross-channel prediction. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2017)
Zhang, R., Isola, P., Efros, A.A., Shechtman, E., Wang, O.: The unreasonable effectiveness of deep features as a perceptual metric. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR). pp. 586–595 (2018)
Zhang, R., Pfister, T., Li, J.: Harmonic unpaired image-to-image translation. In: International Conference on Learning Representations (ICLR) (2019)
Zhu, J.Y., Park, T., Isola, P., Efros, A.A.: Unpaired image-to-image translation using cycle-consistent adversarial networks. In: IEEE International Conference on Computer Vision (ICCV) (2017)
Zhu, J.Y., et al.: Toward multimodal image-to-image translation. In: Advances in Neural Information Processing Systems (2017)
Zontak, M., Irani, M.: Internal statistics of a single natural image. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2011)
Acknowledgements
We thank Allan Jabri and Phillip Isola for helpful discussion and feedback. Taesung Park is supported by a Samsung Scholarship and an Adobe Research Fellowship, and some of this work was done as an Adobe Research intern. This work was partially supported by NSF grant IIS-1633310, grant from SAP, and gifts from Berkeley DeepDrive and Adobe.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
1 Electronic supplementary material
Below is the link to the electronic supplementary material.
Appendices
Additional Image-to-Image Results
We first show additional, randomly selected results on datasets used in our main paper. We then show results on additional datasets.
1.1 Additional Comparisons
In Fig. 10, we show additional, randomly selected results for Horse\(\rightarrow \)Zebra and Cat\(\rightarrow \)Dog. This is an extension of Fig. 3 in the main paper. We compare to baseline methods CycleGAN [89], MUNIT [30], DRIT [41], Self-Distance and DistanceGAN [4], and GcGAN [18].
1.2 Additional Datasets
In Fig. 11 and 12, we show additional datasets, compared against baseline method CycleGAN [89]. Our method provides better or comparable results, demonstrating its flexibility across a variety of datasets.
-
Apple\(\rightarrow \)Orange contains 996 apple and 1,020 orange images from ImageNet and was introduced in CycleGAN [89].
-
Yosemite Summer\(\rightarrow \)Winter contains 1,273 summer and 854 winter images of Yosemite scraped using the FlickAPI was introduced in CycleGAN [89].
-
GTA\(\rightarrow \)Cityscapes GTA contains 24,966 images [63] and Cityscapes [13] contains 19,998 images of street scenes from German cities. The task was originally used in CyCADA [29].
Additional Single Image Translation Results
We show additional results in Fig. 13 and Fig. 14, and describe training details below.
Training details. At each iteration, the input image is randomly scaled to a width between 384 to 1024, and we randomly sample 16 crops of size \(128\,\times \,128\). To avoid overfitting, we divide crops into \(64\,\times \,64\) tiles before passing them to the discriminator. At test time, since the generator network is fully convolutional, it takes the input image at full size.
We found that adopting the architecture of StyleGAN2 [36] instead of CycleGAN slightly improves the output quality, although the difference is marginal. Our StyleGAN2-based generator consists of one downsampling block of StyleGAN2 discriminator, 6 StyleGAN2 residual blocks, and one StyleGAN2 upsampling block. Our discriminator has the same architecture as StyleGAN2. Following StyleGAN2, we use non-saturating GAN loss [61] with R1 gradient penalty [53]. Since we do not use style code, the style modulation layer of StyleGAN2 was removed.
Single image results.
In Fig. 13 and 14, we show additional comparison results for our method, Gatys et al. [19], STROTSS [39], WCT\(^2\) [82], and CycleGAN baseline [89]. Note that the CycleGAN baseline adopts the same augmentation techniques as well as the same generator/discriminator architectures as our method. The image resolution is at 1–2 Megapixels. Please zoom in to see more visual details.
Both figures demonstrate that our results look more photorealistic compared to CycleGAN baseline, Gatys et al. [19], and WCT\(^2\). The quality of our results is on par with results from STROTSS [39]. Note that STROTSS [39] compares to and outperforms recent style transfer methods (e.g., [22, 52]).
High-res painting to photo translation (I). We transfer Monet’s paintings to reference natural photos shown as insets at top-left corners. The training only requires a single image from each domain. We compare our results to recent style and photo transfer methods including Gatys et al. [19], WCT\(^2\) [82], STROTSS [39], and our modified patch-based CycleGAN [89]. Our method can reproduce the texture of the reference photos while retaining structure of the input paintings. Our results are at 1k \(\sim \) 1.5k resolution.
High-res painting to photo translation (II). We transfer Monet’s paintings to reference natural photos shown as insets at top-left corners. The training only requires a single image from each domain. We compare our results to recent style and photo transfer methods including Gatys et al. [19], WCT\(^2\) [82], STROTSS [39], and our modified patch-based CycleGAN [89]. Our method can reproduce the texture of the reference photos while retaining structure of the input paintings. Our results are at 1k \(\sim \) 1.5k resolution.
Unpaired Translation Details and Analysis
1.1 Training Details
To show the effect of the proposed patch-based contrastive loss, we intentionally match the architecture and hyperparameter settings of CycleGAN, except the loss function. This includes the ResNet-based generator [34] with 9 residual blocks, PatchGAN discriminator [31], Least Square GAN loss [50], batch size of 1, and Adam optimizer [38] with learning rate 0.002.
Our full model CUT is trained up to 400 epochs, while the fast variant FastCUT is trained up to 200 epochs, following CycleGAN. Moreover, inspired by GcGAN [18], FastCUT is trained with flip-equivariance augmentation, where the input image to the generator is horizontally flipped, and the output features are flipped back before computing the PatchNCE loss. Our encoder \(G_{\text {enc}}\) is the first half of the CycleGAN generator [89]. In order to calculate our multi-layer, patch-based contrastive loss, we extract features from 5 layers, which are RGB pixels, the first and second downsampling convolution, and the first and the fifth residual block. The layers we use correspond to receptive fields of sizes 1 \(\times \) 1, 9 \(\times \) 9, 15 \(\times \) 15, 35 \(\times \) 35, and 99 \(\times \) 99. For each layer’s features, we sample 256 random locations, and apply 2-layer MLP to acquire 256-dim final features. For our baseline model that uses MoCo-style memory bank [24], we follow the setting of MoCo, and used momentum value 0.999 with temperature 0.07. The size of the memory bank is 16384 per layer, and we enqueue 256 patches per image per iteration.
1.2 Evaluation Details
We list the details of our evaluation protocol.
Fréchet Inception Distance (FID [26]) throughout this paper is computed by resizing the images to 299-by-299 using bilinear sampling of PyTorch framework, and then taking the activations of the last average pooling layer of a pretrained Inception V3 [70] using the weights provided by the TensorFlow framework. We use the default setting of https://github.com/mseitzer/pytorch-fid. All test set images are used for evaluation, unless noted otherwise.
Semantic segmentation metrics on the Cityscapes dataset are computed as follows. First, we trained a semantic segmentation network using the DRN-D-22 [83] architecture. We used the recommended setting from https://github.com/fyu/drn, with batch size 32 and learning rate 0.01, for 250 epochs at 256 \(\times \) 128 resolution. The output images of the 500 validation labels are resized to 256 \(\times \) 128 using bicubic downsampling, passed to the trained DRN network, and compared against the ground truth labels downsampled to the same size using nearest-neighbor sampling.
1.3 Pseudocode
Here we provide the pseudo-code of PatchNCE loss in the PyTorch style. Our code and models are available at our GitHub repo.
Distribution matching. We measure the percentage of pixels belonging to the horse/zebra bodies, using a pre-trained semantic segmentation model. We find a distribution mismatch between sizes of horses and zebras images – zebras usually appear larger (36.8% vs. 17.9%). Our full method CUT has the flexibility to enlarge the horses, as a means of better matching of the training statistics than CycleGAN [89]. Our faster variant FastCUT, trained with a higher PatchNCE loss (\(\lambda _{X}=10\)) and flip-equivariance augmentation, behaves more conservatively like CycleGAN.
1.4 Distribution Matching
In Fig. 15, we show an interesting phenomenon of our method, caused by the training set imbalance of the horse\(\rightarrow \)zebra set. We use an off-the-shelf DeepLab model [7] trained on COCO-Stuff [6], to measure the percentage of pixels that belong to horses and zebrasFootnote 1. The training set exhibits dataset bias [74]. On average, zebras appear in more close-up pictures than horses and take up about twice the number of pixels (\(37\%\) vs \(18\%\)). To perfectly satisfy the discriminator, a translation model should attempt to match the statistics of the training set. Our method allows the flexibility for the horses to change the size, and the percentage of output zebra pixels (\(31\%\)) better matches the training distribution (\(37\%\)) than the CycleGAN baseline (\(19\%\)). On the other hand, our fast variant FastCUT uses a larger weight (\(\lambda _{X} = 10\)) on the Patch NCE loss and flip-equivariance augmentation, and hence behaves more conservatively and more similar to CycleGAN. The strong distribution matching capacity has pros and cons. For certain applications, it can create introduce undesired changes (e.g., zebra patterns on the background for horse\(\rightarrow \)zebra). On the other hand, it can enable dramatic geometric changes for applications such as Cat\(\rightarrow \)Dog.
1.5 Additional Ablation Studies
In the paper, we mainly discussed the impact of loss functions and the number of patches on the final performance. Here we present additional ablation studies on more subtle design choices. We run all the variants on horse2zebra datasets [89]. The FID of our original model is 46.6. We compare it to the following two variants of our model:
-
Ours without weight sharing for the encoder \(G_{\text {enc}}\) and MLP projection network \(H\): for this variant, when computing features \(\{\textit{\textbf{z}}_l\}_L=\{H_l(G_{\text {enc}}^l(\textit{\textbf{x}}))\}_L\), we use two separate encoders and MLP networks for embedding input images (e.g., horse) and the generated images (e.g., zebras) to feature space. They do not share any weights. The FID of this variant is 50.5, worse than our method. This shows that weight sharing helps stabilize training while reducing the number of parameters in our model.
-
Ours without updating the decoder \(G_{\text {dec}}\) using PatchNCE loss: in this variant, we exclude the gradient propagation of the decoder \(G_{\text {dec}}\) regarding PatchNCE loss \(\mathcal {L}_\text {PatchNCE}\). In other words, the decoder \(G_{\text {dec}}\) only gets updated through the adversarial loss \(\mathcal {L}_\text {GAN}\). The FID of this variant is 444.2, and the results contain severe artifacts. This shows that our \(\mathcal {L}_\text {PatchNCE}\) not only helps learn the encoder \(G_{\text {enc}}\), as done in previous unsupervised feature learning methods [24], but also learns a better decoder \(G_{\text {dec}}\) together with the GAN loss. Intuitively, if the generated result has many artifacts and is far from realistic, it would be difficult for the encoder to find correspondences between the input and output, producing a large PatchNCE loss.
Rights and permissions
Copyright information
© 2020 Springer Nature Switzerland AG
About this paper
Cite this paper
Park, T., Efros, A.A., Zhang, R., Zhu, JY. (2020). Contrastive Learning for Unpaired Image-to-Image Translation. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, JM. (eds) Computer Vision – ECCV 2020. ECCV 2020. Lecture Notes in Computer Science(), vol 12354. Springer, Cham. https://doi.org/10.1007/978-3-030-58545-7_19
Download citation
DOI: https://doi.org/10.1007/978-3-030-58545-7_19
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-58544-0
Online ISBN: 978-3-030-58545-7
eBook Packages: Computer ScienceComputer Science (R0)