Skip to main content
Log in

ScoreCL: augmentation-adaptive contrastive learning via score-matching function

  • Published:
Machine Learning Aims and scope Submit manuscript

Abstract

Self-supervised contrastive learning (CL) has achieved state-of-the-art performance in representation learning by minimizing the distance between positive pairs while maximizing that of negative ones. Recently, it has been verified that the model learns better representation with diversely augmented positive pairs because they enable the model to be more view-invariant. However, only a few studies on CL have considered the difference between augmented views, and have not gone beyond the hand-crafted findings. In this paper, we first observe that the score-matching function can measure how much data has changed from the original through augmentation. With the observed property, every pair in CL can be weighted adaptively by the difference of score values, resulting in boosting the performance. We show the generality of our method, referred to as ScoreCL, by consistently improving various CL methods, SimCLR, SimSiam, W-MSE, and VICReg, up to 3%p in image classifcation on CIFAR and ImageNet datasets. Moreover, we have conducted exhaustive experiments and ablations, including results on diverse downstream tasks, comparison with possible baselines, and further applications when used with other augmentation methods. We hope our exploration will inspire more research in exploiting the score matching for CL.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Algorithm 1
Fig. 6
Fig. 7

Similar content being viewed by others

Explore related subjects

Discover the latest articles, news and stories from top researchers in related subjects.

Data availibility

No datasets were generated or analysed during the current study.

Notes

  1. In this paper, we refer to CL as contrastive learning and related methods which model image similarity and dissimilarity (or only similarity) between two or more augmented image views, encompassing siamese networks or joint-embedding methods.

  2. https://www.cs.toronto.edu/~kriz/cifar.html.

  3. https://www.image-net.org/.

  4. https://cs.stanford.edu/~acoates/stl10/.

  5. https://data.vision.ee.ethz.ch/cvl/datasets_extra/food-101/.

  6. https://www.robots.ox.ac.uk/~vgg/data/flowers/102/.

  7. https://ai.stanford.edu/jkrause/cars/car_dataset.html.

  8. https://www.robots.ox.ac.uk/vgg/data/fgvc-aircraft/.

  9. https://www.robots.ox.ac.uk/vgg/data/dtd/.

References

  • Bai, Y., Yang, E., Wang, Z., Du, Y., Han, B., Deng, C., Wang, D., & Liu, T. (2022). RSA: reducing semantic shift from aggressive augmentations for self-supervised learning. Advances in Neural Information Processing Systems, 35, 21128–21141.

    MATH  Google Scholar 

  • Bansal, A., Borgnia, E., Chu, H.-M., Li, J. S., Kazemi, H., Huang, F., Goldblum, M., Geiping, J., & Goldstein, T. (2022). Cold diffusion: Inverting arbitrary image transforms without noise. arXiv preprint arXiv:2208.09392

  • Bardes, A., Ponce, J., & LeCun, Y. (2021). Vicreg: Variance-invariance-covariance regularization for self-supervised learning. arXiv preprint arXiv:2105.04906

  • Bossard, L., Guillaumin, M., & Gool, L. V. (2014). Food-101—Mining discriminative components with random forests. In European conference on computer vision (pp. 446–461). Springer.

  • Caron, M., Misra, I., Mairal, J., Goyal, P., Bojanowski, P., & Joulin, A. (2020). Unsupervised learning of visual features by contrasting cluster assignments. Advances in Neural Information Processing Systems, 33, 9912–9924.

    Google Scholar 

  • Chen, X., & He, K. (2021). Exploring simple Siamese representation learning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 15750–15758).

  • Chen, X., Fan, H., Girshick, R., & He, K. (2020b) Improved baselines with momentum contrastive learning. arXiv preprint arXiv:2003.04297

  • Chen, T., Kornblith, S., Norouzi, M., & Hinton, G. (2020a). A simple framework for contrastive learning of visual representations. In International conference on machine learning (pp. 1597–1607). PMLR.

  • Cimpoi, M., Maji, S., Kokkinos, I., Mohamed, S., & Vedaldi, A. (2014). Describing textures in the wild. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR).

  • Coates, A., Ng, A., & Lee, H. (2011). An analysis of single-layer networks in unsupervised feature learning. In Proceedings of the fourteenth international conference on artificial intelligence and statistics (pp. 215–223). JMLR Workshop and Conference Proceedings.

  • Cubuk, E. D., Zoph, B., Shlens, J., & Le, Q. V. (2020). Randaugment: Practical automated data augmentation with a reduced search space. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition workshops (pp. 702–703).

  • Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., & Fei-Fei, L. (2009). Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition (pp. 248–255). IEEE.

  • Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., & Houlsby, N. (2021). An image is worth 16x16 words: Transformers for image recognition at scale. In International conference on learning representations. https://openreview.net/forum?id=YicbFdNTTy

  • Ermolov, A., Siarohin, A., Sangineto, E., & Sebe, N. (2021). Whitening for self-supervised representation learning. In International conference on machine learning (pp. 3015–3024). PMLR.

  • Everingham, M., Van Gool, L., Williams, C. K., Winn, J., & Zisserman, A. (2010). The pascal visual object classes (VOC) challenge. International Journal of Computer Vision, 88(2), 303–338.

    Article  Google Scholar 

  • Girshick, R., Donahue, J., Darrell, T., & Malik, J. (2014). Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 580–587).

  • Gong, W., & Li, Y. (2021). Interpreting diffusion score matching using normalizing flow. arXiv preprint arXiv:2107.10072

  • Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al. (2020). Bootstrap your own latent-a new approach to self-supervised learning. Advances in Neural Information Processing Systems, 33, 21271–21284.

    Google Scholar 

  • He, K., Fan, H., Wu, Y., Xie, S., & Girshick, R. (2020) Momentum contrast for unsupervised visual representation learning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 9729–9738).

  • He, K., Gkioxari, G., Dollár, P., & Girshick, R. (2017). Mask R-CNN. In Proceedings of the IEEE international conference on computer vision (pp. 2961–2969).

  • Henaff, O. (2020). Data-efficient image recognition with contrastive predictive coding. In International conference on machine learning (pp. 4182–4192). PMLR.

  • Huang, W., Yi, M., & Zhao, X. (2021). Towards the generalization of contrastive self-supervised learning. arXiv preprint arXiv:2111.00743

  • Hyvärinen, A. (2008). Optimal approximation of signal priors. Neural Computation, 20(12), 3087–3110.

    Article  MathSciNet  MATH  Google Scholar 

  • Hyvärinen, A., Hurri, J., & Hoyer, P. O. (2009). Estimation of non-normalized statistical models. Natural Image Statistics, 39, 419–426.

    Article  MATH  Google Scholar 

  • Kadkhodaie, Z., & Simoncelli, E. (2021). Stochastic solutions for linear inverse problems using the prior implicit in a denoiser. Advances in Neural Information Processing Systems, 34, 13242–13254.

    MATH  Google Scholar 

  • Khosla, P., Teterwak, P., Wang, C., Sarna, A., Tian, Y., Isola, P., Maschinot, A., Liu, C., & Krishnan, D. (2020). Supervised contrastive learning. Advances in Neural Information Processing Systems, 33, 18661–18673.

    Google Scholar 

  • Kingma, D. P., & Ba, J. (2014). Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980

  • Krause, J., Stark, M., Deng, J., & Fei-Fei, L. (2013). 3D object representations for fine-grained categorization. In Proceedings of the IEEE international conference on computer vision workshops (pp. 554–561).

  • Krizhevsky, A., & Hinton, G. (2009). Learning multiple layers of features from tiny images.

  • Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2017). Imagenet classification with deep convolutional neural networks. Communications of the ACM, 60(6), 84–90.

    Article  MATH  Google Scholar 

  • Lee, K., & Shin, J. (2022). R$\backslash $’enyicl: Contrastive representation learning with skew r$\backslash $’enyi divergence. arXiv preprint arXiv:2208.06270

  • Lin, T.-Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., & Zitnick, C. L. (2014). Microsoft coco: Common objects in context. In European conference on computer vision (pp. 740–755). Springer.

  • Li, Y., Yang, M., Peng, D., Li, T., Huang, J., & Peng, X. (2022). Twin contrastive learning for online clustering. International Journal of Computer Vision, 130(9), 2205–2221.

    Article  MATH  Google Scholar 

  • Long, J., Shelhamer, E., & Darrell, T. (2015). Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 3431–3440).

  • Mahmood, A., Oliva, J., & Styner, M. (2020). Multiscale score matching for out-of-distribution detection. arXiv preprint arXiv:2010.13132

  • Maji, S., Kannala, J., Rahtu, E., Blaschko, M., & Vedaldi, A. (2013). Fine-grained visual classification of aircraft. Technical report

  • Mo, S., Kang, H., Sohn, K., Li, C.-L., & Shin, J. (2021). Object-aware contrastive learning for debiased scene representation. Advances in Neural Information Processing Systems, 34, 12251–12264.

    Google Scholar 

  • Nilsback, M.-E., & Zisserman, A. (2008). Automated flower classification over a large number of classes. In 2008 Sixth Indian conference on computer vision, graphics & image processing (pp. 722–729). IEEE.

  • Oord, A.V.d., Li, Y., & Vinyals, O. (2018). Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748

  • Peng, X., Wang, K., Zhu, Z., Wang, M., & You, Y. (2022). Crafting better contrastive views for Siamese representation learning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 16031–16040).

  • Purushwalkam, S., & Gupta, A. (2020). Demystifying contrastive self-supervised learning: Invariances, augmentations and dataset biases. Advances in Neural Information Processing Systems, 33, 3407–3418.

    MATH  Google Scholar 

  • Ren, S., He, K., Girshick, R., & Sun, J. (2015). Faster R-CNN: Towards real-time object detection with region proposal networks. Advances in Neural Information Processing Systems, 28, 1137–1149.

    MATH  Google Scholar 

  • Robinson, J., Chuang, C.-Y., Sra, S., & Jegelka, S. (2020). Contrastive learning with hard negative samples. arXiv preprint arXiv:2010.04592

  • Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M., et al. (2015). Imagenet large scale visual recognition challenge. International Journal of Computer Vision, 115(3), 211–252.

    Article  MathSciNet  Google Scholar 

  • Sohl-Dickstein, J., Weiss, E., Maheswaranathan, N., & Ganguli, S. (2015). Deep unsupervised learning using nonequilibrium thermodynamics. In International conference on machine learning (pp. 2256–2265). PMLR.

  • Song, J., & Ermon, S. (2020). Multi-label contrastive predictive coding. Advances in Neural Information Processing Systems, 33, 8161–8173.

    Google Scholar 

  • Song, Y., & Ermon, S. (2019). Generative modeling by estimating gradients of the data distribution. Advances in Neural Information Processing Systems,32

  • Song, Y., Garg, S., Shi, J., & Ermon, S. (2020). Sliced score matching: A scalable approach to density and score estimation. In Uncertainty in artificial intelligence (pp. 574–584). PMLR.

  • Song, Y., Sohl-Dickstein, J., Kingma, D. P., Kumar, A., Ermon, S., & Poole, B. (2020). Score-based generative modeling through stochastic differential equations. arXiv preprint arXiv:2011.13456

  • Sutskever, I., Martens, J., Dahl, G., & Hinton, G. (2013). On the importance of initialization and momentum in deep learning. In International conference on machine learning (pp. 1139–1147). PMLR.

  • Tian, Y., Krishnan, D., & Isola, P. (2020b). Contrastive multiview coding. In European conference on computer vision (pp. 776–794). Springer.

  • Tian, Y., Sun, C., Poole, B., Krishnan, D., Schmid, C., & Isola, P. (2020). What makes for good views for contrastive learning? Advances in Neural Information Processing Systems, 33, 6827–6839.

    Google Scholar 

  • Vincent, P. (2011). A connection between score matching and denoising autoencoders. Neural Computation, 23(7), 1661–1674.

    Article  MathSciNet  MATH  Google Scholar 

  • Wang, X., Fan, H., Tian, Y., Kihara, D., & Chen, X. (2022). On the importance of asymmetry for Siamese representation learning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 16570–16579).

  • Wang, X., & Qi, G.-J. (2022). Contrastive learning with stronger augmentations. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45, 5549–5560.

    MATH  Google Scholar 

  • Wang, Z., Chen, Z., Li, Y., Guo, Y., Yu, J., Gong, M., & Liu, T. (2023). Mosaic representation learning for self-supervised visual pre-training. In The eleventh international conference on learning representations.

  • Wu, Y., Kirillov, A., Massa, F., Lo, W.-Y., & Girshick, R. (2019). Detectron2. https://github.com/facebookresearch/detectron2

  • Xie, J., Zhan, X., Liu, Z., Ong, Y.-S., & Loy, C. C. (2022). Delving into inter-image invariance for unsupervised visual representations. International Journal of Computer Vision, 130(12), 2994–3013.

    Article  MATH  Google Scholar 

  • You, Y., Li, J., Reddi, S., Hseu, J., Kumar, S., Bhojanapalli, S., Song, X., Demmel, J., Keutzer, K., & Hsieh, C.-J. (2019). Large batch optimization for deep learning: Training bert in 76 minutes. arXiv preprint arXiv:1904.00962

  • Zbontar, J., Jing, L., Misra, I., LeCun, Y., & Deny, S. (2021). Barlow twins: Self-supervised learning via redundancy reduction. In International conference on machine learning (pp. 12310–12320). PMLR.

  • Zhang, Q., & Chen, Y. (2021). Diffusion normalizing flow. Advances in Neural Information Processing Systems, 34, 16280–16291.

    MATH  Google Scholar 

  • Zhang, R., Isola, P., Efros, A. A., Shechtman, E., & Wang, O. (2018). The unreasonable effectiveness of deep features as a perceptual metric. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 586–595).

Download references

Author information

Authors and Affiliations

Authors

Contributions

JY Kim, S Kwon, and H Go contributed equally to this work. All authors reviewed the manuscript.

Corresponding authors

Correspondence to Jin-Young Kim or Hyun-Gyoon Kim.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Additional information

Editor: Mingming Gong.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Kim, JY., Kwon, S., Go, H. et al. ScoreCL: augmentation-adaptive contrastive learning via score-matching function. Mach Learn 114, 12 (2025). https://doi.org/10.1007/s10994-024-06707-8

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s10994-024-06707-8

Keywords