Abstract
Self-supervised contrastive learning (CL) has achieved state-of-the-art performance in representation learning by minimizing the distance between positive pairs while maximizing that of negative ones. Recently, it has been verified that the model learns better representation with diversely augmented positive pairs because they enable the model to be more view-invariant. However, only a few studies on CL have considered the difference between augmented views, and have not gone beyond the hand-crafted findings. In this paper, we first observe that the score-matching function can measure how much data has changed from the original through augmentation. With the observed property, every pair in CL can be weighted adaptively by the difference of score values, resulting in boosting the performance. We show the generality of our method, referred to as ScoreCL, by consistently improving various CL methods, SimCLR, SimSiam, W-MSE, and VICReg, up to 3%p in image classifcation on CIFAR and ImageNet datasets. Moreover, we have conducted exhaustive experiments and ablations, including results on diverse downstream tasks, comparison with possible baselines, and further applications when used with other augmentation methods. We hope our exploration will inspire more research in exploiting the score matching for CL.








Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Data availibility
No datasets were generated or analysed during the current study.
Notes
In this paper, we refer to CL as contrastive learning and related methods which model image similarity and dissimilarity (or only similarity) between two or more augmented image views, encompassing siamese networks or joint-embedding methods.
References
Bai, Y., Yang, E., Wang, Z., Du, Y., Han, B., Deng, C., Wang, D., & Liu, T. (2022). RSA: reducing semantic shift from aggressive augmentations for self-supervised learning. Advances in Neural Information Processing Systems, 35, 21128–21141.
Bansal, A., Borgnia, E., Chu, H.-M., Li, J. S., Kazemi, H., Huang, F., Goldblum, M., Geiping, J., & Goldstein, T. (2022). Cold diffusion: Inverting arbitrary image transforms without noise. arXiv preprint arXiv:2208.09392
Bardes, A., Ponce, J., & LeCun, Y. (2021). Vicreg: Variance-invariance-covariance regularization for self-supervised learning. arXiv preprint arXiv:2105.04906
Bossard, L., Guillaumin, M., & Gool, L. V. (2014). Food-101—Mining discriminative components with random forests. In European conference on computer vision (pp. 446–461). Springer.
Caron, M., Misra, I., Mairal, J., Goyal, P., Bojanowski, P., & Joulin, A. (2020). Unsupervised learning of visual features by contrasting cluster assignments. Advances in Neural Information Processing Systems, 33, 9912–9924.
Chen, X., & He, K. (2021). Exploring simple Siamese representation learning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 15750–15758).
Chen, X., Fan, H., Girshick, R., & He, K. (2020b) Improved baselines with momentum contrastive learning. arXiv preprint arXiv:2003.04297
Chen, T., Kornblith, S., Norouzi, M., & Hinton, G. (2020a). A simple framework for contrastive learning of visual representations. In International conference on machine learning (pp. 1597–1607). PMLR.
Cimpoi, M., Maji, S., Kokkinos, I., Mohamed, S., & Vedaldi, A. (2014). Describing textures in the wild. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR).
Coates, A., Ng, A., & Lee, H. (2011). An analysis of single-layer networks in unsupervised feature learning. In Proceedings of the fourteenth international conference on artificial intelligence and statistics (pp. 215–223). JMLR Workshop and Conference Proceedings.
Cubuk, E. D., Zoph, B., Shlens, J., & Le, Q. V. (2020). Randaugment: Practical automated data augmentation with a reduced search space. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition workshops (pp. 702–703).
Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., & Fei-Fei, L. (2009). Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition (pp. 248–255). IEEE.
Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., & Houlsby, N. (2021). An image is worth 16x16 words: Transformers for image recognition at scale. In International conference on learning representations. https://openreview.net/forum?id=YicbFdNTTy
Ermolov, A., Siarohin, A., Sangineto, E., & Sebe, N. (2021). Whitening for self-supervised representation learning. In International conference on machine learning (pp. 3015–3024). PMLR.
Everingham, M., Van Gool, L., Williams, C. K., Winn, J., & Zisserman, A. (2010). The pascal visual object classes (VOC) challenge. International Journal of Computer Vision, 88(2), 303–338.
Girshick, R., Donahue, J., Darrell, T., & Malik, J. (2014). Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 580–587).
Gong, W., & Li, Y. (2021). Interpreting diffusion score matching using normalizing flow. arXiv preprint arXiv:2107.10072
Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al. (2020). Bootstrap your own latent-a new approach to self-supervised learning. Advances in Neural Information Processing Systems, 33, 21271–21284.
He, K., Fan, H., Wu, Y., Xie, S., & Girshick, R. (2020) Momentum contrast for unsupervised visual representation learning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 9729–9738).
He, K., Gkioxari, G., Dollár, P., & Girshick, R. (2017). Mask R-CNN. In Proceedings of the IEEE international conference on computer vision (pp. 2961–2969).
Henaff, O. (2020). Data-efficient image recognition with contrastive predictive coding. In International conference on machine learning (pp. 4182–4192). PMLR.
Huang, W., Yi, M., & Zhao, X. (2021). Towards the generalization of contrastive self-supervised learning. arXiv preprint arXiv:2111.00743
Hyvärinen, A. (2008). Optimal approximation of signal priors. Neural Computation, 20(12), 3087–3110.
Hyvärinen, A., Hurri, J., & Hoyer, P. O. (2009). Estimation of non-normalized statistical models. Natural Image Statistics, 39, 419–426.
Kadkhodaie, Z., & Simoncelli, E. (2021). Stochastic solutions for linear inverse problems using the prior implicit in a denoiser. Advances in Neural Information Processing Systems, 34, 13242–13254.
Khosla, P., Teterwak, P., Wang, C., Sarna, A., Tian, Y., Isola, P., Maschinot, A., Liu, C., & Krishnan, D. (2020). Supervised contrastive learning. Advances in Neural Information Processing Systems, 33, 18661–18673.
Kingma, D. P., & Ba, J. (2014). Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980
Krause, J., Stark, M., Deng, J., & Fei-Fei, L. (2013). 3D object representations for fine-grained categorization. In Proceedings of the IEEE international conference on computer vision workshops (pp. 554–561).
Krizhevsky, A., & Hinton, G. (2009). Learning multiple layers of features from tiny images.
Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2017). Imagenet classification with deep convolutional neural networks. Communications of the ACM, 60(6), 84–90.
Lee, K., & Shin, J. (2022). R$\backslash $’enyicl: Contrastive representation learning with skew r$\backslash $’enyi divergence. arXiv preprint arXiv:2208.06270
Lin, T.-Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., & Zitnick, C. L. (2014). Microsoft coco: Common objects in context. In European conference on computer vision (pp. 740–755). Springer.
Li, Y., Yang, M., Peng, D., Li, T., Huang, J., & Peng, X. (2022). Twin contrastive learning for online clustering. International Journal of Computer Vision, 130(9), 2205–2221.
Long, J., Shelhamer, E., & Darrell, T. (2015). Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 3431–3440).
Mahmood, A., Oliva, J., & Styner, M. (2020). Multiscale score matching for out-of-distribution detection. arXiv preprint arXiv:2010.13132
Maji, S., Kannala, J., Rahtu, E., Blaschko, M., & Vedaldi, A. (2013). Fine-grained visual classification of aircraft. Technical report
Mo, S., Kang, H., Sohn, K., Li, C.-L., & Shin, J. (2021). Object-aware contrastive learning for debiased scene representation. Advances in Neural Information Processing Systems, 34, 12251–12264.
Nilsback, M.-E., & Zisserman, A. (2008). Automated flower classification over a large number of classes. In 2008 Sixth Indian conference on computer vision, graphics & image processing (pp. 722–729). IEEE.
Oord, A.V.d., Li, Y., & Vinyals, O. (2018). Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748
Peng, X., Wang, K., Zhu, Z., Wang, M., & You, Y. (2022). Crafting better contrastive views for Siamese representation learning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 16031–16040).
Purushwalkam, S., & Gupta, A. (2020). Demystifying contrastive self-supervised learning: Invariances, augmentations and dataset biases. Advances in Neural Information Processing Systems, 33, 3407–3418.
Ren, S., He, K., Girshick, R., & Sun, J. (2015). Faster R-CNN: Towards real-time object detection with region proposal networks. Advances in Neural Information Processing Systems, 28, 1137–1149.
Robinson, J., Chuang, C.-Y., Sra, S., & Jegelka, S. (2020). Contrastive learning with hard negative samples. arXiv preprint arXiv:2010.04592
Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M., et al. (2015). Imagenet large scale visual recognition challenge. International Journal of Computer Vision, 115(3), 211–252.
Sohl-Dickstein, J., Weiss, E., Maheswaranathan, N., & Ganguli, S. (2015). Deep unsupervised learning using nonequilibrium thermodynamics. In International conference on machine learning (pp. 2256–2265). PMLR.
Song, J., & Ermon, S. (2020). Multi-label contrastive predictive coding. Advances in Neural Information Processing Systems, 33, 8161–8173.
Song, Y., & Ermon, S. (2019). Generative modeling by estimating gradients of the data distribution. Advances in Neural Information Processing Systems,32
Song, Y., Garg, S., Shi, J., & Ermon, S. (2020). Sliced score matching: A scalable approach to density and score estimation. In Uncertainty in artificial intelligence (pp. 574–584). PMLR.
Song, Y., Sohl-Dickstein, J., Kingma, D. P., Kumar, A., Ermon, S., & Poole, B. (2020). Score-based generative modeling through stochastic differential equations. arXiv preprint arXiv:2011.13456
Sutskever, I., Martens, J., Dahl, G., & Hinton, G. (2013). On the importance of initialization and momentum in deep learning. In International conference on machine learning (pp. 1139–1147). PMLR.
Tian, Y., Krishnan, D., & Isola, P. (2020b). Contrastive multiview coding. In European conference on computer vision (pp. 776–794). Springer.
Tian, Y., Sun, C., Poole, B., Krishnan, D., Schmid, C., & Isola, P. (2020). What makes for good views for contrastive learning? Advances in Neural Information Processing Systems, 33, 6827–6839.
Vincent, P. (2011). A connection between score matching and denoising autoencoders. Neural Computation, 23(7), 1661–1674.
Wang, X., Fan, H., Tian, Y., Kihara, D., & Chen, X. (2022). On the importance of asymmetry for Siamese representation learning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 16570–16579).
Wang, X., & Qi, G.-J. (2022). Contrastive learning with stronger augmentations. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45, 5549–5560.
Wang, Z., Chen, Z., Li, Y., Guo, Y., Yu, J., Gong, M., & Liu, T. (2023). Mosaic representation learning for self-supervised visual pre-training. In The eleventh international conference on learning representations.
Wu, Y., Kirillov, A., Massa, F., Lo, W.-Y., & Girshick, R. (2019). Detectron2. https://github.com/facebookresearch/detectron2
Xie, J., Zhan, X., Liu, Z., Ong, Y.-S., & Loy, C. C. (2022). Delving into inter-image invariance for unsupervised visual representations. International Journal of Computer Vision, 130(12), 2994–3013.
You, Y., Li, J., Reddi, S., Hseu, J., Kumar, S., Bhojanapalli, S., Song, X., Demmel, J., Keutzer, K., & Hsieh, C.-J. (2019). Large batch optimization for deep learning: Training bert in 76 minutes. arXiv preprint arXiv:1904.00962
Zbontar, J., Jing, L., Misra, I., LeCun, Y., & Deny, S. (2021). Barlow twins: Self-supervised learning via redundancy reduction. In International conference on machine learning (pp. 12310–12320). PMLR.
Zhang, Q., & Chen, Y. (2021). Diffusion normalizing flow. Advances in Neural Information Processing Systems, 34, 16280–16291.
Zhang, R., Isola, P., Efros, A. A., Shechtman, E., & Wang, O. (2018). The unreasonable effectiveness of deep features as a perceptual metric. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 586–595).
Author information
Authors and Affiliations
Contributions
JY Kim, S Kwon, and H Go contributed equally to this work. All authors reviewed the manuscript.
Corresponding authors
Ethics declarations
Conflict of interest
The authors declare that they have no conflict of interest.
Additional information
Editor: Mingming Gong.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Kim, JY., Kwon, S., Go, H. et al. ScoreCL: augmentation-adaptive contrastive learning via score-matching function. Mach Learn 114, 12 (2025). https://doi.org/10.1007/s10994-024-06707-8
Received:
Revised:
Accepted:
Published:
DOI: https://doi.org/10.1007/s10994-024-06707-8