Abstract
The proliferation of open Pre-trained Language Models (PTLMs) on model registry platforms like Hugging Face (HF) presents both opportunities and challenges for companies building products around them. Similar to traditional software dependencies, PTLMs continue to evolve after a release. However, the current state of release practices of PTLMs on model registry platforms are plagued by a variety of inconsistencies, such as ambiguous naming conventions and inaccessible model training documentation. Given the knowledge gap on current PTLM release practices, our empirical study uses a mixed-methods approach to analyze the releases of 52,227 PTLMs on the most well-known model registry, HF. Our results reveal 148 different naming practices for PTLM releases, with 40.87% of changes to model weight files not represented in the adopted name-based versioning practice or their documentation. In addition, we identified that the 52,227 PTLMs are derived from only 299 different base models (the modified original models used to create 52,227 PTLMs), with Fine-tuning and Quantization being the most prevalent modification methods applied to these base models. Significant gaps in release transparency, in terms of training dataset specifications and model card availability, still exist, highlighting the need for standardized documentation. While we identified a model naming practice explicitly differentiating between major and minor PTLM releases, we did not find any significant difference in the types of changes that went into either type of releases, suggesting that major/minor version numbers for PTLMs often are chosen arbitrarily. Our findings provide valuable insights to improve PTLM release practices, nudging the field towards more formal semantic versioning practices.














Similar content being viewed by others
Data Availability
The datasets generated and analyzed during this study are available in the replication package (Ajibode 2024).
Notes
By “model name,” we refer to the repository name, such as roneneldan/TinyStories-1M, which differs from the base model name, such as BERT.
Semantic Versioning 2.0.0: https://semver.org
References
Abebe SL, Ali N, Hassan AE (2016) An empirical study of software release notes. Empir Soft Eng 21:1107–1142
Ahn D, Almaatouq A, Gulabani M, Hosanagar K (2024) Impact of model interpretability and outcome feedback on trust in ai. In Proceedings of the CHI conference on human factors in computing systems, pp 1–25
Ajibode A (2024) Wip-24: Towards semantic versioning of pre-trained language models. https://github.com/SAILResearch/wip-24-adekunle-lm-release
Akoglu H (2018) User’s guide to correlation coefficients. Turk J Emerg Med 18(3):91–93
Alcobaça E, Siqueira F, Rivolli A, Garcia LPF, Oliva JT, De Carvalho ACPLF (2020) Mfe: towards reproducible meta-feature extraction. J Mach Learn Res 21(111):1–5
Ali S, Arcaini P, Pradhan D, Safdar SA, Yue T (2020) Quality indicators in search-based software engineering: an empirical evaluation. ACM Trans Softw Eng Methodol (TOSEM)29(2):1–29
Bender EM, Gebru T, McMillan-Major A, Shmitchell S (2021) On the dangers of stochastic parrots: can language models be too big? In Proceedings of the 2021 ACM conference on fairness, accountability, and transparency, pp 610–623
Bhat A, Coursey A, Hu G, Li S, Nahar N, Zhou S, Kästner C, Guo JLC (2023) Aspirations and practice of ml model documentation: moving the needle with nudging and traceability. In Proceedings of the 2023 CHI conference on human factors in computing systems, pp 1–17
Bi T, Xia X, Lo D, Grundy J, Zimmermann T (2020) An empirical study of release note production and usage in practice. IEEE Tran Softw Eng 48(6):1834–1852
Bobrovskis S, Jurenoks A (2018) A survey of continuous integration, continuous delivery and continuos deployment. In BIR workshops, pp 314–322
Boslaugh S (2012) Statistics in a nutshell: A desktop quick reference. “O’Reilly Media, Inc.”
Campbell JL, Quincy C, Osserman J, Pedersen OK (2013) Coding in-depth semistructured interviews: problems of unitization and intercoder reliability and agreement. Sociol Methods Res 42(3):294–320
Carvalho L, Seco JC (2021) Deep semantic versioning for evolution and variability. In Proceedings of the 23rd international symposium on principles and practice of declarative programming, pp 1–13
Castaño J, Martínez-Fernández S, Franch X, Bogner J (2023) Exploring the carbon footprint of hugging face’s ml models: a repository mining study. In 2023 ACM/IEEE international symposium on empirical software engineering and measurement (ESEM), IEEE, pp 1–12
Castaño J, Martínez-Fernández S, Franch X, Bogner J (2024) Analyzing the evolution and maintenance of ml models on hugging face. In 2024 IEEE/ACM 21st international conference on mining software repositories (MSR), IEEE, pp 607–618
Cocks K, Torgerson DJ (2013) Sample size calculations for pilot randomized trials: a confidence interval approach. J Clin Epidemiol 66(2):197–201
Conneau A, Khandelwal K, Goyal N, Chaudhary V, Wenzek G, Guzmán F, Grave E, Ott M, Zettlemoyer L, Stoyanov V (2019) Unsupervised cross-lingual representation learning at scale. arXiv preprint arXiv:1911.02116
Crisan A, Drouhard M, Vig J, Rajani N (2022) Interactive model cards: a human-centered approach to model documentation. In Proceedings of the 2022 ACM conference on fairness, accountability, and transparency, pp 427–439
Decan A, Mens T (2019) What do package dependencies tell us about semantic versioning? IEEE Tran Softw Eng 47(6):1226–1240
Decan A, Mens T, Claes M, Grosjean P (2016) When github meets cran: an analysis of inter-repository package dependency problems. In 2016 IEEE 23rd international conference on software analysis, evolution, and reengineering (SANER), vol 1. IEEE, pp 493–504
Dettmers T, Pagnoni A, Holtzman A, Zettlemoyer L (2024) Qlora: efficient finetuning of quantized llms. Adv Neural Inf Process Syst 36
Devlin J, Chang MW, Lee K, Toutanova K (2018) Bert:Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805
Ding N, Qin Y, Yang G, Wei F, Yang Z, Su Y, Hu S, Chen Y, Chan CM, Chen W et al (2023) (2023) Parameter-efficient fine-tuning of large-scale pre-trained language models. Nat Mach Intell 5(3):220–235
Domínguez-Álvarez D, Gorla A (2019) Release practices for ios and Android apps. In Proceedings of the 3rd ACM SIGSOFT international workshop on app market analytics, pp 15–18
Eldan R, Li Y (2023) Tinystories:How small can language models be and still speak coherent english? arXiv preprint arXiv:2305.07759
Gong Y, Liu G, Xue Y, Li R, Meng L (2023) A survey on dataset quality in machine learning. Inf Softw Technol, pp 107268
Gresta R, Durelli V, Cirilo E (2021) Naming practices in java projects: an empirical study. In Proceedings of the XX Brazilian symposium on software quality, pp 1–10
Houlsby N, Giurgiu A, Jastrzebski S, Morrone B, De Laroussilhe Q, Gesmundo A, Attariyan M, Gelly S (2019) Parameter-efficient transfer learning for nlp. In International conference on machine learning, PMLR, pp 2790–2799
Howard J, Ruder S (2018) Universal language model fine-tuning for text classification. arXiv preprint arXiv:1801.06146 2018
Jacob B, Kligys S, Chen B, Zhu M, Tang M, Howard A, Adam H, Kalenichenko D (2018) Quantization and training of neural networks for efficient integer-arithmetic-only inference. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2704–2713
Jiang AQ, Sablayrolles A, Mensch A, Bamford C, Chaplot DS, de las Casas D, Bressand F, Lengyel G, Lample G, Saulnier L et al (2023a) Mistral 7b. arXiv preprint arXiv:2310.06825
Jiang W, Cheung C, Kim M, Kim H, Thiruvathukal GK, Davis JC (2024a) Naming practices of pre-trained models in hugging face
Jiang W, Cheung C, Thiruvathukal GK, Davis JC (2023b) Exploring naming conventions (and defects) of pre-trained deep learning models in hugging face and other model hubs. arXiv preprint arXiv:2310.01642
Jiang W, Synovic N, Hyatt M, Schorlemmer TR, Sethi R, Lu YH, Thiruvathukal GK, Davis JC (2023c) An empirical study of pre-trained model reuse in the hugging face deep learning model registry. arXiv preprint arXiv:2303.02552
Jiang W, Synovic N, Sethi R, Indarapu A, Hyatt M, Schorlemmer TR, Thiruvathukal GK, Davis JC (2022) An empirical study of artifacts and security risks in the pre-trained model supply chain. In Proceedings of the 2022 ACM workshop on software supply chain offensive research and ecosystem defenses, pp 105–114
Jiang W, Yasmin J, Jones J, Synovic N, Kuo J, Bielanski N, Tian Y, Thiruvathukal GK, Davis JC (2024b) Peatmoss:A dataset and initial analysis of pre-trained models in open-source software. arXiv preprint arXiv:2402.00699
Jones J, Jiang W, Synovic N, Thiruvathukal G, Davis J (2024) What do we know about hugging face? a systematic literature review and quantitative validation of qualitative claims. In Proceedings of the 18th ACM/IEEE international symposium on empirical software engineering and measurement, pp 13–24
Kandpal N, Wallace E, Raffel C (2022) Deduplicating training data mitigates privacy risks in language models. In international conference on machine learning, PMLR, pp 10697–10707
Kathikar A, Nair A, Lazarine B, Sachdeva A, Samtani S (2023) Assessing the vulnerabilities of the open-source artificial intelligence (ai) landscape: a large-scale analysis of the hugging face platform. In 2023 IEEE international conference on intelligence and security informatics (ISI), IEEE, pp 1–6
Kerzazi N, Adams B (2016) Who needs release and devops engineers, and why? In Proceedings of the international workshop on continuous software evolution and delivery, pp 77–83
Khomh F, Dhaliwal T, Zou Y, Adams B (2012) Do faster releases improve software quality? an empirical case study of mozilla Firefox. In 2012 9th IEEE working conference on mining software repositories (MSR), IEEE, pp 179–188
Kinahan S, Saidi P, Daliri A, Liss J, Berisha V (2024) Achieving reproducibility in eeg-based machine learning. In The 2024 ACM conference on fairness, accountability, and transparency, pp 1464–1474
Kirk HR, Jun Y, Volpin F, Iqbal H, Benussi E, Dreyer F, Shtedritski A, Asano Y (2021) Bias out-of-the-box: An empirical analysis of intersectional occupational biases in popular generative language models. Adv Neural Inf Process Syst 34:2611–2624
Lam P, Dietrich J, Pearce DJ (2020) Putting the semantics into semantic versioning. In Proceedings of the 2020 ACM SIGPLAN international symposium on new ideas, new paradigms, and reflections on programming and software, pp 157–179
Laukkanen E, Itkonen J, Lassenius C (2017) Problems, causes and solutions when adopting continuous delivery—a systematic literature review. Inf Softw Technol 82:55–79
Lawrie D, Morrell C, Feild H, Binkley D (2007) Effective identifier names for comprehension and memory. Innov Syst Softw Eng 3:303–318
Liu H, Tam D, Muqeeth M, Mohta J, Huang T, Bansal M, Raffel CA (2022) Few-shot parameter-efficient fine-tuning is better and cheaper than in-context learning. Adv Neural Inf Process Syst 35:1950–1965
Liu Y, Chen C, Zhang R, Qin T, Ji X, Lin H, Yang M (2020) Enhancing the interoperability between deep learning frameworks by model conversion. In Proceedings of the 28th ACM joint meeting on European software engineering conference and symposium on the foundations of software engineering, pp 1320–1330
Liu Y, Ott M, Goyal N, Du J, Joshi M, Chen D, Levy O, Lewis M, Zettlemoyer L, Stoyanov V (2019) Roberta: a robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692
Loomes MJ, Nehaniv CL, Wernick P (2005) The naming of systems and software evolvability. In IEEE international workshop on software evolvability (Software-Evolvability’05), IEEE, pp 23–28
Mao HH (2020) A survey on self-supervised pre-training for sequential transfer learning in neural networks. arXiv preprint arXiv:2007.00800
Martin J. Fine-tuning and deployment. LinkedIn,. https://www.linkedin.com/pulse/fine-tuning-deployment-dr-john-martin-yvqyf
Michlmayr M, Hunt F, Probert D (2007) Release management in free software projects: practices and problems. In open source development, adoption and innovation: IFIP working group 2.13 on open source software, June 11–14, 2007, Limerick, Ireland 3, Springer, pp 295–300
Min S, Seo M, Hajishirzi H (2017) Question answering through transfer learning from large fine-grained supervision data. arXiv preprint arXiv:1702.02171
Mitchell M, Wu S, Zaldivar A, Barnes P, Vasserman L, Hutchinson B, Spitzer E, Raji ID, Gebru T (2019) Model cards for model reporting. In Proceedings of the conference on fairness, accountability, and transparency, pp 220–229
Nayebi M, Adams B, Ruhe G (2016) Release practices for mobile apps–what do users and developers think? In 2016 ieee 23rd international conference on software analysis, evolution, and reengineering (saner), vol 1. IEEE, pp 552–562
Novakouski M, Lewis G, Anderson W, Davenport J (2012) Best practices for artifact versioning in service-oriented systems. Software Engineering Institute, Carnegie Mellon University, Pittsburgh, Pennsylvania, Technical Note CMU/SEI-2011-TN-009
Osborne C, Ding J, Kirk HR (2024) The ai community building the future? a quantitative analysis of development activity on hugging face hub. J Comput Soc Sci 7(2):2067–2105
Osborne C, Ding J, Kirk HR (2024) The ai community building the future? a quantitative analysis of development activity on hugging face hub. J Comput Soc Sci 7(2):2067–2105
Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, Blondel M, Prettenhofer P, Weiss R, Dubourg V et al (2011) Scikit-learn: Machine learning in python. J Mach Learn Res 12:2825–2830
Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, Blondel M, Prettenhofer P, Weiss R, Dubourg V et al (2011) Scikit-learn:Machine learning in python. J Mach Learn Res 12:2825–2830
Pérez J, Díaz J, Garcia-Martin J, Tabuenca B (2020) Systematic literature reviews in software engineering—enhancement of the study selection process using cohen’s kappa statistic. J Syst Softw 168:110657
Pérez J, Díaz J, Garcia-Martin J, Tabuenca B (2020) Systematic literature reviews in software engineering—enhancement of the study selection process using cohen’s kappa statistic. J Syst Softw 168:110657
R OpenAI. Gpt-4 technical report. View in Article 2:13. arxiv arXiv:2303.08774
Raemaekers S, van Deursen A, Visser J (2017) Semantic versioning and impact of breaking changes in the maven repository. J Syst Softw 129:140–158
Raemaekers S, van Deursen A, Visser J (2017) Semantic versioning and impact of breaking changes in the maven repository. J Syst Softw 129:140–158
Saldana J (2015) The Coding Manual for Qualitative Researchers. Sage Publications
Saldana J (2015) The Coding Manual for Qualitative Researchers. Sage Publications
Sarzynska-Wawer J, Wawer A, Pawlak A, Szymanowska J, Stefaniak I, Jarkiewicz M, Okruszek L (2021) Detecting formal thought disorder by deep contextualized word representations. Psychiatry Res 304:114135
Seacord RC, Hissam SA, Wallnau KC (1998) Agora: a search engine for software components. IEEE Internet Comput 2 (6):62
Seacord RC, Hissam SA, Wallnau KC (1998) Agora: a search engine for software components. IEEE Internet Comput 2(6):62
Shahin M, Babar MA, Zhu L (2017) Continuous integration, delivery and deployment:a systematic review on approaches, tools, challenges and practices. IEEE access 5:3909–3943
Shahin M, Babar MA, Zhu L (2017) Continuous integration, delivery and deployment:a systematic review on approaches, tools, challenges and practices. IEEE access 5:3909–3943
Singh AS, Masuku MB (2014) Sampling techniques & determination of sample size in applied statistics research:An overview. Int J Econ Commer Manag 2(11):1–22
Singh AS, Masuku MB (2014) Sampling techniques & determination of sample size in applied statistics research: An overview. Int J Econ Commer Manag 2(11):1–22
Stuckenholz A (2005) Component evolution and versioning state of the art. ACM SIGSOFT Softw Eng Notes 30(1):7
Stuckenholz A (2005) Component evolution and versioning state of the art. ACM SIGSOFT Softw Eng Notes 30(1):7
Sun S, Cheng Y, Gan Z, Liu J (2019) Patient knowledge distillation for bert model compression. arXiv preprint arXiv:1908.09355
Taraghi M, Dorcelus G, Foundjem A, Tambon F, Khomh F (2024) Deep learning model reuse in the huggingface community: challenges, benefit and trends. arXiv preprint arXiv:2401.13177
Team G, Mesnard T, Hardin C, Dadashi R, Bhupatiraju S, Pathak S, Sifre L, Rivière M, Kale MS, Love J et al (2024) Gemma: open models based on gemini research and technology. arXiv preprint arXiv:2403.08295
Toma TR, Bezemer CP (2024) An exploratory study of dataset and model management in open source machine learning applications
Touvron H, Lavril T, Izacard G, Martinet X, Lachaux MA, Lacroix T, Rozière B, Goyal N, Hambro E, Azhar F et al (2023) Llama:Open and efficient foundation language models. arXiv preprint arXiv:2302.13971
Turri V, Morrison K, Robinson KM, Abidi C, Perer A, Forlizzi J, Dzombak R (2024) Transparency in the wild:Navigating transparency in a deployed ai system to broaden need-finding approaches. In The 2024 ACM conference on fairness, accountability, and transparency, pp 1494–1514
Vieira SM, Kaymak U, Sousa JMC (2010) Cohen’s kappa coefficient as a performance measure for feature selection. In International conference on fuzzy systems, IEEE, pp 1–8
Wadhwani A, Jain P (2020) Machine learning model cards transparency review: using model card toolkit. In 2020 IEEE pune section international conference (PuneCon), IEEE, pp 133–137
Wang H, Li J, Wu H, Hovy E, Sun Y (2022) Pre-trained language models and their applications. Eng
Williams LL, Quave K (2019) Chapter 10–tests of proportions:chi-square, likelihood ratio, fisher’s exact test. Quantitative anthropology, pp 123–41
Wood JR, Wood LE (2008) Card sorting:current practices and beyond. J Usability Stud 4(1):1–6
Wortsman M, Ilharco G, Kim JW, Li M, Kornblith S, Roelofs R, Lopes RG, Hajishirzi H, Farhadi A, Namkoong H et al (2022) Robust fine-tuning of zero-shot models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 7959–7971
Xia B, Bi T, Xing Z, Lu Q, Zhu L (2023) An empirical study on software bill of materials:Where we stand and the road ahead. In 2023 IEEE/ACM 45th international conference on software engineering (ICSE), IEEE, pp 2630–2642
Xiu M, Jiang ZMJ, Adams B (2023) An exploratory study of machine learning model stores. IEEE Softw 38(1):114–122
Xu T, Zhou Y (2015) Systems approaches to tackling configuration errors: a survey. ACM Comput Surv (CSUR) 47(4):1–41
Yang L, Zhang H, Shen H, Huang X, Zhou X, Rong G, Shao D (2021) Quality assessment in systematic literature reviews: a software engineering perspective. Inf Softw Technol 130:106397
Yang Z, Shi J, Lo D (2024) Ecosystem of large language models for code. arXiv preprint arXiv:2405.16746
Yin Z, Ma X, Zheng J, Zhou Y, Bairavasundaram LN, Pasupathy S (2011) An empirical study on configuration errors in commercial and open source systems. In Proceedings of the 23rd ACM symposium on operating systems principles, pp 159–172
Zhao WX, Zhou K, Li J, Tang T, Wang X, Hou Y, Min Y, Zhang B, Zhang J, Dong Z et al (2023) A survey of large language models. arXiv preprint arXiv:2303.18223
Zhu M, Gupta S (2017) To prune, or not to prune:exploring the efficacy of pruning for model compression. arXiv preprint arXiv:1710.01878
Funding
This research was supported by the NSERC Discovery Grant.
Author information
Authors and Affiliations
Contributions
−Adekunle Ajibode: Conceptualization, Data Collection, Methodology, Data Analysis, Writing - Original Draft. − Abdul Ali Bangash: Methodology, Data Validation, Writing - Review & Editing. − Filipe Roseiro Cogo: Methodology, Writing - Review & Editing. − Bram Adams: Supervision, Writing - Review & Editing, Conceptual Guidance, Research Direction. − Ahmed E. Hassan: Supervision, Research Direction.
Corresponding author
Ethics declarations
Conflicts of Interests/Competing Interests
The authors declare that they have no known competing interests or personal relationships that could have (appeared to) influenced the work reported in this article.
Ethical Approval
This study does not involve human participants or animals.
Informed Consent
No human subjects were involved in this study.
Additional information
Communicated by: Markus Borg.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Ajibode, A., Bangash, A.A., Cogo, F.R. et al. Towards semantic versioning of open pre-trained language model releases on hugging face. Empir Software Eng 30, 78 (2025). https://doi.org/10.1007/s10664-025-10631-3
Accepted:
Published:
DOI: https://doi.org/10.1007/s10664-025-10631-3