Abstract
Context:
When software is released publicly, it is common to include with it either the full text of the license or licenses under which it is published, or a detailed reference to them. Therefore public licenses, including FOSS (free, open source software) licenses, are usually publicly available in source code repositories
Objective:
To compile a dataset containing as many documents as possible that contain the text of software licenses, or references to the license terms. Once compiled, characterize the dataset so that it can be used for further research, or practical purposes related to license analysis
Method:
Retrieve from Software Heritage—the largest publicly available archive of FOSS source code—all versions of all files whose names are commonly used to convey licensing terms. All retrieved documents will be characterized in various ways, using automated and manual analyses
Results:
The dataset consists of 6.9 million unique license files. Additional metadata about shipped license files is also provided, making the dataset ready to use in various contexts, including: file length measures, MIME type, SPDX license (detected using ScanCode), and oldest appearance. The results of a manual analysis of 8102 documents is also included, providing a ground truth for further analysis. The dataset is released as open data as an archive file containing all deduplicated license files, plus several portable CSV files with metadata, referencing files via cryptographic checksums
Conclusions:
Thanks to the extensive coverage of Software Heritage, the dataset presented in this paper covers a very large fraction of all software licenses for public code. We have assembled a large body of software licenses, characterized it quantitatively and qualitatively, and validated that it is mostly composed of licensing information and includes almost all known license texts. The dataset can be used to conduct empirical studies on open source licensing, training of automated license classifiers, natural language processing (NLP) analyses of legal texts, as well as historical and phylogenetic studies on FOSS licensing. It can also be used in practice to improve tools detecting licenses in source code












Similar content being viewed by others
Notes
License Inclusion Principles:
https://github.com/spdx/license-list-XML/blob/main/DOCS/license-inclusion-principles.md
The version of the dataset discussed in this paper is available at https://annex.softwareheritage.org/public/dataset/license-blobs/2022-04-25/; other versions of the dataset (both past versions and future ones) are available starting from https://annex.softwareheritage.org/public/dataset/license-blobs/
Software Heritage is an archival project established in 2015 with the stated goal of: collect, preserve forever, and make publicly available the entire body of software, in the preferred form for making modifications to it. A detailed description of the project if out-of-scope for this paper, therefore we refer the interested reader to: previous publications about the project Di Cosmo and Zacchiroli (2017); Abramatic et al. (2018), its homepage at https://www.softwareheritage.org, and the archive status page at https://archive.softwareheritage.org (accessed 2022-10-20) where one can find an up-to-date view of the software origins that are periodically crawled to populate the archive.
OSI (Open Source Initiative): https://opensource.org
OSI Approved licenses: https://opensource.org/licenses-draft (accessed on 2022-10-30)
SPDX license list: https://spdx.org/licenses/ (accessed on 2022-10-30)
ScanCode LicenseDB:
https://scancode-licensedb.aboutcode.org/ (accessed on 2022-10-30)
https://annex.softwareheritage.org/public/dataset/license-blobs/2019-03-21/ (accessed 2022-11-10)
https://annex.softwareheritage.org/public/dataset/license-blobs/2021-03-23/ (accessed 2022-11-10)
https://annex.softwareheritage.org/public/dataset/license-blobs/2022-04-25/ (accessed 2022-11-10)
All dataset versions are available starting from https://annex.softwareheritage.org/public/dataset/license-blobs/
See the “How to apply the Apache License to your work” part of the Apache 2.0 license for an example of a license reference: https://www.apache.org/licenses/LICENSE-2.0 (accessed 2022-11-10).
If the document was found under several different filenames, as it could happen, it will appear in the index once for each different filename
Version used: ScanCode 31.2.1.
Details about the JSON schema:
https://scancode-toolkit.readthedocs.io/en/stable/cli-reference/output-format.html (accessed 2022-11-09)
The complete SQL query is available as part of the dataset replication package Gonzalez-Barahona et al. (2023), in the replication-package.tar.gz file.
https://scancode-toolkit.readthedocs.io/, accessed 2022-11-09
https://docs.softwareheritage.org/devel/swh-graph/api.html#leaves, accessed 2022-11-09
SWHID swh:1:cnt:36406a1eee032e80a284d3ed9f5176bba67be064
SWHID swh:1:cnt:cdc98c898b1d257ddb4752ee7a1c85ed3ddf5673
SWHID swh:1:cnt:2e26bf237427aaa56f99846acb1aeb94198119e9
SWHID swh:1:cnt:606a3bce98a4ade7d80c2761b8458d79438a3c6f
SWHID swh:1:cnt:78ec4db8002adeae4fcbfa5f56b3c1e51bfaf8c5
SWHID: swh:1:cnt:c7f43dd49cbedb819fc247b3bfe5ae45841738dc
SWHID swh:1:cnt:9ea952f4a37478f17f2a2aafb45ced7a4df67de2
SWHID swh:1:cnt:aa3157cb23f7de5d062ab5d0bf0ffb44bb719df9
SWHID swh:1:cnt:509b6082ee6debe85c005d80f047668d70dd1cb8
SWHID swh:1:cnt:f961852cee6ee9e9a0b8a25af5d090ddb6abe6a8
SWHID swh:1:cnt:711ded4ae27c43ba18a71ad05e9466a268e4387a
SWHID swh:1:cnt:46ae7b2bee342168dc48d6ca7fa1753b98e525d8
SWHID swh:1:cnt:62319023a68b04f23ea30931bb1a7c1a3e741fba
SWHID swh:1:cnt:eb9ed7bfc458af9796b59426d54d0f97a199078f
SWHID swh:1:cnt:b864764d9fc4d55eb09e123e42ede11519556d18
SWHID swh:1:cnt:9bffa2d5a63151c8c9bf3d68e9f9445558273612
SWHID swh:1:cnt:c53a6c27009183d8304d26a213b1321bdfc0cb8d
SWHID swh:1:cnt:41a6fc531459dde48d1752f24eae007047361709
SWHID swh:1:cnt:4e5eebfdbebefe990e309ecbdd83842035d3852c
SWHID swh:1:cnt:105961e3702324fadaa808457338a984101d6028
SWHID swh:1:cnt:f3932de6d7f19b26afaa7bc8502c800476c2f0a5
SWHID swh:1:cnt:fed8329964dd68adcd3dc98dd405950e53614282
SWHID swh:1:cnt:60ff9a40c14915b25d265f2bdfb508274b6782fe
SWHID swh:1:cnt:ace0bbb7fe0a8677ef5ae001b5da076b2aa666a5
SWHID swh:1:cnt:9392142a987ee04c3f0d303a58b19df818df86b3
SWHID swh:1:cnt:eb531dc6990ca433ccde3100633780ad55aed22b
licen and licens are Python modules for dealing with the Document Collection.
path_from_filename is a function returning the path of a document in the collection, given its name (SHA1)
For a full, ready-to-work program, check the file truth/random_forest.py in the dataset
Software Heritage archive changelog page:
https://docs.softwareheritage.org/devel/archive-changelog.html (accessed 2022-11-10)
https://spdx.org/licenses/, accessed 2022-11-10
Debian Copyright Review Tools: https://wiki.debian.org/CopyrightReviewTools
References
Abramatic JF, Di Cosmo R, Zacchiroli S (2018) Building the universal archive of source code. Communications of the ACM 61(10):29–31
Allançon T, A Pietri, S Zacchiroli (2021) The software heritage filesystem (swhfs): Integrating source code archival with development. In 43rd IEEE/ACM International Conference on Software Engineering: Companion Proceedings, ICSE Companion 2021, Madrid, Spain, May 25-28, 2021, pages 45–48. IEEE
Bird S (2006) NLTK: the natural language toolkit. In Nicoletta Calzolari, Claire Cardie, and Pierre Isabelle, editors, ACL 2006, 21st International Conference on Computational Linguistics and 44th Annual Meeting of the Association for Computational Linguistics, Proceedings of the Conference, Sydney, Australia, 17-21 July 2006. The Association for Computer Linguistics
Boldi P, Pietri A, Vigna S, Zacchiroli S (2020) Ultra-large-scale repository analysis via graph compression. In SANER 2020: The 27th IEEE International Conference on Software Analysis, Evolution and Reengineering. IEEE, 2020
Caneill M, Germán DM, Zacchiroli S (2017) The debsources dataset: Two decades of free and open source software. Empirical Software Engineering 22:1405–1437
ClearlyDefined (2023) ClearlyDefined, 2023. https://clearlydefined.io. Accessed 2023-05-08
Collet Y (2022) RFC 8878 - Zstandard compression and the “application/zstd” media type, 2021. Accessed 2022-01-24
Di Cosmo R, Gruenpeter M, Zacchiroli S (2018) Identifiers for digital objects: the case of software source code preservation. In Proceedings of the 15th International Conference on Digital Preservation, iPRES 2018, Boston, USA
Di Cosmo R, Zacchiroli S (2017) Software Heritage: Why and how to preserve software source code. In Proceedings of the 14th International Conference on Digital Preservation, iPRES 2017
Di Penta M, German DM, Gaël Guéhéneuc Y, Antoniol G (2010) An exploratory study of the evolution of software licensing. In Proceedings of the 32nd ACM/IEEE International Conference on Software Engineering - Volume 1, ICSE ’10, page 145-154, New York, NY, USA, 2010. Association for Computing Machinery
Dyer R, Nguyen HA, Rajan H, Nguyen TN (2015) Boa: Ultra-large-scale software repository and source-code mining. ACM Trans. Softw Eng Methodol 25(1):7:1–7:34
Flint SW, Chauhan J, Dyer R (2021) Escaping the time pit: Pitfalls and guidelines for using time-based git data. In 18th IEEE/ACM International Conference on Mining Software Repositories, MSR 2021, Madrid, Spain, May 17-19, 2021 85–96. IEEE, 2021
Gandhi RA, Germonprez M, GJP Link (2018) Open data standards for open source software risk management routines: An examination of SPDX. In Forte A, Prilla M, Vivacqua AS, Müller C, and Lionel P. Robert Jr., editors, Proceedings of the 2018 ACM Conference on Supporting Groupwork, GROUP 2018, Sanibel Island, FL, USA, January 07 - 10, pages 219–229. ACM, 2018
German DM, Di Penta M, Davies J (2010) Understanding and auditing the licensing of open source software distributions. In 2010 IEEE 18th International Conference on Program Comprehension 84–93
German DM, González-Barahona JM (2009) An empirical study of the reuse of software licensed under the GNU General Public License. In Boldyreff C, Crowston K, Lundell B, and Wasserman AI, editors, Open Source Ecosystems: Diverse Communities Interacting, pages 185–198, Berlin, Heidelberg, 2009. Springer Berlin Heidelberg
German DM, Hassan AE (2009) License integration patterns: Addressing license mismatches in component-based development. In 2009 IEEE 31st International Conference on Software Engineering 188–198
Germán DM, Manabe Y, Inoue K (2010) A sentence-matching method for automatic license identification of source code files. In Pecheur C, Andrews J, and Di Nitto E, editors, ASE 2010, 25th IEEE/ACM International Conference on Automated Software Engineering, Antwerp, Belgium, September 20-24, pages 437–446. ACM, 2010
Germán DM, Di Penta M (2012) A method for open source license compliance of java applications. IEEE Softw 29(3):58–63
GitHub. Licensee (2023). https://licensee.github.io/licensee/. Accessed 2023-05-08
Gobeille R (2008) The fossology project. In Hassan AE, Lanza M, and Godfrey MW, editors, Proceedings of the 2008 International Working Conference on Mining Software Repositories, MSR 2008 (Co-located with ICSE), Leipzig, Germany, May 10-11, 2008, Proceedings 47–50. ACM
Gomulkiewicz RW (2009) Open source license proliferation: Helpful diversity or hopeless confusion. Wash. UJL & Pol’y 30:261
Gonzalez-Barahona JM, Montes-Leon S, Robles G, Zacchiroli S (2023) The Software Heritage License Dataset (2022 Edition). https://doi.org/10.5281/zenodo.8200352
Gousios G, Spinellis D (2012) Ghtorrent: Github’s data from a firehose. In Lanza M, Di Penta M, and Xie T, editors, 9th IEEE Working Conference of Mining Software Repositories, MSR, pages 12–21. IEEE Computer Society, 2012
Harutyunyan N (2020) Managing your open source supply chain-why and how? Computer 53(6):77–81
Libraries.io. Libraries.io (2023). https://libraries.io. Accessed 2023-05-08
Lindberg V (2008) Intellectual property and open source: a practical guide to protecting. O’Reilly Media, Inc., 2008
Ma Y, Dey T, Bogart C, Amreen S, Valiev M, Tutko A, Kennard D, Zaretzki R, Mockus A (2021) World of code: enabling a research workflow for mining and analyzing the universe of open source VCS data. Empir Softw Eng 26(2):22
Manabe Y, German DM, Inoue K (2014) Analyzing the relationship between the license of packages and their files in free and open source software. In Corral L, Sillitti A, Succi G, Vlasenko J, and Wasserman AI, editors, Open Source Software: Mobile Open Source Technologies 51–60, Berlin, Heidelberg, 2014. Springer Berlin Heidelberg
Manabe Y, Hayase Y, Inoue K (2010) Evolutional analysis of licenses in FOSS. In Andrea Capiluppi, Anthony Cleve, and Naouel Moha, editors, Proceedings of the Joint ERCIM Workshop on Software Evolution (EVOL) and International Workshop on Principles of Software Evolution (IWPSE), Antwerp, Belgium, September 20-21, 2010, pages 83–87. ACM, 2010
Maryka T, Germán DM, Poo-Caamaño G (2015) On the variability of the BSD and MIT licenses. In Ernesto Damiani, Fulvio Frati, Dirk Riehle, and Anthony I. Wasserman, editors, Open Source Systems: Adoption and Impact - 11th IFIP WG 2.13 International Conference, OSS 2015, Florence, Italy, May 16-17, 2015, Proceedings, volume 451 of IFIP Advances in Information and Communication Technology 146–156. Springer, 2015
Maryka T, German DM, Poo-Caamaño G (2015) On the variability of the bsd and mit licenses. In: Damiani Ernesto, Frati Fulvio, Riehle Dirk, Wasserman Anthony I (eds) Open Source Systems: Adoption and Impact (OSS 2015). pp. Springer International Publishing, Cham, pp 146–156
McKinney W et al (2011) Pandas: a foundational python library for data analysis and statistics. Python for high performance and scientific computing 14(9):1–9
nexB ScanCode (2022) https://www.aboutcode.org/projects/scancode.html. Accessed 2022-01-25
nexB. ScanCode LicenseDB (2022). https://scancode-licensedb.aboutcode.org/. Accessed 2022-01-26
Philippe Ombredanne (2020) Free and open source software license compliance: Tools for software composition analysis. Computer 53(10):105–109
Open Source Initiative (2022) Machine readable OSI license information, 2022. https://github.com/OpenSourceOrg/licenses/. Accessed 2022-01-26
Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, Blondel M, Prettenhofer P, Weiss R, Dubourg V et al (2011) Scikit-learn: Machine learning in Python. J Mach Learn Res 12:2825–2830
Phipps S, Zacchiroli S (2020) Continuous open source license compliance. Computer 53(12):115–119
Pietri A, Spinellis D, Zacchiroli S (2019) The Software Heritage graph dataset: public software development under one roof. In Storey MAD, Adams B, and Haiduc S, editors, Proceedings of the 16th International Conference on Mining Software Repositories, MSR 2019, 26-27 May 2019, Montreal, Canada., pages 138–142. IEEE / ACM
Rosen L (2005) Open source licensing, volume 692. Prentice Hall
Rousseau G, Di Cosmo R, Zacchiroli S (2020) Software provenance tracking at the scale of public source code. Empirical Software Engineering 25(4):2930–2959
Shafranovich Y (2005) RFC 4180 - common format and MIME type for comma-separated values (CSV) files, 2005. Accessed 2022-01-24
SPDX Workgroup (2020) Software package data exchange licence list, 2019. https://spdx.org/license-list, retrieved 30 March 2020
Srinivasa-Desikan B (2018) Natural Language Processing and Computational Linguistics: A practical guide to text analysis with Python, Gensim, spaCy, and Keras. Packt Publishing Ltd, 2018
Stewart K, P Odence, Rockett E (2010) Software package data exchange (SPDX) specification. IFOSS L Rev 2:191
The CodeMeta Project (2023) The CodeMeta Project, 2023. https://codemeta.github.io/. Accessed 2023-05-08
The Open Group (2018) file: determine file type, 2018. https://pubs.opengroup.org/onlinepubs/9699919799/utilities/file.html. Accessed 2022-01-25
Vendome C, Bavota G, Di Penta M, Vásquez ML, Germán DM, Poshyvanyk D (2017) License usage and changes: a large-scale study on GitHub. Empir Softw Eng 22(3):1537–1577
Vendome C, Linares-Vásquez M, Bavota G, Di Penta M, German DM, Poshyvanyk D (2015) When and why developers adopt and change software licenses. In 2015 IEEE International Conference on Software Maintenance and Evolution (ICSME) pages 31–40
Vendome C, Vásquez ML, Bavota G, Di Penta M, Germán DM, Poshyvanyk D (2017) Machine learning-based detection of open source license exceptions. In Sebastián Uchitel, Alessandro Orso, and Martin P. Robillard, editors, Proceedings of the 39th International Conference on Software Engineering, ICSE 2017, Buenos Aires, Argentina, May 20-28, 2017, pages 118–129. IEEE / ACM, 2017
Xu S, Gao Y, Fan L, Liu Z, Liu Y, and Ji H (2023) Lidetector: License incompatibility detection for open source software. ACM Trans. Softw Eng Methodol 32(1)
Zacchiroli S (2022) A large-scale dataset of (open source) license text variants. In The 2022 Mining Software Repositories Conference (MSR 2022), pages 757–761. ACM, 2022
Zhang D, Luo P, Tang W, and Zhou M (2021) Osldetector: Identifying open-source libraries through binary analysis. In Proceedings of the 35th IEEE/ACM International Conference on Automated Software Engineering, ASE ’20, page 1312-1315, New York, NY, USA, 2021. Association for Computing Machinery
Acknowledgements
This work was made possible by Software Heritage, the great library of source code: https://www.softwareheritage.org. The authors would like to thank Valentin Lorentz from the Software Heritage engineering team for his help in releasing the new version of the license dataset documented in this paper and streamlining the dataset publication process.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflicts of interest
The authors declared that they have no conflict of interest.
Additional information
Communicated by: Nicole Novielli, Shane McIntosh, David Lo.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Gonzalez-Barahona, J.M., Montes-Leon, S., Robles, G. et al. The software heritage license dataset (2022 edition). Empir Software Eng 28, 147 (2023). https://doi.org/10.1007/s10664-023-10377-w
Accepted:
Published:
DOI: https://doi.org/10.1007/s10664-023-10377-w