Menu

Software Heritage: the universal archive of software source codes

Roberto Di Cosmo

Software is the engine of our industry, the fuel of innovation, the essential instrument we use to communicate, to maintain ourselves, to perform any kind of transaction and operation, to organize ourselves in society and form our political opinions. Software is crucial to the functioning of economic, social and political organizations, whether public or private, whether on mobile devices or in the cloud. It is also the indispensable mediator that enables access to all digital information, and it is, along with articles and data, one of the pillars of modern research (Noorden et al., 2014)

Software therefore represents an important part of our scientific, technical and industrial heritage.

If one looks closely, it is easy to see that the real knowledge that is contained in software is not in the executable programs, but in the "source code", which according to the definition used in the GPL , is "the preferred form for a developer to make a change to a program."1 Source code is a special form of knowledge: it is made to be understood by a human being, the developer, and can be mechanically translated into a form to be executed directly on a machine. The very terminology used by the computing community is telling: "programming languages" are used to "write" software. As Harold Habelson wrote as early as 1985, "programs must be written first so that other human beings can read them" (Abelson & Sussman, 1985).

The source code of software is therefore a human creation in the same way as other written documents, and software developers deserve the same respect as other creators.

Software source code is therefore valuable heritage, as already argued by Len Shustek in a fine 2006 article (Shustek, 2006) as well as by Donald Knuth (Knuth, 1984), and it is thus essential to work on its preservation.

This is one of the missions of Software Heritage, an initiative launched in 2015 with the support of Inria,2 to collect, organize, preserve and make easily accessible all publicly available source code on the planet, regardless of where and how it was developed or distributed.

A complex task

Archiving all available source code is a complex task, and as detailed in the literature (Abramatic et al., 2018) one must deploy different strategies depending on whether one seeks to collect open- or proprietary source code, and one does not treat source code that is readily available online in the same way as source code that resides on older physical media.

For open-source code that is readily available online, the most appropriate approach is to build a harvester that automatically collects content from a wide variety of collaborative development platforms, such as GitHub, GitLab.com, or BitBucket, or from software package distribution platforms, such as Debian, NPM. CRAN or Pypi.

For the source code of old software, a real process of computer archaeology must be set up, and we have already started this work in a collaboration with the University of Pisa and UNESCO that has resulted in the SWHAP process that has been used to find, document and archive software that is important in the history of computing in Italy, and which has recently been extended with the Software Stories project, which aims to highlight all the historical elements around software whose source code has been found.

A universal mission

The founding principles of Software Heritage are (Abramatic et al., 2018; Di Cosmo & Zacchiroli, 2017): the systematic use of open-source software to build the Software Heritage infrastructure, so that its operation can be understood, and replicated if necessary; the construction of a global network of independent mirrors of the archive, because a large number of copies is the best protection against loss and attack; to have a non-profit, international, multi-stakeholder structure, to minimize the risk of having single points of failure, and to ensure that Software Heritage will indeed serve all.

For such a mission, institutional legitimacy is required, as well as a real capacity for openness to enable a broad consensus. The framework agreement signed between Inria and UNESCO on April 3, 2017, and renewed in November 2021, is essential in this regard.

Past, present, future: much more than an archive!

Software Heritage now has an infrastructure that grows day by day, and if the bulk of the archive's content is the result of automatic harvesting, some real treasures are beginning to be uncovered through the patient work of recovering significant historical software, following an acquisition process that has been developed in collaboration with the University of Pisa and UNESCO.3

Figure 1: Number of projects, source files, and versions archived in  as
                     of June 2022
Figure 1: Number of projects, source files, and versions archived in Software Heritage as of June 2022

While exhaustiveness is still far from being achieved, the archive already contains the largest corpus of source code available on the planet, with more than 180 million archived origins, for over 12 billion unique source files, each equipped with an intrinsic identifier based on cryptographic hashes (Di Cosmo et al., 2018)

This unique infrastructure has a multiple mission: of course, it is about preserving for future generations the source code of the past that made the history of Computer Science and the Information Society, but also, and above all, we are trying to build a very large telescope that will allow us to explore the present evolution of the software development galaxy, in order to better understand it, to improve it, and to build a better technological future.

A strategic issue, which needs to be known

The Software Heritage archive is already the most important collection of source code in the world, but there is still a lot of work to do, and a wide range of players, from those working in cultural heritage to industry, from research to public administration, must be brought together to achieve this. To make this possible we are counting on a growing network of ambassadors, including the Computer History Museum in Ljubljana, Slovenia.

It is clear that software has now become an essential component of all human activity, and therefore unrestricted access to publicly available software source codes is becoming a digital sovereignty issue for all nations.

The unique infrastructure that Software Heritage is building, and its universal approach, is an essential element to meet the challenges of digital sovereignty while preserving the common good dimension of the archive.

It is therefore of the utmost importance that institutional, industrial, academic and civil society actors grasp the importance of these issues, and that Europe positions itself quickly, by providing the necessary resources to make Software Heritage grow and last, by taking their place alongside other international actors who are already committed to this project, and by supporting the creation of an international non-profit institution that will carry out this mission over the long term.

Notes

1. GNU91. GNU general public license, version 2, 1991. Retrieved September 2015.

2. Created in 1967, Inria is a public scientific and technological institution specialized in mathematics and computer science, under the dual supervision of the French Ministry of Higher Education, Research and Innovation and the Ministry of Economy and Finance.