A few months back I posted a quick analysis of trends in where bioinformaticians choose to host their source code. A clear trend emerging in the bioinformatics community is to use github as the primary repository of bioinformatics code in published papers. While I am a big fan of github and I support its widespread adoption, in that post I noted my concerns about the ease with which an individual can delete a published repository. In contrast to SourceForge, where it is extremely difficult to delete a repository once files have been released and this can only be done by SourceForge itself, deleting a repository on github takes only a few seconds and can be done (accidentally or intentionally) by the user who created the repository.
Just to see how easy this is, I’ve copied the process for deleting a repository on github here:
- Go to the repo’s admin page
- Click “Delete this repository”
- Read the warnings and enter the name of the repository you want to delete
- Click “I understand the consequences, delete this repository
Given the increasing use of github in publications, I feel the issue of repository deletion on github needs to be discussed by scientists and publishers more in the context of the important issue of long-term maintenance of published code. The reason I see this as important is that most github repositories are published via individual user accounts, and thus only one person holds the keys to preservation of the published code. Furthermore, I suspect funders, editors, publishers and (most) PIs have no idea how easy it is under the current model to delete published code. Call me a bit paranoid, but I see it is my responsibility as a PI to ensure the long-term preservation of published code, since I’m the one who signs off of data/resource plans in grants/final reports. Better to be safe than sorry, right?
On this note, I was pleased to see a retweet in my stream this week (via C. Titus Brown) concerning news that the journal Computers & Geosciences has adopted an official policy for hosting published code on github:
The mechanism that Computers & Geosciences has adopted to ensure long-term preservation of code in their journal is very simple – for the editor to fork code submitted by a github user into a journal organization (note: a similar idea was also suggested independently by Andrew Perry in the comments to my previous post). As clearly stated in the github repository deletion mechanism “Deleting a private repo will delete all forks of the repo. Deleting a public repo will not.” Thus, once Computers & Geosciences has forked the code, risk to the author, journal and community of a single point of failure is substantially ameliorated, with very little overhead to authors or publishers.
So what about the many other journals that have no such digital preservation policy but currently publish papers with bioinformatics code in github? Well, as a stopgap measure until other journals get on board with similar policies (PLOS & BMC, please lead the way!), I’ve taken the initiative to create a github organization called BioinformaticsArchive to serve this function. Currently, I’ve forked code for all but one of the 64 publications with github URLs in their PubMed record. One of the scary/interesting things to observe from this endeavor is just how fragile the current situation is. Of the 63 repositories I’ve forked, about 50% (n=31) had not been previously forked by any other user on github and could have been easily deleted, with consequent loss to the scientific community.
I am aware (thanks to Marc Robinson Rechavi) there are many more published github repositories in the full-text of articles (including two from our lab), which I will endeavor to dig out and add to this archive asap. If anyone else would like to help out with the endeavor, or knows of published repositories that should included, send me an email or tweet and I’ll add them to the archive. Comments on how to improve on the current state of preservation of published bioinformatics code on github and what can be learned form Computers and Geosciences new model policy are most welcome!