On the Preservation of Published Bioinformatics Code on Github

A few months back I posted a quick analysis of trends in where bioinformaticians choose to host their source code. A clear trend emerging in the bioinformatics community is to use github as the primary repository of bioinformatics code in published papers.  While I am a big fan of github and I support its widespread adoption, in that post I noted my concerns about the ease with which an individual can delete a published repository. In contrast to SourceForge, where it is extremely difficult to delete a repository once files have been released and this can only be done by SourceForge itself, deleting a repository on github takes only a few seconds and can be done (accidentally or intentionally) by the user who created the repository.

Just to see how easy this is, I’ve copied the process for deleting a repository on github here:

  • Go to the repo’s admin page

  • Click “Delete this repository”

  • Read the warnings and enter the name of the repository you want to delete
  • Click “I understand the consequences, delete this repository

Given the increasing use of github in publications, I feel the issue of repository deletion on github needs to be discussed by scientists and publishers more in the context of the important issue of long-term maintenance of published code. The reason I see this as important is that most github repositories are published via individual user accounts, and thus only one person holds the keys to preservation of the published code. Furthermore, I suspect funders, editors, publishers and (most) PIs have no idea how easy it is under the current model to delete published code. Call me a bit paranoid, but I see it is my responsibility as a PI to ensure the long-term preservation of published code, since I’m the one who signs off of data/resource plans in grants/final reports. Better to be safe than sorry, right?

On this note, I was pleased to see a retweet in my stream this week (via C. Titus Brown) concerning news that the journal Computers & Geosciences has adopted an official policy for hosting published code on github:

The mechanism that Computers & Geosciences has adopted to ensure long-term preservation of code in their journal is very simple – for the editor to fork code submitted by a github user into a journal organization (note: a similar idea was also suggested independently by Andrew Perry in the comments to my previous post). As clearly stated in the github repository deletion mechanism “Deleting a private repo will delete all forks of the repo. Deleting a public repo will not.” Thus, once Computers & Geosciences has forked the code, risk to the author, journal and community of a single point of failure is substantially ameliorated, with very little overhead to authors or publishers.

So what about the many other journals that have no such digital preservation policy but currently publish papers with bioinformatics code in github? Well, as a stopgap measure until other journals get on board with similar policies (PLOS & BMC, please lead the way!), I’ve taken the initiative to create a github organization called BioinformaticsArchive to serve this function. Currently, I’ve forked code for all but one of the 64 publications with github URLs in their PubMed record. One of the scary/interesting things to observe from this endeavor is just how fragile the current situation is. Of the 63 repositories I’ve forked, about 50% (n=31) had not been previously forked by any other user on github and could have been easily deleted, with consequent loss to the scientific community.

I am aware (thanks to Marc Robinson Rechavi) there are many more published github repositories in the full-text of articles (including two from our lab), which I will endeavor to dig out and add to this archive asap. If anyone else would like to help out with the endeavor, or knows of published repositories that should included, send me an email or tweet and I’ll add them to the archive. Comments on how to improve on the current state of preservation of published bioinformatics code on github and what can be learned form Computers and Geosciences new model policy are most welcome!

28 thoughts on “On the Preservation of Published Bioinformatics Code on Github

  1. The Computers & Geosciences policy to take a github fork of the repository is very neat, but that will only capture the code-base at the time of the fork (i.e. roughly speaking, the code-base at the point of the associated paper’s publication – an important snapshot).

    I do have a concern however: If the original repository is then updated (e.g. bug fixes) and then accidental or otherwise deleted, the journal’s fork would never have contained those bug fixes (unless they have automated tracking the upstream repository?).

    • You are absolutely right that in its simplest form the fork-at-time-of-publication strategy will only provide one snapshot of the codebase. However, I’d argue that preserving some code is better than the risk of losing everything. Also, you could look at this basic stragtegy as something equivalent to archiving the code as supplemental files attached to the publication on the journal website, which also cannot be updated. Also, deletion is less likely to affect projects that are being actively worked on, and having one fork should make it easier for people to find the latest version in the upstream repository. Ideally, however, enlightend journals would implement some sort of periodic update system along the lines you suggest.

  2. Thank-you for bringing up an important issue. Source code preservation policies at SouceForge have been rather important for preserving open-source licenses code. In my case SourceForge preserved the SLRI toolkit my group made in Toronto. As the intellectual property underlying the code was sold to Thompson-Reuters in 2007, my host institution and the dealmakers pressured me to delete the repository. SourceForge policy kept it on the site. In retrospect we were given some bad advice about some impending downfall of SourceForge, and I regret not ignoring that advice and putting all of our code on SourceForge, as was originally intended. The aftermath of all this is that of everything my group did under the guise of open source, only about 30% is preserved and online, and the rest is buried in an intellectual property shoebox at Thompson-Reuters.
    Host institutions have a lot of power of ownership over your intellectual property. If you win the right to post work into open-source, the GitHub delete policy means that your host institution can over-ride this, and require you to take your code out of circulation. GitHub is great, but for the sake of preservation, SourceForge has the right policy, protecting your decision to go open source from later manipulations by your host institution when it becomes “valuable”.

    • If it’s open source, can’t you just fork the software if your host institution decides to take down the work? The right to create forks is a central aspect of open source.

      • Certainly once released as Open Source, yes anyone can fork/copy the code. The original authors doing so in in this situation would probably have to disobey their employer which puts them in a very awkward position.

        The problem is a member of the public can’t take a fork/copy if it the public repository has already been deleted – and right now if the code is on github the owner can easily delete it (either voluntarily or under pressure from their employer).

  3. A few people have said this throughout history, but we are essentially all prostitutes – whilst some sell their bodies, others sell their minds. That’s what we do, we take a salary, and in return all of our ideas and thoughts belong to our employers. This is nothing new or controversial.

  4. It might also be a good idea to augment this with a simple act of cloning each of these repositories to a machine under your own control. If GitHub were to go belly-up tomorrow, then even these forked repos under your GitHub account would be gone.

  5. This is a great initiative ! Once it’s published and we’ve made the time-of-publication snapshot, I hope that github.com/boscoh/inmembrane can be forked too.

    To mitigate the (slim) possibility of Github going belly up overnight, I’d suggest making a mirrored version on something like Bitbucket – they have free unlimited academic accounts, and a simple mechanism to clone projects directly from Github:

    https://confluence.atlassian.com/display/BITBUCKET/Importing+code+from+an+existing+project#Importingcodefromanexistingproject-Importfromahostingsiteorprojectusingtheimporter

  6. This is a great idea, and can be automated with some buy-in from the authors/repo-maintainers. All you would need would be an agreed upon tag to be included in the repo name or the README.md, then crawl github and fork all the repos with #published-science or #published-pubmed-repo or the equivalent. I don’t see why anyone would resist that, given that the code is already public and published– so all we really need is a good tag and some publicity.

  7. This is great, thanks for raising an important issue and launching a clever initiative to address it.

    I think a more robust, if slightly more cumbersome approach would be to place a copy of the git repository on Figshare. Unlike Github, Figshare data is archived by CLOCKSS.org network, making it substantially less likely to be lost if the original provider (figshare, github) disappear from the face of the planet. Figshare is becoming a popular repository for other supplemental material already, and supports multiple versions of an object, ORCID integration, etc. Their API could help automate the process of archiving git repositories, though it is not a slick as forking or as easy to keep synchronized with the original.

    • Once it’s on Figshare, though, a git repository is a dead entity. Not clonably, forkable or accessable by git.In otherwords, you are archiving a different thing.

      The CLOCKKS back up is a nice idea, but it does raise the question of why our research libraries will not provide this sort of backup facility direct to scientists, and in a way that is sympathetic to the file format.
      A much better complement, I think, would be to maintain a master list with all the git URLS; this would make cloning the entire lot very easy for anyone who cared. In short, it would be CLOCKKS but using git.

      One second point that is worth mentioning. Github does not necessarily maintain good license metadata. So, while forking on github is legal (the owners of public repos agree to this in github terms and conditions, even if they haven’t read them). But forking elsewhere, including on figshare might well constitute copyright violation. For an example, consider this broadgsa / gatk-protected which would fall into this category.

      • Once it’s on Figshare, though, a git repository is a dead entity. Not clonably, forkable or accessable by git.In otherwords, you are archiving a different thing.

        Yes, agreed. The only advantage here is for the CLOCKSS backup in case the original git repository disappears. If there is a viable and open source alternative to GitHub (I know of a few but haven’t tried them myself), it would be great if research libraries and journals forked copies after publication (with appropriate git tags) to their servers. Good point re: licensing.

      • Having a unique and permanent identifier for an archived digital object (in this case a git repo associated with a publication) makes no sense to you?

      • DOIs are not unique. It is straight-forward to award multiple DOIs for the same object, and there is not restriction within the DOI system which prevents this. CrossRef DOIs do have a “we will fine you if you do this” post-hoc system for preventing this. But, then, Figshare is not using CrossRef DOIs. And even with CrossRef this requires an definition of equality between two papers. So, you can have multiple identifiers for a single entity.

        Obtaining a unique ID in the other sense; yes, in some ways, this has an advantage, although you are dependant totally on figshare maintaining their DOIs. They don’t have to, and can break the link any time that they choose. Or, indeed they can point it at anything they like. You are not paying them, so, you have no control over this.

        Finally, their supposed utility works on the basis that people actually use them. Given that it’s easier to cut and paste the URL from the title bar, why would they not do this?

        Where I need a stable identifiers — for example for an ontology — then I use purls. These work perfectly well; here I avoid the last problem because ontology URls are not used in a browser.

  8. Archiving of bioinformatics software | m's Bioinformatics

  9. Archiving of bioinformatics software | Bioinfo Toolbox

  10. How can we ensure the persistence of analysis software? – EvoPhylo

  11. Reproducible phylogenetics part 2b; what – EvoPhylo

  12. Developing a modern data workflow for regularly updated data -

Leave a comment