On the Preservation of Published Bioinformatics Code on Github

A few months back I posted a quick analysis of trends in where bioinformaticians choose to host their source code. A clear trend emerging in the bioinformatics community is to use github as the primary repository of bioinformatics code in published papers.  While I am a big fan of github and I support its widespread adoption, in that post I noted my concerns about the ease with which an individual can delete a published repository. In contrast to SourceForge, where it is extremely difficult to delete a repository once files have been released and this can only be done by SourceForge itself, deleting a repository on github takes only a few seconds and can be done (accidentally or intentionally) by the user who created the repository.

Just to see how easy this is, I’ve copied the process for deleting a repository on github here:

  • Go to the repo’s admin page

  • Click “Delete this repository”

  • Read the warnings and enter the name of the repository you want to delete
  • Click “I understand the consequences, delete this repository

Given the increasing use of github in publications, I feel the issue of repository deletion on github needs to be discussed by scientists and publishers more in the context of the important issue of long-term maintenance of published code. The reason I see this as important is that most github repositories are published via individual user accounts, and thus only one person holds the keys to preservation of the published code. Furthermore, I suspect funders, editors, publishers and (most) PIs have no idea how easy it is under the current model to delete published code. Call me a bit paranoid, but I see it is my responsibility as a PI to ensure the long-term preservation of published code, since I’m the one who signs off of data/resource plans in grants/final reports. Better to be safe than sorry, right?

On this note, I was pleased to see a retweet in my stream this week (via C. Titus Brown) concerning news that the journal Computers & Geosciences has adopted an official policy for hosting published code on github:

The mechanism that Computers & Geosciences has adopted to ensure long-term preservation of code in their journal is very simple – for the editor to fork code submitted by a github user into a journal organization (note: a similar idea was also suggested independently by Andrew Perry in the comments to my previous post). As clearly stated in the github repository deletion mechanism “Deleting a private repo will delete all forks of the repo. Deleting a public repo will not.” Thus, once Computers & Geosciences has forked the code, risk to the author, journal and community of a single point of failure is substantially ameliorated, with very little overhead to authors or publishers.

So what about the many other journals that have no such digital preservation policy but currently publish papers with bioinformatics code in github? Well, as a stopgap measure until other journals get on board with similar policies (PLOS & BMC, please lead the way!), I’ve taken the initiative to create a github organization called BioinformaticsArchive to serve this function. Currently, I’ve forked code for all but one of the 64 publications with github URLs in their PubMed record. One of the scary/interesting things to observe from this endeavor is just how fragile the current situation is. Of the 63 repositories I’ve forked, about 50% (n=31) had not been previously forked by any other user on github and could have been easily deleted, with consequent loss to the scientific community.

I am aware (thanks to Marc Robinson Rechavi) there are many more published github repositories in the full-text of articles (including two from our lab), which I will endeavor to dig out and add to this archive asap. If anyone else would like to help out with the endeavor, or knows of published repositories that should included, send me an email or tweet and I’ll add them to the archive. Comments on how to improve on the current state of preservation of published bioinformatics code on github and what can be learned form Computers and Geosciences new model policy are most welcome!

About these ads

25 Responses to “On the Preservation of Published Bioinformatics Code on Github”


  1. 1 Peter November 8, 2012 at 11:37 am

    The Computers & Geosciences policy to take a github fork of the repository is very neat, but that will only capture the code-base at the time of the fork (i.e. roughly speaking, the code-base at the point of the associated paper’s publication – an important snapshot).

    I do have a concern however: If the original repository is then updated (e.g. bug fixes) and then accidental or otherwise deleted, the journal’s fork would never have contained those bug fixes (unless they have automated tracking the upstream repository?).

    • 2 caseybergman November 8, 2012 at 2:51 pm

      You are absolutely right that in its simplest form the fork-at-time-of-publication strategy will only provide one snapshot of the codebase. However, I’d argue that preserving some code is better than the risk of losing everything. Also, you could look at this basic stragtegy as something equivalent to archiving the code as supplemental files attached to the publication on the journal website, which also cannot be updated. Also, deletion is less likely to affect projects that are being actively worked on, and having one fork should make it easier for people to find the latest version in the upstream repository. Ideally, however, enlightend journals would implement some sort of periodic update system along the lines you suggest.

  2. 3 Christopher Hogue November 8, 2012 at 11:55 am

    Thank-you for bringing up an important issue. Source code preservation policies at SouceForge have been rather important for preserving open-source licenses code. In my case SourceForge preserved the SLRI toolkit my group made in Toronto. As the intellectual property underlying the code was sold to Thompson-Reuters in 2007, my host institution and the dealmakers pressured me to delete the repository. SourceForge policy kept it on the site. In retrospect we were given some bad advice about some impending downfall of SourceForge, and I regret not ignoring that advice and putting all of our code on SourceForge, as was originally intended. The aftermath of all this is that of everything my group did under the guise of open source, only about 30% is preserved and online, and the rest is buried in an intellectual property shoebox at Thompson-Reuters.
    Host institutions have a lot of power of ownership over your intellectual property. If you win the right to post work into open-source, the GitHub delete policy means that your host institution can over-ride this, and require you to take your code out of circulation. GitHub is great, but for the sake of preservation, SourceForge has the right policy, protecting your decision to go open source from later manipulations by your host institution when it becomes “valuable”.

    • 4 Kai Blin November 9, 2012 at 9:11 am

      If it’s open source, can’t you just fork the software if your host institution decides to take down the work? The right to create forks is a central aspect of open source.

      • 5 Peter November 9, 2012 at 10:01 am

        Certainly once released as Open Source, yes anyone can fork/copy the code. The original authors doing so in in this situation would probably have to disobey their employer which puts them in a very awkward position.

        The problem is a member of the public can’t take a fork/copy if it the public repository has already been deleted – and right now if the code is on github the owner can easily delete it (either voluntarily or under pressure from their employer).

  3. 6 Titus brown November 8, 2012 at 12:55 pm

    Could you grab ged-lab/khmer and screed while you’re at it?

  4. 8 Mick Watson (@BioMickWatson) November 8, 2012 at 1:39 pm

    A few people have said this throughout history, but we are essentially all prostitutes – whilst some sell their bodies, others sell their minds. That’s what we do, we take a salary, and in return all of our ideas and thoughts belong to our employers. This is nothing new or controversial.

  5. 10 Chris Maloney November 8, 2012 at 3:49 pm

    It might also be a good idea to augment this with a simple act of cloning each of these repositories to a machine under your own control. If GitHub were to go belly-up tomorrow, then even these forked repos under your GitHub account would be gone.

  6. 12 Torsten Seemann November 9, 2012 at 11:55 am

    What’s worth more – a software paper with no citations or the software github repository with no forks…?

  7. 14 Andrew Perry November 11, 2012 at 2:24 am

    This is a great initiative ! Once it’s published and we’ve made the time-of-publication snapshot, I hope that github.com/boscoh/inmembrane can be forked too.

    To mitigate the (slim) possibility of Github going belly up overnight, I’d suggest making a mirrored version on something like Bitbucket – they have free unlimited academic accounts, and a simple mechanism to clone projects directly from Github:

    https://confluence.atlassian.com/display/BITBUCKET/Importing+code+from+an+existing+project#Importingcodefromanexistingproject-Importfromahostingsiteorprojectusingtheimporter

  8. 15 Karmel March 26, 2013 at 10:35 pm

    This is a great idea, and can be automated with some buy-in from the authors/repo-maintainers. All you would need would be an agreed upon tag to be included in the repo name or the README.md, then crawl github and fork all the repos with #published-science or #published-pubmed-repo or the equivalent. I don’t see why anyone would resist that, given that the code is already public and published– so all we really need is a good tag and some publicity.

  9. 16 Carl Boettiger (@cboettig) March 27, 2013 at 12:49 am

    This is great, thanks for raising an important issue and launching a clever initiative to address it.

    I think a more robust, if slightly more cumbersome approach would be to place a copy of the git repository on Figshare. Unlike Github, Figshare data is archived by CLOCKSS.org network, making it substantially less likely to be lost if the original provider (figshare, github) disappear from the face of the planet. Figshare is becoming a popular repository for other supplemental material already, and supports multiple versions of an object, ORCID integration, etc. Their API could help automate the process of archiving git repositories, though it is not a slick as forking or as easy to keep synchronized with the original.

    • 17 Phillip Lord March 28, 2013 at 12:50 pm

      Once it’s on Figshare, though, a git repository is a dead entity. Not clonably, forkable or accessable by git.In otherwords, you are archiving a different thing.

      The CLOCKKS back up is a nice idea, but it does raise the question of why our research libraries will not provide this sort of backup facility direct to scientists, and in a way that is sympathetic to the file format.
      A much better complement, I think, would be to maintain a master list with all the git URLS; this would make cloning the entire lot very easy for anyone who cared. In short, it would be CLOCKKS but using git.

      One second point that is worth mentioning. Github does not necessarily maintain good license metadata. So, while forking on github is legal (the owners of public repos agree to this in github terms and conditions, even if they haven’t read them). But forking elsewhere, including on figshare might well constitute copyright violation. For an example, consider this broadgsa / gatk-protected which would fall into this category.

      • 18 Karthik Ram March 28, 2013 at 6:50 pm

        Once it’s on Figshare, though, a git repository is a dead entity. Not clonably, forkable or accessable by git.In otherwords, you are archiving a different thing.

        Yes, agreed. The only advantage here is for the CLOCKSS backup in case the original git repository disappears. If there is a viable and open source alternative to GitHub (I know of a few but haven’t tried them myself), it would be great if research libraries and journals forked copies after publication (with appropriate git tags) to their servers. Good point re: licensing.

  10. 19 Robert Lanfear (@RobLanfear) March 27, 2013 at 10:11 am

    Great idea.

    Please grab PartitionFinder:

    https://github.com/brettc/partitionfinder/

    I like the figshare point too. It seems to me that a useful solution is to tag released versions and add each one to FigShare, so that anyone can then (in principle) replicate anyone else’s analysis.

    Can one archive a zip file of a tagged git repo in Figshare? That might make everyone’s life a bit easier.

  11. 20 Karthik Ram March 27, 2013 at 4:09 pm

    Yes, one can archive a git repo in figshare. I just recently wrote a paper on why scientists need to use git to increase reproducibility. I wrote the paper in a GitHub repo, deposited the zipped repo in figshare and included the doi in the paper itself (you can reserve a doi ahead of time). So yes, it is possible.

    • 21 Phillip Lord March 28, 2013 at 12:52 pm

      I am still unable to see how this makes any sense at all. I mean, are DOIs really so magic?

      • 22 Karthik Ram March 28, 2013 at 6:45 pm

        Having a unique and permanent identifier for an archived digital object (in this case a git repo associated with a publication) makes no sense to you?

      • 23 Phillip Lord April 2, 2013 at 9:02 am

        DOIs are not unique. It is straight-forward to award multiple DOIs for the same object, and there is not restriction within the DOI system which prevents this. CrossRef DOIs do have a “we will fine you if you do this” post-hoc system for preventing this. But, then, Figshare is not using CrossRef DOIs. And even with CrossRef this requires an definition of equality between two papers. So, you can have multiple identifiers for a single entity.

        Obtaining a unique ID in the other sense; yes, in some ways, this has an advantage, although you are dependant totally on figshare maintaining their DOIs. They don’t have to, and can break the link any time that they choose. Or, indeed they can point it at anything they like. You are not paying them, so, you have no control over this.

        Finally, their supposed utility works on the basis that people actually use them. Given that it’s easier to cut and paste the URL from the title bar, why would they not do this?

        Where I need a stable identifiers — for example for an ontology — then I use purls. These work perfectly well; here I avoid the last problem because ontology URls are not used in a browser.


  1. 1 Archiving of bioinformatics software | m's Bioinformatics Trackback on August 16, 2013 at 2:22 pm
  2. 2 Archiving of bioinformatics software | Bioinfo Toolbox Trackback on September 14, 2013 at 5:14 am

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s




Twitter Updates


Follow

Get every new post delivered to your Inbox.

Join 72 other followers

%d bloggers like this: