Incentivising open data & reproducible research through pre-publication private access to NGS data at EBI

Yesterday Ewan Birney posted a series of tweets expressing surprise that more people don’t take advantage of ENA’s programmatic access to submit and store next-generation sequencing (NGS) data to EBI, that I tried to respond to in broken twitter English. This post attempts to clarify how I think ENA’s system could be improved in ways that I think would benefit both data archiving and reproducible research, and possibly increase uptake and sustainability of the service.

I’ve been a heavy consumer of NGS data from EBI for a couple of years, mainly thanks to their plain-vanilla fastq.gz downloads and clean REST interface for extracting NGS metadata. But I’ve only just recently gone through the process of submitting NGS data to ENA myself, first using their web portal and more recently taking advantage of REST-based programmatic access. Aside from the issue of how best to transfer many big files to EBI in an automatic way (which I’ve blogged about here), I’ve been quite impressed by how well-documented and efficient ENA’s NGS submission process is. For those who’ve had bad experiences submitting to SRA, I agree with Ewan that ENA provides a great service, and I’d suggest giving EBI a try.

In brief, the current ENA submission process entails:

  1. transfer of user’s NGS data to EBI’s “dropbox”, which is basically a private storage area on EBI’s servers that requires user/password authentication (done by user).
  2. creation and submission of metadata files with information about runs and samples (done by user)
  3. validation of data/metadata and creation of accession numbers for the projects/experiments/samples/runs (done by EBI)
  4. conversion of submitted NGS data to EBI formatted version, giving new IDs to each read and connecting appropriate metadata to each NGS data file (done by EBI)
  5. public release of accession-number based annotated data (done by EBI on the user’s release date or after publication)

Where I see the biggest room for improvement is in the “hupped” phase when data is submitted but private.  During this phase, I can store data at EBI privately for up to two years, and thus keep a remote back-up of my data for free, which is great, but only in its original submitted format  I can’t, however, access the exact version of my data that will ultimately become public, i.e. using the REST interface with what will be the published accession numbers on data with converted read IDs.  For these reasons, I can’t write pipelines that use the exact data that will be referenced in a paper, and thus I cannot fully verify that the results I publish can be reproduced by someone else. Additionally, I can’t “proof” what my submission looks like, and thus I have to wait until the submission is live to make any corrections to my data/metadata if they haven’t been converted as intended. As a work around, I’ve been releasing data pre-publication, doing data checks and programming around the live data to ensure that my pipelines and results are reproducible. I suspect not all labs would be comfortable doing this, mainly for fear of getting scooped using their own data.

In experiencing ENA’s data submission system from the twin viewpoints of a data producer and consumer, I’ve had a few thoughts about how to improve the system that could also address the issue of wider community uptake. The first change I would suggest as a simple improvement to EBI’s current service would be to allow REST/browser access to a private, live version of formatted NGS data/metadata during the “hupped” phase with simple HTTP-based password authentication.  This would allow users to submit and store their data privately, but also to have access to the “final” product prior to release. This small change could have many benefits, including:

  • incentivising submission of NGS data early in the life-cycle of a project rather than as an after-thought during publication,
  • reducing the risk of local data loss or failure to submit NGS data at the time of publication,
  • allowing distributed project partners to access big data files from a single, high-bandwith, secure location,
  • allowing quality checks on final version of data/metadata prior to publication/data release, and
  • allowing analysis pipelines to use the final archived version of data/metadata, ensuring complete reproducibility and unified integration with other public datasets.

A second change, which I suspect is more difficult to implement, would be to allow users to pay to store their data for longer than a fixed period of time. I’d say two years is around the lower time limit from when data comes off a sequencer to a paper being published. Thus, I suspect there are many users who are reluctant to submit and store data at ENA prior to paper submission, since their data might be made public before they are ready to share. But if users could pay a modest monthly/quarterly fee to store their data privately past the free period up until publication, this might encourage them to deposit early and gain the benefits of storing/checking/using the live data, without fear that their data will be released earlier than they would like. This change could also lead to a new, low-risk funding stream for EBI, since they would only be charging for more time to private access for data already that is already on disk.

The extended pay-for-privacy model works well for both the user and the community, and could ultimately encourage more early open data release. Paying users will benefit from replicated, offsite storage in publication-ready formats without fear of getting scooped. This will come as a great benefit to many users who are currently struggling with local NGS data storage issues. Reciprocally, the community benefits because contributors who want to pay for extended private data end up supporting common infrastructure disproportionately more than those who release data publicly early. And since it becomes increasingly costly to keep your data private, there is ultimately an incentive to make your data public. This scheme would especially benefit preservation of the large amounts of usable data that go stale or never see the light of day because of delays or failures to write up and thus never get submitted to ENA. And of course, once published, private data would be made openly available immediately, all in a well-formatted and curated manner that the community can benefit from. What’s not to like?

Thoughts on if, or how, these half-baked ideas could be turned into reality are much appreciated in the comments below.