Incentivising open data & reproducible research through pre-publication private access to NGS data at EBI

Yesterday Ewan Birney posted a series of tweets expressing surprise that more people don’t take advantage of ENA’s programmatic access to submit and store next-generation sequencing (NGS) data to EBI, that I tried to respond to in broken twitter English. This post attempts to clarify how I think ENA’s system could be improved in ways that I think would benefit both data archiving and reproducible research, and possibly increase uptake and sustainability of the service.

I’ve been a heavy consumer of NGS data from EBI for a couple of years, mainly thanks to their plain-vanilla fastq.gz downloads and clean REST interface for extracting NGS metadata. But I’ve only just recently gone through the process of submitting NGS data to ENA myself, first using their web portal and more recently taking advantage of REST-based programmatic access. Aside from the issue of how best to transfer many big files to EBI in an automatic way (which I’ve blogged about here), I’ve been quite impressed by how well-documented and efficient ENA’s NGS submission process is. For those who’ve had bad experiences submitting to SRA, I agree with Ewan that ENA provides a great service, and I’d suggest giving EBI a try.

In brief, the current ENA submission process entails:

  1. transfer of user’s NGS data to EBI’s “dropbox”, which is basically a private storage area on EBI’s servers that requires user/password authentication (done by user).
  2. creation and submission of metadata files with information about runs and samples (done by user)
  3. validation of data/metadata and creation of accession numbers for the projects/experiments/samples/runs (done by EBI)
  4. conversion of submitted NGS data to EBI formatted version, giving new IDs to each read and connecting appropriate metadata to each NGS data file (done by EBI)
  5. public release of accession-number based annotated data (done by EBI on the user’s release date or after publication)

Where I see the biggest room for improvement is in the “hupped” phase when data is submitted but private.  During this phase, I can store data at EBI privately for up to two years, and thus keep a remote back-up of my data for free, which is great, but only in its original submitted format  I can’t, however, access the exact version of my data that will ultimately become public, i.e. using the REST interface with what will be the published accession numbers on data with converted read IDs.  For these reasons, I can’t write pipelines that use the exact data that will be referenced in a paper, and thus I cannot fully verify that the results I publish can be reproduced by someone else. Additionally, I can’t “proof” what my submission looks like, and thus I have to wait until the submission is live to make any corrections to my data/metadata if they haven’t been converted as intended. As a work around, I’ve been releasing data pre-publication, doing data checks and programming around the live data to ensure that my pipelines and results are reproducible. I suspect not all labs would be comfortable doing this, mainly for fear of getting scooped using their own data.

In experiencing ENA’s data submission system from the twin viewpoints of a data producer and consumer, I’ve had a few thoughts about how to improve the system that could also address the issue of wider community uptake. The first change I would suggest as a simple improvement to EBI’s current service would be to allow REST/browser access to a private, live version of formatted NGS data/metadata during the “hupped” phase with simple HTTP-based password authentication.  This would allow users to submit and store their data privately, but also to have access to the “final” product prior to release. This small change could have many benefits, including:

  • incentivising submission of NGS data early in the life-cycle of a project rather than as an after-thought during publication,
  • reducing the risk of local data loss or failure to submit NGS data at the time of publication,
  • allowing distributed project partners to access big data files from a single, high-bandwith, secure location,
  • allowing quality checks on final version of data/metadata prior to publication/data release, and
  • allowing analysis pipelines to use the final archived version of data/metadata, ensuring complete reproducibility and unified integration with other public datasets.

A second change, which I suspect is more difficult to implement, would be to allow users to pay to store their data for longer than a fixed period of time. I’d say two years is around the lower time limit from when data comes off a sequencer to a paper being published. Thus, I suspect there are many users who are reluctant to submit and store data at ENA prior to paper submission, since their data might be made public before they are ready to share. But if users could pay a modest monthly/quarterly fee to store their data privately past the free period up until publication, this might encourage them to deposit early and gain the benefits of storing/checking/using the live data, without fear that their data will be released earlier than they would like. This change could also lead to a new, low-risk funding stream for EBI, since they would only be charging for more time to private access for data already that is already on disk.

The extended pay-for-privacy model works well for both the user and the community, and could ultimately encourage more early open data release. Paying users will benefit from replicated, offsite storage in publication-ready formats without fear of getting scooped. This will come as a great benefit to many users who are currently struggling with local NGS data storage issues. Reciprocally, the community benefits because contributors who want to pay for extended private data end up supporting common infrastructure disproportionately more than those who release data publicly early. And since it becomes increasingly costly to keep your data private, there is ultimately an incentive to make your data public. This scheme would especially benefit preservation of the large amounts of usable data that go stale or never see the light of day because of delays or failures to write up and thus never get submitted to ENA. And of course, once published, private data would be made openly available immediately, all in a well-formatted and curated manner that the community can benefit from. What’s not to like?

Thoughts on if, or how, these half-baked ideas could be turned into reality are much appreciated in the comments below.

Advertisements

Battling Administrivia Using an Intramural Question & Answer Forum

The life of a modern academic involves juggling many disparate tasks, and like a computer using more physical memory than it has, swapping between various tasks leads to inefficiency and low performance in our jobs. Personally, the time fragmentation and friction induced by transitioning from task to task seems to be one of the main sources of stress in my work life.  The main reason for this is that many daily tasks on my to-do list are essential but fiddly and time-consuming administrivia (placing orders, filling in forms, entering marks into a database) that prevent me from getting to the things that I enjoy about being an academic: doing research, interacting with students, reading papers, etc.

I would go so far as to say that the mismatch between the desires of most academics and the reality of their jobs is the main source of academic “burnout” and low morale in what otherwise should be an awesome profession. I would also venture that administrivia is one of the major sources of the long hours we endure, since after wading through the “chaff”, we will (dammit!) put in the time on nights and weekends for the things we are most passionate about to sustain our souls. And based on the frequency of sentiments relating to this topic flowing through my Twitter feed, I’d say the negative impact of adminsitrivia is a pervasive problem in modern academic life, not restricted to any one institute.

While it is tempting to propose ameliorating the administrivia problem by simply eliminating bureaucracy, the growth of the administrative sector in higher education makes this solution a virtual impossibility. I have ultimately become resigned to the fact that the fundamentally inefficient nature of university bureaucracy cannot be structurally reformed and begun to seek other solutions to make my work life better. In doing so, I believe I’ve hit on a simple solution to the adminstrivia problem that I’m hoping might help others as well. In fact, I’m now convinced this solution is simple and powerful enough to actually be effective.

Accepting that it cannot be fully eliminated, my view is that the key to reducing the time and morale burden of administrivia is to realize that most routine tasks in University life are just protocols that require some amount of tacit knowledge about policies or procedures. Thus, all that is needed to reduce the negative impact of administrivia to its lowest possible level is to develop a system whereby accurate and relevant protocols can be placed at one’s fingertips so that they can be completed as fast as possible. The problem is that such protocols either don’t exist, don’t exist in a written form, or exist as scattered documents across various filesystems and offices that you have to expend substantial time finding. So how do we develop such protocols without generating more bureaucracy and exacerbating the problem we are attempting to solve?

My source of inspiration for ameliorating administrivia with minimal overhead comes from the positive experiences I have had using online Question and Answering (Q & A) forums based on the Stack Exchange model (principally the BioStars site for answering questions about bioinformatics).  For those not familiar with such systems, the Q & A model popularized by the Stack Exchange platform (and its clones) is a system that allows questions to be asked and answers to be voted on, moderated, edited and commented on in a very intuitive and user-friendly manner. For some reason I am not able to fully explain, the engineering behind the Q & A model naturally facilitates both knowledge exchange and community building in a way that is on the whole extremely positive, and seems to prevent the worst aspects of human nature commonly found on older internet forums and commenting systems.

So here is my proposal to battling the impact of academic administrivia: implement an intramural, University-specific Q & A forum for academic and administrative staff to pose and answer each other’s practical questions, converting tacit knowledge stored in people’s heads, inboxes and intranets into a single knowledge-bank that can be efficiently used and re-used by others who have the same queries. The need for an “intramural” solution and the reason this strategy cannot be applied globally, as it has for Linux administration, Poker or Biblical Hermeneutics, is that Universities (for better or worse) have their own local policies and procedures that can’t be easily shared or benefit from general worldwide input.

We have been piloting the use of the Open Source Question Answer (OSQA) platform (a clone of Stack Exchange) among a subset of our faculty for about a year, with good uptake and virtually unanimous endorsement from everyone who has used it. We currently require a real name policy for users, have limited the system to questions of procedure only, and have encouraged users to answer their own questions after solving burdensome tasks. To make things easy to administer technically, we are using an out of the box virtual machine of OSQA provided by Bitnami. The anonymized screenshot below gives a flavor of the banal, yet time-consuming queries that arise repeatedly in our institution that such a system makes easier to accomplish. I trust colleagues at other institutions will find similar tasks frustratingly familiar.

Untitled

The main reason I am posting this idea now is that I am scheduled to give a demo and presentation to my Dean and management team this week to propose rolling this system out to a wider audience. In preparation for this pitch, I’ve been trying to assemble a list of pros and cons that I am sure is incomplete and would benefit from the input of other people familiar with how Universities and Q & A platforms work.

The pros of an intramural Q & A platform for battling administrivia I’ve come up with so far include:

  • Increasing efficiency, leading to higher productivity for both academic and administrative staff;
  • Reducing the sense of frustration about bureaucratic tasks, leading to higher morale;
  • Improving sense of empowerment and community among academic and administrative staff;
  • Providing better documentation of procedures and policies;
  • Serving as an “aide memoire”;
  • Aiding the success of junior academic staff;
  • Ameliorating the effects of administrative turnover;
  • Providing a platform for people who may not speak up in staff meetings to contribute;
  • Allows “best practice” to emerge through crowd-sourcing;
  • Identifying common problems that should be prioritized for improvement;
  • Identifying like-minded problem solvers in a big institution;
  • Integrating easily around existing IT platforms;
  • Ability to be deployed at any scale (lab group, department, faculty, school, etc.)
  • Allows information to be accessible 24/7 when admininstrative offices are closed (H/T @jdenavascues).

I confess struggling to find true cons, but these might include (rejoinders in parentheses):

  • Security risks (can be solved with proper administration and authentication)
  • Inappropriate content (real name policy should minimize, can be solved with proper moderation);
  • Answers might be “impolitic” (real name policy should minimize, can be solved with proper moderation; H/T @DrLabRatOry)
  • Time wasting (unlikely since whole point is to enhance productivity);
  • Lack of uptake (even if the 90-9-1 rule applies, it is an improvement on the status quo);
  • Perceived as threat to administrative staff (far from it, this approach benefits administrative staff as much as academic staff);
  • Information could be come stale (can be solved with proper moderation and periodic updating).

I’d be very interested to get feedback from others about this general strategy (especially by Tues PM 17 Sep 2013), thoughts on related efforts, or how intramural Q & A platforms could be used in other ways in an academic setting beyond battling administrivia in the comments below.

Related Posts: