Incentivising open data & reproducible research through pre-publication private access to NGS data at EBI

Yesterday Ewan Birney posted a series of tweets expressing surprise that more people don’t take advantage of ENA’s programmatic access to submit and store next-generation sequencing (NGS) data to EBI, that I tried to respond to in broken twitter English. This post attempts to clarify how I think ENA’s system could be improved in ways that I think would benefit both data archiving and reproducible research, and possibly increase uptake and sustainability of the service.

I’ve been a heavy consumer of NGS data from EBI for a couple of years, mainly thanks to their plain-vanilla fastq.gz downloads and clean REST interface for extracting NGS metadata. But I’ve only just recently gone through the process of submitting NGS data to ENA myself, first using their web portal and more recently taking advantage of REST-based programmatic access. Aside from the issue of how best to transfer many big files to EBI in an automatic way (which I’ve blogged about here), I’ve been quite impressed by how well-documented and efficient ENA’s NGS submission process is. For those who’ve had bad experiences submitting to SRA, I agree with Ewan that ENA provides a great service, and I’d suggest giving EBI a try.

In brief, the current ENA submission process entails:

  1. transfer of user’s NGS data to EBI’s “dropbox”, which is basically a private storage area on EBI’s servers that requires user/password authentication (done by user).
  2. creation and submission of metadata files with information about runs and samples (done by user)
  3. validation of data/metadata and creation of accession numbers for the projects/experiments/samples/runs (done by EBI)
  4. conversion of submitted NGS data to EBI formatted version, giving new IDs to each read and connecting appropriate metadata to each NGS data file (done by EBI)
  5. public release of accession-number based annotated data (done by EBI on the user’s release date or after publication)

Where I see the biggest room for improvement is in the “hupped” phase when data is submitted but private.  During this phase, I can store data at EBI privately for up to two years, and thus keep a remote back-up of my data for free, which is great, but only in its original submitted format  I can’t, however, access the exact version of my data that will ultimately become public, i.e. using the REST interface with what will be the published accession numbers on data with converted read IDs.  For these reasons, I can’t write pipelines that use the exact data that will be referenced in a paper, and thus I cannot fully verify that the results I publish can be reproduced by someone else. Additionally, I can’t “proof” what my submission looks like, and thus I have to wait until the submission is live to make any corrections to my data/metadata if they haven’t been converted as intended. As a work around, I’ve been releasing data pre-publication, doing data checks and programming around the live data to ensure that my pipelines and results are reproducible. I suspect not all labs would be comfortable doing this, mainly for fear of getting scooped using their own data.

In experiencing ENA’s data submission system from the twin viewpoints of a data producer and consumer, I’ve had a few thoughts about how to improve the system that could also address the issue of wider community uptake. The first change I would suggest as a simple improvement to EBI’s current service would be to allow REST/browser access to a private, live version of formatted NGS data/metadata during the “hupped” phase with simple HTTP-based password authentication.  This would allow users to submit and store their data privately, but also to have access to the “final” product prior to release. This small change could have many benefits, including:

  • incentivising submission of NGS data early in the life-cycle of a project rather than as an after-thought during publication,
  • reducing the risk of local data loss or failure to submit NGS data at the time of publication,
  • allowing distributed project partners to access big data files from a single, high-bandwith, secure location,
  • allowing quality checks on final version of data/metadata prior to publication/data release, and
  • allowing analysis pipelines to use the final archived version of data/metadata, ensuring complete reproducibility and unified integration with other public datasets.

A second change, which I suspect is more difficult to implement, would be to allow users to pay to store their data for longer than a fixed period of time. I’d say two years is around the lower time limit from when data comes off a sequencer to a paper being published. Thus, I suspect there are many users who are reluctant to submit and store data at ENA prior to paper submission, since their data might be made public before they are ready to share. But if users could pay a modest monthly/quarterly fee to store their data privately past the free period up until publication, this might encourage them to deposit early and gain the benefits of storing/checking/using the live data, without fear that their data will be released earlier than they would like. This change could also lead to a new, low-risk funding stream for EBI, since they would only be charging for more time to private access for data already that is already on disk.

The extended pay-for-privacy model works well for both the user and the community, and could ultimately encourage more early open data release. Paying users will benefit from replicated, offsite storage in publication-ready formats without fear of getting scooped. This will come as a great benefit to many users who are currently struggling with local NGS data storage issues. Reciprocally, the community benefits because contributors who want to pay for extended private data end up supporting common infrastructure disproportionately more than those who release data publicly early. And since it becomes increasingly costly to keep your data private, there is ultimately an incentive to make your data public. This scheme would especially benefit preservation of the large amounts of usable data that go stale or never see the light of day because of delays or failures to write up and thus never get submitted to ENA. And of course, once published, private data would be made openly available immediately, all in a well-formatted and curated manner that the community can benefit from. What’s not to like?

Thoughts on if, or how, these half-baked ideas could be turned into reality are much appreciated in the comments below.

Advertisements

Launch of the PLOS Text Mining Collection

Just a quick post to announce that the PLOS Text Mining Collection is now live!

This PLOS Collection arose out of a twitter conversation with Theo Bloom last year, and has come together through the hard work of the authors of the papers in the Collection, the PLOS Collections team (in particular Sam Moore and Jennifer Horsely), and my co-organizers Larry Hunter and Andrey Rzhetsky. Many thanks to all for seeing this effort to completion.

Because of the large body of work in the area of text mining published in PLOS, we struggled with how best to present all these papers in the collection without diluting the experience for the reader. In the end, we decided only to highlight new work from the last two years and major reviews/tutorials at the time of launch. However, as this is a living collection, new articles will be included in the future, and the aim is to include previously published work as well. We hope to see many more papers in the area of text mining published in the PLOS family of journals in the future.

An overview of the PLOS Text Mining Collection is below (cross-posted at the PLOS EveryONE blog) and a commentary on Collection is available at the Official PLOS Blog entitled “A mine of information – the PLOS Text Mining Collection“.

Background to the PLOS Text Mining Collection

Text Mining is an interdisciplinary field combining techniques from linguistics, computer science and statistics to build tools that can efficiently retrieve and extract information from digital text. Over the last few decades, there has been increasing interest in text mining research because of the potential commercial and academic benefits this technology might enable. However, as with the promises of many new technologies, the benefits of text mining are still not clear to most academic researchers.

This situation is now poised to change for several reasons. First, the rate of growth of the scientific literature has now outstripped the ability of individuals to keep pace with new publications, even in a restricted field of study. Second, text-mining tools have steadily increased in accuracy and sophistication to the point where they are now suitable for widespread application. Finally, the rapid increase in availability of digital text in an Open Access format now permits text-mining tools to be applied more freely than ever before.

To acknowledge these changes and the growing body of work in the area of text mining research, today PLOS launches the Text Mining Collection, a compendium of major reviews and recent highlights published in the PLOS family of journals on the topic of text mining. As one of the major publishers of the Open Access scientific literature, it is perhaps no coincidence that research in text mining in PLOS journals is flourishing. As noted above, the widespread application and societal benefits of text mining is most easily achieved under an Open Access model of publishing, where the barriers to obtaining published articles are minimized and the ability to remix and redistribute data extracted from text is explicitly permitted. Furthermore, PLOS is one of the few publishers who is actively promoting text mining research by providing an open Application Programming Interface to mine their journal content.

Text Mining in PLOS

Since virtually the beginning of its history [1], PLOS has actively promoted the field of text mining by publishing reviews, opinions, tutorials and dozens of primary research articles in this area in PLOS Biology, PLOS Computational Biology and, increasingly, PLOS ONE. Because of the large number of text mining papers in PLOS journals, we are only able to highlight a subset of these works in the first instance of the PLOS Text Mining Collection. These include major reviews and tutorials published over the last decade [1][2][3][4][5][6], plus a selection of research papers from the last two years [7][8][9][10][11][12][13][14][15][16][17][18][19] and three new papers arising from the call for papers for this collection [20][21][22].
The research papers included in the collection at launch provide important overviews of the field and reflect many exciting contemporary areas of research in text mining, such as:

  • methods to extract textual information from figures [7];
  • methods to cluster [8] and navigate [15] the burgeoning biomedical literature;
  • integration of text-mining tools into bioinformatics workflow systems [9];
  • use of text-mined data in the construction of biological networks [10];
  • application of text-mining tools to non-traditional textual sources such as electronic patient records [11] and social media [12];
  • generating links between the biomedical literature and genomic databases [13];
  • application of text-mining approaches in new areas such as the Environmental Sciences [14] and Humanities [16][17];
  • named entity recognition [18];
  • assisting the development of ontologies [19];
  • extraction of biomolecular interactions and events [20][21]; and
  • assisting database curation [22].

Looking Forward

As this is a living collection, it is worth discussing two issues we hope to see addressed in articles that are added to the PLOS text mining collection in the future: scaling up and opening up. While application of text mining tools to abstracts of all biomedical papers in the MEDLINE database is increasingly common, there have been remarkably few efforts that have applied text mining to the entirety of the full text articles in a given domain, even in the biomedical sciences [4][23]. Therefore, we hope to see more text mining applications scaled up to use the full text of all Open Access articles. Scaling up will maximize the utility of text-mining technologies and the uptake by end users, but also demonstrate that demand for access to full text articles exists by the text mining and wider academic communities.

Likewise, we hope to see more text-mining software systems made freely or openly available in the future. As an example of the state of affairs in the field, only 25% of the research articles highlighted in the PLOS text mining collection at launch provide source code or executable software of any kind [13][16][19][21]. The lack of availability of software or source code accompanying published research articles is, of course, not unique to the field of text mining. It is a general problem limiting progress and reproducibility in many fields of science, which authors, reviewers and editors have a duty to address. Making release of open source software the rule, rather than the exception, should further catalyze advances in text mining, as it has in other fields of computational research that have made extremely rapid progress in the last decades (such as genome bioinformatics).

By opening up the code base in text mining research, and deploying text-mining tools at scale on the rapidly growing corpus of full-text Open Access articles, we are confident this powerful technology will make good on its promise to catalyze scholarly endeavors in the digital age.

References

1. Dickman S (2003) Tough mining: the challenges of searching the scientific literature. PLoS biology 1: e48. doi:10.1371/journal.pbio.0000048.
2. Rebholz-Schuhmann D, Kirsch H, Couto F (2005) Facts from Text—Is Text Mining Ready to Deliver? PLoS Biol 3: e65. doi:10.1371/journal.pbio.0030065.
3. Cohen B, Hunter L (2008) Getting started in text mining. PLoS computational biology 4: e20. doi:10.1371/journal.pcbi.0040020.
4. Bourne PE, Fink JL, Gerstein M (2008) Open access: taking full advantage of the content. PLoS computational biology 4: e1000037+. doi:10.1371/journal.pcbi.1000037.
5. Rzhetsky A, Seringhaus M, Gerstein M (2009) Getting Started in Text Mining: Part Two. PLoS Comput Biol 5: e1000411. doi:10.1371/journal.pcbi.1000411.
6. Rodriguez-Esteban R (2009) Biomedical Text Mining and Its Applications. PLoS Comput Biol 5: e1000597. doi:10.1371/journal.pcbi.1000597.
7. Kim D, Yu H (2011) Figure text extraction in biomedical literature. PloS one 6: e15338. doi:10.1371/journal.pone.0015338.
8. Boyack K, Newman D, Duhon R, Klavans R, Patek M, et al. (2011) Clustering More than Two Million Biomedical Publications: Comparing the Accuracies of Nine Text-Based Similarity Approaches. PLoS ONE 6: e18029. doi:10.1371/journal.pone.0018029.
9. Kolluru B, Hawizy L, Murray-Rust P, Tsujii J, Ananiadou S (2011) Using workflows to explore and optimise named entity recognition for chemistry. PloS one 6: e20181. doi:10.1371/journal.pone.0020181.
10. Hayasaka S, Hugenschmidt C, Laurienti P (2011) A network of genes, genetic disorders, and brain areas. PloS one 6: e20907. doi:10.1371/journal.pone.0020907.
11. Roque F, Jensen P, Schmock H, Dalgaard M, Andreatta M, et al. (2011) Using electronic patient records to discover disease correlations and stratify patient cohorts. PLoS computational biology 7: e1002141. doi:10.1371/journal.pcbi.1002141.
12. Salathé M, Khandelwal S (2011) Assessing Vaccination Sentiments with Online Social Media: Implications for Infectious Disease Dynamics and Control. PLoS Comput Biol 7: e1002199. doi:10.1371/journal.pcbi.1002199.
13. Baran J, Gerner M, Haeussler M, Nenadic G, Bergman C (2011) pubmed2ensembl: a resource for mining the biological literature on genes. PloS one 6: e24716. doi:10.1371/journal.pone.0024716.
14. Fisher R, Knowlton N, Brainard R, Caley J (2011) Differences among major taxa in the extent of ecological knowledge across four major ecosystems. PloS one 6: e26556. doi:10.1371/journal.pone.0026556.
15. Hossain S, Gresock J, Edmonds Y, Helm R, Potts M, et al. (2012) Connecting the dots between PubMed abstracts. PloS one 7: e29509. doi:10.1371/journal.pone.0029509.
16. Ebrahimpour M, Putniņš TJ, Berryman MJ, Allison A, Ng BW-H, et al. (2013) Automated authorship attribution using advanced signal classification techniques. PLoS ONE 8: e54998. doi:10.1371/journal.pone.0054998.
17. Acerbi A, Lampos V, Garnett P, Bentley RA (2013) The Expression of Emotions in 20th Century Books. PLoS ONE 8: e59030. doi:10.1371/journal.pone.0059030.
18. Groza T, Hunter J, Zankl A (2013) Mining Skeletal Phenotype Descriptions from Scientific Literature. PLoS ONE 8: e55656. doi:10.1371/journal.pone.0055656.
19. Seltmann KC, Pénzes Z, Yoder MJ, Bertone MA, Deans AR (2013) Utilizing Descriptive Statements from the Biodiversity Heritage Library to Expand the Hymenoptera Anatomy Ontology. PLoS ONE 8: e55674. doi:10.1371/journal.pone.0055674.
20. Van Landeghem S, Bjorne J, Wei C-H, Hakala K, Pyysal S, et al. (2013) Large-Scale Event Extraction from Literature with Multi-Level Gene Normalization. PLOS ONE 8: e55814. doi:10.1371/journal.pone.0055814
21. Liu H, Hunter L, Keselj V, Verspoor K (2013) Approximate Subgraph Matching-based Literature Mining for Biomedical Events and Relations. PLoS ONE 8(4): e60954. doi:10.1371/journal.pone.0060954
22. Davis A, Weigers T, Johnson R, Lay J, Lennon-Hopkins K, et al. (2013) Text mining effectively scores and ranks the literature for improving chemical-gene-disease curation at the Comparative Toxicogenomics Database. PLOS ONE 8: e58201. doi:10.1371/journal.pone.0058201
23. Bergman CM (2012) Why Are There So Few Efforts to Text Mine the Open Access Subset of PubMed Central? https://caseybergman.wordpress.com/2012/03/02/why-are-there-so-few-efforts-to-text-mine-the-open-access-subset-of-pubmed-central/.

On the Preservation of Published Bioinformatics Code on Github

A few months back I posted a quick analysis of trends in where bioinformaticians choose to host their source code. A clear trend emerging in the bioinformatics community is to use github as the primary repository of bioinformatics code in published papers.  While I am a big fan of github and I support its widespread adoption, in that post I noted my concerns about the ease with which an individual can delete a published repository. In contrast to SourceForge, where it is extremely difficult to delete a repository once files have been released and this can only be done by SourceForge itself, deleting a repository on github takes only a few seconds and can be done (accidentally or intentionally) by the user who created the repository.

Just to see how easy this is, I’ve copied the process for deleting a repository on github here:

  • Go to the repo’s admin page

  • Click “Delete this repository”

  • Read the warnings and enter the name of the repository you want to delete
  • Click “I understand the consequences, delete this repository

Given the increasing use of github in publications, I feel the issue of repository deletion on github needs to be discussed by scientists and publishers more in the context of the important issue of long-term maintenance of published code. The reason I see this as important is that most github repositories are published via individual user accounts, and thus only one person holds the keys to preservation of the published code. Furthermore, I suspect funders, editors, publishers and (most) PIs have no idea how easy it is under the current model to delete published code. Call me a bit paranoid, but I see it is my responsibility as a PI to ensure the long-term preservation of published code, since I’m the one who signs off of data/resource plans in grants/final reports. Better to be safe than sorry, right?

On this note, I was pleased to see a retweet in my stream this week (via C. Titus Brown) concerning news that the journal Computers & Geosciences has adopted an official policy for hosting published code on github:

The mechanism that Computers & Geosciences has adopted to ensure long-term preservation of code in their journal is very simple – for the editor to fork code submitted by a github user into a journal organization (note: a similar idea was also suggested independently by Andrew Perry in the comments to my previous post). As clearly stated in the github repository deletion mechanism “Deleting a private repo will delete all forks of the repo. Deleting a public repo will not.” Thus, once Computers & Geosciences has forked the code, risk to the author, journal and community of a single point of failure is substantially ameliorated, with very little overhead to authors or publishers.

So what about the many other journals that have no such digital preservation policy but currently publish papers with bioinformatics code in github? Well, as a stopgap measure until other journals get on board with similar policies (PLOS & BMC, please lead the way!), I’ve taken the initiative to create a github organization called BioinformaticsArchive to serve this function. Currently, I’ve forked code for all but one of the 64 publications with github URLs in their PubMed record. One of the scary/interesting things to observe from this endeavor is just how fragile the current situation is. Of the 63 repositories I’ve forked, about 50% (n=31) had not been previously forked by any other user on github and could have been easily deleted, with consequent loss to the scientific community.

I am aware (thanks to Marc Robinson Rechavi) there are many more published github repositories in the full-text of articles (including two from our lab), which I will endeavor to dig out and add to this archive asap. If anyone else would like to help out with the endeavor, or knows of published repositories that should included, send me an email or tweet and I’ll add them to the archive. Comments on how to improve on the current state of preservation of published bioinformatics code on github and what can be learned form Computers and Geosciences new model policy are most welcome!

Top N Reasons To Do A Ph.D. or Post-Doc in Bioinformatics/Computational Biology

For the last few years I’ve given a talk to incoming Ph.D. students in Molecular Biology on why they should consider doing Computational Biology research. I’m fairly passionate about making this pitch, since I strongly believe all 21st century Biologists should have a greater (or lesser) degree of computational training, and that the best time to gain that training is during a Ph.D. or a Post-Doc.

I’ve decided to post an expanded version of the reasons I give for why Biology trainees should gain computational skills in hopes of encouraging a wider audience to consider a research path in Computational Biology. For simplicity, I define the field of Computational Biology to include Bioinformatics as well, although there are important distinctions between these two disciplines. Also, I note that this list is geared towards convincing students with a background in Molecular Biology to consider moving into Computational Biology, but core aspects and variants of the arguments here should apply to people with backgrounds in other disciplines (e.g. Ecology, Neuroscience) as well. Here we go…

0. Computing is the key skill set for 21st century biology: As time progresses, Biology is becoming a more quantitative science. Over the last three centuries, biology has transformed from an observational science into an experimental science into a data science. As the low-hanging fruit gets picked, fundamental discoveries are getting harder to make using observation and experiment alone. In the future, new discoveries will require leveraging big datasets and using advanced analytical methods. Big data and complex models require computational skills. Full stop. There is no way to escape this reality.

But if you don’t take my word for it, listen to what Nobel-prize winning pioneer of molecular biology Walter Gilbert, who made this same argument about the future of biology over 20 years ago:

To use this flood of [sequence] knowledge, which will pour across the computer networks of the world, biologists not only must become computer literate, but also change their approach to the problem of understanding life.

Or listen to Nobel-prize winning pioneer of molecular biology Sydney Brenner, who has been banging on about this issue for years:

I spent many hours persuading people that computing was not only going to be the essential tool for biological research but would also provide models for analyzing complexity…The development of sequencing techniques and their widespread application has generated enormous databases of information, and the need for computers is no longer questioned

1. Computational skills are highly transferable: Let’s face it, not everyone doing a Ph.D. or Post-Doc. in Biology is going to go on to a career in academic research. The Washington Post recently reported that “only 14 percent of those with a Ph.D. in biology and the life sciences now land a coveted academic position within five years“. So if there is high probability that your Ph.D. or Post-Doc training will need to be used outside of academic research, why not acquire the most broadly applicable skill set that you can? Experimental skills only transfer to laboratory jobs in the biosciences or medical job market. Computational skills transfer across this sector, plus a much wider market outside of the (bio)science. Increasing your computational chops won’t just give you a better chance at landing a job. It will have added benefits in your own life as well, since you will have a deeper appreciation for how computers work and more mastery of when you interact with computers in your daily life.

2. Computing will help improve your core scientific skills: Biology is inherently a messy subject. While some Biologists are rigorously trained in how to cope with this messiness through good experimental design and statistical analysis (here’s looking at you my Ecologist sisters and brothers), the sad truth is that many (most?) Biologists have bad habits when it comes to data collection and analysis.  Computing forces you to confront and tame the very human tendency to do science in ad hoc ways and therefore it naturally develops core scientific skills such as: logically planning experiments, collecting data consistently, developing reproducible methodology, and analysing your data with proper statistical methodology. So even if you can’t be convinced to abandon the bench or field forever, computational training will develop scientific best-practice that crosses-over and enhances your experimental skills set.

3. You should use you Ph.D./Post-Doc to develop new skills: Most Biologists come into their Ph.D. with some experimental training from high school and undergraduate studies. OK, so maybe this training isn’t cutting edge and you haven’t done advanced research to really hone your experimental skills, but nevertheless you do have some amount of training under your belt. In contrast, the vast majority of Biology Ph.D. students have no training in scientific computing skills beyond using Excel or a GUI-based statistics package. So use your Ph.D. or Post-Doc. time to for what it should be — training in something new, not just further developing a skill set that you already have.

My view is that the best time to train in Computational Biology is during a Ph.D., and the last chance to do this is likely to be as a Post-Doc. This is because during your Ph.D. you have time, secure funding and a departmental structure to protect you that you will never have again in your career. Gaining computational skills as a Post-Doc is also a great option, but shorter contracts, greater PI dependency, and higher expectations to publish mean that you typically don’t have as much time to re-train as you would during a Ph.D. Good luck finding the time to re-tool as a PI.

4. You will develop a more unique skill set in Biology: As noted above, the vast majority of Biologists have experimental training, but very few have advanced Computational training. While this is (thankfully!) changing, you will still be at a competitive advantage for at least a decade or more in terms of getting results in post-genomic Biology if you can code. And because you will be able to get results that many others cannot, plus the fact that you will have skills that set you apart from the herd, you will be more competitive on the job market. Straight up.

5. You will publish more papers: While it may not always feel like it, a Ph.D. or  Post-Doc goes by quickly. Therefore, you don’t have a lot of time to waste time with experiments that fail, if you want to stay in the game. Don’t get me wrong, Computational Biology will provide you more than your fair share of failed experiments, but crucially they will fail in hours/days instead of weeks/months, and therefore allow you to move on to something that works more quickly. As a result, you are very likely to publish more papers per unit time in Computational Biology. Whether you believe the old chestnut that experimental papers are somehow “harder” and therefore have more worth (I don’t), it is clear that publication remains the hard currency of science. Moreover, the adage that search committees “know how to count even if they can’t read” is still as true as ever. More seriously, what employers and funding agencies want to see is junior researchers who have good ideas and can take them to completion. Publication is the proof that you can finish projects. Computational Biology will allow you to demonstrate that you are a finisher, and that you have what it takes to succeed in science, a little bit faster than the next guy or gal.

6. You will have more flexibility in your research: I would say one of the greatest thing about being a Computational Biologist is that you are not as constrained in your research as you are when you do Experimental Biology. Sure, you can only work on projects that are amenable to computational analysis, but this scope is vast — from Computational Neuroscience to Theoretical Ecology and anything and everything in between. You can also move from flexibly from topic to topic more easily than you can if your skill set is linked to specific experimental techniques. This flexibility in scope allows you to satisfy your intellectual curiosity or chase the latest trend as you wish.  Most importantly for trainees, the flexibility (and low-cost, see below) afforded by Computational Biology research allows you to make the case to your PI to develop your own research programme earlier in your career. This is crucial since the more experience you have designing independent projects early in your career, the more likely you will be to succeed if/when you make it to the big time.

7. You will have more flexibility in working practices: ‘Nuff said:

Seriously though, Computational Biology has many pluses when it come to balancing work and life, but still maintaining a high level of productivity. Unlike being chained to the bench, you can do Computational Biology from pretty much anywhere, and telecommuting/working from home are standard practices in Computational Biology. Over the longer term, this flexibility in work practice helps you to accommodate career-breaks, manage the tough times life will throw at you, and make big life decisions like starting a family easier, since you can integrate coding and submitting jobs to the cluster into your life much better than you can integrate racing back to the lab to flip stocks or harvest cells. Let me say it loud and clear right here: if you want to have a career in academic Biological research and also have a family, choosing to do a Ph.D or Post-Doc in Computational Biology will be more likely to get you to this goal than if you are stuck in the lab. This is not just true for women, as I and others can attest to:

8. Computational research is cost-effective: With the wealth of publicly available data now available, Computational Biology research is cheaper than most experimental work that requires a large consumables budget. This is important for a number of reasons. Primarily, work in Computational Biology is less dependent on grant funding, and therefore you don’t have to be a slave to trends or waste inordinate time chasing grant funding — you can actually just get on with the job of doing the science you want to do. This is especially important in tough economic times like the present moment. As mentioned above, the reduced cost of Computational Biology research also allows trainees to design their own research at an earlier career stage, since you will not be as reliant on a PI to authorize expenditure for your project. Cost-efficiency is also very important when you are starting your group and for maintaining continuity of productivity when riding out troughs in funding or group size. Finally, the cost-efficiency of Computational Biology allows researchers in developing scientific economies to be on equal parity with researchers in rich countries. In my opinion, trainees from BRICS nations and other developing economies (sorry to use this somewhat judgemental term) should really consider choosing Computational Biology as a way to get to the top of the class globally without being limited by the need for big budgets.

9. A successful scientist ends up in an office: This is the kicker. If you succeed and get that “coveted” PI position, you will ultimately end up stuck in an office. True, some brave souls still find time to make it into the lab to do experiments, but they are a rare breed. The truth is that the native habitat for an academic researchers is sitting in their office in front of their computer. You can’t do a lick of wet lab or field work from the office, but you can still do Computational Biology research from behind a desk! As noted by Webb Miller, one of the most highly-cited bioinformaticians ever, continuing to do your own research is also one of the best ways to stay motivated about your work over the long haul of a career. Remember that the long-term goal is to be a “Principal Investigator”, not an “In Principle Investigator,” so if you’ve really wanted to do research since you were young, then ask yourself: why train in skills you will never ultimately use for the majority of your career, while somebody else in your lab gets to have fun making all the discoveries?

[10. You will understand why lists should start with the number zero.]

A major reason I have for posting this list is to start more discussion about the benefits of doing research in Computational Biology. I have deliberately made this a top N (not a top 10 list) so that good ideas can be added to the above. I’ll update this post with good suggestions from the comments, and give full credit to the originator.

Where Do Bioinformaticians Host Their Code?

Awhile back I was piqued by a discussion on BioStar about “Where would you host your open source code repository today?“, which got me thinking about the relative merits of the different sites for hosting bioinformatics software.  I am not an evangelist for any particular version control system or hosting site, and I leave it to readers to have a look into these systems themselves or at the BioStar thread for more on the relative merits of major hosting services, such as Sourceforge, Google Code, github and bitbucket. My aim here is not to advocate any particular system (although as a lab head I have certain predilections*), but to answer the straightforward empirical question: where do bioinformaticians host their code?

To do this, I’ve queried PubMed for keywords in the URLs of the four major hosting services listed above to get estimates of their uptake in biomedical publications.  This simple analysis clearly has some caveats, including the fact that many publications link to hosting services in sections of the paper outside the abstract, and that many bioinformaticians (frustratingly) release code via insitutional or personal webpages. Furthermore, the various hosting services arose at different times in history, so it is also important to interpret these data in a temporal context.  These (and other caveats) aside, the following provides an overview of how the bioinformatics community votes with their feet in terms of hosting their code on the major repository systems…

First of all, the bad news: of the many thousands of articles published in the field of bioinformatics, as of July Dec 31 2012 just under 700 papers (n=676) have easily discoverable code linked to a major repository in their abstract. The totals for each repository system are: 446 Sourceforge, 152 on Google Code, 78 on github and only 5 on bitbucket. So, by far, the majority of authors have chosen not to host their code on a major repository. But for the minority of authors who have chosen to release their code via a stable repository system, most use Sourceforge (which was is the oldest and most established source code repository) and effectively nobody is using bitbucket.

The first paper to link published code to a major repository system was only a decade ago in 2002, and a breakdown of the growth in code hosting since then looks like this:

 Year Sourceforge Google github
2002 4 0 0
2003 3 0 0
2004 10 0 0
2005 21 1 0
2006 24 0 0
2007 30 1 0
2008 30 10 0
2009 48 10 0
2010 69 21 8
2011 94 46 18
2012 113 63 52
Total 446 152 78

Trends in bioinformatics code repository usage 2002-2012.

A few things are clear from these results: 1) there is an upward trend in biomedical researchers hosting their code on major repository sites (the apparent downturn in 2012 is because data for this year is incomplete), 2) Sourceforge has clearly been the dominant players in the biomedical code repository game to date, but 3) the current growth rate of github appears to be outstripping both Sourceforge and Google Code. Furthermore, it appears that github is not experiencing any lag in uptake, as was observed in the 2002-2004 period for Sourceforge and 2006-2009 period for Google Code. It is good to see that new players in the hosting market are being accepted at a quicker rate than they were a decade ago.

Hopefully the upward trend for bioinformaticians to release their code via a major code hosting service will continue (keep up the good work, brothers and sisters!), and this will ultimately create a snowball effect such that it is no longer acceptable to publish bioinformatics software without releasing it openly into the wild.


  • As a lab manager I prefer to use Sourceforge in our published work, since Sourceforge has a very draconian policy when it come to deleting projects, which prevents accidental or willful deletion of a repository. In my opinion, Google Code and (especially) github are too permissive in terms of allowing projects to be deleted. As a lab head, I see it is my duty to ensure the long-term preservation of published code above all other considerations. I am aware that there are mechanisms to protect against deletion of repositories on github and Google Code, but I would suspect that most lab heads do not utilize them and that a substantial fraction of published academic code is one click away from deletion.

Nominations for the Benjamin Franklin Award for Open Access in the Life Sciences

Earlier this week I recieved an email with the annual call for nominations for the Benjamin Franklin Award for Open Access in the Life Sciences. While I am in general not that fussed about the importance of acadamic accolades, I think this a great award since it recognizes contributions in a sub-discipne of biology — computational biology, or bioinformatics — that are specifically done in the spririt of open innovation. By placing the emphasis on recognizing openness as an achievement, the Franklin Award goes beyond other related honors (such as those awarded by the International Society for Computational Biology) and, in my view, captures the essence of the true spirit of what scientists should be striving for in their work.

In looking over the past recipients, few would argue that the award has not been given out to major contributors to the open source/open access movements in biology. In thinking about who might be appropriate to add to this list, two people sprang to mind who I’ve had the good fortune to work with in the past, both of whom have made a major impresion on my (and many others’) thinking and working practices in computational biology.  So without further ado, here are my nominations for the 2012 Benjamin Franklin Award for Open Access in the Life Sciences (in chronological order of my interaction with them)…

Suzanna Lewis

Suzanna Lewis (Lawrence Berkeley National Laboratory) is one of the pioneers of developing open standards and software for genome annotation and ontologies. She led the team repsonsible for the systematic annotation of the Drosophila melanogaster genome, which included development of the Gadfly annotation pipeline and database framework, and the annotation curation/visualization tool Apollo. Lewis’ work in genome annotation also includes playing instrumental roles in the GASP community assessement exercises to evaluate the state of the art in genome annotation, development of the Gbrowser genome browser, and the data coordination center for modENCODE project. In addition to her work in genome annotation, Lewis has been a leader in the development of open biological ontologies (OBO, NCBO), contributing to the Gene Ontology, Sequence Ontology, and Uberon anatomy ontologies, and developing open software for editing and navigating ontologies (AmiGO, OBO-Edit, and Phenote).

Carole Goble

Carole Goble (University of Manchester) is widely recognized as a visionary in the development of software to support automated workflows in biology. She has been a leader of the myGrid and Open Middleware Infrastructure Institute consortia, which have generated a large number of highly innovative open resources for e-research in the life sciences including the Taverna Workbench for developing and deploying workflows, the BioCatalogue registry of bioinformatics web services, and the social-networking inspired myExperiment workflow repository. Goble has also played an instrumental role in the development of semantic-web tools for constructing and analyzing life science ontologies, the development of ontologies for describing bioinformatics resources, as well as ontology-based tools such as RightField for managing life science data.

I hope others join me in acknowledging the outputs of these two open innovators as being more than worthy of the Franklin Award, support their nomination, and cast votes in their favor this year and/or in years to come!