Launch of the PLOS Text Mining Collection

Just a quick post to announce that the PLOS Text Mining Collection is now live!

This PLOS Collection arose out of a twitter conversation with Theo Bloom last year, and has come together through the hard work of the authors of the papers in the Collection, the PLOS Collections team (in particular Sam Moore and Jennifer Horsely), and my co-organizers Larry Hunter and Andrey Rzhetsky. Many thanks to all for seeing this effort to completion.

Because of the large body of work in the area of text mining published in PLOS, we struggled with how best to present all these papers in the collection without diluting the experience for the reader. In the end, we decided only to highlight new work from the last two years and major reviews/tutorials at the time of launch. However, as this is a living collection, new articles will be included in the future, and the aim is to include previously published work as well. We hope to see many more papers in the area of text mining published in the PLOS family of journals in the future.

An overview of the PLOS Text Mining Collection is below (cross-posted at the PLOS EveryONE blog) and a commentary on Collection is available at the Official PLOS Blog entitled “A mine of information – the PLOS Text Mining Collection“.

Background to the PLOS Text Mining Collection

Text Mining is an interdisciplinary field combining techniques from linguistics, computer science and statistics to build tools that can efficiently retrieve and extract information from digital text. Over the last few decades, there has been increasing interest in text mining research because of the potential commercial and academic benefits this technology might enable. However, as with the promises of many new technologies, the benefits of text mining are still not clear to most academic researchers.

This situation is now poised to change for several reasons. First, the rate of growth of the scientific literature has now outstripped the ability of individuals to keep pace with new publications, even in a restricted field of study. Second, text-mining tools have steadily increased in accuracy and sophistication to the point where they are now suitable for widespread application. Finally, the rapid increase in availability of digital text in an Open Access format now permits text-mining tools to be applied more freely than ever before.

To acknowledge these changes and the growing body of work in the area of text mining research, today PLOS launches the Text Mining Collection, a compendium of major reviews and recent highlights published in the PLOS family of journals on the topic of text mining. As one of the major publishers of the Open Access scientific literature, it is perhaps no coincidence that research in text mining in PLOS journals is flourishing. As noted above, the widespread application and societal benefits of text mining is most easily achieved under an Open Access model of publishing, where the barriers to obtaining published articles are minimized and the ability to remix and redistribute data extracted from text is explicitly permitted. Furthermore, PLOS is one of the few publishers who is actively promoting text mining research by providing an open Application Programming Interface to mine their journal content.

Text Mining in PLOS

Since virtually the beginning of its history [1], PLOS has actively promoted the field of text mining by publishing reviews, opinions, tutorials and dozens of primary research articles in this area in PLOS Biology, PLOS Computational Biology and, increasingly, PLOS ONE. Because of the large number of text mining papers in PLOS journals, we are only able to highlight a subset of these works in the first instance of the PLOS Text Mining Collection. These include major reviews and tutorials published over the last decade [1][2][3][4][5][6], plus a selection of research papers from the last two years [7][8][9][10][11][12][13][14][15][16][17][18][19] and three new papers arising from the call for papers for this collection [20][21][22].
The research papers included in the collection at launch provide important overviews of the field and reflect many exciting contemporary areas of research in text mining, such as:

  • methods to extract textual information from figures [7];
  • methods to cluster [8] and navigate [15] the burgeoning biomedical literature;
  • integration of text-mining tools into bioinformatics workflow systems [9];
  • use of text-mined data in the construction of biological networks [10];
  • application of text-mining tools to non-traditional textual sources such as electronic patient records [11] and social media [12];
  • generating links between the biomedical literature and genomic databases [13];
  • application of text-mining approaches in new areas such as the Environmental Sciences [14] and Humanities [16][17];
  • named entity recognition [18];
  • assisting the development of ontologies [19];
  • extraction of biomolecular interactions and events [20][21]; and
  • assisting database curation [22].

Looking Forward

As this is a living collection, it is worth discussing two issues we hope to see addressed in articles that are added to the PLOS text mining collection in the future: scaling up and opening up. While application of text mining tools to abstracts of all biomedical papers in the MEDLINE database is increasingly common, there have been remarkably few efforts that have applied text mining to the entirety of the full text articles in a given domain, even in the biomedical sciences [4][23]. Therefore, we hope to see more text mining applications scaled up to use the full text of all Open Access articles. Scaling up will maximize the utility of text-mining technologies and the uptake by end users, but also demonstrate that demand for access to full text articles exists by the text mining and wider academic communities.

Likewise, we hope to see more text-mining software systems made freely or openly available in the future. As an example of the state of affairs in the field, only 25% of the research articles highlighted in the PLOS text mining collection at launch provide source code or executable software of any kind [13][16][19][21]. The lack of availability of software or source code accompanying published research articles is, of course, not unique to the field of text mining. It is a general problem limiting progress and reproducibility in many fields of science, which authors, reviewers and editors have a duty to address. Making release of open source software the rule, rather than the exception, should further catalyze advances in text mining, as it has in other fields of computational research that have made extremely rapid progress in the last decades (such as genome bioinformatics).

By opening up the code base in text mining research, and deploying text-mining tools at scale on the rapidly growing corpus of full-text Open Access articles, we are confident this powerful technology will make good on its promise to catalyze scholarly endeavors in the digital age.


1. Dickman S (2003) Tough mining: the challenges of searching the scientific literature. PLoS biology 1: e48. doi:10.1371/journal.pbio.0000048.
2. Rebholz-Schuhmann D, Kirsch H, Couto F (2005) Facts from Text—Is Text Mining Ready to Deliver? PLoS Biol 3: e65. doi:10.1371/journal.pbio.0030065.
3. Cohen B, Hunter L (2008) Getting started in text mining. PLoS computational biology 4: e20. doi:10.1371/journal.pcbi.0040020.
4. Bourne PE, Fink JL, Gerstein M (2008) Open access: taking full advantage of the content. PLoS computational biology 4: e1000037+. doi:10.1371/journal.pcbi.1000037.
5. Rzhetsky A, Seringhaus M, Gerstein M (2009) Getting Started in Text Mining: Part Two. PLoS Comput Biol 5: e1000411. doi:10.1371/journal.pcbi.1000411.
6. Rodriguez-Esteban R (2009) Biomedical Text Mining and Its Applications. PLoS Comput Biol 5: e1000597. doi:10.1371/journal.pcbi.1000597.
7. Kim D, Yu H (2011) Figure text extraction in biomedical literature. PloS one 6: e15338. doi:10.1371/journal.pone.0015338.
8. Boyack K, Newman D, Duhon R, Klavans R, Patek M, et al. (2011) Clustering More than Two Million Biomedical Publications: Comparing the Accuracies of Nine Text-Based Similarity Approaches. PLoS ONE 6: e18029. doi:10.1371/journal.pone.0018029.
9. Kolluru B, Hawizy L, Murray-Rust P, Tsujii J, Ananiadou S (2011) Using workflows to explore and optimise named entity recognition for chemistry. PloS one 6: e20181. doi:10.1371/journal.pone.0020181.
10. Hayasaka S, Hugenschmidt C, Laurienti P (2011) A network of genes, genetic disorders, and brain areas. PloS one 6: e20907. doi:10.1371/journal.pone.0020907.
11. Roque F, Jensen P, Schmock H, Dalgaard M, Andreatta M, et al. (2011) Using electronic patient records to discover disease correlations and stratify patient cohorts. PLoS computational biology 7: e1002141. doi:10.1371/journal.pcbi.1002141.
12. Salathé M, Khandelwal S (2011) Assessing Vaccination Sentiments with Online Social Media: Implications for Infectious Disease Dynamics and Control. PLoS Comput Biol 7: e1002199. doi:10.1371/journal.pcbi.1002199.
13. Baran J, Gerner M, Haeussler M, Nenadic G, Bergman C (2011) pubmed2ensembl: a resource for mining the biological literature on genes. PloS one 6: e24716. doi:10.1371/journal.pone.0024716.
14. Fisher R, Knowlton N, Brainard R, Caley J (2011) Differences among major taxa in the extent of ecological knowledge across four major ecosystems. PloS one 6: e26556. doi:10.1371/journal.pone.0026556.
15. Hossain S, Gresock J, Edmonds Y, Helm R, Potts M, et al. (2012) Connecting the dots between PubMed abstracts. PloS one 7: e29509. doi:10.1371/journal.pone.0029509.
16. Ebrahimpour M, Putniņš TJ, Berryman MJ, Allison A, Ng BW-H, et al. (2013) Automated authorship attribution using advanced signal classification techniques. PLoS ONE 8: e54998. doi:10.1371/journal.pone.0054998.
17. Acerbi A, Lampos V, Garnett P, Bentley RA (2013) The Expression of Emotions in 20th Century Books. PLoS ONE 8: e59030. doi:10.1371/journal.pone.0059030.
18. Groza T, Hunter J, Zankl A (2013) Mining Skeletal Phenotype Descriptions from Scientific Literature. PLoS ONE 8: e55656. doi:10.1371/journal.pone.0055656.
19. Seltmann KC, Pénzes Z, Yoder MJ, Bertone MA, Deans AR (2013) Utilizing Descriptive Statements from the Biodiversity Heritage Library to Expand the Hymenoptera Anatomy Ontology. PLoS ONE 8: e55674. doi:10.1371/journal.pone.0055674.
20. Van Landeghem S, Bjorne J, Wei C-H, Hakala K, Pyysal S, et al. (2013) Large-Scale Event Extraction from Literature with Multi-Level Gene Normalization. PLOS ONE 8: e55814. doi:10.1371/journal.pone.0055814
21. Liu H, Hunter L, Keselj V, Verspoor K (2013) Approximate Subgraph Matching-based Literature Mining for Biomedical Events and Relations. PLoS ONE 8(4): e60954. doi:10.1371/journal.pone.0060954
22. Davis A, Weigers T, Johnson R, Lay J, Lennon-Hopkins K, et al. (2013) Text mining effectively scores and ranks the literature for improving chemical-gene-disease curation at the Comparative Toxicogenomics Database. PLOS ONE 8: e58201. doi:10.1371/journal.pone.0058201
23. Bergman CM (2012) Why Are There So Few Efforts to Text Mine the Open Access Subset of PubMed Central?

Directed Genome Sequencing: the Key to Deciphering the Fabric of Life in 1993

Seeing the #AAASmtg hashtag flowing on my twitter stream over the last few days reminded my that my former post-doc advisor Sue Celniker must be enjoying her well-deserved election to the American Association for the Advancement of Science (AAAS). Sue has made a number of major contributions to Drosophila genomics, and I personally owe her for the chance to spend my journeyman years with her and so many other talented people in the Berkeley Drosophila Genome Project. I even would go so far as to say that it was Sue’s 1995 paper with Ed Lewis on the “Complete sequence of the bithorax complex of Drosophila” that first got me interested in “genomics.” I remember being completely in awe of the Genbank accession from this paper which was over 300,000 bp long! Man, this had to be the future. (In fact the accession number for the BX-C region, U31961, is etched in my brain like some telephone numbers from my childhood.) By the time I arrived at BDGP in 2001, the sequencing of the BX-C was already ancient history, as was the directed sequencing strategy used for this project.  These rapid changes made discovery of a set of discarded propaganda posters collecting dust in Reed George’s office that were made at the time (circa 1993) extolling the virtues of “Directed Genome Sequencing” as the key to “Deciphering the Fabric of Life” all the more poignant. I dug a photo I took of one of these posters today to commemerate the recognition of this pioneering effort (below). Here’s to a bygone era, and hats off to pioneers like Sue who paved the road for the rest of us in (Drosophila) genomics!


A Case for Junior/Senior Partnership Grants

Much has been made in recently years over funding crises in the US and Europe, which are the inevitable result of the Great Recession superimposed on top of the end of exponential growth in Science. Governments hamstrung by austerity measures or lack of political will have been forced to abandon increases in scientific funding, going so far even as to freeze funds for awarded grants in Spain (see translation here). The consequences of this stagnant period of inputs to scientific progress will be felt for many years to come, materially in terms of basic and applied discoveries, but also socially in terms of the impacts on an entire generation of scientists who are just beginning their independent careers.

Why are early stage researchers hit hardest by stagnation or decreases in funding? Simply because access to funding is not a level playing field for all scientists, and is in fact highly dependent on career stage and experience. Therefore, increased competition for resources is expected to hit younger scientists disproportionately harder relative to established researchers because of many factors, including:

  • less experience in the art of writing grants,
  • less experience in reviewing grants,
  • less experience serving on grant panels,
  • shorter scientific and management track record,
  • and a less highly developed social network.

The specific negative effect that a general increase in resource competition has on young researchers is (in my view) the best explanation for the extremely worrying downward trends in the proportion of young PIs receiving NIH grants, and the increasing upward trend in the age to receipt of first RO1 in the USA, shown in the following diagrams from the NIH Rock Talk Blog:

Thankfully, this issue which is being discussed seriously by NIH’s Deputy Director for Extramural Research, Dr. Sally Rockey, as publication of these data attests to.  [I would very much welcome if other funding agencies published similar demographic breakdowns of their funding to address whether this is a global effect.] However, not all see these trends as worrying and interpret them on socially-neutral demographic grounds.

To help combat the inherent age-based iniquities in access research funding, funding agencies typically ring-fence funding for early-stage researchers under a “New Investigator” type umbrella. In fact, Sally Rockey provides a link to an impressive history of initiatives the NIH has undertaken to tackle the New Investigator issue. But what is striking to me is that despite putting a series of different New Investigator mechanisms in place, the negative impacts on early-stage researchers have only worsened over the last three decades. Thus New Investigator programmes are clearly not enough to redress this issue, and new solutions must be sought out. Furthermore ring-fencing funding for junior researchers necessarily creates an us-vs-them mentality, which can have counterproductive repercussions among different scientific cohorts. And while New Investigator programmes are widely supported in principle, trade-offs in resource allocation can lead to unstable to changes in policy, as witnessed in the case of the now-defunct NERC New Investigator programme.

So, what of it? Is this post just another bemoaning the sorry state of affairs in funding for early-stage researchers? No, or at least, not only. Actually, my motivation is to constructively propose a relatively simple (naive?) mechanism to fund research projects that can address the inequities in funding across career stages, but which also has the additional benefit of engendering mentorship and transfer of skills across the generations: the Junior/Senior Partnership Grant. [As with all (good) ideas, such a model has been proposed before by the Women’s Cancer Network, but does not appear to be adopted by major federal funding agencies.]

The idea behind a Junior/Senior Partnership funding “scheme” is simple. Based on some criteria (years since PhD or first tenure-track position, number of successful PI awards, number of wrinkles, etc.) researchers would be classified as Junior or Senior. Based on your classification, to be eligible for an award under such a programme, at least one Junior and one Senior PI would need to be co-applicants on grant and have distinct contributions to the grant and project management. This simple mechanism would ensure that young PIs get a piece of the funding pie and allow them to establish a track record, just as a New Investigator schemes do.  But it would also obviate the need for reform to rely on the altruistic stepping aside by Senior scientists to make way for their Junior colleagues, as there would be positive (financial) incentives for them to lend a hand down the generations. And by reconfiguring resource allocation from “us-vs-them” to “we’re-all-in-this-together,” Junior/Senior Partnership Grants would further provide a natural mechanism for Senior PIs to transfer expertise in grant writing and project management to their Junior colleagues in a meaningful way, rather than in the lip-service manner that is normally paid in most institutions. Finally, and most importantly, the knowledge transfer through such a scheme would strengthen the future expertise base in Science, which all indicators would suggest is currently at risk.

Related Posts:

From Electron to Retrotransposon: “-on” the Origin of a Common Suffix in Molecular Biology

Over the last year or so, I have become increasingly interested in understanding the origin of major concepts in genetics and molecular biology. This is driven by several motivating factors, primarily to cure my ignorance/satisfy my curiosity but also to be able to answer student queries more authoritatively and unearth unsolved questions in biology. One of the more interesting stories I have stumbled across relates to why so many terms in molecular biology (e.g. codon, replicon, exon, intron, transposon, etc.) end with the suffix “-on”? While nowhere as pervasive the “-ome” suffix that has contaminated biological parlance of late,  the suffix “-on” has clearly left its mark in some of the most frequently used terms the lexicon of molecular biology.

According to Brenner (1996) [1], the common use of the suffix “-on” in molecular biological terms can be traced to Seymour Benzer’s dissection of the fine structure of the rII locus in bacteriophage T4, which overturned the classical idea that a gene is an indivisible unit:

To mark this new view of the gene, Seymour invented new terms for the now different units of mutation, recombination and function. As he was a physicist, he modelled his terms on those of physics and just as electrons, protons and neutrons replaced the once indivisible atom, so genes came to be composed of mutons, recons and cistrons. The the unit of function, the cistron was based on the cis–trans complementation test, of which only the trans part is usually done…Of these terms, only cistron came to be widely used. It is conjectured that the other two, the muton and the recon, disappeared because Seymour failed to follow the first rule for inventing new words, which is to check what they may mean in other languages…Seymour’s pioneering invention of units was followed by a spate of other new names not all of which will survive. One that seems to have taken root is codon, which I invented in 1957; and the terms intron and exon, coined by Walter Gilbert, are certain to survive as well. Operon is moot; it is still frequently used in prokaryotic genetics but as the weight of research shifts to eukaryotes, which do not have such units of regulation, it may be lost. Replicon, invented by Francis Jacob and myself in 1962, seems also to have survived, despite the fact that we paid insufficient attention to how it sounded in other languages.

Thus, the fact that many molecular biological terms end in “-on” (initiated by Benzer) owes its origin to patterns of nomenclature in chemistry/nuclear physics (which itself began with Stoney’s proposal of the term electron in 1894) and the desire to identify “fundamental units” of biological structure and function.

While Brenner’s commentary provides a crucial first-hand account to understand the origin of these terms, it does not provide any primary references concerning the coining of these terms. So I’ve spent some time digging out the original usage for a number of more common molecular biology “-ons”, which I thought many be of use or interest to others.

The terms reconmuton and cistron were defined by Benzer (1957) [2] as follows:

  • Recon: “The unit of recombination will be defined as the smallest element in the one-dimensional array that is interchangeable (but not divisible) by genetic recombination. One such element will be referred to as a “recon.””
  • Muton: “The unit of mutation, the “muton” will be defined as the smallest element that, when altered, can give rise to a mutant form of the organism.”
  • Cistron: “A unit of function can be defined genetically, independent of biochemical information, by means of the elegant cistrans comparison devised by [Ed] Lewis…Such a map segment, corresponding to a function which is unitary as defined by the cistrans test applied to the heterocaryon, will be defined as a cistron.”

I have not been able to find a definitive first reference that defines the term codon the fundamental unit of the genetic code. According to Brenner (1996) [1] and US National Library of Medicine’s Profiles in Science webpage on Marshall Nirenberg [2], the term codon was introduced by Brenner in 1957 “to describe the fundamental units engaged in protein synthesis, even though the units had yet to be fully determined. Francis Crick popularized the term in 1959. After 1962, Nirenberg began to use “codon” to characterize the three-letter RNA code words” [3].

The term operon was introduced by Jacob et al. (1960) [4] and defined as follows (italics theirs):

  • Operon: “Celle-ci comprendrait des unités d’expression coordonée (opérons) constituées par un opérateur et le group de gènes de structure coodoneés par lui.”

The term replicon was introduced by Jacob and Brenner (1963) [4] and defined as follows (italics theirs):

  • Replicon: “Il est donc clair qu’un chromosome (de bactérie ou de phage) ou un épisome constitue une unité de réplication indépendante ou réplicon, dont la reproduction est régie par la présence et l’activité de certain déterminants qu’il porte. Les caractères des réplicons exigent qu’ils déterminent des systèmes spécifique gouvernant leur propre réplication.”

Near and dear to my heart is the term transposon, which was first introduced by Hedges and Jacob (1974) [7] (italics theirs):

  • Transposon: “We designate DNA sequences with transposition potential as transposons (units of transposition)”

The very commonly used terms intron and exon were defined by Gilbert (1978) [6] as follows:

  • Intron & Exon: “The notion of the cistron, the genetic unit of function that one thought to correspond to a polypeptide chain, must be replaced by that of a transcription unit containing regions which will be lost from the mature messenger – which I suggest we call introns (for intragenic regions) – alternating with regions which will be expressed – exons.”

And finally, Boeke et al. (1985) [8] defined the term retrotransposon in the following passage (italics theirs):

  • Retrotransposon: “These observations, together with the finding that introns are spliced out of the Ty upon transposition, suggest that reverse transcription is a step in the transposition of Ty elements…We therefore propose the term retrotransposon  for Ty and related elements.”

So there you have it, from electron to retrotransposon in just a few steps. I’ve left out some lesser used terms with this suffix for the moment (e.g. regulon, stimulon, modulon), so as not to let this post go -on and -on. If anyone has any major terms to add here or corrections to my reading of the tea leaves, please let me know in the comments below.

[1] Brenner, S. (1995) “Loose end: Molecular biology by numbers… one.” Current Biology 5(8): 964.
[2] Benzer, S. (1957) “The Elementary Units of Heredity.” in Symposium on the Chemical Basis of Heredity p. 70–93.  Johns Hopkins University Press
[4] Jacob, F., et al. (1960) “L’opéron: groupe de gènes à expression coordonnée par un opérateur.” C.R. Acad. Sci. Paris 250: 1727-1729.
[5] Jacob, F., and S. Brenner. (1963) “Sur la regulation de la synthese du DNA chez les bacteries: l’hypothese du replicon.” C. R. Acad. Sci 246: 298-300.
[6] Gilbert, W. (1978) “Why genes in pieces?.” Nature 271(5645): 501.
[7] Hedges, R. W., and A. E. Jacob. (1974) “Transposition of ampicillin resistance from RP4 to other replicons.” Molecular and General Genetics MGG 132(1): 31-40.
[8] Boeke, J.D., et al. (1985) “Ty elements transpose through an RNA intermediate.” Cell 40(3): 491.
Jim Shapiro (University of Chicago) gave very helpful pointers to possible places where the term “transposon” might have originally have been introduced.

Accelerating Your Science with arXiv and Google Scholar

As part of my recent conversion to using arXiv, I’ve been struck by how posting preprints arXiv synergizes incredibly well with Google Scholar. I’ve tried to make some of these points on Twitter and elsewhere, but I thought I’d try to summarize here what I see as a very powerful approach to accelerating Open Science using arXiv and several features of the Google Scholar toolkit. Part of the motivation for writing this post is that I’ve tried to make this same pitch to several of my colleagues, and was hoping to be able to point them to a coherent version of this argument, which might be of use for others as well.

A couple of preliminaries. First, the main point of this post is not about trying to convince people to post preprints to arXiv. The benefits of preprinting on arXiv are manifold (early availability of results, allowing others to build on your work sooner, prepublication feedback on your manuscript, feedback from many eyes not just 2-3 reviewers, availability of manuscript in open access format, mechanism to establish scientific priority, opportunity to publicize your work in blogs/twitter, increased duration for citations) and have been ably summarized elsewhere. This post is specifically about how one can get the most out of preprinting on arXiv by using Google Scholar tools.

Secondly, it is important to make sure people are aware of two relatively recent developments in the Google Scholar toolkit beyond the basic Google Scholar search functionality — namely, Google Scholar Citations and Google Scholar Updates. Google Scholar Citations allows users to build a personal profile of their publications, which draws in citation data from the Google Scholar database, allowing you to “check who is citing your publications, graph citations over time, and compute several citation metrics”, which also will “appear in Google Scholar results when people search for your name.” While Google Scholar Citations has been around for a little over a year now, I often find that many Scientists are either not aware that it exists, or have not activated their profile yet, even though it is scarily easy to set up. Another more recent feature available for those with active Google Scholar Citations profiles is called Google Scholar Updates, a tool that can analyze “your articles (as identified in your Scholar profile), scan the entire web looking for new articles relevant to your research, and then show you the most relevant articles when you visit Scholar”. As others have commented, Google Scholar Updates provides a big step forward in sifting through the scientific literature, since it provides a tailored set of articles delivered to your browser based on your previous publication record.

With these preliminaries in mind, what I want to discuss now is how a Google Scholar plays so well with preprints on arXiv to accelerate science when done in the Open. By posting preprint to arXiv and activating your Google Scholar Citation profile, you immediately gain several advantages, including the following:

  1. arXiv preprints are rapidly indexed by Google Scholar (with 1-2 days in my experience) and thus can be discovered easily by others using a standard Google Scholar search.
  2. arXiv preprints are listed in your Google Scholar profile, so when people browse your profile for your most recent papers they will find arXiv preprints at the top of the list (e.g. see Graham Coop’s Google Scholar profile here).
  3. Citations to your arXiv preprints are automatically updated in your Google Scholar profile, allowing you to see who is citing your most recent work.
  4. References included in your arXiv preprints will be indexed by Google Scholar and linked to citations in other people’s Google Scholar profiles, allowing them to find your arXiv preprint via citations to their work.
  5. Inclusion of an arXiv preprint in your Google Scholar profile allows Google Scholar Updates to provide better recommendations for what you should read, which is particularly important when you are moving into a new area of research that you have not previously published on.
  6. [Update June 14, 2013] Once Google Scholar has indexed your preprint on arXiv it will automatically generate a set of Related Articles, which you can browse to identify previously published work related to your manuscript.  This is especially useful at the preprint stage, since you can incorporate related articles you may have missed before submission or during revision.

I probably have overlooked other possible benefits of the synergy between these two technologies, since they are only dawning on me as I become more familiar with these symbiotic scholarly tools myself. What’s abundantly clear to me at this stage though is that by embracing Open Science and using arXiv together with Google Scholar puts you at a fairly substantial competitive advantage in terms of your scholarship, in ways that are simply not possible using the classical approach to publishing in biology.

Suggesting Reviewers in the Era of arXiv and Twitter

Along with many others in the evolutionary genetics community, I’ve recently converted to using arXiv as a preprint server for new papers from my lab. In so doing, I’ve confronted an unexpected ethical question concerning pre-printing and the use of social media, which I was hoping to generate some discussion about as this practice becomes more common in the scientific community. The question concerns the suggestion of reviewers for a journal submission of a paper that has previously been submitted to arXiv and then subsequently discussed on social media platforms like Twitter. Specifically put, the question is: is it ethical to suggest reviewers for a journal submission based on tweets about your arXiv preprint?

To see how this ethical issue arises, I’ll first describe my current workflow for submitting to arXiv and publicizing it on Twitter. Then, I’ll propose an alternative that might be considered to be “gaming” the system, and discuss precedents in the pre-social media world that might inform the resolution of this issue.

My current workflow for submission to arXiv and announcement on twitter is as follows:

  1. submit manuscript to a journal with suggested reviewers based on personal judgement;
  2. deposit the same version of the manuscript that was submitted to journal in arXiv;
  3. wait until arXiv submission is live and then tweet links to the arXiv preprint.

From doing this a few times (as well as benefiting from additional Twitter exposure via Haldane’s Sieve), I’ve realized that there can often be fairly substantive feedback about an arXiv submission via twitter in the form of who (re)tweets links to it and what people are saying about the manuscript. It doesn’t take much thought to realize that this information could potentially be used to influence a journal submission in the form of which reviewers to suggest or oppose using an alternative workflow:

  1. submit manuscript to arXiv;
  2. wait until arXiv submission is live and then tweet about it;
  3. moniter and assimilate feedback from Twitter;
  4. submit manuscript to journal with suggested and opposed reviewers based on Twitter activity.

This second workflow incidentally also arises under the first workflow if your initial journal submission is rejected, since there would naturally be a time lag in which it would be difficult to fully ignore activity on Twitter about an arXiv submission.

Now, I want to be clear that I haven’t and don’t intend to use the second workflow (yet), since I have not fully decided if this an ethical approach to suggesting reviewers. Nevertheless, I lean towards the view that it is no more or less ethical than the current mechanisms of selecting suggested reviewers based on: (1) perceived allies/rivals with relevant expertise or (2) informal feedback on the work in question presented at meetings.

In the former case of using who you perceive to be for or against your work, you are relying on personal experience and subjective opinions about researchers in your field, both good and bad, to inform your choice of suggested or opposed reviewers. This is some sense no different qualitatively to using information on Twitter prior to journal submission, but is instead based on a closed network using past information, rather than an open network using information specific to the piece of work in question. The latter case of suggesting reviewers based on feedback from meeting presentations is perhaps more similar to the matter at hand, and I suspect would be considered by most scientists to be a perfectly valid mechanism to suggest or oppose reviewers for a journal submission.

Now, of course I recognize that suggested reviewers are just that, and editors can use or ignore these suggestions as they wish, so this issue may in fact be moot. However, based on my experience, suggested reviewers are indeed frequently used by editors (if not, why would they be there?). Thus resolving whether smoking out opinions on Twitter is considered “fair play” is probably something the scientific community should consider more thoroughly in the near future, and I’d be happy to hear what other folks think about this in the comments below.

On the Preservation of Published Bioinformatics Code on Github

A few months back I posted a quick analysis of trends in where bioinformaticians choose to host their source code. A clear trend emerging in the bioinformatics community is to use github as the primary repository of bioinformatics code in published papers.  While I am a big fan of github and I support its widespread adoption, in that post I noted my concerns about the ease with which an individual can delete a published repository. In contrast to SourceForge, where it is extremely difficult to delete a repository once files have been released and this can only be done by SourceForge itself, deleting a repository on github takes only a few seconds and can be done (accidentally or intentionally) by the user who created the repository.

Just to see how easy this is, I’ve copied the process for deleting a repository on github here:

  • Go to the repo’s admin page

  • Click “Delete this repository”

  • Read the warnings and enter the name of the repository you want to delete
  • Click “I understand the consequences, delete this repository

Given the increasing use of github in publications, I feel the issue of repository deletion on github needs to be discussed by scientists and publishers more in the context of the important issue of long-term maintenance of published code. The reason I see this as important is that most github repositories are published via individual user accounts, and thus only one person holds the keys to preservation of the published code. Furthermore, I suspect funders, editors, publishers and (most) PIs have no idea how easy it is under the current model to delete published code. Call me a bit paranoid, but I see it is my responsibility as a PI to ensure the long-term preservation of published code, since I’m the one who signs off of data/resource plans in grants/final reports. Better to be safe than sorry, right?

On this note, I was pleased to see a retweet in my stream this week (via C. Titus Brown) concerning news that the journal Computers & Geosciences has adopted an official policy for hosting published code on github:

The mechanism that Computers & Geosciences has adopted to ensure long-term preservation of code in their journal is very simple – for the editor to fork code submitted by a github user into a journal organization (note: a similar idea was also suggested independently by Andrew Perry in the comments to my previous post). As clearly stated in the github repository deletion mechanism “Deleting a private repo will delete all forks of the repo. Deleting a public repo will not.” Thus, once Computers & Geosciences has forked the code, risk to the author, journal and community of a single point of failure is substantially ameliorated, with very little overhead to authors or publishers.

So what about the many other journals that have no such digital preservation policy but currently publish papers with bioinformatics code in github? Well, as a stopgap measure until other journals get on board with similar policies (PLOS & BMC, please lead the way!), I’ve taken the initiative to create a github organization called BioinformaticsArchive to serve this function. Currently, I’ve forked code for all but one of the 64 publications with github URLs in their PubMed record. One of the scary/interesting things to observe from this endeavor is just how fragile the current situation is. Of the 63 repositories I’ve forked, about 50% (n=31) had not been previously forked by any other user on github and could have been easily deleted, with consequent loss to the scientific community.

I am aware (thanks to Marc Robinson Rechavi) there are many more published github repositories in the full-text of articles (including two from our lab), which I will endeavor to dig out and add to this archive asap. If anyone else would like to help out with the endeavor, or knows of published repositories that should included, send me an email or tweet and I’ll add them to the archive. Comments on how to improve on the current state of preservation of published bioinformatics code on github and what can be learned form Computers and Geosciences new model policy are most welcome!

Why You Should Reject the “Rejection Improves Impact” Meme

Over the last two weeks, a meme has been making the rounds in the scientific twittersphere that goes something like “Rejection of a scientific manuscript improves its eventual impact”.  This idea is based a recent analysis of patterns of manuscript submission reported in Science by Calcagno et al., which has been actively touted in the scientific press and seems to have touched a nerve with many scientists.

Nature News reported on this article on the first day of its publication (11 Oct 2012), with the statement that “papers published after having first been rejected elsewhere receive significantly more citations on average than ones accepted on first submission” (emphasis mine). The Scientist led its piece on the same day entitled “The Benefits of Rejection” with the claim that “Chances are, if a researcher resubmits her work to another journal, it will be cited more often”. Science Insider led the next day with the claim that “Rejection before publication is rare, and for those who are forced to revise and resubmit, the process will boost your citation record”. Influential science media figure Ed Yong tweeted “What doesn’t kill you makes you stronger – papers get more citations if they were initially rejected”. The message from the scientific media is clear: submitting your papers to selective journals and having them rejected is ultimately worth it, since you’ll get more citations when they are published somewhere lower down the scientific publishing food chain.

I will take on faith that the primary result of Calcagno et al. that underlies this meme is sound, since it has been vetted by the highest standard of editorial and peer review at Science magazine. However, I do note that it not possible to independently verify this result since the raw data for this analysis was not made available at the time of publication (contravening Science’s “Making Data Maximally Available Policy“), and has not been made available even after being queried. What I want to explore here is why this meme is so uncritically being propagated in the scientific press and twittersphere.

As succinctly noted by Joe Pickrell, anyone who takes even a cursory look at the basis for this claim would see that it is at best a weak effect*, and is clearly being overblown by the media and scientists alike.

Taken at face value, the way I read this graph is that papers that are rejected then published elsewhere have a median value of ~0.95 citations, whereas papers that are accepted at the first journal they are submitted to have a median value of ~0.90 citations. Although not explicitly stated in the figure legend or in the main text, I assume these results are on a natural log scale since, based on the font and layout, this plot was most likely made in R and the natural scale is the default in R (also, the authors refer the natural scale in a different figure earlier in the text). Thus, the median number of citations per article that rejection may provide an author is on the order of ~0.1.  Even if this result is on the log10 scale, this difference translates to a boost of less than one citation.  While statistically significant, this can hardly be described as a “significant increase” in citation. Still excited?

More importantly, the analysis of the effects of rejection on citation is univariate and ignores all most other possible confounding explanatory variables.  It is easy to imagine a large number of other confounding effects that could lead to this weak difference (number of reviews obtained, choice of original and final journals, number of authors, rejection rate/citation differences among discipline or subdiscipline, etc., etc.). In fact, in panel B of the same figure 4, the authors show a stronger effect of changing discipline on the number of citations in resubmitted manuscripts. Why a deeper multivariate analysis was not performed to back up the headline claim that “rejection improves impact” is hard to understand from a critical perspective. [UPDATE 26/10/2012: Bala Iyengar pointed out to me a page on the author’s website that discusses the effects of controlling for year and publishing journal on the citation effect, which led me to re-read the paper and supplemental materials more closely and see that these two factors are in fact controlled for in the main analysis of the paper. No other possible confounding factors are controlled for however.]

So what is going on here? Why did Science allow such a weak effect with a relatively superficial analysis to be published in the one of the supposedly most selective journals? Why are major science media outlets pushing this incredibly small boost in citations that is (possibly) associated with rejection? Likewise, why are scientists so uncritically posting links to the Nature and Scientist news pieces and repeating “Rejection Improves Impact” meme?

I believe the answer to the first two questions is clear: Nature and Science have a vested interest in making the case that it is in the best interest of scientists to submit their most important work to (their) highly selective journals and risk having it be rejected.  This gives Nature and Science first crack at selecting the best science and serves to maintain their hegemony in the scientific publishing marketplace. If this interpretation is true, it is an incredibly self-serving stance for Nature and Science to take, and one that may back-fire since, on the whole, scientists are not stupid people who blindly accept nonsense. More importantly though, using the pages of Science and Nature as a marketing campaign to convince scientists to submit their work to these journals risks their credibility as arbiters of “truth”. If Science and Nature go so far as to publish and hype weak, self-serving scientometric effects to get us to submit our work there, what’s to say that would they not do the same for actual scientific results?

But why are scientists taking the bait on this one?  This is more difficult to understand, but most likely has to do with the possibility that most people repeating this meme have not read the paper. Topsy records over 700 and 150 tweets to the Nature News and Scientist news pieces, but only ~10 posts to the original article in Science. Taken at face value, roughly 80-fold more scientists are reading the news about this article than reading the article itself. To be fair, this is due in part to the fact that the article is not open access and is behind a paywall, whereas the news pieces are freely available**. But this is only the proximal cause. The ultimate cause is likely that many scientists are happy to receive (uncritically, it seems) any justification, however tenuous, for continuing to play the high-impact factor journal sweepstakes. Now we have a scientifically valid reason to take the risk of being rejected by top-tier journals, even if it doesn’t pay off. Right? Right?

The real shame in the “Rejection Improves Impact” spin is that an important take-home message of Calcagno et al. is that the vast majority of papers (>75%) are published in the first journal to which they are submitted.  As a scientific community we should continue to maintain and improve this trend, selecting the appropriate home for our work on initial submission. Justifying pipe-dreams that waste precious time based on self-serving spin that benefits the closed-access publishing industry should be firmly: Rejected.

Don’t worry, it’s probably in the best interest of Science and Nature that you believe this meme.

* To be fair, Science Insider does acknowledge that the effect is weak: “previously rejected papers had a slight bump in the number of times they were cited by other papers” (emphasis mine).

** Following a link available on the author’s website, you can access this article for free here.

Calcagno, V., Demoinet, E., Gollner, K., Guidi, L., Ruths, D., & de Mazancourt, C. (2012). Flows of Research Manuscripts Among Scientific Journals Reveal Hidden Submission Patterns Science DOI: 10.1126/science.1227833

Related Posts

On The Neutral Sequence Fallacy

Beginning in the late 1960s, Motoo Kimura overturned over a century of “pan-selectionist” thinking in evolutionary biology by proposing what has come to be called The Neutral Theory of Molecular Evolution. The Neutral Theory in its basic form states that the dynamics of the majority of changes observed at the molecular level are governed by the force of Genetic Drift, rather than Darwinian (i.e. Positive) Natural Selection. As with all paradigm shifts in Science, there was much of controversy over the Neutral Theory in its early years, but nevertheless the Neutral Theory has firmly established itself as the null hypothesis for studies of evolution at the molecular level since the mid-1980s.

Despite its widespread adoption, over the last ten years or so there has been a worrying increase in abuse of terminology concerning the Neutral Theory, which I will collectively term here the “Neutral Sequence Fallacy” (inspired by T. Ryan Gregory’s Platypus Fallacy). The Neutral Sequence Fallacy arises when the distinct concepts of functional constraint and selective neutrality are conflated, leading to the mistaken description of functionally unconstrained sequences as being “Neutral”. The Fallacy, in short, is to assign the term Neutral to a particular biomolecular sequence.

The Neutral Sequence Fallacy now routinely causes problems in the fields of evolutionary and genome biology, both in terms of generating conceptual muddles as well as shifting the goalposts needed to reject the null model of sequence evolution. I have intended to write about this problem for years in order to put a halt to this growing abuse of Neutral terminology, but unfortunately never found the time. However, this issue has unfortunately reared its head more strongly in the last few days with new forms of the Neutral Sequence Fallacy arising in the context of discussions about the ENCODE project, motivating a rough version of this critique to finally see the light of day. Here I will try to sketch out the origins of the Neutral Sequence Fallacy, in its original pre-genomic form that was debunked by Kimura while he was alive, and in its modern post-genomic form that has proliferated unchecked since the early comparative genomic era.

The Neutral Sequence Fallacy draws on several misconceptions about the Neutral Theory, and begins with the abbreviation of the theory’s name from its full form (The Neutral Mutation – Random Drift Hypothesis) to its colloquial form (The Neutral Theory). This abbreviation de-emphasizes that the concept of selective neutrality applies to mutations (i.e. variants, alleles), not biomolecular sequences (i.e. regions of the genome, proteins). Simply put, only variants of a sequence can be neutral or non-neutral, not sequences themselves.

The key misconception that permits the Neutral Sequence Fallacy to flourish is the incorrect notion that if a sequence is neutrally evolving, it implies a lack of functional constraint operating on that sequence, and vice versa. Other ways to state this misconception are: “a sequence is Neutral if it is under no selective constraint” or conversely “selective constraint rejects Neutrality”. This misconception arose originally in the 1970s, shortly after the proposal of The Neutral Theory when many researchers were first coming to terms with what the theory meant. This misconception became prevalent enough that it was the first to be addressed head-on by Kimura (1983) nearly 30 years ago in section 3.6 of his book The Neutral Theory of Molecular Evolution entitled “On some misunderstandings and criticisms” (emphasis is mine):

Since a number of criticisms and comments have been made regarding my neutral theory, often based on misunderstandings, I would like to take this opportunity to discuss some of them. The neutral theory by no means claims that the genes involved are functionless as mistakenly suggested by Zuckerkandl (1978). They may or may not be, but what the neutral theory assumes is that the mutant forms of each gene participating in molecular evolution are selectively nearly equivalent, that is, they can do the job equally well in terms of survival and reproduction of the individual. (p. 50)

As pointed out by Kimura and Ohta (1977), functional constraints are consistent with neutral substitutions within a class of mutants. For example, if a group of amino acids are constrained to be hydrophilic, there can be random changes within the codons producing such amino acids…There is, of course, negative selection against hydrophobic mutants in this region, but, as mentioned before, negative selection does not contradict the neutral theory.  (p. 53)

It is understandable how this misconception arises, because in the limit of zero functional constraint (e.g. in a non-functional pseudogene), all alleles become effectively equivalent to one another and are therefore selectively neutral. However, this does not mean that an unconstrained sequence is Neutral (unless we redefine the meaning of Neutrality, see below), because a sequence itself cannot be Neutral, only variants of a sequence can be Neutral with respect to each other.

It is crucial in this context to understand that the Neutral Theory accommodates all levels of selective constraint, and sequences under selective constraint can evolve Neutrally (see formal statement of this in Equation 5.1 of Kimura 1983). This point is often lost on many people. Until you get this, you don’t understand the Neutral Theory. A simple example shows how this is true. Consider a single codon in a protein coding region that codes for a degenerate amino acid. Deletion of the third codon position would creat a frameshift, and thus a third position “silent” site is indeed functional. However, alternative codons for this amino acid are functionally equivalent and evolve (close to) neutrally. The fact that these alternative alleles evolve neutrally has to do with their equivalence of function, not the degree of their functional constraint.


To demonstrate the The Neutral Sequence Fallacy, I’d like to point out a few clear examples of this misconception in action.  The majority of transgressions in this area come from the genomics community where people may not have been formally trained in evolution, but I am sad to say that an increasing number of evolutionary biologists are also falling victim to The Neutral Sequence Fallacy these days. My reckoning is that the The Neutral Sequence Fallacy gained traction again in the post-genomic era around the time of the mouse genome paper by Waterston et al. (2002). In this widely-read paper, putatively unconstrained ancestral repeats were referred to (incorrectly) as “neutrally evolving DNA”, and used to estimate the fraction of the human genome under selective constraint. This analysis culminated with the following question: “How can we cleanly separate neutral and selected sequences?”. Under the Neutral Theory, this question makes no sense. First, sequences cannot be neutral; and second the framework used to detect functional constraints by comparative genomics assumes Neutral evolution of both classes of sites (unconstrained and constrained) – i.e. most changes between species are driven by Genetic Drift not Positive Selection. The proper formulation of this question should have been: “How can we cleanly separate unconstrained and constrained sequences?”.

Here is another clear example of the Neutral Sequence Fallacy in action from Lunter et al. (2006):

Figure 5 from Lunter et al. (2006). Notice how in the top panel, regions of the genome are contrasted as being “Neutral” vs. “Functional”. Here the term “Neutral” is being used incorrectly to mean selectively unconstrained. The bottom panel shows how indels are suppressed in Functional regions leading to intergap segments.

Here are a couple of more examples of the Neutral Sequence Fallacy in action, right in the title of fairly high-profile comparative genomics papers:

Title from Enlistki et al (2003). Notice that the functional class of “Regulatory DNA” is incorrectly contrasted as being the complement of nonfunctional “Neutral Sites”. In fact, both classes of sites are assumed to evolve neutrally in the authors’ model.

Title from Chin et al (2005). As above, notice how the concept of “Functionally conserved” is incorrectly stated to be the opposite of “Neutral sequence” and both classes of sites are assumed to evolve neutrally in the authors’ model.

I don’t mean to single these papers out, they just happen to represent very clear examples of the Neutral Sequence Fallacy in action. In fact, the Lunter et al. (2006) paper is one of my all time favorites, but it bugs the hell out of me when I have to unpick student’s misconceptions after they read it. Frustratingly, the list of papers repeating the Neutral Sequence Fallacy is long and growing. I have recently started to collect them as a citeulike library to provide examples for students to understand how not to make this common mistake. (If anyone else would like to contribute to this effort, please let me know — there is much work to be done to reverse this trend.)


So what’s the big deal here?  Some would argue that these authors actually know what they are talking about, but they just happen to be using the wrong terminology. I wish that this were the case, but very often it is not. In many papers that I read or review that perpetrate the Neutral Sequence Fallacy, I usually find further examples of seriously flawed evolutionary reasoning, suggesting that they actually do not have a deep understanding of the issues at hand. In fact, evidence of the Neutral Sequence Fallacy is usually a clear hallmark in a paper that the authors are most likely practicing population genetics or molecular evolution without a license. This leads to a Neutral Sequence Fallacy of the 1st Kind: where authors do not understand the difference between the concepts functional constraint and selective neutrality. The problems for the Neutral Theory caused by violations of the 1st Kind are deep and clear. Because the Neutral Theory is not fully understood, it is possible to construct a straw-man version of the null hypothesis of Neutrality that can easily be “rejected” simply by finding evidence of selective constraint. Furthermore, because selectively unconstrained sequences are asserted (incorrectly) to be “Neutral” without actually evaluating their mode of evolution, this conceptual error undermines the entire value of the Neutral Theory as a null hypothesis testing framework.

But some authors really do know the difference between these ideas, and just happen to be using the term “Neutral” as shorthand for the term “Unconstrained.” Increasingly, I see some of my respected peers making this mistake in print who are card-carrying molecular evolutionists and do know their stuff. In these cases what is happening is a Neutral Sequence Fallacy of the 2nd Kind: understanding the difference between functional constraint and selective neutrality, but using lazy terminology that confuses these ideas in print. This is most often found in the context of studies on noncoding DNA where, in the absence of the genetic code to conveniently constrain terminology, people use terms like “neutral standard” or “neutral region” or “neutral sites” or “neutral proxy” in place of  “putatively unconstrained”. While the meaning of violations of the 2nd Kind can be overlooked and parsed correctly by experts in molecular evolution (I hope), this sloppy language causes substantial confusion about the Neutral Theory by students or non-evolutionary biologists who are new to the field, and leads to whole swathes of subsequent violations of the 1st Kind. Moreover, defining sequences as Neutral serves those with an Adaptationist agenda: since a control region is defined as being Neutral, all mutations that occur in that region must therefore be neutral as well, and thus any potential complications of the non-neutrality of mutations in one’s control region are conveniently swept under the carpet. Violations of the 2nd Kind are often quite insidious since they are generally perpetrated by people with some authority in evolutionary biology, often who are unaware of their misuse of terminology and who will vigorously deny that they are using terms which perpetuate a classical misconception laid to rest by Kimura 30 years ago.


Which brings us to the most recent incarnation of the Neutral Sequence Fallacy in the context of the ENCODE project. In a companion post explaining the main findings of the ENCODE Project, Ewan Birney describes how the ENCODE Project reinforced recent findings that many biochemical events operate on the genome that are highly reproducible, but have no known function. In describing these event, Birney states:

I really hate the phrase “biological noise” in this context. I would argue that “biologically neutral” is the better term, expressing that there are totally reproducible, cell-type-specific biochemical events that natural selection does not care about. This is similar to the neutral theory of amino acid evolution, which suggests that most amino acid changes are not selected either for or against…Whichever term you use, we can agree that some of these events are “neutral” and are not relevant for evolution.

Under the standard view of the Neutral Theory, Birney misuses the term “Neutral” here to mean lack of functional constraint, repeating the classical form of the Neutral Sequence Fallacy. Because of this, I argue that Birney’s proposed terminology be rejected, since it will perpetuate a classic misconception in Biology. Instead, I propose the term “biologically inert”.

But wait a minute, you say, this is actually a transgression of the 2nd Kind. Really what is going on here is a matter of semantics. Birney knows the difference between functional constraint and selective neutrality. He is just formalizing the creeping misuse of the term Neutral to mean “Nonfunctional” that has been happening over the last decade.  If so, then I argue he is proposing to assign to the term Neutral the primary misconception of the Neutral Theory previously debunked by Kimura. This is a very dangerous proposal, since it will lead to further confusion in genomics arising from the “overloading” of the term Neutral (Kimura’s meaning: selectively equivalent; Birney’s meaning: no functional constraint). This muddle will subsequently prevent most scientists from properly understanding the Neutral Theory, and lead to many further examples of the Neutral Sequence Fallacy of both Kinds.

In my view, semantic switches like this are dangerous in Science, since they massively hinder communication and, therefore, progress. Semantic switches also lead to a distortion of understanding about key concepts in science. A famous case in point is Watson’s semantic switch of Crick’s term “Central Dogma” that corrupted Crick’s beautifully crafted original concept into the watered down textbook misinterpretation that is most often repeated: “DNA makes RNA make protein” (See Larry Moran’s blog for more on this).  Some may say this is the great thing about language, the same word can mean different things to different people. This view is best characterized in the immortal words of Humpty-Dumpty in Lewis Carroll’s Through the Looking Glass:

Others, including myself, disagree and prefer to have fixed definitions for scientific terms.

In a second recent case of the Neutral Sequence Fallacy creeping into discussions in the context of ENCODE, Michael Eisen proposes that we develop a “A neutral theory of molecular function” to interpret the meaning of these reproducible biochemical events that have no known function. Inspired by the introduction of a new null hypothesis in evolutionary biology ushered in by the Neutral Theory, Eisen calls for a new “neutral null hypothesis” that requires the molecular functions to be proven, not assumed. I laud any attempt to promote the use of null models for hypothesis testing in molecular biology, and whole-heartedly agree with Eisen’s main message about the need for a null model for molecular function.

But I disagree with Eisen’s proposal for a “neutral null hypothesis”, which from my reading of his piece, directly couples the null hypothesis for function with the null hypothesis for sequence evolution. By synonymizing the Ho of the functional model with the Ho of the evolutionary model, regions of the genome that would fail to reject the null functional model (i.e. have no functional constraint) will then be conflated with “being Neutral” (incorrect) or evolving neutrally (potentially correct), whereas those regions that reject the null functional model will be immediately considered as evolving non-neutrally (which may not always be the case since functional regions can evolve neutrally). While I assume this is not what is intended by Eisen, this is almost inevitably the outcome of suggesting a “neutral null hypothesis” in the context of biomolecular sequences. A “neutral null hypothesis for molecular function” makes it all to easy to merge the concepts of functional constraint and selective neutrality, which will inevitably lead many to the Neutral Sequence Fallacy. As Kimura does, Eisen should formally decouple the concept of functional constraint on a sequence from the mode of evolution by which that sequence evolves. Eisen should instead be promoting a “A null model of molecular function” that cleanly separates the concepts of function and evolution (an example of such a null model is embodied in Sean Eddy’s Random Genome Project). If not, I fear this conflation of concepts, like Birney’s semantic switch, will lead to more examples of the Neutral Sequence Fallacy of both Kinds.


The Neutral Sequence Fallacy shares many sociological similarities with the chronic misuse and misconceptions about the concept of Homology. As discussed by Marabotti and Facchiano in their article “When it comes to homology, bad habits die hard“, there was a peak of misuse of the term Homology in the mid-1980s, which lead to backlash of many publications demanding more rigorous use of the term Homology. Despite this backlash and the best efforts of many scientists to stem the tide of misuse of Homology, ~43% of abstracts surveyed in 2007 use Homology incorrectly, down from 51% in 1986 before the assault on its misuse began. As anyone teaching the concept knows, unpicking misconceptions about Homology vs. Similarity is crucial for getting students to understand evolutionary theory. I argue that the same is true for the distinction between Functional Constraint and Selective Neutrality. When it comes to Functional Constraints on biomolecular sequences, our choice of terminology should be anything but Neutral.


Chin CS, Chuang JH, & Li H (2005). Genome-wide regulatory complexity in yeast promoters: separation of functionally conserved and neutral sequence. Genome research, 15 (2), 205-13 PMID: 15653830

Elnitski L, Hardison RC, Li J, Yang S, Kolbe D, Eswara P, O’Connor MJ, Schwartz S, Miller W, & Chiaromonte F (2003). Distinguishing regulatory DNA from neutral sites. Genome research, 13 (1), 64-72 PMID: 12529307

Lunter G, Ponting CP, & Hein J (2006). Genome-wide identification of human functional DNA using a neutral indel model. PLoS computational biology, 2 (1) PMID: 16410828

Marabotti A, & Facchiano A (2009). When it comes to homology, bad habits die hard. Trends in biochemical sciences, 34 (3), 98-9 PMID: 19181528

Waterston RH, Lindblad-Toh K, Birney E, Rogers J, Abril JF, Agarwal P, Agarwala R, Ainscough R, Alexandersson M, An P, Antonarakis SE, Attwood J, Baertsch R, Bailey J, Barlow K, Beck S, Berry E, Birren B, Bloom T, Bork P, Botcherby M, Bray N, Brent MR, Brown DG, Brown SD, Bult C, Burton J, Butler J, Campbell RD, Carninci P, Cawley S, Chiaromonte F, Chinwalla AT, Church DM, Clamp M, Clee C, Collins FS, Cook LL, Copley RR, Coulson A, Couronne O, Cuff J, Curwen V, Cutts T, Daly M, David R, Davies J, Delehaunty KD, Deri J, Dermitzakis ET, Dewey C, Dickens NJ, Diekhans M, Dodge S, Dubchak I, Dunn DM, Eddy SR, Elnitski L, Emes RD, Eswara P, Eyras E, Felsenfeld A, Fewell GA, Flicek P, Foley K, Frankel WN, Fulton LA, Fulton RS, Furey TS, Gage D, Gibbs RA, Glusman G, Gnerre S, Goldman N, Goodstadt L, Grafham D, Graves TA, Green ED, Gregory S, Guigó R, Guyer M, Hardison RC, Haussler D, Hayashizaki Y, Hillier LW, Hinrichs A, Hlavina W, Holzer T, Hsu F, Hua A, Hubbard T, Hunt A, Jackson I, Jaffe DB, Johnson LS, Jones M, Jones TA, Joy A, Kamal M, Karlsson EK, Karolchik D, Kasprzyk A, Kawai J, Keibler E, Kells C, Kent WJ, Kirby A, Kolbe DL, Korf I, Kucherlapati RS, Kulbokas EJ, Kulp D, Landers T, Leger JP, Leonard S, Letunic I, Levine R, Li J, Li M, Lloyd C, Lucas S, Ma B, Maglott DR, Mardis ER, Matthews L, Mauceli E, Mayer JH, McCarthy M, McCombie WR, McLaren S, McLay K, McPherson JD, Meldrim J, Meredith B, Mesirov JP, Miller W, Miner TL, Mongin E, Montgomery KT, Morgan M, Mott R, Mullikin JC, Muzny DM, Nash WE, Nelson JO, Nhan MN, Nicol R, Ning Z, Nusbaum C, O’Connor MJ, Okazaki Y, Oliver K, Overton-Larty E, Pachter L, Parra G, Pepin KH, Peterson J, Pevzner P, Plumb R, Pohl CS, Poliakov A, Ponce TC, Ponting CP, Potter S, Quail M, Reymond A, Roe BA, Roskin KM, Rubin EM, Rust AG, Santos R, Sapojnikov V, Schultz B, Schultz J, Schwartz MS, Schwartz S, Scott C, Seaman S, Searle S, Sharpe T, Sheridan A, Shownkeen R, Sims S, Singer JB, Slater G, Smit A, Smith DR, Spencer B, Stabenau A, Stange-Thomann N, Sugnet C, Suyama M, Tesler G, Thompson J, Torrents D, Trevaskis E, Tromp J, Ucla C, Ureta-Vidal A, Vinson JP, Von Niederhausern AC, Wade CM, Wall M, Weber RJ, Weiss RB, Wendl MC, West AP, Wetterstrand K, Wheeler R, Whelan S, Wierzbowski J, Willey D, Williams S, Wilson RK, Winter E, Worley KC, Wyman D, Yang S, Yang SP, Zdobnov EM, Zody MC, & Lander ES (2002). Initial sequencing and comparative analysis of the mouse genome. Nature, 420 (6915), 520-62 PMID: 12466850


Thanks to Chip Aquadro for originally pointing out to me when I perpetrated the Neutral Sequence Fallacy (of the 1st Kind!) during a journal club as an undergraduate in his lab. I can distinctly recall hot, embarrassment of the moment while being schooled in this important issue by a master. Thanks also to Alan Moses, who was the first of many people I converted to the light on this issue, and who has encouraged me since to write this up for a wider audience. Thanks also to Douda Bensasson for putting up with me ranting about this issue for years, and helpful comments on this post.

Related Posts:

The Cost to Science of the ENCODE Publication Embargo

The big buzz in the genomics twittersphere today is the release of over 30 publications on the human ENCODE project. This is a heroic achievement, both in terms of science and publishing, with many groundbreaking discoveries in biology and pioneering developments in publishing to be found in this set of papers. It is a triumph that all of these papers are freely available to read, and much is being said elsewhere in the blogosphere about the virtues of this project and the lessons learned from the publication of these data. I’d like to pick up here on an important point made by Daniel MacArthur in his post about the delays in the publication of these landmark papers that have arisen from the common practice of embargoing papers in genomics. To be clear, I am not talking about embargoing the use of data (which is also problematic), but embargoing the release of manuscripts that have been accepted for publication after peer review.

MacArthur writes:

Many of us in the genomics community were aware of the progress the [ENCODE] project had been making via conference presentations and hallway conversations with participants. However, many other researchers who might have benefited from early access to the ENCODE data simply weren’t aware of its existence until today’s dramatic announcement – and as a result, these people are 6-12 months behind in their analyses.

It is important to emphasize that these publication delays are by design, and are driven primarily by the journals that set the publication schedules for major genomics papers. I saw first-hand how Nature sets the agenda for major genomics papers and their associated companion papers as part of the Drosophila 12 Genomes Project. This insider’s view left a distinctly bad taste in my mouth about how much control a single journal has over some of the most important community resource papers that are published in Biology.  To give more people insight into this process, I am posting the agenda set by Nature for publication (in reverse chronological order) of the main Drosophila 12 Genomes paper, which went something like this:

7 Nov 2007: papers are published, embargo lifted on main/companion papers
28 Sept 2007: papers must be in production
21 Sept 2007: revised versions of papers received
17 Aug 2007: reviews are returned to authors
27 Jul 2007: papers are submitted

Not only was acceptance of the manuscript essentially assumed by the Nature editorial staff, the entire timeline was spelled out in advance, with an embargo built in to the process from the outset. Seeing this process unfold first hand was shocking to me, and has made me very skeptical of the power that the major journals have to dictate terms about how we, and other journals, publish our work.

Personally, I cannot see how this embargo system serves anyone in science other than the major journals. There is no valid scientific reason that major genome papers and their companions cannot be made available as online accepted preprints, as is now standard practice in the publishing industry. As scientists, we have a duty to ensure that the science we produce is released to the general public and community of scientists as rapidly and openly as possible. We do not have a duty to serve the agenda of a journal to increase their cachet or revenue stream. I am aware that we need to accept delays due to quality control via the peer review and publication process. But the delays due to the normal peer review process are bad enough, as ably discussed recently by Leslie Voshall. Why on earth would we accept that journals build in further unnecessary delays into the publication process?

This of course leads to the pertinent question: how harmful is this system of embargoes? Well, we can estimate put an upper estimate on * this pretty easily from the submission/acceptance dates of the main and companion ENCODE papers (see table below). In general, most ENCODE papers were embargoed for a minimum of 2 months but some were embargoed for up to nearly 7 months. Ignoring (unfairly) the direct impact that these delays may have on the careers of PhD students and post-docs involved, something on the order of 112 months of access to these important papers have been lost to all scientists by this single embargo. Put another way, nearly up to * 10 years of access time to these papers has been collectively lost to science because of the ENCODE embargo. To the extent that these papers are crucial for understanding the human genome, and the consequences this knowledge has for human health, this decade lost to humanity is clearly unacceptable. Let us hope that the ENCODE project puts an end to the era of journal-mandated embargoes in genomics.

DOI Date Received Date Accepted Date published Months in review Months in embargo
nature11247 24-Nov-11 29-May-12 05-Sep-12 6.0 3.2
nature11233 10-Dec-11 15-May-12 05-Sep-12 5.1 3.6
nature11232 15-Dec-11 15-May-12 05-Sep-12 4.9 3.6
nature11212 11-Dec-11 10-May-12 05-Sep-12 4.9 3.8
nature11245 09-Dec-11 22-May-12 05-Sep-12 5.3 3.4
nature11279 09-Dec-11 01-Jun-12 05-Sep-12 5.6 3.1
gr.134445.111 06-Nov-11 07-Feb-12 05-Sep-12 3.0 6.8
gr.134957.111 16-Nov-11 01-May-12 05-Sep-12 5.4 4.1
gr.133553.111 17-Oct-11 05-Jun-12 05-Sep-12 7.5 3.0
gr.134767.111 11-Nov-11 03-May-12 05-Sep-12 5.6 4.0
gr.136838.111 21-Dec-11 30-Apr-12 05-Sep-12 4.2 4.1
gr.127761.111 16-Jun-11 27-Mar-12 05-Sep-12 9.2 5.2
gr.136101.111 09-Dec-11 30-Apr-12 05-Sep-12 4.6 4.1
gr.134890.111 23-Nov-11 10-May-12 05-Sep-12 5.5 3.8
gr.134478.111 07-Nov-11 01-May-12 05-Sep-12 5.7 4.1
gr.135129.111 21-Nov-11 08-Jun-12 05-Sep-12 6.5 2.9
gr.127712.111 15-Jun-11 27-Mar-12 05-Sep-12 9.2 5.2
gr.136366.111 13-Dec-11 04-May-12 05-Sep-12 4.6 4.0
gr.136127.111 16-Dec-11 24-May-12 05-Sep-12 5.2 3.4
gr.135350.111 25-Nov-11 22-May-12 05-Sep-12 5.8 3.4
gr.132159.111 17-Sep-11 07-Mar-12 05-Sep-12 5.5 5.9
gr.137323.112 05-Jan-12 02-May-12 05-Sep-12 3.8 4.1
gr.139105.112 25-Mar-12 07-Jun-12 05-Sep-12 2.4 2.9
gr.136184.111 10-Dec-11 10-May-12 05-Sep-12 4.9 3.8
gb-2012-13-9-r48 21-Dec-11 08-Jun-12 05-Sep-12 5.5 2.9
gb-2012-13-9-r49 28-Mar-12 08-Jun-12 05-Sep-12 2.3 2.9
gb-2012-13-9-r50 04-Dec-11 18-Jun-12 05-Sep-12 6.4 2.5
gb-2012-13-9-r51 23-Mar-12 25-Jun-12 05-Sep-12 3.0 2.3
gb-2012-13-9-r52 09-Mar-12 25-May-12 05-Sep-12 2.5 3.3
gb-2012-13-9-r53 29-Mar-12 19-Jun-12 05-Sep-12 2.6 2.5
Min 2.3 2.3
Max 9.2 6.8
Avg 5.1 3.7
Sum 152.7 112.1


* Based on a converation on twitter with Chris Cole, I’ve revised this to be estimate to reflect the upper bound, rather than a point estimate of time lost to science.