Why You Should Reject the “Rejection Improves Impact” Meme

ResearchBlogging.org

Over the last two weeks, a meme has been making the rounds in the scientific twittersphere that goes something like “Rejection of a scientific manuscript improves its eventual impact”.  This idea is based a recent analysis of patterns of manuscript submission reported in Science by Calcagno et al., which has been actively touted in the scientific press and seems to have touched a nerve with many scientists.

Nature News reported on this article on the first day of its publication (11 Oct 2012), with the statement that “papers published after having first been rejected elsewhere receive significantly more citations on average than ones accepted on first submission” (emphasis mine). The Scientist led its piece on the same day entitled “The Benefits of Rejection” with the claim that “Chances are, if a researcher resubmits her work to another journal, it will be cited more often”. Science Insider led the next day with the claim that “Rejection before publication is rare, and for those who are forced to revise and resubmit, the process will boost your citation record”. Influential science media figure Ed Yong tweeted “What doesn’t kill you makes you stronger – papers get more citations if they were initially rejected”. The message from the scientific media is clear: submitting your papers to selective journals and having them rejected is ultimately worth it, since you’ll get more citations when they are published somewhere lower down the scientific publishing food chain.

I will take on faith that the primary result of Calcagno et al. that underlies this meme is sound, since it has been vetted by the highest standard of editorial and peer review at Science magazine. However, I do note that it not possible to independently verify this result since the raw data for this analysis was not made available at the time of publication (contravening Science’s “Making Data Maximally Available Policy“), and has not been made available even after being queried. What I want to explore here is why this meme is so uncritically being propagated in the scientific press and twittersphere.

As succinctly noted by Joe Pickrell, anyone who takes even a cursory look at the basis for this claim would see that it is at best a weak effect*, and is clearly being overblown by the media and scientists alike.

Taken at face value, the way I read this graph is that papers that are rejected then published elsewhere have a median value of ~0.95 citations, whereas papers that are accepted at the first journal they are submitted to have a median value of ~0.90 citations. Although not explicitly stated in the figure legend or in the main text, I assume these results are on a natural log scale since, based on the font and layout, this plot was most likely made in R and the natural scale is the default in R (also, the authors refer the natural scale in a different figure earlier in the text). Thus, the median number of citations per article that rejection may provide an author is on the order of ~0.1.  Even if this result is on the log10 scale, this difference translates to a boost of less than one citation.  While statistically significant, this can hardly be described as a “significant increase” in citation. Still excited?

More importantly, the analysis of the effects of rejection on citation is univariate and ignores all most other possible confounding explanatory variables.  It is easy to imagine a large number of other confounding effects that could lead to this weak difference (number of reviews obtained, choice of original and final journals, number of authors, rejection rate/citation differences among discipline or subdiscipline, etc., etc.). In fact, in panel B of the same figure 4, the authors show a stronger effect of changing discipline on the number of citations in resubmitted manuscripts. Why a deeper multivariate analysis was not performed to back up the headline claim that “rejection improves impact” is hard to understand from a critical perspective. [UPDATE 26/10/2012: Bala Iyengar pointed out to me a page on the author's website that discusses the effects of controlling for year and publishing journal on the citation effect, which led me to re-read the paper and supplemental materials more closely and see that these two factors are in fact controlled for in the main analysis of the paper. No other possible confounding factors are controlled for however.]

So what is going on here? Why did Science allow such a weak effect with a relatively superficial analysis to be published in the one of the supposedly most selective journals? Why are major science media outlets pushing this incredibly small boost in citations that is (possibly) associated with rejection? Likewise, why are scientists so uncritically posting links to the Nature and Scientist news pieces and repeating “Rejection Improves Impact” meme?

I believe the answer to the first two questions is clear: Nature and Science have a vested interest in making the case that it is in the best interest of scientists to submit their most important work to (their) highly selective journals and risk having it be rejected.  This gives Nature and Science first crack at selecting the best science and serves to maintain their hegemony in the scientific publishing marketplace. If this interpretation is true, it is an incredibly self-serving stance for Nature and Science to take, and one that may back-fire since, on the whole, scientists are not stupid people who blindly accept nonsense. More importantly though, using the pages of Science and Nature as a marketing campaign to convince scientists to submit their work to these journals risks their credibility as arbiters of “truth”. If Science and Nature go so far as to publish and hype weak, self-serving scientometric effects to get us to submit our work there, what’s to say that would they not do the same for actual scientific results?

But why are scientists taking the bait on this one?  This is more difficult to understand, but most likely has to do with the possibility that most people repeating this meme have not read the paper. Topsy records over 700 and 150 tweets to the Nature News and Scientist news pieces, but only ~10 posts to the original article in Science. Taken at face value, roughly 80-fold more scientists are reading the news about this article than reading the article itself. To be fair, this is due in part to the fact that the article is not open access and is behind a paywall, whereas the news pieces are freely available**. But this is only the proximal cause. The ultimate cause is likely that many scientists are happy to receive (uncritically, it seems) any justification, however tenuous, for continuing to play the high-impact factor journal sweepstakes. Now we have a scientifically valid reason to take the risk of being rejected by top-tier journals, even if it doesn’t pay off. Right? Right?

The real shame in the “Rejection Improves Impact” spin is that an important take-home message of Calcagno et al. is that the vast majority of papers (>75%) are published in the first journal to which they are submitted.  As a scientific community we should continue to maintain and improve this trend, selecting the appropriate home for our work on initial submission. Justifying pipe-dreams that waste precious time based on self-serving spin that benefits the closed-access publishing industry should be firmly: Rejected.

Don’t worry, it’s probably in the best interest of Science and Nature that you believe this meme.

* To be fair, Science Insider does acknowledge that the effect is weak: “previously rejected papers had a slight bump in the number of times they were cited by other papers” (emphasis mine).

** Following a link available on the author’s website, you can access this article for free here.

References
Calcagno, V., Demoinet, E., Gollner, K., Guidi, L., Ruths, D., & de Mazancourt, C. (2012). Flows of Research Manuscripts Among Scientific Journals Reveal Hidden Submission Patterns Science DOI: 10.1126/science.1227833

Related Posts

On The Neutral Sequence Fallacy

ResearchBlogging.org

Beginning in the late 1960s, Motoo Kimura overturned over a century of “pan-selectionist” thinking in evolutionary biology by proposing what has come to be called The Neutral Theory of Molecular Evolution. The Neutral Theory in its basic form states that the dynamics of the majority of changes observed at the molecular level are governed by the force of Genetic Drift, rather than Darwinian (i.e. Positive) Natural Selection. As with all paradigm shifts in Science, there was much of controversy over the Neutral Theory in its early years, but nevertheless the Neutral Theory has firmly established itself as the null hypothesis for studies of evolution at the molecular level since the mid-1980s.

Despite its widespread adoption, over the last ten years or so there has been a worrying increase in abuse of terminology concerning the Neutral Theory, which I will collectively term here the “Neutral Sequence Fallacy” (inspired by T. Ryan Gregory’s Platypus Fallacy). The Neutral Sequence Fallacy arises when the distinct concepts of functional constraint and selective neutrality are conflated, leading to the mistaken description of functionally unconstrained sequences as being “Neutral”. The Fallacy, in short, is to assign the term Neutral to a particular biomolecular sequence.

The Neutral Sequence Fallacy now routinely causes problems in the fields of evolutionary and genome biology, both in terms of generating conceptual muddles as well as shifting the goalposts needed to reject the null model of sequence evolution. I have intended to write about this problem for years in order to put a halt to this growing abuse of Neutral terminology, but unfortunately never found the time. However, this issue has unfortunately reared its head more strongly in the last few days with new forms of the Neutral Sequence Fallacy arising in the context of discussions about the ENCODE project, motivating a rough version of this critique to finally see the light of day. Here I will try to sketch out the origins of the Neutral Sequence Fallacy, in its original pre-genomic form that was debunked by Kimura while he was alive, and in its modern post-genomic form that has proliferated unchecked since the early comparative genomic era.

The Neutral Sequence Fallacy draws on several misconceptions about the Neutral Theory, and begins with the abbreviation of the theory’s name from its full form (The Neutral Mutation – Random Drift Hypothesis) to its colloquial form (The Neutral Theory). This abbreviation de-emphasizes that the concept of selective neutrality applies to mutations (i.e. variants, alleles), not biomolecular sequences (i.e. regions of the genome, proteins). Simply put, only variants of a sequence can be neutral or non-neutral, not sequences themselves.

The key misconception that permits the Neutral Sequence Fallacy to flourish is the incorrect notion that if a sequence is neutrally evolving, it implies a lack of functional constraint operating on that sequence, and vice versa. Other ways to state this misconception are: “a sequence is Neutral if it is under no selective constraint” or conversely “selective constraint rejects Neutrality”. This misconception arose originally in the 1970s, shortly after the proposal of The Neutral Theory when many researchers were first coming to terms with what the theory meant. This misconception became prevalent enough that it was the first to be addressed head-on by Kimura (1983) nearly 30 years ago in section 3.6 of his book The Neutral Theory of Molecular Evolution entitled “On some misunderstandings and criticisms” (emphasis is mine):

Since a number of criticisms and comments have been made regarding my neutral theory, often based on misunderstandings, I would like to take this opportunity to discuss some of them. The neutral theory by no means claims that the genes involved are functionless as mistakenly suggested by Zuckerkandl (1978). They may or may not be, but what the neutral theory assumes is that the mutant forms of each gene participating in molecular evolution are selectively nearly equivalent, that is, they can do the job equally well in terms of survival and reproduction of the individual. (p. 50)

As pointed out by Kimura and Ohta (1977), functional constraints are consistent with neutral substitutions within a class of mutants. For example, if a group of amino acids are constrained to be hydrophilic, there can be random changes within the codons producing such amino acids…There is, of course, negative selection against hydrophobic mutants in this region, but, as mentioned before, negative selection does not contradict the neutral theory.  (p. 53)

It is understandable how this misconception arises, because in the limit of zero functional constraint (e.g. in a non-functional pseudogene), all alleles become effectively equivalent to one another and are therefore selectively neutral. However, this does not mean that an unconstrained sequence is Neutral (unless we redefine the meaning of Neutrality, see below), because a sequence itself cannot be Neutral, only variants of a sequence can be Neutral with respect to each other.

It is crucial in this context to understand that the Neutral Theory accommodates all levels of selective constraint, and sequences under selective constraint can evolve Neutrally (see formal statement of this in Equation 5.1 of Kimura 1983). This point is often lost on many people. Until you get this, you don’t understand the Neutral Theory. A simple example shows how this is true. Consider a single codon in a protein coding region that codes for a degenerate amino acid. Deletion of the third codon position would creat a frameshift, and thus a third position “silent” site is indeed functional. However, alternative codons for this amino acid are functionally equivalent and evolve (close to) neutrally. The fact that these alternative alleles evolve neutrally has to do with their equivalence of function, not the degree of their functional constraint.

~~~~

To demonstrate the The Neutral Sequence Fallacy, I’d like to point out a few clear examples of this misconception in action.  The majority of transgressions in this area come from the genomics community where people may not have been formally trained in evolution, but I am sad to say that an increasing number of evolutionary biologists are also falling victim to The Neutral Sequence Fallacy these days. My reckoning is that the The Neutral Sequence Fallacy gained traction again in the post-genomic era around the time of the mouse genome paper by Waterston et al. (2002). In this widely-read paper, putatively unconstrained ancestral repeats were referred to (incorrectly) as “neutrally evolving DNA”, and used to estimate the fraction of the human genome under selective constraint. This analysis culminated with the following question: “How can we cleanly separate neutral and selected sequences?”. Under the Neutral Theory, this question makes no sense. First, sequences cannot be neutral; and second the framework used to detect functional constraints by comparative genomics assumes Neutral evolution of both classes of sites (unconstrained and constrained) – i.e. most changes between species are driven by Genetic Drift not Positive Selection. The proper formulation of this question should have been: “How can we cleanly separate unconstrained and constrained sequences?”.

Here is another clear example of the Neutral Sequence Fallacy in action from Lunter et al. (2006):

Figure 5 from Lunter et al. (2006). Notice how in the top panel, regions of the genome are contrasted as being “Neutral” vs. “Functional”. Here the term “Neutral” is being used incorrectly to mean selectively unconstrained. The bottom panel shows how indels are suppressed in Functional regions leading to intergap segments.

Here are a couple of more examples of the Neutral Sequence Fallacy in action, right in the title of fairly high-profile comparative genomics papers:

Title from Enlistki et al (2003). Notice that the functional class of “Regulatory DNA” is incorrectly contrasted as being the complement of nonfunctional “Neutral Sites”. In fact, both classes of sites are assumed to evolve neutrally in the authors’ model.

Title from Chin et al (2005). As above, notice how the concept of “Functionally conserved” is incorrectly stated to be the opposite of “Neutral sequence” and both classes of sites are assumed to evolve neutrally in the authors’ model.

I don’t mean to single these papers out, they just happen to represent very clear examples of the Neutral Sequence Fallacy in action. In fact, the Lunter et al. (2006) paper is one of my all time favorites, but it bugs the hell out of me when I have to unpick student’s misconceptions after they read it. Frustratingly, the list of papers repeating the Neutral Sequence Fallacy is long and growing. I have recently started to collect them as a citeulike library to provide examples for students to understand how not to make this common mistake. (If anyone else would like to contribute to this effort, please let me know — there is much work to be done to reverse this trend.)

~~~~

So what’s the big deal here?  Some would argue that these authors actually know what they are talking about, but they just happen to be using the wrong terminology. I wish that this were the case, but very often it is not. In many papers that I read or review that perpetrate the Neutral Sequence Fallacy, I usually find further examples of seriously flawed evolutionary reasoning, suggesting that they actually do not have a deep understanding of the issues at hand. In fact, evidence of the Neutral Sequence Fallacy is usually a clear hallmark in a paper that the authors are most likely practicing population genetics or molecular evolution without a license. This leads to a Neutral Sequence Fallacy of the 1st Kind: where authors do not understand the difference between the concepts functional constraint and selective neutrality. The problems for the Neutral Theory caused by violations of the 1st Kind are deep and clear. Because the Neutral Theory is not fully understood, it is possible to construct a straw-man version of the null hypothesis of Neutrality that can easily be “rejected” simply by finding evidence of selective constraint. Furthermore, because selectively unconstrained sequences are asserted (incorrectly) to be “Neutral” without actually evaluating their mode of evolution, this conceptual error undermines the entire value of the Neutral Theory as a null hypothesis testing framework.

But some authors really do know the difference between these ideas, and just happen to be using the term “Neutral” as shorthand for the term “Unconstrained.” Increasingly, I see some of my respected peers making this mistake in print who are card-carrying molecular evolutionists and do know their stuff. In these cases what is happening is a Neutral Sequence Fallacy of the 2nd Kind: understanding the difference between functional constraint and selective neutrality, but using lazy terminology that confuses these ideas in print. This is most often found in the context of studies on noncoding DNA where, in the absence of the genetic code to conveniently constrain terminology, people use terms like “neutral standard” or “neutral region” or “neutral sites” or “neutral proxy” in place of  “putatively unconstrained”. While the meaning of violations of the 2nd Kind can be overlooked and parsed correctly by experts in molecular evolution (I hope), this sloppy language causes substantial confusion about the Neutral Theory by students or non-evolutionary biologists who are new to the field, and leads to whole swathes of subsequent violations of the 1st Kind. Moreover, defining sequences as Neutral serves those with an Adaptationist agenda: since a control region is defined as being Neutral, all mutations that occur in that region must therefore be neutral as well, and thus any potential complications of the non-neutrality of mutations in one’s control region are conveniently swept under the carpet. Violations of the 2nd Kind are often quite insidious since they are generally perpetrated by people with some authority in evolutionary biology, often who are unaware of their misuse of terminology and who will vigorously deny that they are using terms which perpetuate a classical misconception laid to rest by Kimura 30 years ago.

~~~~

Which brings us to the most recent incarnation of the Neutral Sequence Fallacy in the context of the ENCODE project. In a companion post explaining the main findings of the ENCODE Project, Ewan Birney describes how the ENCODE Project reinforced recent findings that many biochemical events operate on the genome that are highly reproducible, but have no known function. In describing these event, Birney states:

I really hate the phrase “biological noise” in this context. I would argue that “biologically neutral” is the better term, expressing that there are totally reproducible, cell-type-specific biochemical events that natural selection does not care about. This is similar to the neutral theory of amino acid evolution, which suggests that most amino acid changes are not selected either for or against…Whichever term you use, we can agree that some of these events are “neutral” and are not relevant for evolution.

Under the standard view of the Neutral Theory, Birney misuses the term “Neutral” here to mean lack of functional constraint, repeating the classical form of the Neutral Sequence Fallacy. Because of this, I argue that Birney’s proposed terminology be rejected, since it will perpetuate a classic misconception in Biology. Instead, I propose the term “biologically inert”.

But wait a minute, you say, this is actually a transgression of the 2nd Kind. Really what is going on here is a matter of semantics. Birney knows the difference between functional constraint and selective neutrality. He is just formalizing the creeping misuse of the term Neutral to mean “Nonfunctional” that has been happening over the last decade.  If so, then I argue he is proposing to assign to the term Neutral the primary misconception of the Neutral Theory previously debunked by Kimura. This is a very dangerous proposal, since it will lead to further confusion in genomics arising from the “overloading” of the term Neutral (Kimura’s meaning: selectively equivalent; Birney’s meaning: no functional constraint). This muddle will subsequently prevent most scientists from properly understanding the Neutral Theory, and lead to many further examples of the Neutral Sequence Fallacy of both Kinds.

In my view, semantic switches like this are dangerous in Science, since they massively hinder communication and, therefore, progress. Semantic switches also lead to a distortion of understanding about key concepts in science. A famous case in point is Watson’s semantic switch of Crick’s term “Central Dogma” that corrupted Crick’s beautifully crafted original concept into the watered down textbook misinterpretation that is most often repeated: “DNA makes RNA make protein” (See Larry Moran’s blog for more on this).  Some may say this is the great thing about language, the same word can mean different things to different people. This view is best characterized in the immortal words of Humpty-Dumpty in Lewis Carroll’s Through the Looking Glass:

Others, including myself, disagree and prefer to have fixed definitions for scientific terms.

In a second recent case of the Neutral Sequence Fallacy creeping into discussions in the context of ENCODE, Michael Eisen proposes that we develop a “A neutral theory of molecular function” to interpret the meaning of these reproducible biochemical events that have no known function. Inspired by the introduction of a new null hypothesis in evolutionary biology ushered in by the Neutral Theory, Eisen calls for a new “neutral null hypothesis” that requires the molecular functions to be proven, not assumed. I laud any attempt to promote the use of null models for hypothesis testing in molecular biology, and whole-heartedly agree with Eisen’s main message about the need for a null model for molecular function.

But I disagree with Eisen’s proposal for a “neutral null hypothesis”, which from my reading of his piece, directly couples the null hypothesis for function with the null hypothesis for sequence evolution. By synonymizing the Ho of the functional model with the Ho of the evolutionary model, regions of the genome that would fail to reject the null functional model (i.e. have no functional constraint) will then be conflated with “being Neutral” (incorrect) or evolving neutrally (potentially correct), whereas those regions that reject the null functional model will be immediately considered as evolving non-neutrally (which may not always be the case since functional regions can evolve neutrally). While I assume this is not what is intended by Eisen, this is almost inevitably the outcome of suggesting a “neutral null hypothesis” in the context of biomolecular sequences. A “neutral null hypothesis for molecular function” makes it all to easy to merge the concepts of functional constraint and selective neutrality, which will inevitably lead many to the Neutral Sequence Fallacy. As Kimura does, Eisen should formally decouple the concept of functional constraint on a sequence from the mode of evolution by which that sequence evolves. Eisen should instead be promoting a “A null model of molecular function” that cleanly separates the concepts of function and evolution (an example of such a null model is embodied in Sean Eddy’s Random Genome Project). If not, I fear this conflation of concepts, like Birney’s semantic switch, will lead to more examples of the Neutral Sequence Fallacy of both Kinds.

~~~~

The Neutral Sequence Fallacy shares many sociological similarities with the chronic misuse and misconceptions about the concept of Homology. As discussed by Marabotti and Facchiano in their article “When it comes to homology, bad habits die hard“, there was a peak of misuse of the term Homology in the mid-1980s, which lead to backlash of many publications demanding more rigorous use of the term Homology. Despite this backlash and the best efforts of many scientists to stem the tide of misuse of Homology, ~43% of abstracts surveyed in 2007 use Homology incorrectly, down from 51% in 1986 before the assault on its misuse began. As anyone teaching the concept knows, unpicking misconceptions about Homology vs. Similarity is crucial for getting students to understand evolutionary theory. I argue that the same is true for the distinction between Functional Constraint and Selective Neutrality. When it comes to Functional Constraints on biomolecular sequences, our choice of terminology should be anything but Neutral.

References:

Chin CS, Chuang JH, & Li H (2005). Genome-wide regulatory complexity in yeast promoters: separation of functionally conserved and neutral sequence. Genome research, 15 (2), 205-13 PMID: 15653830

Elnitski L, Hardison RC, Li J, Yang S, Kolbe D, Eswara P, O’Connor MJ, Schwartz S, Miller W, & Chiaromonte F (2003). Distinguishing regulatory DNA from neutral sites. Genome research, 13 (1), 64-72 PMID: 12529307

Lunter G, Ponting CP, & Hein J (2006). Genome-wide identification of human functional DNA using a neutral indel model. PLoS computational biology, 2 (1) PMID: 16410828

Marabotti A, & Facchiano A (2009). When it comes to homology, bad habits die hard. Trends in biochemical sciences, 34 (3), 98-9 PMID: 19181528

Waterston RH, Lindblad-Toh K, Birney E, Rogers J, Abril JF, Agarwal P, Agarwala R, Ainscough R, Alexandersson M, An P, Antonarakis SE, Attwood J, Baertsch R, Bailey J, Barlow K, Beck S, Berry E, Birren B, Bloom T, Bork P, Botcherby M, Bray N, Brent MR, Brown DG, Brown SD, Bult C, Burton J, Butler J, Campbell RD, Carninci P, Cawley S, Chiaromonte F, Chinwalla AT, Church DM, Clamp M, Clee C, Collins FS, Cook LL, Copley RR, Coulson A, Couronne O, Cuff J, Curwen V, Cutts T, Daly M, David R, Davies J, Delehaunty KD, Deri J, Dermitzakis ET, Dewey C, Dickens NJ, Diekhans M, Dodge S, Dubchak I, Dunn DM, Eddy SR, Elnitski L, Emes RD, Eswara P, Eyras E, Felsenfeld A, Fewell GA, Flicek P, Foley K, Frankel WN, Fulton LA, Fulton RS, Furey TS, Gage D, Gibbs RA, Glusman G, Gnerre S, Goldman N, Goodstadt L, Grafham D, Graves TA, Green ED, Gregory S, Guigó R, Guyer M, Hardison RC, Haussler D, Hayashizaki Y, Hillier LW, Hinrichs A, Hlavina W, Holzer T, Hsu F, Hua A, Hubbard T, Hunt A, Jackson I, Jaffe DB, Johnson LS, Jones M, Jones TA, Joy A, Kamal M, Karlsson EK, Karolchik D, Kasprzyk A, Kawai J, Keibler E, Kells C, Kent WJ, Kirby A, Kolbe DL, Korf I, Kucherlapati RS, Kulbokas EJ, Kulp D, Landers T, Leger JP, Leonard S, Letunic I, Levine R, Li J, Li M, Lloyd C, Lucas S, Ma B, Maglott DR, Mardis ER, Matthews L, Mauceli E, Mayer JH, McCarthy M, McCombie WR, McLaren S, McLay K, McPherson JD, Meldrim J, Meredith B, Mesirov JP, Miller W, Miner TL, Mongin E, Montgomery KT, Morgan M, Mott R, Mullikin JC, Muzny DM, Nash WE, Nelson JO, Nhan MN, Nicol R, Ning Z, Nusbaum C, O’Connor MJ, Okazaki Y, Oliver K, Overton-Larty E, Pachter L, Parra G, Pepin KH, Peterson J, Pevzner P, Plumb R, Pohl CS, Poliakov A, Ponce TC, Ponting CP, Potter S, Quail M, Reymond A, Roe BA, Roskin KM, Rubin EM, Rust AG, Santos R, Sapojnikov V, Schultz B, Schultz J, Schwartz MS, Schwartz S, Scott C, Seaman S, Searle S, Sharpe T, Sheridan A, Shownkeen R, Sims S, Singer JB, Slater G, Smit A, Smith DR, Spencer B, Stabenau A, Stange-Thomann N, Sugnet C, Suyama M, Tesler G, Thompson J, Torrents D, Trevaskis E, Tromp J, Ucla C, Ureta-Vidal A, Vinson JP, Von Niederhausern AC, Wade CM, Wall M, Weber RJ, Weiss RB, Wendl MC, West AP, Wetterstrand K, Wheeler R, Whelan S, Wierzbowski J, Willey D, Williams S, Wilson RK, Winter E, Worley KC, Wyman D, Yang S, Yang SP, Zdobnov EM, Zody MC, & Lander ES (2002). Initial sequencing and comparative analysis of the mouse genome. Nature, 420 (6915), 520-62 PMID: 12466850

Credits:

Thanks to Chip Aquadro for originally pointing out to me when I perpetrated the Neutral Sequence Fallacy (of the 1st Kind!) during a journal club as an undergraduate in his lab. I can distinctly recall hot, embarrassment of the moment while being schooled in this important issue by a master. Thanks also to Alan Moses, who was the first of many people I converted to the light on this issue, and who has encouraged me since to write this up for a wider audience. Thanks also to Douda Bensasson for putting up with me ranting about this issue for years, and helpful comments on this post.

Related Posts:

The Logistics of Scientific Growth in the 21st Century

ResearchBlogging.org

Over the last few months, I’ve noticed a growing number of reports about declining opportunities and increasing pressure for early stage academic researchers (Ph.D. students, post-docs and junior faculty). For example, the Washington Post published an article in early July about trends in the U.S. scientific job market entitled “U.S. pushes for more scientists, but the jobs aren’t there.” This post generated over 3,500 comments on the WaPo website alone and was highly discussed in the twittersphere. In mid July, Inside Higher Ed reported that an ongoing study revealed a recent, precipitous drop in the interest of STEM (Science/Technology/Engineering/Mathematics) Ph.D. students wishing to pursue an academic tenure-track career. These results confirmed those published in PLoS ONE in May that showed the interest to pursue an academic career of STEM students surveyed in 2010 showed evidence of a decline during the course of Ph.D. studies:

Figure 1. Percent of STEM Ph.D. judging a career to be “extremely attractive”. Taken from Saurman & Roach (2012).

Even for those lucky enough to get an academic appointment, the bad news seems to be that it is getting harder to establish a research program.  For example, the average age for a researcher to get their first NIH grant (a virtual requirement for tenure for many biologists in the US) is now 42 years old. National Public Radio quips “50 is the new 30, if you’re a promising scientist.”

I’ve found these reports very troubling since, after over nearly fifteen years of slogging it out since my undergrad to achieve the UK equivalent of a “tenured” academic position, I am acutely aware of the how hard the tenure track is for junior scientists at this stage in history. On a regular basis I see how the current system negatively affects the lives of talented students, post-docs and early-stage faculty. I have for some time wanted to write about my point of view on this issue since I see these trends as indicators of bigger changes in the growth of science than individuals may be aware of.  I’ve finally been inspired to do so by a recent piece by Euan Ritchie and Joern Fischer published in The Conversation entitled “Cracks in the ivory tower: is academia’s culture sustainable?“, which I think hits the nail on head about the primary source of the current problems in academics: the deeply flawed philosophy that “more is always better”.

My view is that the declining opportunities and increasing malaise among early-stage academics is a by-product of the fact that the era of exponential growth in academic research is over.  That’s nonsense, you say, the problems we are experiencing now are because of the current global economic downturn. What’s happening now is a temporary blip, things will return to happier days when we get back to “normal” economic growth and governments increase investment in research. Nonsense, I say. This has nothing to do with the current economic climate and instead has more to do with long-term trends in the growth of scientific activity over the last three centuries.

My views are almost entirely derived from a book written by Derek de Solla Price entitled Little Science, Big Science. Price was a scientist-cum-historian who published this slim tome in 1963 based a series of lectures at Brookhaven National Lab in 1962. It was a very influential book in the 1960s and 1970s, since it introduced citation analysis to a wide audience. Along with Eugene Garfield of ISI/Impact Factor fame (or infamy, depending on your point of view), Price is credited as being one of the founding fathers of Scientometrics. Sadly, this important book is now out of print, the Wikipedia page on this book is a stub with no information, and Google books has not scanned it into their electronic library, showing just how far the ideas in this book are out of the current consciousness. I am not the first to lament that Price’s writings have been ignored in recent years.

In a few short chapters, Price covers large-scale trends in the growth of science and the scientific literature from its origins in the 17th century, which I urge readers to explore for themselves. I will focus here only on one of his key points that relates to the matter at hand — the pinch we are currently feeling in science. Price shows that as scientific disciplines matured in the 20th century, they achieved a characteristic exponential growth rate, which appears linear on a logarithmic scale. This can be seen terms of both the output of scientific papers (Figure 2) or scientists themselves (Figure 3).

Figure 2. Taken from de Solla Price 1963.

Figure 4. A model of logistic growth for Science in the late 20th and early 21st century (taken from de Solla Price 1963).

Figure 3. Taken from de Solla Price 1963.

Price showed that there was a roughly constant doubling time for different forms of scientific output (number of journals, number of papers, number of scientists, etc.) of about 10-15 years. That is, the amount of scientific output at a given point in history is twice as large as it was 10-15 years before. This incessant growth is why we all feel like it is so hard to keep up on the literature (and incidentally why I believe that text mining is now an essential tool). And these observations led Price to make the famous claim that “Eighty to 90 per cent of all the scientists who have ever lived are alive now”.

Crucially, Price pointed out that the doubling time of the number of scientists is much shorter than the doubling time of the overall human population (~50 years). Thus, the proportion of scientists relative to the total human population has been increasing for decades, if not centuries. Price makes the startling but obvious outcomes of this observation very clear: either everyone on earth will be a scientist one day, or the growth rate of science must decrease from its previous long-term trends. He then goes on to argue that the most likely outcome is the latter, and that scientific growth rates will change from exponential to logistic growth and reach saturation sometime within 100 years from the publication of his book in 1963 (Figure 4):

Figure 4. A model of logistic growth for Science (taken from de Solla Price 1963).

So maybe the bad news circulating in labs, coffee rooms and over the internet is not a short-term trend based on the current economic downturn, but instead reflects the product of a long-term trend in the history of science?  Perhaps the crunch that we are currently experiencing in academic research now is the byproduct of the fact that we are in Price’s transition from exponential to logistic growth in science? If so, the pressures we are experiencing now may simply reflect that the current rate of production of scientists is no longer matched to the long-term demand for scientists in society.

Whether or not this model of growth in science is true is clearly debatable (please do so below!). But if we are in the midst of making the transition from exponential to logistic growth in science, then there are a number of important implications that I feel scientists at all stages of their careers should be aware of:

1) For PhD students and post-docs: you have every right to be feeling like the opportunities in science may not be there for you as they were for your supervisors and professors. This message sucks, I know, but one important take-home message from this is that it may not have anything to do with your abilities; it may just have to do with when you came along in history. I am not saying that there will be no opportunities in the future, just fewer as a proportion of the total number of jobs in society relative to current levels. I’d argue that this is a cautiously optimistic view, since anticipating the long-term trends will help you develop more realistic and strategic approaches to making career choices.

2) For early-stage academics: your career trajectory is going to be more limited that you anticipated going into this gig. Sorry mate, but your lab is probably not going to be as big as you might think it should be, you will probably get fewer grants, and you will have more competition for resources than you witnessed in your PhD or post-doc supervisor’s lab. Get used it. If you think you have it hard, see point 1). You are lucky to have a job in science. Also bear in mind that the people judging your career progression may hold expectations that are no longer relevant, and as a result you may have more conflict with senior members of staff during the earlier phases of your career than you expect. Most importantly, if you find that this new reality is true for you, then do your best to adjust your expectations for PhD  students and post-docs as well.

3) For established academics: you came up during the halcyon days of growth in science, so bear in mind that you had it easy relative to those trying to make it today. So when you set your expectations for your students or junior colleagues in terms of performance, recruitment or tenure, be sure to take on board that they have it much harder now than you did at the corresponding point in your career [see points 1) and 2)]. A corollary of this point is that anyone actually succeeding in science now and in the future is (on average) probably better trained and works harder than you (at the corresponding point in your career), so on the whole you are probably dealing with someone who is more qualified for their job than you would be.  So don’t judge your junior colleagues with out-of-date views (that you might not be able to achieve yourself in the current climate) and promote values from a bygone era of incessant growth. Instead, adjust your views of success for the 21st century and seek to promote a sustainable model of scientific career development that will fuel innovation for the next hundred years.

References

de Solla Price D (1963) Little Science. Big Science. New York: Columbia University Press.

Kealey T (2000). More is less. Economists and governments lag decades behind Derek Price’s thinking Nature, 405 (6784) PMID: 10830939

Sauermann H, & Roach M (2012). Science PhD career preferences: levels, changes, and advisor encouragement. PloS one, 7 (5) PMID: 22567149

Related Posts:

A Open Archive of My F1000 Reviews

Following on from a recent conversation with David Stephens on Twitter about my decision to resign from Faculty of 1000, F1000 has clarified their terms for the submission of evaluations and confirmed that it is permissible to “reproduce personal evaluations on institutional & personal blogs if you clearly reference F1000″.

As such, I am delighted to be able to repost here an Open Archive of my F1000 contributions. Additionally, this post acts in a second capacity as my first contribution to the Research Blogging Network. Hopefully these commentraies will be of interest to some, and should add support to the Altmetrics profiles for these papers through systems like Total Impact.

ResearchBlogging.org

Nelson CE, Hersh BM, & Carroll SB (2004). The regulatory content of intergenic DNA shapes genome architecture. Genome biology, 5 (4) PMID: 15059258

My review: This article reports that genes with complex expression have longer intergenic regions in both D. melanogaster and C. elegans, and introduces several innovative and complementary approaches to quantify the complexity of gene expression in these organisms. Additionally, the structure of intergenic DNA in genes with high complexity (e.g. receptors, specific transcription factors) is shown to be longer and more evenly distributed over 5′ and 3′ regions in D. melanogaster than in C. elegans, whereas genes with low complexity (e.g. metabolic genes, general transcription factors) are shown to have similar intergenic lengths in both species and exhibit no strong differences in length between 5′ and 3′ regions. This work suggests that the organization of noncoding DNA may reflect constraints on transcriptional regulation and that gene structure may yield insight into the functional complexity of uncharacterized genes in compact animal genomes. (@F1000: http://f1000.com/1032936)

ResearchBlogging.org

Li R, Ye J, Li S, Wang J, Han Y, Ye C, Wang J, Yang H, Yu J, Wong GK, & Wang J (2005). ReAS: Recovery of ancestral sequences for transposable elements from the unassembled reads of a whole genome shotgun. PLoS computational biology, 1 (4) PMID: 16184192

My review: This paper presents a novel method for automating the laborious task of constructing libraries of transposable element (TE) consensus sequences. Since repetitive TE sequences confound whole-genome shotgun (WGS) assembly algorithms, sequence reads from TEs are initially screened from WGS assemblies based on overrepresented k-mer frequencies. Here, the authors invert the same principle, directly identifying TE consensus sequences from those same reads containing high frequency k-mers. The method was shown to identify all high copy number TEs and increase the effectiveness of repeat masking in the rice genome. By circumventing the inherent difficulties of TE consensus reconstruction from erroneously assembled genome sequences, and by providing a method to identify TEs prior to WGS assembly, this method provides a new strategy to increase the accuracy of WGS assemblies as well as our understanding of the TEs in genome sequences. (@F1000: http://f1000.com/1031746)

ResearchBlogging.org

Rifkin SA, Houle D, Kim J, & White KP (2005). A mutation accumulation assay reveals a broad capacity for rapid evolution of gene expression. Nature, 438 (7065), 220-3 PMID: 16281035

My review: This paper reports empirical estimates of the mutational input to gene expression variation in Drosophila, knowledge of which is critical for understanding the mechanisms governing regulatory evolution. These direct estimates of mutational variance are compared to gene expression differences across species, revealing that the majority of genes have lower expression divergence than is expected if evolving solely by mutation and genetic drift. Mutational variances on a gene-by-gene basis range over several orders of magnitude and are shown to vary with gene function and developmental context. Similar results in C. elegans [1] provide strong support for stabilizing selection as the dominant mode of gene expression evolution. (@F1000: http://f1000.com/1040157)

References: 1. Denver DR, Morris K, Streelman JT, Kim SK, Lynch M, & Thomas WK (2005). The transcriptional consequences of mutation and natural selection in Caenorhabditis elegans. Nature genetics, 37 (5), 544-8 PMID: 15852004

ResearchBlogging.org

Caspi A, & Pachter L (2006). Identification of transposable elements using multiple alignments of related genomes. Genome research, 16 (2), 260-70 PMID: 16354754

My review: This paper reports an innovative strategy for the de novo detection of transposable elements (TEs) in genome sequences based on comparative genomic data. By capitalizing on the fact that bursts of TE transposition create large insertions in multiple genomic locations, the authors show that detection of repeat insertion regions (RIRs) in alignments of multiple Drosophila genomes has high sensitivity to identify both individual instances and families of known TEs. This approach opens a new direction in the field of repeat detection and provides added value to TE annotations by placing insertion events in a phylogenetic context. (@F1000 http://f1000.com/1049265)

ResearchBlogging.org

Simons C, Pheasant M, Makunin IV, & Mattick JS (2006). Transposon-free regions in mammalian genomes. Genome research, 16 (2), 164-72 PMID: 16365385

My review: This paper presents an intriguing analysis of transposon-free regions (TFRs) in the human and mouse genomes, under the hypothesis that TFRs indicate genomic regions where transposon insertion is deleterious and removed by purifying selection. The authors test and reject a model of random transposon distribution and investigate the properties of TFRs, which appear to be conserved in location across species and enriched for genes (especially transcription factors and micro-RNAs). An alternative mutational hypothesis not considered by the authors is the possibility for clustered transposon integration (i.e. preferential insertion into regions of the genome already containing transposons), which may provide a non-selective explanation for the apparent excess of TFRs in the human and mouse genomes. (@F1000: http://f1000.com/1010399)

ResearchBlogging.org

Wheelan SJ, Scheifele LZ, Martínez-Murillo F, Irizarry RA, & Boeke JD (2006). Transposon insertion site profiling chip (TIP-chip). Proceedings of the National Academy of Sciences of the United States of America, 103 (47), 17632-7 PMID: 17101968

My review: This paper demonstrates the utility of whole-genome microarrays for the high-throughput mapping of eukaryotic transposable element (TE) insertions on a genome-wide basis. With an experimental design guided by first computationally digesting the genome into suitable fragments, followed by linker-PCR to amplify TE flanking regions and subsequent hybridization to tiling arrays, this method was shown to recover all detectable TE insertions with essentially no false positives in yeast. Although limited to species with available genome sequences, this approach circumvents inefficiencies and biases associated with the alternative of whole-genome shotgun resequencing to detect polymorphic TEs on a genome-wide scale. Application of this or related technologies (e.g. [1]) to more complex genomes should fill gaps in our understanding of the contribution of TE insertions to natural genetic variation. (@F1000: http://f1000.com/1088573)

References: 1. Gabriel A, Dapprich J, Kunkel M, Gresham D, Pratt SC, & Dunham MJ (2006). Global mapping of transposon location. PLoS genetics, 2 (12) PMID: 17173485

ResearchBlogging.org

Haag-Liautard C, Dorris M, Maside X, Macaskill S, Halligan DL, Houle D, Charlesworth B, & Keightley PD (2007). Direct estimation of per nucleotide and genomic deleterious mutation rates in Drosophila. Nature, 445 (7123), 82-5 PMID: 17203060

My review: This paper presents the first direct estimates of nucleotide mutation rates across the Drosophila genome derived from mutation accumulation experiments. By using DHPLC to scan over 20 megabases of genomic DNA, the authors obtain several fundamental results concerning mutation at the molecular level in Drosophila: SNPs are more frequent than indels; deletions are more frequent than insertions; mutation rates are similar across coding, intronic and intergenic regions; and mutation rates may vary across genetic backgrounds. Results in D. melanogaster contrast with those obtained from mutation accumulation experiments in C. elegans (see [1], where indels are more frequent than SNPs, and insertions are more frequent than deletions), indicating that basic mutation processes may vary across metazoan taxa. (@F1000: http://f1000.com/1070688)

References: 1. Denver DR, Morris K, Lynch M, & Thomas WK (2004). High mutation rate and predominance of insertions in the Caenorhabditis elegans nuclear genome. Nature, 430 (7000), 679-82 PMID: 15295601

ResearchBlogging.org

Katzourakis A, Pereira V, & Tristem M (2007). Effects of recombination rate on human endogenous retrovirus fixation and persistence. Journal of virology, 81 (19), 10712-7 PMID: 17634225

My review: This study shows that the persistence, but not the integration, of long-terminal repeat (LTR) containing human endogenous retroviruses (HERVs) is associated with local recombination rate, and suggests a link between intra-strand homologous recombination and meiotic exchange. This inference about the mechanisms controlling the transposable element (TE) abundance is obtained by demonstrating that total HERV density (full-length elements plus solo LTRs) is not correlated with recombination rate, whereas the ratio of full-length HERVs relative to solo LTRs is. This work relies critically on advanced computational methods to join TE fragments, demonstrating the need for such algorithms to make accurate inferences about the evolution of mobile DNA and to reveal new insights into genome biology. (@F1000: http://f1000.com/1091037)

ResearchBlogging.org

Giordano J, Ge Y, Gelfand Y, Abrusán G, Benson G, & Warburton PE (2007). Evolutionary history of mammalian transposons determined by genome-wide defragmentation. PLoS computational biology, 3 (7) PMID: 17630829

My review: This article reports the first comprehensive stratigraphic record of transposable element (TE) activity in mammalian genomes based on several innovative computational methods that use information encoded in patterns of TE nesting. The authors first develop an efficient algorithm for detecting nests of TEs by intelligently joining TE fragments identified by RepeatMasker, which (in addition to providing an improved genome annotation) outputs a global “interruption matrix” that can be used by a second novel algorithm which generates a chronological ordering of TE activity by minimizing the nesting of young TEs into old TEs. Interruption matrix analysis yields results that support previous phylogenetic analyses of TE activity in humans but are not dependent on the assumption of a molecular clock. Comparison of the chronological orders of TE activity in six mammalian genomes provides unique insights into the ancestral and lineage-specific record of global TE activity in mammals. (@F1000: http://f1000.com/1089045)

ResearchBlogging.org

Schuemie MJ, & Kors JA (2008). Jane: suggesting journals, finding experts. Bioinformatics (Oxford, England), 24 (5), 727-8 PMID: 18227119

My review: This paper introduces a fast method for finding related articles and relevant journals/experts based on user input text and should help improve the referencing, review and publication of biomedical manuscripts. The JANE (Journal/Author Name Estimator) method uses a standard word frequency approach to find similar documents, then adds the scores in the top 50 records to produce a ranked list of journals or authors. Using either the abstract or full-text, JANE suggested quite sensible journals and authors in seconds for a manuscript we have in press, while the related eTBLAST method [1] failed to complete while I wrote this review. JANE should prove to be a very useful text mining tool for authors and editors alike. (@F1000: http://f1000.com/1101037)

References: 1. Errami M, Wren JD, Hicks JM, & Garner HR (2007). eTBLAST: a web server to identify expert reviewers, appropriate journals and similar publications. Nucleic acids research, 35 (Web Server issue) PMID: 17452348

ResearchBlogging.org

Pask AJ, Behringer RR, & Renfree MB (2008). Resurrection of DNA function in vivo from an extinct genome. PloS one, 3 (5) PMID: 18493600

My review: This paper reports the first transgenic analysis of a cis-regulatory element cloned from an extinct species. Although no differences were seen in the expression pattern of the collagen (Col2A1) enhancer from the extinct Tasmanian tiger and extant mouse, this work is an important proof of principle for using ancient DNA in the evolutionary analysis of gene regulation. (@F1000: http://f1000.com/1108816)

ResearchBlogging.org

Ginsberg J, Mohebbi MH, Patel RS, Brammer L, Smolinski MS, & Brilliant L (2009). Detecting influenza epidemics using search engine query data. Nature, 457 (7232), 1012-4 PMID: 19020500

My review: A landmark paper in health bioinformatics demonstrating that Google searches can predict influenza trends in the United States. Predicting infectious disease outbreaks currently relies on patient reports gathered through clinical settings and submitted to government agencies such as the CDC. The possible use of patient “self-reporting” through internet search queries offers unprecedented real-time access to temporal and regional trends in infectious diseases. Here, the authors use a linear modeling strategy to learn which Google search terms best correlate with regional trends in influenza-related illness. This model explains flu trends over a 5 year period with startling accuracy, and was able to predict flu trends during 2007-2008 with a 1-2 week lead time ahead of CDC reports. The phenomenal use of crowd-based predictive health informatics revolutionizes the role of the internet in biomedical research and will likely set an important precedent in many areas of natural sciences. (@F1000: http://f1000.com/1127181)

ResearchBlogging.org

Taher L, & Ovcharenko I (2009). Variable locus length in the human genome leads to ascertainment bias in functional inference for non-coding elements. Bioinformatics (Oxford, England), 25 (5), 578-84 PMID: 19168912

My review: This paper raises the important observation that differences in the length of genes can bias their functional classification using the Gene Ontology, and provides a simple method to correct for this inherent feature of genome architecture. A basic observation of genome biology is that genes differ widely in their size and structure within and between species. Understanding the causes and consequences of this variation in gene structure is an open challenge in genome biology. Previously, Nelson and colleagues [1] have shown, in flies and worms, that the length of intergenic regions is correlated with the regulatory complexity of genes and that genes from different Gene Ontology (GO) categories have drastically different lengths. Here, Taher and Ovcharenko confirm this observation of functionally non-random gene length in the human genome, and discuss the implications of this feature of genome organization on analyses that employ the GO for functional inference. Specifically, these authors show that random selection of noncoding DNA sequences from the human genome leads to the false inference of over- and under-representation of specific GO categories that preferentially contain longer or shorter genes, respectively. This finding has important implications for the large number of studies that employ a combination of gene expression microarrays and GO enrichment analysis, since gene expression is largely controlled by noncoding DNA. The authors provide a simple method to correct for this bias in GO analyses, and show that previous reports of the enrichment of “ultraconserved” noncoding DNA sequences in vertebrate developmental genes [2] may be a statistical artifact. (@F1000: http://f1000.com/1157594)

References: 1. Nelson CE, Hersh BM, & Carroll SB (2004). The regulatory content of intergenic DNA shapes genome architecture. Genome biology, 5 (4) PMID: 15059258

2. Bejerano G, Pheasant M, Makunin I, Stephen S, Kent WJ, Mattick JS, & Haussler D (2004). Ultraconserved elements in the human genome. Science (New York, N.Y.), 304 (5675), 1321-5 PMID: 15131266

ResearchBlogging.org

Hindorff LA, Sethupathy P, Junkins HA, Ramos EM, Mehta JP, Collins FS, & Manolio TA (2009). Potential etiologic and functional implications of genome-wide association loci for human diseases and traits. Proceedings of the National Academy of Sciences of the United States of America, 106 (23), 9362-7 PMID: 19474294

My review: This article introduces results from human genome-wide association studies (GWAS) into the realm of large-scale functional genomic data mining. These authors compile the first curated database of trait-associated single-nucleotide polymorphisms (SNPs) from GWAS studies (http://www.genome.gov/gwastudies/) that can be mined for general features of SNPs underlying phenotypes in humans. By analyzing 531 SNPs from 151 GWAS studies, the authors discover that trait-associated SNPs are predominantly in non-coding regions (43% intergenic, 45% intronic), but that non-synonymous and promoter trait-associated SNPs are enriched relative to expectations. The database is actively maintained and growing, and currently contains 3943 trait-associated SNPs from 796 publications. This important resource will facilitate data mining and integration with high-throughput functional genomics data (e.g. ChIP-seq), as well as meta-analyses, to address important questions in human genetics, such as the discovery of loci that affects multiple traits. While the interface to the GWAS catalog is rather limited, a related project (http://www.gwascentral.org/) [1] provides a much more powerful interface for searching and browsing data from the GWAS catalog. (@F1000: http://f1000.com/8408956)

References: 1. Thorisson GA, Lancaster O, Free RC, Hastings RK, Sarmah P, Dash D, Brahmachari SK, & Brookes AJ (2009). HGVbaseG2P: a central genetic association database. Nucleic acids research, 37 (Database issue) PMID: 18948288

ResearchBlogging.org

Tamames J, & de Lorenzo V (2010). EnvMine: a text-mining system for the automatic extraction of contextual information. BMC bioinformatics, 11 PMID: 20515448

My review: This paper describes EnvMine, an innovative text-mining tool to obtain physico-chemical and geographical information about environmental genomics samples. This work represents a pioneering effort to apply text-mining technologies in the domain of ecology, providing novel methods to extract the units and variables of physico-chemical entities, as well as link the location of samples to worldwide geographic coordinates via Google Maps. Application of EnvMine to full-text articles in the environmental genomics database envDB {1} revealed very high system performance, suggesting that information extracted by EnvMine will be of use to researchers seeking meta-data about environmental samples across different domains of biology. (@F1000: http://f1000.com/3502956)

References: 1. Tamames J, Abellán JJ, Pignatelli M, Camacho A, & Moya A (2010). Environmental distribution of prokaryotic taxa. BMC microbiology, 10 PMID: 20307274