Why You Should Reject the “Rejection Improves Impact” Meme


Over the last two weeks, a meme has been making the rounds in the scientific twittersphere that goes something like “Rejection of a scientific manuscript improves its eventual impact”.  This idea is based a recent analysis of patterns of manuscript submission reported in Science by Calcagno et al., which has been actively touted in the scientific press and seems to have touched a nerve with many scientists.

Nature News reported on this article on the first day of its publication (11 Oct 2012), with the statement that “papers published after having first been rejected elsewhere receive significantly more citations on average than ones accepted on first submission” (emphasis mine). The Scientist led its piece on the same day entitled “The Benefits of Rejection” with the claim that “Chances are, if a researcher resubmits her work to another journal, it will be cited more often”. Science Insider led the next day with the claim that “Rejection before publication is rare, and for those who are forced to revise and resubmit, the process will boost your citation record”. Influential science media figure Ed Yong tweeted “What doesn’t kill you makes you stronger – papers get more citations if they were initially rejected”. The message from the scientific media is clear: submitting your papers to selective journals and having them rejected is ultimately worth it, since you’ll get more citations when they are published somewhere lower down the scientific publishing food chain.

I will take on faith that the primary result of Calcagno et al. that underlies this meme is sound, since it has been vetted by the highest standard of editorial and peer review at Science magazine. However, I do note that it not possible to independently verify this result since the raw data for this analysis was not made available at the time of publication (contravening Science’s “Making Data Maximally Available Policy“), and has not been made available even after being queried. What I want to explore here is why this meme is so uncritically being propagated in the scientific press and twittersphere.

As succinctly noted by Joe Pickrell, anyone who takes even a cursory look at the basis for this claim would see that it is at best a weak effect*, and is clearly being overblown by the media and scientists alike.

Taken at face value, the way I read this graph is that papers that are rejected then published elsewhere have a median value of ~0.95 citations, whereas papers that are accepted at the first journal they are submitted to have a median value of ~0.90 citations. Although not explicitly stated in the figure legend or in the main text, I assume these results are on a natural log scale since, based on the font and layout, this plot was most likely made in R and the natural scale is the default in R (also, the authors refer the natural scale in a different figure earlier in the text). Thus, the median number of citations per article that rejection may provide an author is on the order of ~0.1.  Even if this result is on the log10 scale, this difference translates to a boost of less than one citation.  While statistically significant, this can hardly be described as a “significant increase” in citation. Still excited?

More importantly, the analysis of the effects of rejection on citation is univariate and ignores all most other possible confounding explanatory variables.  It is easy to imagine a large number of other confounding effects that could lead to this weak difference (number of reviews obtained, choice of original and final journals, number of authors, rejection rate/citation differences among discipline or subdiscipline, etc., etc.). In fact, in panel B of the same figure 4, the authors show a stronger effect of changing discipline on the number of citations in resubmitted manuscripts. Why a deeper multivariate analysis was not performed to back up the headline claim that “rejection improves impact” is hard to understand from a critical perspective. [UPDATE 26/10/2012: Bala Iyengar pointed out to me a page on the author’s website that discusses the effects of controlling for year and publishing journal on the citation effect, which led me to re-read the paper and supplemental materials more closely and see that these two factors are in fact controlled for in the main analysis of the paper. No other possible confounding factors are controlled for however.]

So what is going on here? Why did Science allow such a weak effect with a relatively superficial analysis to be published in the one of the supposedly most selective journals? Why are major science media outlets pushing this incredibly small boost in citations that is (possibly) associated with rejection? Likewise, why are scientists so uncritically posting links to the Nature and Scientist news pieces and repeating “Rejection Improves Impact” meme?

I believe the answer to the first two questions is clear: Nature and Science have a vested interest in making the case that it is in the best interest of scientists to submit their most important work to (their) highly selective journals and risk having it be rejected.  This gives Nature and Science first crack at selecting the best science and serves to maintain their hegemony in the scientific publishing marketplace. If this interpretation is true, it is an incredibly self-serving stance for Nature and Science to take, and one that may back-fire since, on the whole, scientists are not stupid people who blindly accept nonsense. More importantly though, using the pages of Science and Nature as a marketing campaign to convince scientists to submit their work to these journals risks their credibility as arbiters of “truth”. If Science and Nature go so far as to publish and hype weak, self-serving scientometric effects to get us to submit our work there, what’s to say that would they not do the same for actual scientific results?

But why are scientists taking the bait on this one?  This is more difficult to understand, but most likely has to do with the possibility that most people repeating this meme have not read the paper. Topsy records over 700 and 150 tweets to the Nature News and Scientist news pieces, but only ~10 posts to the original article in Science. Taken at face value, roughly 80-fold more scientists are reading the news about this article than reading the article itself. To be fair, this is due in part to the fact that the article is not open access and is behind a paywall, whereas the news pieces are freely available**. But this is only the proximal cause. The ultimate cause is likely that many scientists are happy to receive (uncritically, it seems) any justification, however tenuous, for continuing to play the high-impact factor journal sweepstakes. Now we have a scientifically valid reason to take the risk of being rejected by top-tier journals, even if it doesn’t pay off. Right? Right?

The real shame in the “Rejection Improves Impact” spin is that an important take-home message of Calcagno et al. is that the vast majority of papers (>75%) are published in the first journal to which they are submitted.  As a scientific community we should continue to maintain and improve this trend, selecting the appropriate home for our work on initial submission. Justifying pipe-dreams that waste precious time based on self-serving spin that benefits the closed-access publishing industry should be firmly: Rejected.

Don’t worry, it’s probably in the best interest of Science and Nature that you believe this meme.

* To be fair, Science Insider does acknowledge that the effect is weak: “previously rejected papers had a slight bump in the number of times they were cited by other papers” (emphasis mine).

** Following a link available on the author’s website, you can access this article for free here.

Calcagno, V., Demoinet, E., Gollner, K., Guidi, L., Ruths, D., & de Mazancourt, C. (2012). Flows of Research Manuscripts Among Scientific Journals Reveal Hidden Submission Patterns Science DOI: 10.1126/science.1227833

Related Posts


The Cost to Science of the ENCODE Publication Embargo

The big buzz in the genomics twittersphere today is the release of over 30 publications on the human ENCODE project. This is a heroic achievement, both in terms of science and publishing, with many groundbreaking discoveries in biology and pioneering developments in publishing to be found in this set of papers. It is a triumph that all of these papers are freely available to read, and much is being said elsewhere in the blogosphere about the virtues of this project and the lessons learned from the publication of these data. I’d like to pick up here on an important point made by Daniel MacArthur in his post about the delays in the publication of these landmark papers that have arisen from the common practice of embargoing papers in genomics. To be clear, I am not talking about embargoing the use of data (which is also problematic), but embargoing the release of manuscripts that have been accepted for publication after peer review.

MacArthur writes:

Many of us in the genomics community were aware of the progress the [ENCODE] project had been making via conference presentations and hallway conversations with participants. However, many other researchers who might have benefited from early access to the ENCODE data simply weren’t aware of its existence until today’s dramatic announcement – and as a result, these people are 6-12 months behind in their analyses.

It is important to emphasize that these publication delays are by design, and are driven primarily by the journals that set the publication schedules for major genomics papers. I saw first-hand how Nature sets the agenda for major genomics papers and their associated companion papers as part of the Drosophila 12 Genomes Project. This insider’s view left a distinctly bad taste in my mouth about how much control a single journal has over some of the most important community resource papers that are published in Biology.  To give more people insight into this process, I am posting the agenda set by Nature for publication (in reverse chronological order) of the main Drosophila 12 Genomes paper, which went something like this:

7 Nov 2007: papers are published, embargo lifted on main/companion papers
28 Sept 2007: papers must be in production
21 Sept 2007: revised versions of papers received
17 Aug 2007: reviews are returned to authors
27 Jul 2007: papers are submitted

Not only was acceptance of the manuscript essentially assumed by the Nature editorial staff, the entire timeline was spelled out in advance, with an embargo built in to the process from the outset. Seeing this process unfold first hand was shocking to me, and has made me very skeptical of the power that the major journals have to dictate terms about how we, and other journals, publish our work.

Personally, I cannot see how this embargo system serves anyone in science other than the major journals. There is no valid scientific reason that major genome papers and their companions cannot be made available as online accepted preprints, as is now standard practice in the publishing industry. As scientists, we have a duty to ensure that the science we produce is released to the general public and community of scientists as rapidly and openly as possible. We do not have a duty to serve the agenda of a journal to increase their cachet or revenue stream. I am aware that we need to accept delays due to quality control via the peer review and publication process. But the delays due to the normal peer review process are bad enough, as ably discussed recently by Leslie Voshall. Why on earth would we accept that journals build in further unnecessary delays into the publication process?

This of course leads to the pertinent question: how harmful is this system of embargoes? Well, we can estimate put an upper estimate on * this pretty easily from the submission/acceptance dates of the main and companion ENCODE papers (see table below). In general, most ENCODE papers were embargoed for a minimum of 2 months but some were embargoed for up to nearly 7 months. Ignoring (unfairly) the direct impact that these delays may have on the careers of PhD students and post-docs involved, something on the order of 112 months of access to these important papers have been lost to all scientists by this single embargo. Put another way, nearly up to * 10 years of access time to these papers has been collectively lost to science because of the ENCODE embargo. To the extent that these papers are crucial for understanding the human genome, and the consequences this knowledge has for human health, this decade lost to humanity is clearly unacceptable. Let us hope that the ENCODE project puts an end to the era of journal-mandated embargoes in genomics.

DOI Date Received Date Accepted Date published Months in review Months in embargo
nature11247 24-Nov-11 29-May-12 05-Sep-12 6.0 3.2
nature11233 10-Dec-11 15-May-12 05-Sep-12 5.1 3.6
nature11232 15-Dec-11 15-May-12 05-Sep-12 4.9 3.6
nature11212 11-Dec-11 10-May-12 05-Sep-12 4.9 3.8
nature11245 09-Dec-11 22-May-12 05-Sep-12 5.3 3.4
nature11279 09-Dec-11 01-Jun-12 05-Sep-12 5.6 3.1
gr.134445.111 06-Nov-11 07-Feb-12 05-Sep-12 3.0 6.8
gr.134957.111 16-Nov-11 01-May-12 05-Sep-12 5.4 4.1
gr.133553.111 17-Oct-11 05-Jun-12 05-Sep-12 7.5 3.0
gr.134767.111 11-Nov-11 03-May-12 05-Sep-12 5.6 4.0
gr.136838.111 21-Dec-11 30-Apr-12 05-Sep-12 4.2 4.1
gr.127761.111 16-Jun-11 27-Mar-12 05-Sep-12 9.2 5.2
gr.136101.111 09-Dec-11 30-Apr-12 05-Sep-12 4.6 4.1
gr.134890.111 23-Nov-11 10-May-12 05-Sep-12 5.5 3.8
gr.134478.111 07-Nov-11 01-May-12 05-Sep-12 5.7 4.1
gr.135129.111 21-Nov-11 08-Jun-12 05-Sep-12 6.5 2.9
gr.127712.111 15-Jun-11 27-Mar-12 05-Sep-12 9.2 5.2
gr.136366.111 13-Dec-11 04-May-12 05-Sep-12 4.6 4.0
gr.136127.111 16-Dec-11 24-May-12 05-Sep-12 5.2 3.4
gr.135350.111 25-Nov-11 22-May-12 05-Sep-12 5.8 3.4
gr.132159.111 17-Sep-11 07-Mar-12 05-Sep-12 5.5 5.9
gr.137323.112 05-Jan-12 02-May-12 05-Sep-12 3.8 4.1
gr.139105.112 25-Mar-12 07-Jun-12 05-Sep-12 2.4 2.9
gr.136184.111 10-Dec-11 10-May-12 05-Sep-12 4.9 3.8
gb-2012-13-9-r48 21-Dec-11 08-Jun-12 05-Sep-12 5.5 2.9
gb-2012-13-9-r49 28-Mar-12 08-Jun-12 05-Sep-12 2.3 2.9
gb-2012-13-9-r50 04-Dec-11 18-Jun-12 05-Sep-12 6.4 2.5
gb-2012-13-9-r51 23-Mar-12 25-Jun-12 05-Sep-12 3.0 2.3
gb-2012-13-9-r52 09-Mar-12 25-May-12 05-Sep-12 2.5 3.3
gb-2012-13-9-r53 29-Mar-12 19-Jun-12 05-Sep-12 2.6 2.5
Min 2.3 2.3
Max 9.2 6.8
Avg 5.1 3.7
Sum 152.7 112.1


* Based on a converation on twitter with Chris Cole, I’ve revised this to be estimate to reflect the upper bound, rather than a point estimate of time lost to science.

Announcing the PLoS Text Mining Collection

Based on a spur of the moment tweet earlier this year, and a positive follow up from Theo Bloom, I’m very happy to announce that PLoS has now put the wheels in motion to develop a Collection of articles that highlight the importance of Text Mining research. The Call for Papers has just been announced today, and I’m very excited to see this effort highlight the synergy between Open Access, Altmetrics and Text Mining research. I’m particularly keen to see someone take the reigns on writing a good description of the API for PLoS (and other publishers). And a good lesson to all to be careful to watch what you tweet!

The Call for Paper below is cross posted at the PLoS Blog

Call for Papers: PLoS Text Mining Collection

The Public Library of Science (PLoS) seeks submissions in the broad field of text-mining research for a collection to be launched across all of its journals in 2013. All submissions submitted before October 30th, 2012 will be considered for the launch of the collection. Please read the following post for further information on how to submit your article.

The scientific literature is exponentially increasing in size, with thousands of new papers published every day. Few researchers are able to keep track of all new publications, even in their own field, reducing the quality of scholarship and leading to undesirable outcomes like redundant publication. While social media and expert recommendation systems provide partial solutions to the problem of keeping up with the literature, systematically identifying relevant articles and extracting key information from them can only come through automated text-mining technologies.

Research in text mining has made incredible advances over the last decade, driven through community challenges and increasingly sophisticated computational technologies. However, the promise of text mining to accelerate and enhance research largely has not yet been fulfilled, primarily since the vast majority of the published scientific literature is not published under an Open Access model. As Open Access publishing yields an ever-growing archive of unrestricted full-text articles, text mining will play an increasingly important role in drilling down to essential research and data in scientific literature in the 21st century scholarly landscape.

As part of its commitment to realizing the maximal utility of Open Access literature, PLoS is launching a collection of articles dedicated to highlighting the importance of research in the area of text mining. The launch of this Text Mining Collection complements related PLoS Collections on Open Access and Altmetrics (forthcoming), as well as the recent release of the PLoS Application Programming Interface, which provides an open API to PLoS journal content.

As part of this Text Mining Collection, we are making a call for high quality submissions that advance the field of text-mining research, including:

  • New methods for the retrieval or extraction of published scientific facts
  • Large-scale analysis of data extracted from the scientific literature
  • New interfaces for accessing the scientific literature
  • Semantic enrichment of scientific articles
  • Linking the literature to scientific databases
  • Application of text mining to database curation
  • Approaches for integrating text mining into workflows
  • Resources (ontologies, corpora) to improve text mining research

Please note that all submissions submitted before October 30th, 2012 will be considered for the launch of the collection (expected early 2013); submissions after this date will still be considered for the collection, but may not appear in the collection at launch.

Submission Guidelines
If you wish to submit your research to the PLoS Text Mining Collection, please consider the following when preparing your manuscript:

All articles must adhere to the submission guidelines of the PLoS journal to which you submit.
Standard PLoS policies and relevant publication fees apply to all submissions.
Submission to any PLoS journal as part of the Text Mining Collection does not guarantee publication.

When you are ready to submit your manuscript to the collection, please log in to the relevant PLoS manuscript submission system and mention the Collection’s name in your cover letter. This will ensure that the staff is aware of your submission to the Collection. The submission systems can be found on the individual journal websites.

Please contact Samuel Moore (smoore@plos.org) if you would like further information about how to submit your research to the PLoS Text Mining Collection.

Casey Bergman (University of Manchester)
Lawrence Hunter (University of Colorado-Denver)
Andrey Rzhetsky (University of Chicago)


Did Finishing the Drosophila Genome Legitimize Open Access Publishing?

I’m currently reading Glyn Moody‘s (2003) “Digital Code of Life: How Bioinformatics is Revolutionizing Science, Medicine, and Business” and greatly enjoying the writing as well as the whirlwind summary of the history of Bioinformatics and the (Human) Genome Project(s). Most of what Moody says that I am familiar with is quite accurate, and his scholarship is thorough, so I find his telling of the story compelling. One claim I find new and curious in this book is in his discussion of the sequencing of the Drosphila melanogaster genome, more precisely the “finishing” of this genome, and its impact on the legitimacy of Open Access publishing.

The sequencing of D. melanogaster was done as a collaboration with between the Berkeley Drosophila Genome Project and Celera, as a test case to prove that whole-genome shotgun sequencing could be applied to large animal genomes.  I won’t go into the details here, but it is a widely regarded fact that the Adams et al. (2000) and Myers et al. (2000) papers in Science demonstrated the feasibility of whole-genome shotgun sequencing, but it was a lesser-known paper by Celniker et al. (2002) in Genome Biology which reported the “finished” D. melanogaster genome that proved the accuracy of whole-genome shotgun sequencing assembly. No controversy here.

More debatable is what Moody goes on to write about the Celniker et al. (2002) paper:

This was an important paper, then, and one that had a significance that went beyond its undoubted scientific value. For it appeared neither in Science, as the previous Drosophila papers had done, nor in Nature, the obvious alternative. Instead, it was published in Genome Biology. This describes itself as “a journal, delivered over the web.” That is, the Web is the primary medium, with the printed version offering a kind of summary of the online content in a convenient portable form. The originality of Genome Biology does not end there: all of its main research articles are available free online.

A description then follows of the history and virtues of PubMed Central and the earliest Open Access biomedical publishers BioMed Central and PLoS. Moody (emphasis mine) then returns to the issue of:

…whether a journal operating on [Open Access] principles could attract top-ranked scientists. This question was answered definitively in the affirmative with the announcement and analysis of the finished Drosophila sequence in January 2003. This key opening paper’s list of authors included not only [Craig] Venter, [Gene] Myers, and [Mark] Adams, but equally stellar representatives of the academic world of Science, such as Gerald Rubin, the boss of the fruit fly genome project, and Richard Gibbs, head of sequencing at Baylor College. Alongside this paper there were no less than nine other weighty contributions, including one on Apollo, a new tool for viewing and editing sequence annotation. For its own Drosophila extravaganza of March 2000, Science had marshalled seven paper in total. Clearly, Genome Biology had arrived, and with it a new commercial publishing model based on the latest way of showing the data.

This passage resonated with me since I was working at the BDGP at the time this special issue on the finishing of the Drosophila genome in Genome Biology was published, and was personally introduced to Open Access publishing through this event.  I recall Rubin walking the hallways of building 64 on his periodic visits promoting this idea, motivating us all to work hard to get our papers together by the end of 2002 for this unique opportunity. I also remember lugging around stacks of the printed issue at the Fly meeting in Chicago in 2003, plying unsuspecting punters with a copy of a journal that most people had never heard of, and having some of my first conversations with people on Open Access as a consequence.

What Moody doesn’t capture in this telling is the fact the Rubin’s decision to publish in Genome Biology almost surely owes itself to the influence that Mike Eisen had on Rubin and others in the genomics community in Berkeley at the time. Eisen and Rubin had recently collaborated on a paper, Eisen had made inroads in Berkeley on the Open Access issue by actively recruiting signatories for the PLoS open letter the year before, and Eisen himself published his first Open Access paper in Oct 2002 in Genome Biology. So clearly the idea of publishing in Open Access journals, and in particular Genome Biology, was in the air at the time. So it may not have been as bold of a step for Rubin to take as Moody implies.

Nevertheless, it is a point that may have some truth, and I think it is interesting to consider if indeed the long-standing open data philosophy of the Drosophila genetics community that led to the Genome Biology special issue was a key turning point in the widespread success of Open Access publishing over the next decade. Surely the movement would have taken off anyways at some point. But in late 2002, when the BioMed Central journals were the only place to publish gold Open Access articles, few people had tested the waters since the launch of BMC journals in 2000. While we cannot replay the tape, Moody’s claim is plausible in my view and it is interesting to ask whether widespread buy-in to Open Access publishing in biology might have been delyaed if Rubin had not insisted that the efforts of the Berkeley Drosophila Genome Project be published under and Open Access model?

UPDATE 25 March 2012

After tweeting this post, here is what Eisen and Moody have to say:

UPDATE 19 May 2012

It appears that the publication of another part of the Drosophila (meta)genome, its Wolbachia endosymbiont, played and important role in the conversion of Jonathan Eisen to supporting Open Access. Read more here.

The Roberts/Ashburner Response

A previous post on this blog shared a helpful boilerplate response to editors for politely declining to review for non-Open Access journals, which I received originally from Michael Ashburner. During a quick phone chat today, Ashburner told me that he in fact inherited a version of this response originally from Nobel laureate Richard Roberts, co-discoverer of introns, lead author on the Open Letter to Science calling for a “Genbank” of the scientific literature, and long-time editor of Nucleic Acids Research, one of the first classical journals to move to a fully Open Access model. So to give credit where it is due, I’ve updated the title of the “Just Say No” post to make the attribution of this letter more clear. We owe both Roberts and Ashburner many thanks for paving the way to a better model of scientific communication and leading by example.

Why the Research Works Act Doesn’t Affect Text-mining Research

As the the central digital repository for life science publications, PubMed Central (PMC) is one of the most significant resources for making the Open Access movement a tangible reality for researchers and citizens around the world. Articles in PMC are deposited through two routes: either automatically by journals that participate in the PMC system, or directly by authors for journals that do not. Author deposition of peer-reviewed manuscripts in PMC is mandated by funders in order to make the results of publicly- or charity-funded research maximally accessible, and has led to over 180,000 articles being made free (gratis) to download that would otherwise be locked behind closed-access paywalls. Justifiably, there has been outrage over recent legislation (the Research Works Act) that would repeal the NIH madate in the USA and thereby prevent important research from being freely available.

However, from a text-miner’s perspective author-deposited manuscripts in PMC are closed access since, while they can be downloaded and read individually, virtually none (<200) are available from the PMC’s Open Access subset that includes all articles that are free (libre) to download in bulk and text/data mine. This includes ~99% of the author deposited manuscripts from the journal Nature, despite a clear statement from 2009 entitled “Nature Publishing Group allows data- and text-mining on self-archived manuscripts”. In short, funder mandates only make manuscripts public but not open, and thus whether the RWA is passed or not is actually moot from a text/data-mining perspective.

Why is this important? The simple reason is that there are currently only ~400,000 articles in the PMC Open Access subset, and therefore author-deposited manuscripts are only two-fold less abundant than all articles currently available for text/data-mining. Thus what could be a potentially rich source of data for large-scale information extraction remains locked away from programmatic analysis. This is especially tragic considering the fact that at the point of manuscript acceptance, publishers have invested little-to-nothing into the publication process and their claim to copyright is most tenuous.

So instead of discussing whether we should support the status quo of weak funder mandates by working to block the RWA or expand NIH-like mandates (e.g. as under the Federal Research Public Access Act, FRPAA), the real discussion that needs to be had is how to make funder mandates stronger to insist (at a minimum) that author-deposited manuscripts be available for text/data-mining research. More-of-the same, not matter how much, only takes us half the distance towards the ultimate goals of the Open Access movement, and doesn’t permit the crucial text/data mining research that is needed to make sense of the deluge of information in the scientific literature.

Credits: Max Haussler for making me aware of the lack of author manuscripts in PMC a few years back, and Heather Piwowar for recently jump-starting the key conversation on how to push for improved text/data mining rights in future funder mandates.

Related Posts:

A Open Archive of My F1000 Reviews

Following on from a recent conversation with David Stephens on Twitter about my decision to resign from Faculty of 1000, F1000 has clarified their terms for the submission of evaluations and confirmed that it is permissible to “reproduce personal evaluations on institutional & personal blogs if you clearly reference F1000”.

As such, I am delighted to be able to repost here an Open Archive of my F1000 contributions. Additionally, this post acts in a second capacity as my first contribution to the Research Blogging Network. Hopefully these commentraies will be of interest to some, and should add support to the Altmetrics profiles for these papers through systems like Total Impact.


Nelson CE, Hersh BM, & Carroll SB (2004). The regulatory content of intergenic DNA shapes genome architecture. Genome biology, 5 (4) PMID: 15059258

My review: This article reports that genes with complex expression have longer intergenic regions in both D. melanogaster and C. elegans, and introduces several innovative and complementary approaches to quantify the complexity of gene expression in these organisms. Additionally, the structure of intergenic DNA in genes with high complexity (e.g. receptors, specific transcription factors) is shown to be longer and more evenly distributed over 5′ and 3′ regions in D. melanogaster than in C. elegans, whereas genes with low complexity (e.g. metabolic genes, general transcription factors) are shown to have similar intergenic lengths in both species and exhibit no strong differences in length between 5′ and 3′ regions. This work suggests that the organization of noncoding DNA may reflect constraints on transcriptional regulation and that gene structure may yield insight into the functional complexity of uncharacterized genes in compact animal genomes. (@F1000: http://f1000.com/1032936)


Li R, Ye J, Li S, Wang J, Han Y, Ye C, Wang J, Yang H, Yu J, Wong GK, & Wang J (2005). ReAS: Recovery of ancestral sequences for transposable elements from the unassembled reads of a whole genome shotgun. PLoS computational biology, 1 (4) PMID: 16184192

My review: This paper presents a novel method for automating the laborious task of constructing libraries of transposable element (TE) consensus sequences. Since repetitive TE sequences confound whole-genome shotgun (WGS) assembly algorithms, sequence reads from TEs are initially screened from WGS assemblies based on overrepresented k-mer frequencies. Here, the authors invert the same principle, directly identifying TE consensus sequences from those same reads containing high frequency k-mers. The method was shown to identify all high copy number TEs and increase the effectiveness of repeat masking in the rice genome. By circumventing the inherent difficulties of TE consensus reconstruction from erroneously assembled genome sequences, and by providing a method to identify TEs prior to WGS assembly, this method provides a new strategy to increase the accuracy of WGS assemblies as well as our understanding of the TEs in genome sequences. (@F1000: http://f1000.com/1031746)


Rifkin SA, Houle D, Kim J, & White KP (2005). A mutation accumulation assay reveals a broad capacity for rapid evolution of gene expression. Nature, 438 (7065), 220-3 PMID: 16281035

My review: This paper reports empirical estimates of the mutational input to gene expression variation in Drosophila, knowledge of which is critical for understanding the mechanisms governing regulatory evolution. These direct estimates of mutational variance are compared to gene expression differences across species, revealing that the majority of genes have lower expression divergence than is expected if evolving solely by mutation and genetic drift. Mutational variances on a gene-by-gene basis range over several orders of magnitude and are shown to vary with gene function and developmental context. Similar results in C. elegans [1] provide strong support for stabilizing selection as the dominant mode of gene expression evolution. (@F1000: http://f1000.com/1040157)

References: 1. Denver DR, Morris K, Streelman JT, Kim SK, Lynch M, & Thomas WK (2005). The transcriptional consequences of mutation and natural selection in Caenorhabditis elegans. Nature genetics, 37 (5), 544-8 PMID: 15852004


Caspi A, & Pachter L (2006). Identification of transposable elements using multiple alignments of related genomes. Genome research, 16 (2), 260-70 PMID: 16354754

My review: This paper reports an innovative strategy for the de novo detection of transposable elements (TEs) in genome sequences based on comparative genomic data. By capitalizing on the fact that bursts of TE transposition create large insertions in multiple genomic locations, the authors show that detection of repeat insertion regions (RIRs) in alignments of multiple Drosophila genomes has high sensitivity to identify both individual instances and families of known TEs. This approach opens a new direction in the field of repeat detection and provides added value to TE annotations by placing insertion events in a phylogenetic context. (@F1000 http://f1000.com/1049265)


Simons C, Pheasant M, Makunin IV, & Mattick JS (2006). Transposon-free regions in mammalian genomes. Genome research, 16 (2), 164-72 PMID: 16365385

My review: This paper presents an intriguing analysis of transposon-free regions (TFRs) in the human and mouse genomes, under the hypothesis that TFRs indicate genomic regions where transposon insertion is deleterious and removed by purifying selection. The authors test and reject a model of random transposon distribution and investigate the properties of TFRs, which appear to be conserved in location across species and enriched for genes (especially transcription factors and micro-RNAs). An alternative mutational hypothesis not considered by the authors is the possibility for clustered transposon integration (i.e. preferential insertion into regions of the genome already containing transposons), which may provide a non-selective explanation for the apparent excess of TFRs in the human and mouse genomes. (@F1000: http://f1000.com/1010399)


Wheelan SJ, Scheifele LZ, Martínez-Murillo F, Irizarry RA, & Boeke JD (2006). Transposon insertion site profiling chip (TIP-chip). Proceedings of the National Academy of Sciences of the United States of America, 103 (47), 17632-7 PMID: 17101968

My review: This paper demonstrates the utility of whole-genome microarrays for the high-throughput mapping of eukaryotic transposable element (TE) insertions on a genome-wide basis. With an experimental design guided by first computationally digesting the genome into suitable fragments, followed by linker-PCR to amplify TE flanking regions and subsequent hybridization to tiling arrays, this method was shown to recover all detectable TE insertions with essentially no false positives in yeast. Although limited to species with available genome sequences, this approach circumvents inefficiencies and biases associated with the alternative of whole-genome shotgun resequencing to detect polymorphic TEs on a genome-wide scale. Application of this or related technologies (e.g. [1]) to more complex genomes should fill gaps in our understanding of the contribution of TE insertions to natural genetic variation. (@F1000: http://f1000.com/1088573)

References: 1. Gabriel A, Dapprich J, Kunkel M, Gresham D, Pratt SC, & Dunham MJ (2006). Global mapping of transposon location. PLoS genetics, 2 (12) PMID: 17173485


Haag-Liautard C, Dorris M, Maside X, Macaskill S, Halligan DL, Houle D, Charlesworth B, & Keightley PD (2007). Direct estimation of per nucleotide and genomic deleterious mutation rates in Drosophila. Nature, 445 (7123), 82-5 PMID: 17203060

My review: This paper presents the first direct estimates of nucleotide mutation rates across the Drosophila genome derived from mutation accumulation experiments. By using DHPLC to scan over 20 megabases of genomic DNA, the authors obtain several fundamental results concerning mutation at the molecular level in Drosophila: SNPs are more frequent than indels; deletions are more frequent than insertions; mutation rates are similar across coding, intronic and intergenic regions; and mutation rates may vary across genetic backgrounds. Results in D. melanogaster contrast with those obtained from mutation accumulation experiments in C. elegans (see [1], where indels are more frequent than SNPs, and insertions are more frequent than deletions), indicating that basic mutation processes may vary across metazoan taxa. (@F1000: http://f1000.com/1070688)

References: 1. Denver DR, Morris K, Lynch M, & Thomas WK (2004). High mutation rate and predominance of insertions in the Caenorhabditis elegans nuclear genome. Nature, 430 (7000), 679-82 PMID: 15295601


Katzourakis A, Pereira V, & Tristem M (2007). Effects of recombination rate on human endogenous retrovirus fixation and persistence. Journal of virology, 81 (19), 10712-7 PMID: 17634225

My review: This study shows that the persistence, but not the integration, of long-terminal repeat (LTR) containing human endogenous retroviruses (HERVs) is associated with local recombination rate, and suggests a link between intra-strand homologous recombination and meiotic exchange. This inference about the mechanisms controlling the transposable element (TE) abundance is obtained by demonstrating that total HERV density (full-length elements plus solo LTRs) is not correlated with recombination rate, whereas the ratio of full-length HERVs relative to solo LTRs is. This work relies critically on advanced computational methods to join TE fragments, demonstrating the need for such algorithms to make accurate inferences about the evolution of mobile DNA and to reveal new insights into genome biology. (@F1000: http://f1000.com/1091037)


Giordano J, Ge Y, Gelfand Y, Abrusán G, Benson G, & Warburton PE (2007). Evolutionary history of mammalian transposons determined by genome-wide defragmentation. PLoS computational biology, 3 (7) PMID: 17630829

My review: This article reports the first comprehensive stratigraphic record of transposable element (TE) activity in mammalian genomes based on several innovative computational methods that use information encoded in patterns of TE nesting. The authors first develop an efficient algorithm for detecting nests of TEs by intelligently joining TE fragments identified by RepeatMasker, which (in addition to providing an improved genome annotation) outputs a global “interruption matrix” that can be used by a second novel algorithm which generates a chronological ordering of TE activity by minimizing the nesting of young TEs into old TEs. Interruption matrix analysis yields results that support previous phylogenetic analyses of TE activity in humans but are not dependent on the assumption of a molecular clock. Comparison of the chronological orders of TE activity in six mammalian genomes provides unique insights into the ancestral and lineage-specific record of global TE activity in mammals. (@F1000: http://f1000.com/1089045)


Schuemie MJ, & Kors JA (2008). Jane: suggesting journals, finding experts. Bioinformatics (Oxford, England), 24 (5), 727-8 PMID: 18227119

My review: This paper introduces a fast method for finding related articles and relevant journals/experts based on user input text and should help improve the referencing, review and publication of biomedical manuscripts. The JANE (Journal/Author Name Estimator) method uses a standard word frequency approach to find similar documents, then adds the scores in the top 50 records to produce a ranked list of journals or authors. Using either the abstract or full-text, JANE suggested quite sensible journals and authors in seconds for a manuscript we have in press, while the related eTBLAST method [1] failed to complete while I wrote this review. JANE should prove to be a very useful text mining tool for authors and editors alike. (@F1000: http://f1000.com/1101037)

References: 1. Errami M, Wren JD, Hicks JM, & Garner HR (2007). eTBLAST: a web server to identify expert reviewers, appropriate journals and similar publications. Nucleic acids research, 35 (Web Server issue) PMID: 17452348


Pask AJ, Behringer RR, & Renfree MB (2008). Resurrection of DNA function in vivo from an extinct genome. PloS one, 3 (5) PMID: 18493600

My review: This paper reports the first transgenic analysis of a cis-regulatory element cloned from an extinct species. Although no differences were seen in the expression pattern of the collagen (Col2A1) enhancer from the extinct Tasmanian tiger and extant mouse, this work is an important proof of principle for using ancient DNA in the evolutionary analysis of gene regulation. (@F1000: http://f1000.com/1108816)


Ginsberg J, Mohebbi MH, Patel RS, Brammer L, Smolinski MS, & Brilliant L (2009). Detecting influenza epidemics using search engine query data. Nature, 457 (7232), 1012-4 PMID: 19020500

My review: A landmark paper in health bioinformatics demonstrating that Google searches can predict influenza trends in the United States. Predicting infectious disease outbreaks currently relies on patient reports gathered through clinical settings and submitted to government agencies such as the CDC. The possible use of patient “self-reporting” through internet search queries offers unprecedented real-time access to temporal and regional trends in infectious diseases. Here, the authors use a linear modeling strategy to learn which Google search terms best correlate with regional trends in influenza-related illness. This model explains flu trends over a 5 year period with startling accuracy, and was able to predict flu trends during 2007-2008 with a 1-2 week lead time ahead of CDC reports. The phenomenal use of crowd-based predictive health informatics revolutionizes the role of the internet in biomedical research and will likely set an important precedent in many areas of natural sciences. (@F1000: http://f1000.com/1127181)


Taher L, & Ovcharenko I (2009). Variable locus length in the human genome leads to ascertainment bias in functional inference for non-coding elements. Bioinformatics (Oxford, England), 25 (5), 578-84 PMID: 19168912

My review: This paper raises the important observation that differences in the length of genes can bias their functional classification using the Gene Ontology, and provides a simple method to correct for this inherent feature of genome architecture. A basic observation of genome biology is that genes differ widely in their size and structure within and between species. Understanding the causes and consequences of this variation in gene structure is an open challenge in genome biology. Previously, Nelson and colleagues [1] have shown, in flies and worms, that the length of intergenic regions is correlated with the regulatory complexity of genes and that genes from different Gene Ontology (GO) categories have drastically different lengths. Here, Taher and Ovcharenko confirm this observation of functionally non-random gene length in the human genome, and discuss the implications of this feature of genome organization on analyses that employ the GO for functional inference. Specifically, these authors show that random selection of noncoding DNA sequences from the human genome leads to the false inference of over- and under-representation of specific GO categories that preferentially contain longer or shorter genes, respectively. This finding has important implications for the large number of studies that employ a combination of gene expression microarrays and GO enrichment analysis, since gene expression is largely controlled by noncoding DNA. The authors provide a simple method to correct for this bias in GO analyses, and show that previous reports of the enrichment of “ultraconserved” noncoding DNA sequences in vertebrate developmental genes [2] may be a statistical artifact. (@F1000: http://f1000.com/1157594)

References: 1. Nelson CE, Hersh BM, & Carroll SB (2004). The regulatory content of intergenic DNA shapes genome architecture. Genome biology, 5 (4) PMID: 15059258

2. Bejerano G, Pheasant M, Makunin I, Stephen S, Kent WJ, Mattick JS, & Haussler D (2004). Ultraconserved elements in the human genome. Science (New York, N.Y.), 304 (5675), 1321-5 PMID: 15131266


Hindorff LA, Sethupathy P, Junkins HA, Ramos EM, Mehta JP, Collins FS, & Manolio TA (2009). Potential etiologic and functional implications of genome-wide association loci for human diseases and traits. Proceedings of the National Academy of Sciences of the United States of America, 106 (23), 9362-7 PMID: 19474294

My review: This article introduces results from human genome-wide association studies (GWAS) into the realm of large-scale functional genomic data mining. These authors compile the first curated database of trait-associated single-nucleotide polymorphisms (SNPs) from GWAS studies (http://www.genome.gov/gwastudies/) that can be mined for general features of SNPs underlying phenotypes in humans. By analyzing 531 SNPs from 151 GWAS studies, the authors discover that trait-associated SNPs are predominantly in non-coding regions (43% intergenic, 45% intronic), but that non-synonymous and promoter trait-associated SNPs are enriched relative to expectations. The database is actively maintained and growing, and currently contains 3943 trait-associated SNPs from 796 publications. This important resource will facilitate data mining and integration with high-throughput functional genomics data (e.g. ChIP-seq), as well as meta-analyses, to address important questions in human genetics, such as the discovery of loci that affects multiple traits. While the interface to the GWAS catalog is rather limited, a related project (http://www.gwascentral.org/) [1] provides a much more powerful interface for searching and browsing data from the GWAS catalog. (@F1000: http://f1000.com/8408956)

References: 1. Thorisson GA, Lancaster O, Free RC, Hastings RK, Sarmah P, Dash D, Brahmachari SK, & Brookes AJ (2009). HGVbaseG2P: a central genetic association database. Nucleic acids research, 37 (Database issue) PMID: 18948288


Tamames J, & de Lorenzo V (2010). EnvMine: a text-mining system for the automatic extraction of contextual information. BMC bioinformatics, 11 PMID: 20515448

My review: This paper describes EnvMine, an innovative text-mining tool to obtain physico-chemical and geographical information about environmental genomics samples. This work represents a pioneering effort to apply text-mining technologies in the domain of ecology, providing novel methods to extract the units and variables of physico-chemical entities, as well as link the location of samples to worldwide geographic coordinates via Google Maps. Application of EnvMine to full-text articles in the environmental genomics database envDB {1} revealed very high system performance, suggesting that information extracted by EnvMine will be of use to researchers seeking meta-data about environmental samples across different domains of biology. (@F1000: http://f1000.com/3502956)

References: 1. Tamames J, Abellán JJ, Pignatelli M, Camacho A, & Moya A (2010). Environmental distribution of prokaryotic taxa. BMC microbiology, 10 PMID: 20307274