Did Finishing the Drosophila Genome Legitimize Open Access Publishing?

I’m currently reading Glyn Moody‘s (2003) “Digital Code of Life: How Bioinformatics is Revolutionizing Science, Medicine, and Business” and greatly enjoying the writing as well as the whirlwind summary of the history of Bioinformatics and the (Human) Genome Project(s). Most of what Moody says that I am familiar with is quite accurate, and his scholarship is thorough, so I find his telling of the story compelling. One claim I find new and curious in this book is in his discussion of the sequencing of the Drosphila melanogaster genome, more precisely the “finishing” of this genome, and its impact on the legitimacy of Open Access publishing.

The sequencing of D. melanogaster was done as a collaboration with between the Berkeley Drosophila Genome Project and Celera, as a test case to prove that whole-genome shotgun sequencing could be applied to large animal genomes. I won’t go into the details here, but it is a widely regarded fact that the Adams et al. (2000) and Myers et al. (2000) papers in Science demonstrated the feasibility of whole-genome shotgun sequencing, but it was a lesser-known paper by Celniker et al. (2002) in Genome Biology which reported the “finished” D. melanogaster genome that proved the accuracy of whole-genome shotgun sequencing assembly. No controversy here.

More debatable is what Moody goes on to write about the Celniker et al. (2002) paper:

This was an important paper, then, and one that had a significance that went beyond its undoubted scientific value. For it appeared neither in Science, as the previous Drosophila papers had done, nor in Nature, the obvious alternative. Instead, it was published in Genome Biology. This describes itself as “a journal, delivered over the web.” That is, the Web is the primary medium, with the printed version offering a kind of summary of the online content in a convenient portable form. The originality of Genome Biology does not end there: all of its main research articles are available free online.

A description then follows of the history and virtues of PubMed Central and the earliest Open Access biomedical publishers BioMed Central and PLoS. Moody (emphasis mine) then returns to the issue of:

…whether a journal operating on [Open Access] principles could attract top-ranked scientists. This question was answered definitively in the affirmative with the announcement and analysis of the finished Drosophila sequence in January 2003. This key opening paper’s list of authors included not only [Craig] Venter, [Gene] Myers, and [Mark] Adams, but equally stellar representatives of the academic world of Science, such as Gerald Rubin, the boss of the fruit fly genome project, and Richard Gibbs, head of sequencing at Baylor College. Alongside this paper there were no less than nine other weighty contributions, including one on Apollo, a new tool for viewing and editing sequence annotation. For its own Drosophila extravaganza of March 2000, Science had marshalled seven paper in total. Clearly, Genome Biology had arrived, and with it a new commercial publishing model based on the latest way of showing the data.

This passage resonated with me since I was working at the BDGP at the time this special issue on the finishing of the Drosophila genome in Genome Biology was published, and was personally introduced to Open Access publishing through this event. I recall Rubin walking the hallways of building 64 on his periodic visits promoting this idea, motivating us all to work hard to get our papers together by the end of 2002 for this unique opportunity. I also remember lugging around stacks of the printed issue at the Fly meeting in Chicago in 2003, plying unsuspecting punters with a copy of a journal that most people had never heard of, and having some of my first conversations with people on Open Access as a consequence.

What Moody doesn’t capture in this telling is the fact the Rubin’s decision to publish in Genome Biology almost surely owes itself to the influence that Mike Eisen had on Rubin and others in the genomics community in Berkeley at the time. Eisen and Rubin had recently collaborated on a paper, Eisen had made inroads in Berkeley on the Open Access issue by actively recruiting signatories for the PLoS open letter the year before, and Eisen himself published his first Open Access paper in Oct 2002 in Genome Biology. So clearly the idea of publishing in Open Access journals, and in particular Genome Biology, was in the air at the time. So it may not have been as bold of a step for Rubin to take as Moody implies.

Nevertheless, it is a point that may have some truth, and I think it is interesting to consider if indeed the long-standing open data philosophy of the Drosophila genetics community that led to the Genome Biology special issue was a key turning point in the widespread success of Open Access publishing over the next decade. Surely the movement would have taken off anyways at some point. But in late 2002, when the BioMed Central journals were the only place to publish gold Open Access articles, few people had tested the waters since the launch of BMC journals in 2000. While we cannot replay the tape, Moody’s claim is plausible in my view and it is interesting to ask whether widespread buy-in to Open Access publishing in biology might have been delyaed if Rubin had not insisted that the efforts of the Berkeley Drosophila Genome Project be published under and Open Access model?

UPDATE 25 March 2012

After tweeting this post, here is what Eisen and Moody have to say:

https://twitter.com/#!/mbeisen/status/183649695715966976

https://twitter.com/#!/caseybergman/status/183654936461049856

https://twitter.com/#!/mbeisen/status/183683170451980288

UPDATE 19 May 2012

It appears that the publication of another part of the Drosophila (meta)genome, its Wolbachia endosymbiont, played and important role in the conversion of Jonathan Eisen to supporting Open Access. Read more here.

Comments on the RCUK’s New Draft Policy on Open Access

RCUK, the umbrella agency that represents several major publicly-funded Research Councils in the UK, has recently released a draft document outlining revision to its policy on Open Access publishing for RCUK-funded research. One of the leaders of the Open Access movement, Peter Suber, has provided strong assent for this policy and ably summarized the salient features of this document on a G+ post, with which I concur. Based on his encourgagement to submit comments to RCUK directly, I’ve emailed the following two points for RCUK to consider in their revision of this policy:

From: Casey Bergman <Casey.Bergman@xxx.xx.xx>
Date: 18 March 2012 15:22:29 GMT
To: <communications@rcuk.ac.uk>
Subject: Open Access Feedback

Hello –

I write to support the draft RCUK policy on Open Access, but would like to raise two points that I see are crucial to effectively achieving the aims of libre Open Access:

1) The green OA route does not always ensure libre OA, and often green OA documents remain unavailable for text and data mining. For example, author-deposited manuscripts in (UK)PMC are not available for text mining, since the are not in the “OA subset” (see https://caseybergman.wordpress.com/2012/02/11/why-the-research-works-act-doesnt-affect-text-mining-research/). Thus, for RCUK to mandate libre OA via the green route, RCUK would need to work with repositories like (UK)PMC to make sure that green author-deposited manuscripts go into the OA subset that can be automatically downloaded for re-use.

2) Further information should be provided about the following comment: “In addition, the Research Councils are happy to work with individual institutions on how they might build an institutional Open Access fund drawing from the indirect costs on grants.” RCUK should take the lead on establishing financial models that are viable for recovering OA costs that can easily be adopted by universities. Promoting the development of University OA funds that can effectively recover costs from RCUK grants to support gold OA papers that are published after the life-time of a grant would be a major boost for publishing RCUK funded work under a libre OA model.

Yours sincerely,

Casey Bergman, Ph.D.
Faculty of Life Sciences
University of Manchester
Michael Smith Building
Oxford Road, M13 9PT
Manchester, UK

The Roberts/Ashburner Response

A previous post on this blog shared a helpful boilerplate response to editors for politely declining to review for non-Open Access journals, which I received originally from Michael Ashburner. During a quick phone chat today, Ashburner told me that he in fact inherited a version of this response originally from Nobel laureate Richard Roberts, co-discoverer of introns, lead author on the Open Letter to Science calling for a “Genbank” of the scientific literature, and long-time editor of Nucleic Acids Research, one of the first classical journals to move to a fully Open Access model. So to give credit where it is due, I’ve updated the title of the “Just Say No” post to make the attribution of this letter more clear. We owe both Roberts and Ashburner many thanks for paving the way to a better model of scientific communication and leading by example.

Why Are There So Few Efforts to Text Mine the Open Access Subset of PubMed Central?

The open access movement in scientific publishing has two broad aims: (i) to make scientific articles more broadly accessible and (ii) to permit unrestricted re-use of published scientific content. From its humble beginnings in 2001 with only two journals, PubMed Central (PMC) has grown to become the world’s largest repository of full-text open-access biomedical articles, containing nearly 2.4 million biomedical articles that can be freely downloaded by anyone around the world. Thus, while holding only ~11% of the total published biomedical literature, PMC can be viewed clearly as a major success in terms of making the biomedical literature more broadly accessible.

However, I argue that PMC has yet catalyze similar success on the second goal of the open-access movement — unrestricted re-use of published scientific content. This point became clear to me when writing the discussions for two papers that my lab published last year. In digging around for references to cite, I was struck by how difficult it was to find examples of projects that applied text-mining tools to the entire set of open-access articles from PubMed Central. Unsure if this was a reflection of my ignorance or the actual state of the art in the field, I canvassed the biological text mining community, the bioinformatics community and two major open-access publishers for additional examples of text-mining on the entire open-access subset of PMC.

Surprisingly, I found that after a decade of existence only ~15 articles* have ever been published that have used the entire open-access subset of PMC for text-mining research. In other words, less than 2 research articles per year are being published that actually use the open-access contents of PubMed Central for large-scale data mining or sevice provision. I find the lack of uptake of PMC by text-mining researchers to be rather astonishing, considering it is an incredibly rich achive of the combined output of thousands of scientists worldwide.

This observation begs the question: why are there so few efforts to text mine content in PubMed Central? I don’t pretend to have the answer, but there are a number of plausible reasons that include (but are not limited to):

The Open Access subset of PMC is such a small component of the entire published literature that it is unsuable.
Full-text mining research is such a difficult task that it cannot be usefully done.
The text-mining community is more focused on developing methods than applying them to the biomedical literature.
There is not an established community of users for full-text mining research.
[insert your interpretation in the comments below]

Personally, I see none of these as valid explanations of why applying text mining tools to the entirety of the PMC open access subset remains so rare. While it is true that <2% of all articles in PubMed are in the PMC open-acces subset is a limitation, the facts contained within the introductions and discussion of this subset cover a substantially broader proportion of scientific knowledge. Big data? Not compared to other areas of bioscience like genomics. Mining even a few mammalian genome’s worth of DNA sequence data is more technically and scientifically challenging than the English text of ~400,000 full-text articles. Text-miners also routinely apply their systems to MEDLINE abstracts, albeit often on a small scale, and there is a growing community of biocurators and bioinformaticians eager to consume data from full-text mining. So what is going on here?

Perhaps it is worth drawing an analogy with another major resource that was released at roughly the same time as PMC — the human genome sequence. According to many, including those in the popular media, the promise of human genome was oversold, perhaps to leverage financial support for this major project. Unfortunately, as Greg Petsko and Jonathan Eisen have argued, overselling the human genome project has had unintended negative consequences for the understanding, and perhaps funding, of basic research. Could the goal of reuse of open access articles likewise represent an overselling of the PMC repository? If so, then the open-access movement runs the risk of failing to deliver on one of the key planks in its platform. Failing to deliver on re-use could ultimately justify funders (if no-one is using it, why should we pay) and publishers (if no-one is using it, why should we make it open) to advocate green over gold open access, which could have a devestating impact on text-mining research, since author-deposited (green) manuscripts in PMC are off-limits for text-mining research.

I hope (and am actively working to prove) that re-use of the open access literature will not remain an unfulfilled promise. I suspect rather that we simply are in the lag phase before a period of explosive growth in full-text mining, akin to what happened in the field of genome-wide association studies after the publication of the human genome sequence. So text-miners, bioinformaticians, and computational biologists do your part to maximize the utility of Varmus, Lipman and Brown’s vision of an Arxiv for biology, and prove that the twin aims of the open access movement can be fulfilled.

* Published text mining studies using the entirety of the Open Access subset of PMC:

Annotating genes and genomes with DNA sequences extracted from biomedical articles. http://www.ncbi.nlm.nih.gov/pubmed/21325301
pubmed2ensembl: a resource for mining the biological literature on genes. http://www.ncbi.nlm.nih.gov/pubmed/21980353
Figure text extraction in biomedical literature. http://www.ncbi.nlm.nih.gov/pubmed/21249186
LINNAEUS: a species name identification system for biomedical literature. http://www.ncbi.nlm.nih.gov/pubmed/20149233
Systematic Characterizations of Text Similarity in Full Text Biomedical Publications http://www.ncbi.nlm.nih.gov/pubmed/20856807
Yale Image Finder (YIF): a new search engine for retrieving biomedical images. http://www.ncbi.nlm.nih.gov/pubmed/18614584
BioLit: integrating biological literature with databases. http://www.ncbi.nlm.nih.gov/pubmed/18515836
Figure mining for biomedical research. http://www.ncbi.nlm.nih.gov/pubmed/19439564
Author keywords in biomedical journal articles. http://www.ncbi.nlm.nih.gov/pubmed/21347036
UKPMC: a full text article resource for the life sciences. http://www.ncbi.nlm.nih.gov/pubmed/21062818
An exploration of mining gene expression mentions and their anatomical locations from biomedical text. http://dl.acm.org/citation.cfm?id=1869970
Extraction of data deposition statements from the literature: a method for automatically tracking research results. http://www.ncbi.nlm.nih.gov/pubmed/21998156
BioNOT: A searchable database of biomedical negated sentences. http://www.ncbi.nlm.nih.gov/pubmed/22032181
Intrinsic evaluation of text mining tools may not predict performance on realistic tasks. http://www.ncbi.nlm.nih.gov/pubmed/18229722
BioText Search Engine: beyond abstract search. http://www.ncbi.nlm.nih.gov/pubmed/17545178
Integration of Open Access Literature into the RCSB Protein Data Bank Using BioLit. http://www.ncbi.nlm.nih.gov/pubmed/20429930

UPDATE – New papers using PMC since original post

GeneView: a comprehensive semantic search engine for PubMed. http://www.ncbi.nlm.nih.gov/pubmed/22693219
BioContext: an integrated text mining system for large-scale extraction and contextualisation of biomolecular events. http://www.ncbi.nlm.nih.gov/pubmed/22711795

An Assembly of Fragments

Monthly Archives: March 2012

Did Finishing the Drosophila Genome Legitimize Open Access Publishing?

Comments on the RCUK’s New Draft Policy on Open Access

The Roberts/Ashburner Response

Why Are There So Few Efforts to Text Mine the Open Access Subset of PubMed Central?