The open access movement in scientific publishing has two broad aims: (i) to make scientific articles more broadly accessible and (ii) to permit unrestricted re-use of published scientific content. From its humble beginnings in 2001 with only two journals, PubMed Central (PMC) has grown to become the world’s largest repository of full-text open-access biomedical articles, containing nearly 2.4 million biomedical articles that can be freely downloaded by anyone around the world. Thus, while holding only ~11% of the total published biomedical literature, PMC can be viewed clearly as a major success in terms of making the biomedical literature more broadly accessible.
However, I argue that PMC has yet catalyze similar success on the second goal of the open-access movement — unrestricted re-use of published scientific content. This point became clear to me when writing the discussions for two papers that my lab published last year. In digging around for references to cite, I was struck by how difficult it was to find examples of projects that applied text-mining tools to the entire set of open-access articles from PubMed Central. Unsure if this was a reflection of my ignorance or the actual state of the art in the field, I canvassed the biological text mining community, the bioinformatics community and two major open-access publishers for additional examples of text-mining on the entire open-access subset of PMC.
Surprisingly, I found that after a decade of existence only ~15 articles* have ever been published that have used the entire open-access subset of PMC for text-mining research. In other words, less than 2 research articles per year are being published that actually use the open-access contents of PubMed Central for large-scale data mining or sevice provision. I find the lack of uptake of PMC by text-mining researchers to be rather astonishing, considering it is an incredibly rich achive of the combined output of thousands of scientists worldwide.
This observation begs the question: why are there so few efforts to text mine content in PubMed Central? I don’t pretend to have the answer, but there are a number of plausible reasons that include (but are not limited to):
- The Open Access subset of PMC is such a small component of the entire published literature that it is unsuable.
- Full-text mining research is such a difficult task that it cannot be usefully done.
- The text-mining community is more focused on developing methods than applying them to the biomedical literature.
- There is not an established community of users for full-text mining research.
- [insert your interpretation in the comments below]
Personally, I see none of these as valid explanations of why applying text mining tools to the entirety of the PMC open access subset remains so rare. While it is true that <2% of all articles in PubMed are in the PMC open-acces subset is a limitation, the facts contained within the introductions and discussion of this subset cover a substantially broader proportion of scientific knowledge. Big data? Not compared to other areas of bioscience like genomics. Mining even a few mammalian genome’s worth of DNA sequence data is more technically and scientifically challenging than the English text of ~400,000 full-text articles. Text-miners also routinely apply their systems to MEDLINE abstracts, albeit often on a small scale, and there is a growing community of biocurators and bioinformaticians eager to consume data from full-text mining. So what is going on here?
Perhaps it is worth drawing an analogy with another major resource that was released at roughly the same time as PMC — the human genome sequence. According to many, including those in the popular media, the promise of human genome was oversold, perhaps to leverage financial support for this major project. Unfortunately, as Greg Petsko and Jonathan Eisen have argued, overselling the human genome project has had unintended negative consequences for the understanding, and perhaps funding, of basic research. Could the goal of reuse of open access articles likewise represent an overselling of the PMC repository? If so, then the open-access movement runs the risk of failing to deliver on one of the key planks in its platform. Failing to deliver on re-use could ultimately justify funders (if no-one is using it, why should we pay) and publishers (if no-one is using it, why should we make it open) to advocate green over gold open access, which could have a devestating impact on text-mining research, since author-deposited (green) manuscripts in PMC are off-limits for text-mining research.
I hope (and am actively working to prove) that re-use of the open access literature will not remain an unfulfilled promise. I suspect rather that we simply are in the lag phase before a period of explosive growth in full-text mining, akin to what happened in the field of genome-wide association studies after the publication of the human genome sequence. So text-miners, bioinformaticians, and computational biologists do your part to maximize the utility of Varmus, Lipman and Brown’s vision of an Arxiv for biology, and prove that the twin aims of the open access movement can be fulfilled.
* Published text mining studies using the entirety of the Open Access subset of PMC:
- Annotating genes and genomes with DNA sequences extracted from biomedical articles. http://www.ncbi.nlm.nih.gov/pubmed/21325301
- pubmed2ensembl: a resource for mining the biological literature on genes. http://www.ncbi.nlm.nih.gov/pubmed/21980353
- Figure text extraction in biomedical literature. http://www.ncbi.nlm.nih.gov/pubmed/21249186
- LINNAEUS: a species name identification system for biomedical literature. http://www.ncbi.nlm.nih.gov/pubmed/20149233
- Systematic Characterizations of Text Similarity in Full Text Biomedical Publications http://www.ncbi.nlm.nih.gov/pubmed/20856807
- Yale Image Finder (YIF): a new search engine for retrieving biomedical images. http://www.ncbi.nlm.nih.gov/pubmed/18614584
- BioLit: integrating biological literature with databases. http://www.ncbi.nlm.nih.gov/pubmed/18515836
- Figure mining for biomedical research. http://www.ncbi.nlm.nih.gov/pubmed/19439564
- Author keywords in biomedical journal articles. http://www.ncbi.nlm.nih.gov/pubmed/21347036
- UKPMC: a full text article resource for the life sciences. http://www.ncbi.nlm.nih.gov/pubmed/21062818
- An exploration of mining gene expression mentions and their anatomical locations from biomedical text. http://dl.acm.org/citation.cfm?id=1869970
- Extraction of data deposition statements from the literature: a method for automatically tracking research results. http://www.ncbi.nlm.nih.gov/pubmed/21998156
- BioNOT: A searchable database of biomedical negated sentences. http://www.ncbi.nlm.nih.gov/pubmed/22032181
- Intrinsic evaluation of text mining tools may not predict performance on realistic tasks. http://www.ncbi.nlm.nih.gov/pubmed/18229722
- BioText Search Engine: beyond abstract search. http://www.ncbi.nlm.nih.gov/pubmed/17545178
- Integration of Open Access Literature into the RCSB Protein Data Bank Using BioLit. http://www.ncbi.nlm.nih.gov/pubmed/20429930
UPDATE – New papers using PMC since original post
- GeneView: a comprehensive semantic search engine for PubMed. http://www.ncbi.nlm.nih.gov/pubmed/22693219
- BioContext: an integrated text mining system for large-scale extraction and contextualisation of biomolecular events. http://www.ncbi.nlm.nih.gov/pubmed/22711795