Why Are There So Few Efforts to Text Mine the Open Access Subset of PubMed Central?

The open access movement in scientific publishing has two broad aims: (i) to make scientific articles more broadly accessible and (ii) to permit unrestricted re-use of published scientific content. From its humble beginnings in 2001 with only two journals, PubMed Central (PMC) has grown to become the world’s largest repository of full-text open-access biomedical articles, containing nearly 2.4 million biomedical articles that can be freely downloaded by anyone around the world. Thus, while holding only ~11% of the total published biomedical literature, PMC can be viewed clearly as a major success in terms of making the biomedical literature more broadly accessible.

However, I argue that PMC has yet catalyze similar success on the second goal of the open-access movement — unrestricted re-use of published scientific content. This point became clear to me when writing the discussions for two papers that my lab published last year. In digging around for references to cite, I was struck by how difficult it was to find examples of projects that applied text-mining tools to the entire set of open-access articles from PubMed Central. Unsure if this was a reflection of my ignorance or the actual state of the art in the field, I canvassed the biological text mining community, the bioinformatics community and two major open-access publishers for additional examples of text-mining on the entire open-access subset of PMC.

Surprisingly, I found that after a decade of existence only ~15 articles* have ever been published that have used the entire open-access subset of PMC for text-mining research. In other words, less than 2 research articles per year are being published that actually use the open-access contents of PubMed Central for large-scale data mining or sevice provision. I find the lack of uptake of PMC by text-mining researchers to be rather astonishing, considering it is  an incredibly rich achive of the combined output of thousands of scientists worldwide.

This observation begs the question: why are there so few efforts to text mine content in PubMed Central? I don’t pretend to have the answer, but there are a number of plausible reasons that include (but are not limited to):

  • The Open Access subset of PMC is such a small component of the entire published literature that it is unsuable.
  • Full-text mining research is such a difficult task that it cannot be usefully done.
  • The text-mining community is more focused on developing methods than applying them to the biomedical literature.
  • There is not an established community of users for full-text mining research.
  • [insert your interpretation in the comments below]

Personally, I see none of these as valid explanations of why applying text mining tools to the entirety of the PMC open access subset remains so rare. While it is true that <2% of all articles in PubMed are in the PMC open-acces subset is a limitation, the facts contained within the introductions and discussion of this subset cover a substantially broader proportion of scientific knowledge. Big data? Not compared to other areas of bioscience like genomics. Mining even a few mammalian genome’s worth of DNA sequence data is more technically and scientifically challenging than the English text of ~400,000 full-text articles. Text-miners also routinely apply their systems to MEDLINE abstracts, albeit often on a small scale, and there is a growing community of biocurators and bioinformaticians eager to consume data from full-text mining. So what is going on here?

Perhaps it is worth drawing an analogy with another major resource that was released at roughly the same time as PMC — the human genome sequence. According to many, including those in the popular media, the promise of human genome was oversold, perhaps to leverage financial support for this major project. Unfortunately, as Greg Petsko and Jonathan Eisen have argued, overselling the human genome project has had unintended negative consequences for the understanding, and perhaps funding, of basic research. Could the goal of reuse of open access articles likewise represent an overselling of the PMC repository? If so, then the open-access movement runs the risk of failing to deliver on one of the key planks in its platform. Failing to deliver on re-use could ultimately justify funders (if no-one is using it, why should we pay) and publishers (if no-one is using it, why should we make it open) to advocate green over gold open access, which could have a devestating impact on text-mining research, since author-deposited (green) manuscripts in PMC are off-limits for text-mining research.

I hope (and am actively working to prove) that re-use of the open access literature will not remain an unfulfilled promise. I suspect rather that we simply are in the lag phase before a period of explosive growth in full-text mining, akin to what happened in the field of genome-wide association studies after the publication of the human genome sequence. So text-miners, bioinformaticians, and computational biologists do your part to maximize the utility of Varmus, Lipman and Brown’s vision of an Arxiv for biology, and prove that the twin aims of the open access movement can be fulfilled.

* Published text mining studies using the entirety of the Open Access subset of PMC:

UPDATE – New papers using PMC since original post


16 thoughts on “Why Are There So Few Efforts to Text Mine the Open Access Subset of PubMed Central?

  1. Why Are There So Few Efforts to Text Mine the Open Access Subset of PubMed Central? « Another Word For It

  2. I appreciate many of the points here, though I don’t understand why the analysis focuses only on applications that use the entirety of the PMC open access subset.

    To me this definition isn’t synonymous with “articles .. published that actually use the open-access contents of PubMed Central for large-scale data mining or service provision.”

    • I agree that this is a bit arbitrary, especially since historicaly what is “all” of PMC OA become outdated relative to a time-point in the future. And yes, there are situations where it is more relevant to do some query up-front to restrict full-text mining efforts. I guess I was trying to make the point that corpus-wide text-mining efforts are rare, and what this implies for the state of the art. Maybe we should try to put together another list of studies that use a subset of PMC OA. I wonder if this would be substantially more extenisve or equally limited?

      • Yes, I think it would be valuable for sure. Different estimate, and seeing all the different uses would be interesting, inspiring, and show off more gaps.

        Need all these uses before concluding how useful PMC OA has been for text mining thus far.

        > corpus-wide text-mining efforts are rare, and what this implies for the state of the art

        ok, I’m with you on this. Good point.

  3. Casey,

    I don’t see a way to respond to the Nature Editorial online, but I’ll combine your blog response and the Editorial response here.

    The Editorial includes the statement “The promise is yet to be backed up with concrete examples of scientific success”. That’s only true in part. Our research group has been generating potential discovery over the past five years. I addressed the issue in detail in an ARIST chapter [1], and more recently in a comprehensive update of our discovery approach [2].

    However, the critics quoted in the Editorial may have a case. Useful text mining results have both a quantity and quality component. Typically, the computer and its built-in algorithmic rules do a reasonable job on the quantity component, but not on quality. For the latter, human judgment is essential. Much of the focus has been concentrated solely on the algorithms.
    For full text, I have seen very little published in the literature-related discovery area. I did some full text mining when I worked at MITRE. I addressed two aspects: information retrieval and information extraction. I published the information retrieval results in Journal of Information Science in 2010 [3]. If queries are generated properly, full text information retrieval can provide orders of magnitude increase in relevant articles retrieved, depending on the category of interest. In fact, we have found that the types of queries required to retrieve relevant documents from searching the full text are the same types of queries required to pinpoint concepts/papers with high discovery potential. For intel work, many of the categories of interest (e.g., suppliers, hardware, software specifics, etc) can only be found in the full text.

    For the information extraction, we examined a few approaches. The difficulty was extracting some of the important rare events. Operationally, important rare events (low frequency phenomena) had to be related to high frequency phenomena. Standard text mining procedures, such as in scientometrics, are based on statistical analysis of high frequency phenomena and are not really applicable to the low frequency extraction problem. I found a way to relate the important low frequency events to the high frequency and extract these rare events, but cannot comment on the downstream etiology of the results. The bottom line is that much useful information can be obtained from full text mining, if the right algorithms are used, and some human judgment is applied as well.


    [1]. Kostoff, R.N., Block, J.A., Solka, J.A., Briggs, M.B., Rushenberg, R.L., Stump, J.A., Johnson, D., Wyatt, J.R. “Literature-Related Discovery”. ARIST. 43. 243-285. 2008.
    [2]. Kostoff RN. Literature-related discovery and innovation — update. Technological Forecasting and Social Change (2012). doi:10.1016/j.techfore.2012.02.002. Also, see appended below.
    [3]. Kostoff RN. “Expanded information retrieval using full text searching”. Journal of Information Science. 36:1. 104-113. 2010.

    When reference [2] went online, I distributed an announcement letter to a few colleagues. I reproduce the letter below; it summarizes the contents and provides access.

    A recent publication updates the Literature-Related Discovery and Innovation (LRDI) technique, which identifies prevention and remediation measures for chronic and infectious diseases [1]. The information technology-based LRDI technique may be of interest to researchers in text mining, bioinformatics, and literature-based discovery, and the potential medical applications may be of special interest to researchers/clinicians focused on preventing, reducing, halting, or reversing progression of chronic and infectious diseases. To illustrate the potential power of LRDI, the article emphasizes the relationship between the results of our 2007 LRDI multiple sclerosis (MS) study and a recent demonstration of MS reversal.

    The findings in the update [1] include:
    * the role of comprehensive and precise information retrieval in discovery and innovation
    * the value of interdisciplinary research in discovery and innovation
    * the critical role of hormesis and synergy in preventative measures and accelerated healing
    * the critical need for cause removal in reversal of chronic disease
    * the severe under-reporting of critical variables in the clinical trials literature
    * the severe under-utilization of the broad biomedical literature for reversing chronic disease
    * concerns about the credibility and integrity of the medical literature in areas that concern commercial and government/political sensitivities

    Dr. Ronald N. Kostoff

    [1]. Kostoff RN. Literature-Related Discovery and Innovation – Update. Technological Forecasting and Social Change (2012). doi:10.1016/j.techfore.2012.02.002.
    *Pre-print full text version can be accessed at (http://stip.gatech.edu/wp-content/uploads/2012/02/LRD-UPDATE_TFSC_7_REV.pdf).
    *Journal posting access (http://dx.doi.org/10.1016/j.techfore.2012.02.002).

  4. Rothamsted, Council, ELC and the bioeconomy | Professor Douglas Kell's blog

  5. Casey,

    I’ve read the latest postings above, and they all seem to point in the same direction. Very few full text studies are being done, despite the potential benefits. That is my observation as well. Usually, in paradoxical situations like this, one has to examine incentives for deeper insights.

    For an example from a completely different area of study, most people believe that interdisciplinary research has myriad benefits, but publications reflecting real interdisciplinary research are relatively sparse. I examined this paradox in a Bioscience paper in 2002, and showed that, despite the flowery words about the benefits of interdisciplinary research, the reality was mainly disincentives to pursue this type of research. There is far more ‘bang to the buck’ in publishing traditional focused research in a discipline, where every slight change in a parameter could result in another publication. While interdisciplinary research could result in great benefits to science not available through more narrowly focused research, the time and effort required to understand the different disciplines and their inter-relationships does not, in most cases, pay off in terms of the metrics used to evaluate research productivity.

    As another example, climate change seems to be bearing down upon us, yet essentially nothing is being done to counter it. I’ve examined the motivations of all the major stakeholders related to climate change, and all are comfortable with the status quo, albeit for different reasons.

    From the few studies I’ve done in full text mining, far more and richer information is possible than mining titles or Abstracts. However, mining full text is intrinsically more difficult than mining Abstracts, and as in the examples above, I’m not sure it provides more ‘bang for the buck’ of interest to most researchers. In other words, the incentives for going to full text may not be there.

    However, I don’t buy the arguments of limited coverage as a valid reason for lack of studies. Many research studies could be categorized as proof-of-principle demonstration, and for that only limited coverage databases are required. Some of the original studies with Textpresso demonstrate that.

    The best way to move full text mining forward is to use the limited full text databases available, compare full text results with Abstract only results, and show the benefits (and additional costs as well). If these studies could show that a benefit-cost advantage exists for full text, full text mining would be well on itsur way to gaining acceptance. Our limited full text studies in 2009, only part of which were published, convinced me the benefits far outweighed the costs, but that was only one data point. Far more is required to convince a wider public.

    As a postscript, there is another advantage of full text that I haven’t seen mentioned elsewhere. Most publication information retrieval is based on text. Queries tend to be words and word combinations. One can then use these retrievals to explore citation networks and find additional relevant articles outside the text query terms used. But, full text especially contains much more than words/phrases. There are symbols and graphics of all types, like equations and curves. In theory, at least, these symbols could be used as search terms. One might have an interesting curve, and want to identify such curves in other literatures and how they were interpreted. Since curves typically are only presented in full text, this unexplored area has the potential of great payoff for full text mining.

  6. Launch of the PLOS Text Mining Collection | I wish you'd made me angry earlier

  7. Small update wrt percentages, as far as I’m currently looking at it. PubMed contains 25M articles, PMC has 3.6M articles (open, full access), of those 3.6M 1.04M can be downloaded in different formats using http://ftp.ncbi.nlm.nih.gov/pub/pmc
    I’d say it would be easier to download them all using FTP. Currently trying to download all (i.e. 25M) abstracts, and all entire (i.e. 3.6M) full-text articles… We’ll see what happens next.
    I expect one of the reasons (although a lot may have happened in the 3 years between my comment and this page was added) is researchers often stay on their “island”… So we need more multidisciplinary nerds – I volunteer :-)

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s