Why the Research Works Act Doesn’t Affect Text-mining Research

As the the central digital repository for life science publications, PubMed Central (PMC) is one of the most significant resources for making the Open Access movement a tangible reality for researchers and citizens around the world. Articles in PMC are deposited through two routes: either automatically by journals that participate in the PMC system, or directly by authors for journals that do not. Author deposition of peer-reviewed manuscripts in PMC is mandated by funders in order to make the results of publicly- or charity-funded research maximally accessible, and has led to over 180,000 articles being made free (gratis) to download that would otherwise be locked behind closed-access paywalls. Justifiably, there has been outrage over recent legislation (the Research Works Act) that would repeal the NIH madate in the USA and thereby prevent important research from being freely available.

However, from a text-miner’s perspective author-deposited manuscripts in PMC are closed access since, while they can be downloaded and read individually, virtually none (<200) are available from the PMC’s Open Access subset that includes all articles that are free (libre) to download in bulk and text/data mine. This includes ~99% of the author deposited manuscripts from the journal Nature, despite a clear statement from 2009 entitled “Nature Publishing Group allows data- and text-mining on self-archived manuscripts”. In short, funder mandates only make manuscripts public but not open, and thus whether the RWA is passed or not is actually moot from a text/data-mining perspective.

Why is this important? The simple reason is that there are currently only ~400,000 articles in the PMC Open Access subset, and therefore author-deposited manuscripts are only two-fold less abundant than all articles currently available for text/data-mining. Thus what could be a potentially rich source of data for large-scale information extraction remains locked away from programmatic analysis. This is especially tragic considering the fact that at the point of manuscript acceptance, publishers have invested little-to-nothing into the publication process and their claim to copyright is most tenuous.

So instead of discussing whether we should support the status quo of weak funder mandates by working to block the RWA or expand NIH-like mandates (e.g. as under the Federal Research Public Access Act, FRPAA), the real discussion that needs to be had is how to make funder mandates stronger to insist (at a minimum) that author-deposited manuscripts be available for text/data-mining research. More-of-the same, not matter how much, only takes us half the distance towards the ultimate goals of the Open Access movement, and doesn’t permit the crucial text/data mining research that is needed to make sense of the deluge of information in the scientific literature.

Credits: Max Haussler for making me aware of the lack of author manuscripts in PMC a few years back, and Heather Piwowar for recently jump-starting the key conversation on how to push for improved text/data mining rights in future funder mandates.

Related Posts:

8 thoughts on “Why the Research Works Act Doesn’t Affect Text-mining Research

  1. Thanks for flagging this. We deposit >1000 manuscripts a year on behalf of authors in PMC (in addition to those which authors deposit themselves), and have been doing so since mid-2008, so I agree there should be more than the 200 your search found in this dataset (which all seem to be from NPG journals as far as I can see).

    We’ll look into this with PMC and get back to you with an update. But please be assured there is no change to NPG’s stated policy.

    Grace Baynes
    Nature Publishing Group

    • Hi Grace –

      This is excellent news. NPG has a very progressive policy on text/data-mining on author-manuscripts, so it is a shame that there are technical barriers to actually using the data. Any progress towards making all NPG author-deposited manucripts available via the PMC OA subset would establish an important precedent, and be welcomed by the text/data-mining community. Many thanks for looking into this issue.


  2. The Open Access Irony Awards: Naming and shaming « O'Really?

  3. Why Are There So Few Efforts to Text Mine the Open Access Subset of PubMed Central? « I wish you'd made me angry earlier

  4. Comments on the RCUK’s New Draft Policy on Open Access « I wish you'd made me angry earlier

  5. “while they can be downloaded and read individually, virtually none (<200) are available from the PMC’s Open Access subset that includes all articles that are free (libre) to download in bulk and text/data mine."

    Why is this the case? Could there be a technical fix?

    • Sadly no this is not a technical issue, but a legal/sociological problem. Since there are complex per-journal copyright issues for self-archived manuscripts, PMC has taken a conservative approach to assuming that author-deposited manuscripts are not OA. For this to change it would require each journal to ask PMC to switch author-deposited manuscripts into the OA subset.

      Looking at the numbers again today, it appears that 400+ author-deposited manuscripts are now in the the OA subset, all of which are from Nature Publishing Group. So it looks like NPG has been making some inroads into fixing this problem.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s