As the the central digital repository for life science publications, PubMed Central (PMC) is one of the most significant resources for making the Open Access movement a tangible reality for researchers and citizens around the world. Articles in PMC are deposited through two routes: either automatically by journals that participate in the PMC system, or directly by authors for journals that do not. Author deposition of peer-reviewed manuscripts in PMC is mandated by funders in order to make the results of publicly- or charity-funded research maximally accessible, and has led to over 180,000 articles being made free (gratis) to download that would otherwise be locked behind closed-access paywalls. Justifiably, there has been outrage over recent legislation (the Research Works Act) that would repeal the NIH madate in the USA and thereby prevent important research from being freely available.
However, from a text-miner’s perspective author-deposited manuscripts in PMC are closed access since, while they can be downloaded and read individually, virtually none (<200) are available from the PMC’s Open Access subset that includes all articles that are free (libre) to download in bulk and text/data mine. This includes ~99% of the author deposited manuscripts from the journal Nature, despite a clear statement from 2009 entitled “Nature Publishing Group allows data- and text-mining on self-archived manuscripts”. In short, funder mandates only make manuscripts public but not open, and thus whether the RWA is passed or not is actually moot from a text/data-mining perspective.
Why is this important? The simple reason is that there are currently only ~400,000 articles in the PMC Open Access subset, and therefore author-deposited manuscripts are only two-fold less abundant than all articles currently available for text/data-mining. Thus what could be a potentially rich source of data for large-scale information extraction remains locked away from programmatic analysis. This is especially tragic considering the fact that at the point of manuscript acceptance, publishers have invested little-to-nothing into the publication process and their claim to copyright is most tenuous.
So instead of discussing whether we should support the status quo of weak funder mandates by working to block the RWA or expand NIH-like mandates (e.g. as under the Federal Research Public Access Act, FRPAA), the real discussion that needs to be had is how to make funder mandates stronger to insist (at a minimum) that author-deposited manuscripts be available for text/data-mining research. More-of-the same, not matter how much, only takes us half the distance towards the ultimate goals of the Open Access movement, and doesn’t permit the crucial text/data mining research that is needed to make sense of the deluge of information in the scientific literature.
Credits: Max Haussler for making me aware of the lack of author manuscripts in PMC a few years back, and Heather Piwowar for recently jump-starting the key conversation on how to push for improved text/data mining rights in future funder mandates.