Archive for the 'text mining' Category

Launch of the PLOS Text Mining Collection

Just a quick post to announce that the PLOS Text Mining Collection is now live!

This PLOS Collection arose out of a twitter conversation with Theo Bloom last year, and has come together through the hard work of the authors of the papers in the Collection, the PLOS Collections team (in particular Sam Moore and Jennifer Horsely), and my co-organizers Larry Hunter and Andrey Rzhetsky. Many thanks to all for seeing this effort to completion.

Because of the large body of work in the area of text mining published in PLOS, we struggled with how best to present all these papers in the collection without diluting the experience for the reader. In the end, we decided only to highlight new work from the last two years and major reviews/tutorials at the time of launch. However, as this is a living collection, new articles will be included in the future, and the aim is to include previously published work as well. We hope to see many more papers in the area of text mining published in the PLOS family of journals in the future.

An overview of the PLOS Text Mining Collection is below (cross-posted at the PLOS EveryONE blog) and a commentary on Collection is available at the Official PLOS Blog entitled “A mine of information – the PLOS Text Mining Collection“.

Background to the PLOS Text Mining Collection

Text Mining is an interdisciplinary field combining techniques from linguistics, computer science and statistics to build tools that can efficiently retrieve and extract information from digital text. Over the last few decades, there has been increasing interest in text mining research because of the potential commercial and academic benefits this technology might enable. However, as with the promises of many new technologies, the benefits of text mining are still not clear to most academic researchers.

This situation is now poised to change for several reasons. First, the rate of growth of the scientific literature has now outstripped the ability of individuals to keep pace with new publications, even in a restricted field of study. Second, text-mining tools have steadily increased in accuracy and sophistication to the point where they are now suitable for widespread application. Finally, the rapid increase in availability of digital text in an Open Access format now permits text-mining tools to be applied more freely than ever before.

To acknowledge these changes and the growing body of work in the area of text mining research, today PLOS launches the Text Mining Collection, a compendium of major reviews and recent highlights published in the PLOS family of journals on the topic of text mining. As one of the major publishers of the Open Access scientific literature, it is perhaps no coincidence that research in text mining in PLOS journals is flourishing. As noted above, the widespread application and societal benefits of text mining is most easily achieved under an Open Access model of publishing, where the barriers to obtaining published articles are minimized and the ability to remix and redistribute data extracted from text is explicitly permitted. Furthermore, PLOS is one of the few publishers who is actively promoting text mining research by providing an open Application Programming Interface to mine their journal content.

Text Mining in PLOS

Since virtually the beginning of its history [1], PLOS has actively promoted the field of text mining by publishing reviews, opinions, tutorials and dozens of primary research articles in this area in PLOS Biology, PLOS Computational Biology and, increasingly, PLOS ONE. Because of the large number of text mining papers in PLOS journals, we are only able to highlight a subset of these works in the first instance of the PLOS Text Mining Collection. These include major reviews and tutorials published over the last decade [1][2][3][4][5][6], plus a selection of research papers from the last two years [7][8][9][10][11][12][13][14][15][16][17][18][19] and three new papers arising from the call for papers for this collection [20][21][22].
The research papers included in the collection at launch provide important overviews of the field and reflect many exciting contemporary areas of research in text mining, such as:

  • methods to extract textual information from figures [7];
  • methods to cluster [8] and navigate [15] the burgeoning biomedical literature;
  • integration of text-mining tools into bioinformatics workflow systems [9];
  • use of text-mined data in the construction of biological networks [10];
  • application of text-mining tools to non-traditional textual sources such as electronic patient records [11] and social media [12];
  • generating links between the biomedical literature and genomic databases [13];
  • application of text-mining approaches in new areas such as the Environmental Sciences [14] and Humanities [16][17];
  • named entity recognition [18];
  • assisting the development of ontologies [19];
  • extraction of biomolecular interactions and events [20][21]; and
  • assisting database curation [22].

Looking Forward

As this is a living collection, it is worth discussing two issues we hope to see addressed in articles that are added to the PLOS text mining collection in the future: scaling up and opening up. While application of text mining tools to abstracts of all biomedical papers in the MEDLINE database is increasingly common, there have been remarkably few efforts that have applied text mining to the entirety of the full text articles in a given domain, even in the biomedical sciences [4][23]. Therefore, we hope to see more text mining applications scaled up to use the full text of all Open Access articles. Scaling up will maximize the utility of text-mining technologies and the uptake by end users, but also demonstrate that demand for access to full text articles exists by the text mining and wider academic communities.

Likewise, we hope to see more text-mining software systems made freely or openly available in the future. As an example of the state of affairs in the field, only 25% of the research articles highlighted in the PLOS text mining collection at launch provide source code or executable software of any kind [13][16][19][21]. The lack of availability of software or source code accompanying published research articles is, of course, not unique to the field of text mining. It is a general problem limiting progress and reproducibility in many fields of science, which authors, reviewers and editors have a duty to address. Making release of open source software the rule, rather than the exception, should further catalyze advances in text mining, as it has in other fields of computational research that have made extremely rapid progress in the last decades (such as genome bioinformatics).

By opening up the code base in text mining research, and deploying text-mining tools at scale on the rapidly growing corpus of full-text Open Access articles, we are confident this powerful technology will make good on its promise to catalyze scholarly endeavors in the digital age.

References

1. Dickman S (2003) Tough mining: the challenges of searching the scientific literature. PLoS biology 1: e48. doi:10.1371/journal.pbio.0000048.
2. Rebholz-Schuhmann D, Kirsch H, Couto F (2005) Facts from Text—Is Text Mining Ready to Deliver? PLoS Biol 3: e65. doi:10.1371/journal.pbio.0030065.
3. Cohen B, Hunter L (2008) Getting started in text mining. PLoS computational biology 4: e20. doi:10.1371/journal.pcbi.0040020.
4. Bourne PE, Fink JL, Gerstein M (2008) Open access: taking full advantage of the content. PLoS computational biology 4: e1000037+. doi:10.1371/journal.pcbi.1000037.
5. Rzhetsky A, Seringhaus M, Gerstein M (2009) Getting Started in Text Mining: Part Two. PLoS Comput Biol 5: e1000411. doi:10.1371/journal.pcbi.1000411.
6. Rodriguez-Esteban R (2009) Biomedical Text Mining and Its Applications. PLoS Comput Biol 5: e1000597. doi:10.1371/journal.pcbi.1000597.
7. Kim D, Yu H (2011) Figure text extraction in biomedical literature. PloS one 6: e15338. doi:10.1371/journal.pone.0015338.
8. Boyack K, Newman D, Duhon R, Klavans R, Patek M, et al. (2011) Clustering More than Two Million Biomedical Publications: Comparing the Accuracies of Nine Text-Based Similarity Approaches. PLoS ONE 6: e18029. doi:10.1371/journal.pone.0018029.
9. Kolluru B, Hawizy L, Murray-Rust P, Tsujii J, Ananiadou S (2011) Using workflows to explore and optimise named entity recognition for chemistry. PloS one 6: e20181. doi:10.1371/journal.pone.0020181.
10. Hayasaka S, Hugenschmidt C, Laurienti P (2011) A network of genes, genetic disorders, and brain areas. PloS one 6: e20907. doi:10.1371/journal.pone.0020907.
11. Roque F, Jensen P, Schmock H, Dalgaard M, Andreatta M, et al. (2011) Using electronic patient records to discover disease correlations and stratify patient cohorts. PLoS computational biology 7: e1002141. doi:10.1371/journal.pcbi.1002141.
12. Salathé M, Khandelwal S (2011) Assessing Vaccination Sentiments with Online Social Media: Implications for Infectious Disease Dynamics and Control. PLoS Comput Biol 7: e1002199. doi:10.1371/journal.pcbi.1002199.
13. Baran J, Gerner M, Haeussler M, Nenadic G, Bergman C (2011) pubmed2ensembl: a resource for mining the biological literature on genes. PloS one 6: e24716. doi:10.1371/journal.pone.0024716.
14. Fisher R, Knowlton N, Brainard R, Caley J (2011) Differences among major taxa in the extent of ecological knowledge across four major ecosystems. PloS one 6: e26556. doi:10.1371/journal.pone.0026556.
15. Hossain S, Gresock J, Edmonds Y, Helm R, Potts M, et al. (2012) Connecting the dots between PubMed abstracts. PloS one 7: e29509. doi:10.1371/journal.pone.0029509.
16. Ebrahimpour M, Putniņš TJ, Berryman MJ, Allison A, Ng BW-H, et al. (2013) Automated authorship attribution using advanced signal classification techniques. PLoS ONE 8: e54998. doi:10.1371/journal.pone.0054998.
17. Acerbi A, Lampos V, Garnett P, Bentley RA (2013) The Expression of Emotions in 20th Century Books. PLoS ONE 8: e59030. doi:10.1371/journal.pone.0059030.
18. Groza T, Hunter J, Zankl A (2013) Mining Skeletal Phenotype Descriptions from Scientific Literature. PLoS ONE 8: e55656. doi:10.1371/journal.pone.0055656.
19. Seltmann KC, Pénzes Z, Yoder MJ, Bertone MA, Deans AR (2013) Utilizing Descriptive Statements from the Biodiversity Heritage Library to Expand the Hymenoptera Anatomy Ontology. PLoS ONE 8: e55674. doi:10.1371/journal.pone.0055674.
20. Van Landeghem S, Bjorne J, Wei C-H, Hakala K, Pyysal S, et al. (2013) Large-Scale Event Extraction from Literature with Multi-Level Gene Normalization. PLOS ONE 8: e55814. doi:10.1371/journal.pone.0055814
21. Liu H, Hunter L, Keselj V, Verspoor K (2013) Approximate Subgraph Matching-based Literature Mining for Biomedical Events and Relations. PLoS ONE 8(4): e60954. doi:10.1371/journal.pone.0060954
22. Davis A, Weigers T, Johnson R, Lay J, Lennon-Hopkins K, et al. (2013) Text mining effectively scores and ranks the literature for improving chemical-gene-disease curation at the Comparative Toxicogenomics Database. PLOS ONE 8: e58201. doi:10.1371/journal.pone.0058201
23. Bergman CM (2012) Why Are There So Few Efforts to Text Mine the Open Access Subset of PubMed Central? https://caseybergman.wordpress.com/2012/03/02/why-are-there-so-few-efforts-to-text-mine-the-open-access-subset-of-pubmed-central/.

Where Do Bioinformaticians Host Their Code?

Awhile back I was piqued by a discussion on BioStar about “Where would you host your open source code repository today?“, which got me thinking about the relative merits of the different sites for hosting bioinformatics software.  I am not an evangelist for any particular version control system or hosting site, and I leave it to readers to have a look into these systems themselves or at the BioStar thread for more on the relative merits of major hosting services, such as Sourceforge, Google Code, github and bitbucket. My aim here is not to advocate any particular system (although as a lab head I have certain predilections*), but to answer the straightforward empirical question: where do bioinformaticians host their code?

To do this, I’ve queried PubMed for keywords in the URLs of the four major hosting services listed above to get estimates of their uptake in biomedical publications.  This simple analysis clearly has some caveats, including the fact that many publications link to hosting services in sections of the paper outside the abstract, and that many bioinformaticians (frustratingly) release code via insitutional or personal webpages. Furthermore, the various hosting services arose at different times in history, so it is also important to interpret these data in a temporal context.  These (and other caveats) aside, the following provides an overview of how the bioinformatics community votes with their feet in terms of hosting their code on the major repository systems…

First of all, the bad news: of the many thousands of articles published in the field of bioinformatics, as of July Dec 31 2012 just under 700 papers (n=676) have easily discoverable code linked to a major repository in their abstract. The totals for each repository system are: 446 Sourceforge, 152 on Google Code, 78 on github and only 5 on bitbucket. So, by far, the majority of authors have chosen not to host their code on a major repository. But for the minority of authors who have chosen to release their code via a stable repository system, most use Sourceforge (which was is the oldest and most established source code repository) and effectively nobody is using bitbucket.

The first paper to link published code to a major repository system was only a decade ago in 2002, and a breakdown of the growth in code hosting since then looks like this:

 Year Sourceforge Google github
2002 4 0 0
2003 3 0 0
2004 10 0 0
2005 21 1 0
2006 24 0 0
2007 30 1 0
2008 30 10 0
2009 48 10 0
2010 69 21 8
2011 94 46 18
2012 113 63 52
Total 446 152 78

Trends in bioinformatics code repository usage 2002-2012.

A few things are clear from these results: 1) there is an upward trend in biomedical researchers hosting their code on major repository sites (the apparent downturn in 2012 is because data for this year is incomplete), 2) Sourceforge has clearly been the dominant players in the biomedical code repository game to date, but 3) the current growth rate of github appears to be outstripping both Sourceforge and Google Code. Furthermore, it appears that github is not experiencing any lag in uptake, as was observed in the 2002-2004 period for Sourceforge and 2006-2009 period for Google Code. It is good to see that new players in the hosting market are being accepted at a quicker rate than they were a decade ago.

Hopefully the upward trend for bioinformaticians to release their code via a major code hosting service will continue (keep up the good work, brothers and sisters!), and this will ultimately create a snowball effect such that it is no longer acceptable to publish bioinformatics software without releasing it openly into the wild.


  • As a lab manager I prefer to use Sourceforge in our published work, since Sourceforge has a very draconian policy when it come to deleting projects, which prevents accidental or willful deletion of a repository. In my opinion, Google Code and (especially) github are too permissive in terms of allowing projects to be deleted. As a lab head, I see it is my duty to ensure the long-term preservation of published code above all other considerations. I am aware that there are mechanisms to protect against deletion of repositories on github and Google Code, but I would suspect that most lab heads do not utilize them and that a substantial fraction of published academic code is one click away from deletion.

Announcing the PLoS Text Mining Collection

Based on a spur of the moment tweet earlier this year, and a positive follow up from Theo Bloom, I’m very happy to announce that PLoS has now put the wheels in motion to develop a Collection of articles that highlight the importance of Text Mining research. The Call for Papers has just been announced today, and I’m very excited to see this effort highlight the synergy between Open Access, Altmetrics and Text Mining research. I’m particularly keen to see someone take the reigns on writing a good description of the API for PLoS (and other publishers). And a good lesson to all to be careful to watch what you tweet!

The Call for Paper below is cross posted at the PLoS Blog

Call for Papers: PLoS Text Mining Collection

The Public Library of Science (PLoS) seeks submissions in the broad field of text-mining research for a collection to be launched across all of its journals in 2013. All submissions submitted before October 30th, 2012 will be considered for the launch of the collection. Please read the following post for further information on how to submit your article.

The scientific literature is exponentially increasing in size, with thousands of new papers published every day. Few researchers are able to keep track of all new publications, even in their own field, reducing the quality of scholarship and leading to undesirable outcomes like redundant publication. While social media and expert recommendation systems provide partial solutions to the problem of keeping up with the literature, systematically identifying relevant articles and extracting key information from them can only come through automated text-mining technologies.

Research in text mining has made incredible advances over the last decade, driven through community challenges and increasingly sophisticated computational technologies. However, the promise of text mining to accelerate and enhance research largely has not yet been fulfilled, primarily since the vast majority of the published scientific literature is not published under an Open Access model. As Open Access publishing yields an ever-growing archive of unrestricted full-text articles, text mining will play an increasingly important role in drilling down to essential research and data in scientific literature in the 21st century scholarly landscape.

As part of its commitment to realizing the maximal utility of Open Access literature, PLoS is launching a collection of articles dedicated to highlighting the importance of research in the area of text mining. The launch of this Text Mining Collection complements related PLoS Collections on Open Access and Altmetrics (forthcoming), as well as the recent release of the PLoS Application Programming Interface, which provides an open API to PLoS journal content.

As part of this Text Mining Collection, we are making a call for high quality submissions that advance the field of text-mining research, including:

  • New methods for the retrieval or extraction of published scientific facts
  • Large-scale analysis of data extracted from the scientific literature
  • New interfaces for accessing the scientific literature
  • Semantic enrichment of scientific articles
  • Linking the literature to scientific databases
  • Application of text mining to database curation
  • Approaches for integrating text mining into workflows
  • Resources (ontologies, corpora) to improve text mining research

Please note that all submissions submitted before October 30th, 2012 will be considered for the launch of the collection (expected early 2013); submissions after this date will still be considered for the collection, but may not appear in the collection at launch.

Submission Guidelines
If you wish to submit your research to the PLoS Text Mining Collection, please consider the following when preparing your manuscript:

All articles must adhere to the submission guidelines of the PLoS journal to which you submit.
Standard PLoS policies and relevant publication fees apply to all submissions.
Submission to any PLoS journal as part of the Text Mining Collection does not guarantee publication.

When you are ready to submit your manuscript to the collection, please log in to the relevant PLoS manuscript submission system and mention the Collection’s name in your cover letter. This will ensure that the staff is aware of your submission to the Collection. The submission systems can be found on the individual journal websites.

Please contact Samuel Moore (smoore@plos.org) if you would like further information about how to submit your research to the PLoS Text Mining Collection.

Organizers
Casey Bergman (University of Manchester)
Lawrence Hunter (University of Colorado-Denver)
Andrey Rzhetsky (University of Chicago)

 

Comments on the RCUK’s New Draft Policy on Open Access

RCUK, the umbrella agency that represents several major publicly-funded Research Councils in the UK, has recently released a draft document outlining revision to its policy on Open Access publishing for RCUK-funded research. One of the leaders of the Open Access movement, Peter Suber, has provided strong assent for this policy and ably summarized the salient features of this document on a G+ post, with which I concur. Based on his encourgagement to submit comments to RCUK directly, I’ve emailed the following two points for RCUK to consider in their revision of this policy:

From: Casey Bergman <Casey.Bergman@xxx.xx.xx>
Date: 18 March 2012 15:22:29 GMT
To: <communications@rcuk.ac.uk>
Subject: Open Access Feedback

Hello -

I write to support the draft RCUK policy on Open Access, but would like to raise two points that I see are crucial to effectively achieving the aims of libre Open Access:

1) The green OA route does not always ensure libre OA, and often green OA documents remain unavailable for text and data mining.  For example, author-deposited manuscripts in (UK)PMC are not available for text mining, since the are not in the “OA subset” (see https://caseybergman.wordpress.com/2012/02/11/why-the-research-works-act-doesnt-affect-text-mining-research/).  Thus, for RCUK to mandate libre OA via the green route, RCUK would need to work with repositories like (UK)PMC to make sure that green author-deposited manuscripts go into the OA subset that can be automatically downloaded for re-use.

2) Further information should be provided about the following comment: “In addition, the Research Councils are happy to work with individual institutions on how they might build an institutional Open Access fund drawing from the indirect costs on grants.”  RCUK should take the lead on establishing financial models that are viable for recovering OA costs that can easily be adopted by universities.  Promoting the development of University OA funds that can effectively recover costs from RCUK grants to support gold OA papers that are published after the life-time of a grant would be a major boost for publishing RCUK funded work under a libre OA model.

Yours sincerely,

Casey Bergman, Ph.D.
Faculty of Life Sciences
University of Manchester
Michael Smith Building
Oxford Road, M13 9PT
Manchester, UK

Why Are There So Few Efforts to Text Mine the Open Access Subset of PubMed Central?

The open access movement in scientific publishing has two broad aims: (i) to make scientific articles more broadly accessible and (ii) to permit unrestricted re-use of published scientific content. From its humble beginnings in 2001 with only two journals, PubMed Central (PMC) has grown to become the world’s largest repository of full-text open-access biomedical articles, containing nearly 2.4 million biomedical articles that can be freely downloaded by anyone around the world. Thus, while holding only ~11% of the total published biomedical literature, PMC can be viewed clearly as a major success in terms of making the biomedical literature more broadly accessible.

However, I argue that PMC has yet catalyze similar success on the second goal of the open-access movement — unrestricted re-use of published scientific content. This point became clear to me when writing the discussions for two papers that my lab published last year. In digging around for references to cite, I was struck by how difficult it was to find examples of projects that applied text-mining tools to the entire set of open-access articles from PubMed Central. Unsure if this was a reflection of my ignorance or the actual state of the art in the field, I canvassed the biological text mining community, the bioinformatics community and two major open-access publishers for additional examples of text-mining on the entire open-access subset of PMC.

Surprisingly, I found that after a decade of existence only ~15 articles* have ever been published that have used the entire open-access subset of PMC for text-mining research. In other words, less than 2 research articles per year are being published that actually use the open-access contents of PubMed Central for large-scale data mining or sevice provision. I find the lack of uptake of PMC by text-mining researchers to be rather astonishing, considering it is  an incredibly rich achive of the combined output of thousands of scientists worldwide.

This observation begs the question: why are there so few efforts to text mine content in PubMed Central? I don’t pretend to have the answer, but there are a number of plausible reasons that include (but are not limited to):

  • The Open Access subset of PMC is such a small component of the entire published literature that it is unsuable.
  • Full-text mining research is such a difficult task that it cannot be usefully done.
  • The text-mining community is more focused on developing methods than applying them to the biomedical literature.
  • There is not an established community of users for full-text mining research.
  • [insert your interpretation in the comments below]

Personally, I see none of these as valid explanations of why applying text mining tools to the entirety of the PMC open access subset remains so rare. While it is true that <2% of all articles in PubMed are in the PMC open-acces subset is a limitation, the facts contained within the introductions and discussion of this subset cover a substantially broader proportion of scientific knowledge. Big data? Not compared to other areas of bioscience like genomics. Mining even a few mammalian genome’s worth of DNA sequence data is more technically and scientifically challenging than the English text of ~400,000 full-text articles. Text-miners also routinely apply their systems to MEDLINE abstracts, albeit often on a small scale, and there is a growing community of biocurators and bioinformaticians eager to consume data from full-text mining. So what is going on here?

Perhaps it is worth drawing an analogy with another major resource that was released at roughly the same time as PMC — the human genome sequence. According to many, including those in the popular media, the promise of human genome was oversold, perhaps to leverage financial support for this major project. Unfortunately, as Greg Petsko and Jonathan Eisen have argued, overselling the human genome project has had unintended negative consequences for the understanding, and perhaps funding, of basic research. Could the goal of reuse of open access articles likewise represent an overselling of the PMC repository? If so, then the open-access movement runs the risk of failing to deliver on one of the key planks in its platform. Failing to deliver on re-use could ultimately justify funders (if no-one is using it, why should we pay) and publishers (if no-one is using it, why should we make it open) to advocate green over gold open access, which could have a devestating impact on text-mining research, since author-deposited (green) manuscripts in PMC are off-limits for text-mining research.

I hope (and am actively working to prove) that re-use of the open access literature will not remain an unfulfilled promise. I suspect rather that we simply are in the lag phase before a period of explosive growth in full-text mining, akin to what happened in the field of genome-wide association studies after the publication of the human genome sequence. So text-miners, bioinformaticians, and computational biologists do your part to maximize the utility of Varmus, Lipman and Brown’s vision of an Arxiv for biology, and prove that the twin aims of the open access movement can be fulfilled.

* Published text mining studies using the entirety of the Open Access subset of PMC:

UPDATE – New papers using PMC since original post

Why the Research Works Act Doesn’t Affect Text-mining Research

As the the central digital repository for life science publications, PubMed Central (PMC) is one of the most significant resources for making the Open Access movement a tangible reality for researchers and citizens around the world. Articles in PMC are deposited through two routes: either automatically by journals that participate in the PMC system, or directly by authors for journals that do not. Author deposition of peer-reviewed manuscripts in PMC is mandated by funders in order to make the results of publicly- or charity-funded research maximally accessible, and has led to over 180,000 articles being made free (gratis) to download that would otherwise be locked behind closed-access paywalls. Justifiably, there has been outrage over recent legislation (the Research Works Act) that would repeal the NIH madate in the USA and thereby prevent important research from being freely available.

However, from a text-miner’s perspective author-deposited manuscripts in PMC are closed access since, while they can be downloaded and read individually, virtually none (<200) are available from the PMC’s Open Access subset that includes all articles that are free (libre) to download in bulk and text/data mine. This includes ~99% of the author deposited manuscripts from the journal Nature, despite a clear statement from 2009 entitled “Nature Publishing Group allows data- and text-mining on self-archived manuscripts”. In short, funder mandates only make manuscripts public but not open, and thus whether the RWA is passed or not is actually moot from a text/data-mining perspective.

Why is this important? The simple reason is that there are currently only ~400,000 articles in the PMC Open Access subset, and therefore author-deposited manuscripts are only two-fold less abundant than all articles currently available for text/data-mining. Thus what could be a potentially rich source of data for large-scale information extraction remains locked away from programmatic analysis. This is especially tragic considering the fact that at the point of manuscript acceptance, publishers have invested little-to-nothing into the publication process and their claim to copyright is most tenuous.

So instead of discussing whether we should support the status quo of weak funder mandates by working to block the RWA or expand NIH-like mandates (e.g. as under the Federal Research Public Access Act, FRPAA), the real discussion that needs to be had is how to make funder mandates stronger to insist (at a minimum) that author-deposited manuscripts be available for text/data-mining research. More-of-the same, not matter how much, only takes us half the distance towards the ultimate goals of the Open Access movement, and doesn’t permit the crucial text/data mining research that is needed to make sense of the deluge of information in the scientific literature.

Credits: Max Haussler for making me aware of the lack of author manuscripts in PMC a few years back, and Heather Piwowar for recently jump-starting the key conversation on how to push for improved text/data mining rights in future funder mandates.

Related Posts:

Why Doesn’t the Ecological Society of America Allow Their Open Access Content to be Text Mined?

A recent tweet from Todd Vision and blog post by Jonathan Eisen’s have alerted me to the shameful defense of the status quo in scientific publishing advanced by the the Ecological Society of America concerning the Office of Science and Technology Policy’s recent request for information on Open Access. This particular thread caught my eye because I still have fresh bruises from being denied access to Open Access ESA journal content for text-mining research. Denied access to Open content – how is this possible, you say?

Over the last two years I have been targetting scientific society’s whose journal’s are not in the tiny fraction of the scientific literature in the PubMed Central Open Access subset, hoping to encourage them to release their content for text-mining research projects in my group (e.g. http://www.text2genome.org). My attitude has been that Society’s are the ones to go after, since they often hold the copyrights and are typically run by colleagues who I can directly appeal to. After productive (yet rather protracted) communication with The Genetics Society of America, the UK Genetics Society and the Society for Molecular Biology and Evolution, we’ve been able to obtain back-content for Genetics, Heredity and Molecular Biology and Evolution for our projects.* Heredity has gone far enough to announce that their content is now open for text-mining research on their home page (victory!)

In stark contrast, a similar line of inquiry with the Ecological Society of America has led to a very sour and unproductive experience which I will summarize here to demonstrate the the ESA’s recent response letter to the OSTP is consistent with a general attitude of protecting their journal content. This narrative echoes Peter Murray-Rust’s painful story of his years negotiating with Elsevier for access to content, which likewise has no positive conclusion.

While Katherine McCarter is true in saying that the ESA publishes a subset of their content under OA licenses, it is not true that this content in is in any meaningful way “open” in a 21st-century, linked-data, remix-and-reuse context. Why? Because like virtually all of the ecological, agricultural and environmental literature, ESA OA content is not deposited in a public archive like PubMed Central, and can only be accessed via the ESA journal website.  However, this content is not accessible to text-mining since the ESA journal permissions clearly state:

Altering, recompiling, systematic or programmatic copying, or reselling of text or other information from ESA Journals in any form or medium is prohibited. Systematic or programmatic downloading, service bureau redistribution services, printing for fee-for-service purposes and/or the systematic making of print or electronic copies for transmission to non-subscribing institutions are prohibited.

Since I have been burned in the past by aggressive closed-access publishers shutting down my office IP for naively downloading content that my univeristy has a site license for, I dutifully went down the proper channel of requesting permissions to automatically download ESA content from the Permissions Editor, Dr. Cliff Duke. For the record, I can say that Dr. Duke has been faultlessly professional throughout the process and was positive about my initial request, an excerpt of which follows:

From: Cliff Duke <CSDuke@xxx.xxx>
Date: 28 June 2011 14:19:18 GMT+01:00
To: Casey Bergman <casey.bergman@xxx.xxx>
Subject: RE: request for permission to use ESA content in text-mining research

Casey,

In answer to your question — not that I recall, but I also don’t recall any previous similar requests, and I’ve been the permissions editor for about seven years. However, I doubt your request will be the last such, given the increasing interest in this kind of research.

Cliff

—–Original Message—–
From: Casey Bergman [mailto:casey.bergman@xxx.xxx]
Sent: Tuesday, June 28, 2011 9:14 AM
To: Cliff Duke
Subject: Re: request for permission to use ESA content in text-mining research

Dear Dr. Duke -

Many thanks for the very quick reply.  I appreciate your efforts in bringing this to the attention of the ESA leadership and I fully understand that this may take some time to sort out (it took several months with GSA).

One quick question at this stage: is there any precedent at ESA for permitting bulk access for text mining research?

Best regards,
Casey

—–Original Message—–
On 28 Jun 2011, at 13:48, Cliff Duke wrote:

Dr. Bergman,

I will discuss your request with our executive director and editors and get back to you as soon as I can. Our director is on travel this week, and I am on vacation next week, so it may be a couple of weeks before you hear back from us. Let me know if you have any questions meanwhile.

Regards,
Cliff Duke

Clifford S. Duke, Ph.D.
Permissions Editor

Ecological Society of America
1990 M Street NW, Suite 700
Washington, DC 20036
Phone: (202) 833-8773
Fax: (202) 833-8775
E-mail: csduke [at] esa.org

—–Original Message—–
From: Casey Bergman [mailto:casey.bergman@xxx.xxx]
Sent: Tuesday, June 28, 2011 8:34 AM
To: Cliff Duke
Subject: request for permission to use ESA content in text-mining research

Dear Dr. Duke -

Greetings, I am a researcher at the University of Manchester, with an interest in application of text and data mining to biological problems at the interface of computational and evolutionary biology. I am writing to request permission to use ESA journal content in text-mining research project that I am developing to submit as proposal to the UK Natural Environment Research Council. Specifically, I would like to request permission to automate download of the entire/the open-access subset of all ESA titles, which I understand is not permitted under the standard ESA policy (http://www.esapubs.org/esapubs/permissions.htm).

[SNIP]

Dr Duke then put me in touch with the Managing Editor of the ESA journals, David Baldwin who promptly ignored my request for several months, despite repeated emails and phone calls on e.g. 6 September 2011, 27 September 2011, 13 October 2011.  I finally received one email from David Baldwin on 20 October 2011, where he promised (but failed) to get back to me a few days later:

From: Casey Bergman <casey.bergman@xxx.xxx>
Date: 20 October 2011 12:42:01 GMT+01:00
To: J David Baldwin <jdb27@xxx.xxx>
Cc: Cliff Duke <csduke@xxx.xxx>
Bcc: Casey Bergman <casey.bergman@xxx.xx>
Subject: Re: request for permission to use ESA content in text-mining research

Dear David -

Many thanks for replying. I fully understand that nonstandard requests can take some time. My previous interactions with Genetics and Heredity have also taken many months to lead to positive decisions on releasing content for text mining research.

All that is required in the short term is explicit permission to execute automated downloads on the http://www.esajournals.org site that abide by the limits of your systems (<50 sessions in 10 minutes).  No other technical issues need to be addressed on your side.

Also, I am happy in the first instance to restrict automated downloads to the Open Access subset of ESA publications, if a decision to permit access to the entirety of ESA content is more difficult.

Best regards,
Casey

On 20 Oct 2011, at 12:03, J David Baldwin wrote:

Dear Dr. Bergman–

I know you are keen to discuss this, but I’m afraid you have picked a particularly bad week for contacting me. Today (Thursday, 20 October) won’t be any better than the past two days, and tomorrow (Friday) I’ll be out of town. I’ll look over your request before Monday, 24 October, and will e-mail you again by then. I’m afraid I get handed off the nonstandard requests (“the buck stops here”), and yet I have my own priorities (i.e., working to keep the journal issues on schedule).

Cheers,
David

—–Original Message—–
From: Casey Bergman [mailto:casey.bergman@xxx.xxx]
Sent: Thursday, October 13, 2011 9:01 AM
To: J David Baldwin
Cc: Cliff Duke
Subject: Re: request for permission to use ESA content in text-mining research

Dear David -

Would there be a convenient time for you sometime in the next week or so to discuss how we might be able to access ESA content programmatically?  I am +5 hrs to you, so late afternoon for me = morning for you is typically the best timeline to arrange a call.

Best regards,
Casey

Well I can report that as of 6 January 2012, the buck certain has stopped with David Baldwin on this issue, since he still refuses to respond to the most minimal request for permission to automatically download only the Open Access subset of the ESA content. Since no effort is required on his part other than to say “yes”, I take his inaction to speak for the ESA that they have no interest in supporting text and data mining research on their content — even their OA content — which is fully consistent with their specious arguments for protectionism of society subsidies through closed access publishing put forward in their response letter to the OSTP. Given the undeniable importance of data in the ecological literature for science and society, the ESA should be ashamed for locking away this precious resource from the world and being adamant in their position that this is in any way morally or ethically justified.

I look forward to David Baldwin’s response on this request, and hope the the ESA is more progressive in their outlook toward open access publishing and text/data mining in the coming years….

* Credits to Tracey DePellegrin Connelly, Scott Hawley and Lauren McIntrye for helping to free Genetics content;  Roger Butlin for helping free Heredity content; and Ken Wolfe, Soojin Yi and the SMBE council for helping to free MBE content.

Related Posts:

Just Say No – The Roberts/Ashburner Response

UPDATE: see follow-up post “The Roberts/Ashburner Response” to get more of the story on the origin of this letter.

I had the pleasure of catching up with my post-doc mentor Michael Ashburner today, and among other things we discussed the ongoing development of UKPMC and the importance of open access publishing. Although I consider myself a strong open access advocate, I did not sign the PLoS open letter in 2001, since at the time I was a post-doc and not in a position fully to control where I published. Therefore I couldn’t be sure that I could abide by the manifesto 100%, and didn’t want to put my name to something I couldn’t deliver on. As it turns out this is still the case to a certain degree and (because of collaborations) my freely-available-article-index remains at a respectable 85% (33/39), but alas will never reach the coveted 100% mark.

Nevertheless, I have steadily adopted most of the policies of the open letter, especially as my group has gotten more heavily involved in text-mining research over the years. This became especially true after a nasty encounter with one publisher in 2008 caused campus IT to shutdown my office IP for downloading articles from a journal for which our University has a site license, which radicalized me into more of an open access evangelist. After discussing this event at the time with Ashburner, he reminded me of the manifesto and one of its most powerful tools for changing the landscape of scholarly publishing – refusing to reviewing for journals/publishers who do not submit their content to PubMed Central (see the white-list of journals here).

I have dug this letter out countless times since then and used versions of it when asked to review for non-PMC journals, as it expresses the principles in plain and powerful language. I had another call to dig it out today and thought that I’d post the “Ashburner response” so others have a model to follow if they chose this path.

Enjoy!

From: “Michael Ashburner” <michael.ashburner@xxx.xxx>
Date: 30 August 2008 13:48:03 GMT+01:00
To: “Casey Bergman” <casey.bergman@xxx.xxx>
Subject: Just say No

Dear Editor,

Thank you for your invitation to review for your journal. Because it is not open access and does not provide its back content to PubMed Central, or any similar resource, I regret that I am unwilling to do this.

I would urge you to seriously reconsider both policies and would ask that you send this letter to your co-editors and publisher. In the event that you do change your policy, even to the extent of providing your back content to PubMed Central, or a similar resource, then I will be happy to review for you.

The scientific literature is at present the most significant resource available to researchers. Without access to the literature we cannot do science in any scholarly manner. Your journal refuses to embrace the idea that the purpose of the scientific literature is to communicate knowledge, not to make a profit for publishers. Without the free input of manuscripts and referees’ time your journal would not exist. By and large, the great majority of the work you publish is paid for by taxpayers. We now, either as individuals or as researchers whose grants are top-sliced, have to pay to read our own work and that of our colleagues, either personally or through our institutes’ libraries. I find that, increasingly, literature that is not available by open access is simply being ignored. Moreover, I am very aware that, increasingly, discovering information from the literature relies on some sort of computational analysis. This can only be effective if the entire content of primary research papers is freely available. Finally, by not being an open access journal you are disenfranchising both scientists who cannot afford (or whose institutions cannot afford) to pay for access and the general public.

There are now several good models for open access publication, and I would urge your journal to adopt one of these. There is an extensive literature on open access publishing, and its economic implications. I would be pleased to send you references to this literature.

Yours sincerely,

Michael Ashburner

Related Posts:

Is Science really “Making Data Maximally Available”?

Earlier this year Hanson, Sugdon and Alberts [1] argued in a piece entitled “Making Data Maximally Available” that journals like Science play a crucial role in making scientific data “publicly and permanently available” and that efforts to improve the standard of supporting online materials will increase their utility and the impact of their associated publications. While I whole-heartedly agreed with their view that improving supplemental materials is a better solution to the current disorganized [2] and impermanent [3] state of affairs (as opposed to the unwise alternative of discarding them altogether [4]), there were a few things about this piece that really irked me, and I had intended to write a letter to the editor on this with a colleague that unfortunately didn’t materialize, so I thought I’d post them here.

First, the authors make an artificial distinction between the supporting online materials associated with a paper and the contents of the paper itself. Clearly the most important data in a scientific report is in the full text of the article, and thus if making data in supporting online materials “maximally available” is a goal, surely so must be making data in full-text article itself. Second, in the context of the wider discussion on “big data” in which these points are made, it must be noted that maximal availability is only one step towards maximal utility, the other being maximal access. As the entire content of Science magazine is not available for unrestricted download and re-use from PubMed Central‘s Open Access repository, maximal utility of data in the full text or supplemental materials of articles published in Science is currently fettered because it is not available for bulk text mining or data mining. Amazingly, this is true even for Author-deposited manuscripts in PubMed Central, which are not currently included in the PubMed Central Open Access subset and therefore not available for bulk download and re-use.

Therefore it seems imperative that, in addition to making a clarion call for the improved availability of data, code and references in supplemental materials, the Editors of Science should issue a clear policy statement about the use of full-text articles and supplemental online materials that are published in Science for text and data mining research. At a minimum, Science should join with other high profile journals such as Nature [5] in clarifying the use of Author-deposited manuscripts in PubMed Central for text and data mining that are required to be deposited under funding body mandates for these very purposes. Additionally, Science should make a clear statement about the copyright and re-use policies for supporting online materials of all published articles, which are freely available for download without a Science subscription, and currently fall in the grey area between restricted and open access.

As we move firmly into the era of big data where issues of access and re-use of data becoming increasingly acute, Science, as the representative publication of the world’s largest general scientific society, should take the lead in opening its content for text and data mining, to the mutual benefit of authors, researchers and the AAAS.

References:

1. Hanson et al. (2011) Making Data Maximally Available. Science 331:649
2. Santos et al. (2005) Supplementary data need to be kept in public repositories. Nature 438:738
3. Anderson et al. (2006) On the persistence of supplementary resources in biomedical publications. BMC Bioinformatics 7:260
4. Journal of Neuroscience policy on Supplemental Material
5. Nature Press release on data- and text-mining of self-archived manuscripts

Related Posts:


Twitter Updates


Follow

Get every new post delivered to your Inbox.

Join 72 other followers