Archive for the 'data mining' Category

Launch of the PLOS Text Mining Collection

Just a quick post to announce that the PLOS Text Mining Collection is now live!

This PLOS Collection arose out of a twitter conversation with Theo Bloom last year, and has come together through the hard work of the authors of the papers in the Collection, the PLOS Collections team (in particular Sam Moore and Jennifer Horsely), and my co-organizers Larry Hunter and Andrey Rzhetsky. Many thanks to all for seeing this effort to completion.

Because of the large body of work in the area of text mining published in PLOS, we struggled with how best to present all these papers in the collection without diluting the experience for the reader. In the end, we decided only to highlight new work from the last two years and major reviews/tutorials at the time of launch. However, as this is a living collection, new articles will be included in the future, and the aim is to include previously published work as well. We hope to see many more papers in the area of text mining published in the PLOS family of journals in the future.

An overview of the PLOS Text Mining Collection is below (cross-posted at the PLOS EveryONE blog) and a commentary on Collection is available at the Official PLOS Blog entitled “A mine of information – the PLOS Text Mining Collection“.

Background to the PLOS Text Mining Collection

Text Mining is an interdisciplinary field combining techniques from linguistics, computer science and statistics to build tools that can efficiently retrieve and extract information from digital text. Over the last few decades, there has been increasing interest in text mining research because of the potential commercial and academic benefits this technology might enable. However, as with the promises of many new technologies, the benefits of text mining are still not clear to most academic researchers.

This situation is now poised to change for several reasons. First, the rate of growth of the scientific literature has now outstripped the ability of individuals to keep pace with new publications, even in a restricted field of study. Second, text-mining tools have steadily increased in accuracy and sophistication to the point where they are now suitable for widespread application. Finally, the rapid increase in availability of digital text in an Open Access format now permits text-mining tools to be applied more freely than ever before.

To acknowledge these changes and the growing body of work in the area of text mining research, today PLOS launches the Text Mining Collection, a compendium of major reviews and recent highlights published in the PLOS family of journals on the topic of text mining. As one of the major publishers of the Open Access scientific literature, it is perhaps no coincidence that research in text mining in PLOS journals is flourishing. As noted above, the widespread application and societal benefits of text mining is most easily achieved under an Open Access model of publishing, where the barriers to obtaining published articles are minimized and the ability to remix and redistribute data extracted from text is explicitly permitted. Furthermore, PLOS is one of the few publishers who is actively promoting text mining research by providing an open Application Programming Interface to mine their journal content.

Text Mining in PLOS

Since virtually the beginning of its history [1], PLOS has actively promoted the field of text mining by publishing reviews, opinions, tutorials and dozens of primary research articles in this area in PLOS Biology, PLOS Computational Biology and, increasingly, PLOS ONE. Because of the large number of text mining papers in PLOS journals, we are only able to highlight a subset of these works in the first instance of the PLOS Text Mining Collection. These include major reviews and tutorials published over the last decade [1][2][3][4][5][6], plus a selection of research papers from the last two years [7][8][9][10][11][12][13][14][15][16][17][18][19] and three new papers arising from the call for papers for this collection [20][21][22].
The research papers included in the collection at launch provide important overviews of the field and reflect many exciting contemporary areas of research in text mining, such as:

  • methods to extract textual information from figures [7];
  • methods to cluster [8] and navigate [15] the burgeoning biomedical literature;
  • integration of text-mining tools into bioinformatics workflow systems [9];
  • use of text-mined data in the construction of biological networks [10];
  • application of text-mining tools to non-traditional textual sources such as electronic patient records [11] and social media [12];
  • generating links between the biomedical literature and genomic databases [13];
  • application of text-mining approaches in new areas such as the Environmental Sciences [14] and Humanities [16][17];
  • named entity recognition [18];
  • assisting the development of ontologies [19];
  • extraction of biomolecular interactions and events [20][21]; and
  • assisting database curation [22].

Looking Forward

As this is a living collection, it is worth discussing two issues we hope to see addressed in articles that are added to the PLOS text mining collection in the future: scaling up and opening up. While application of text mining tools to abstracts of all biomedical papers in the MEDLINE database is increasingly common, there have been remarkably few efforts that have applied text mining to the entirety of the full text articles in a given domain, even in the biomedical sciences [4][23]. Therefore, we hope to see more text mining applications scaled up to use the full text of all Open Access articles. Scaling up will maximize the utility of text-mining technologies and the uptake by end users, but also demonstrate that demand for access to full text articles exists by the text mining and wider academic communities.

Likewise, we hope to see more text-mining software systems made freely or openly available in the future. As an example of the state of affairs in the field, only 25% of the research articles highlighted in the PLOS text mining collection at launch provide source code or executable software of any kind [13][16][19][21]. The lack of availability of software or source code accompanying published research articles is, of course, not unique to the field of text mining. It is a general problem limiting progress and reproducibility in many fields of science, which authors, reviewers and editors have a duty to address. Making release of open source software the rule, rather than the exception, should further catalyze advances in text mining, as it has in other fields of computational research that have made extremely rapid progress in the last decades (such as genome bioinformatics).

By opening up the code base in text mining research, and deploying text-mining tools at scale on the rapidly growing corpus of full-text Open Access articles, we are confident this powerful technology will make good on its promise to catalyze scholarly endeavors in the digital age.

References

1. Dickman S (2003) Tough mining: the challenges of searching the scientific literature. PLoS biology 1: e48. doi:10.1371/journal.pbio.0000048.
2. Rebholz-Schuhmann D, Kirsch H, Couto F (2005) Facts from Text—Is Text Mining Ready to Deliver? PLoS Biol 3: e65. doi:10.1371/journal.pbio.0030065.
3. Cohen B, Hunter L (2008) Getting started in text mining. PLoS computational biology 4: e20. doi:10.1371/journal.pcbi.0040020.
4. Bourne PE, Fink JL, Gerstein M (2008) Open access: taking full advantage of the content. PLoS computational biology 4: e1000037+. doi:10.1371/journal.pcbi.1000037.
5. Rzhetsky A, Seringhaus M, Gerstein M (2009) Getting Started in Text Mining: Part Two. PLoS Comput Biol 5: e1000411. doi:10.1371/journal.pcbi.1000411.
6. Rodriguez-Esteban R (2009) Biomedical Text Mining and Its Applications. PLoS Comput Biol 5: e1000597. doi:10.1371/journal.pcbi.1000597.
7. Kim D, Yu H (2011) Figure text extraction in biomedical literature. PloS one 6: e15338. doi:10.1371/journal.pone.0015338.
8. Boyack K, Newman D, Duhon R, Klavans R, Patek M, et al. (2011) Clustering More than Two Million Biomedical Publications: Comparing the Accuracies of Nine Text-Based Similarity Approaches. PLoS ONE 6: e18029. doi:10.1371/journal.pone.0018029.
9. Kolluru B, Hawizy L, Murray-Rust P, Tsujii J, Ananiadou S (2011) Using workflows to explore and optimise named entity recognition for chemistry. PloS one 6: e20181. doi:10.1371/journal.pone.0020181.
10. Hayasaka S, Hugenschmidt C, Laurienti P (2011) A network of genes, genetic disorders, and brain areas. PloS one 6: e20907. doi:10.1371/journal.pone.0020907.
11. Roque F, Jensen P, Schmock H, Dalgaard M, Andreatta M, et al. (2011) Using electronic patient records to discover disease correlations and stratify patient cohorts. PLoS computational biology 7: e1002141. doi:10.1371/journal.pcbi.1002141.
12. Salathé M, Khandelwal S (2011) Assessing Vaccination Sentiments with Online Social Media: Implications for Infectious Disease Dynamics and Control. PLoS Comput Biol 7: e1002199. doi:10.1371/journal.pcbi.1002199.
13. Baran J, Gerner M, Haeussler M, Nenadic G, Bergman C (2011) pubmed2ensembl: a resource for mining the biological literature on genes. PloS one 6: e24716. doi:10.1371/journal.pone.0024716.
14. Fisher R, Knowlton N, Brainard R, Caley J (2011) Differences among major taxa in the extent of ecological knowledge across four major ecosystems. PloS one 6: e26556. doi:10.1371/journal.pone.0026556.
15. Hossain S, Gresock J, Edmonds Y, Helm R, Potts M, et al. (2012) Connecting the dots between PubMed abstracts. PloS one 7: e29509. doi:10.1371/journal.pone.0029509.
16. Ebrahimpour M, Putniņš TJ, Berryman MJ, Allison A, Ng BW-H, et al. (2013) Automated authorship attribution using advanced signal classification techniques. PLoS ONE 8: e54998. doi:10.1371/journal.pone.0054998.
17. Acerbi A, Lampos V, Garnett P, Bentley RA (2013) The Expression of Emotions in 20th Century Books. PLoS ONE 8: e59030. doi:10.1371/journal.pone.0059030.
18. Groza T, Hunter J, Zankl A (2013) Mining Skeletal Phenotype Descriptions from Scientific Literature. PLoS ONE 8: e55656. doi:10.1371/journal.pone.0055656.
19. Seltmann KC, Pénzes Z, Yoder MJ, Bertone MA, Deans AR (2013) Utilizing Descriptive Statements from the Biodiversity Heritage Library to Expand the Hymenoptera Anatomy Ontology. PLoS ONE 8: e55674. doi:10.1371/journal.pone.0055674.
20. Van Landeghem S, Bjorne J, Wei C-H, Hakala K, Pyysal S, et al. (2013) Large-Scale Event Extraction from Literature with Multi-Level Gene Normalization. PLOS ONE 8: e55814. doi:10.1371/journal.pone.0055814
21. Liu H, Hunter L, Keselj V, Verspoor K (2013) Approximate Subgraph Matching-based Literature Mining for Biomedical Events and Relations. PLoS ONE 8(4): e60954. doi:10.1371/journal.pone.0060954
22. Davis A, Weigers T, Johnson R, Lay J, Lennon-Hopkins K, et al. (2013) Text mining effectively scores and ranks the literature for improving chemical-gene-disease curation at the Comparative Toxicogenomics Database. PLOS ONE 8: e58201. doi:10.1371/journal.pone.0058201
23. Bergman CM (2012) Why Are There So Few Efforts to Text Mine the Open Access Subset of PubMed Central? https://caseybergman.wordpress.com/2012/03/02/why-are-there-so-few-efforts-to-text-mine-the-open-access-subset-of-pubmed-central/.

Will the Democratization of Sequencing Undermine Openness in Genomics?

It is no secret, nor is it an accident, that the success of genome biology over the last two decades owes itself in large part to the Open Science ideals and practices that underpinned the Human Genome Project. From the development of the Bermuda principles in 1996 to the Ft. Lauderdale agreement in 2003, leaders in the genomics community fought for rapid, pre-publication data release policies that have (for the most part) protected the interests of genome sequencing centers and the research community alike.

As a consequence, progress in genomic data acquisition and analysis has been incredibly fast, leading to major basic and medical breakthroughs, thousands of publications, and ultimately to new technologies that now permit extremely high-throughput DNA sequencing. These new sequencing technologies now give individual groups sequencing capabilities that were previously only acheivable by large sequencing centers. This development makes it timely to ask: how do the data release policies for primary genome sequences apply in the era of next-generation sequencing (NGS)?

My reading of the the history of genome sequence release policies condenses the key issues as follows:

  • The Bermuda Principles say that assemblies of primary genomic sequences of human and other organims should be made within 24 hrs of their production
  • The Ft. Lauderdale Agreement says that whole genome shotgun reads should be deposited in public repositories within one week of generation. (This agreement was also encouraged to be applied to other types of data from “community resource projects” – defined as research project specifically devised and implemented to create a set of data, reagents or other material whose primary utility will be as a resource for the broad scientific community.)

Thus, the agreed standard in the genomics field is that raw sequence data from the primary genomic sequence of organisms should be made available within a week of generation. In my view this also applies to so-called “resequencing” efforts (like the 1000 Genomes Project), since genomic data from a new strain or individual is actually a new primary genome sequence.

The key question concerning genomic data release policies in the NGS era, then, is do these data release policies apply only to sequencing centers or to any group producing primary genomic data? Now that you are a sequencing center, are you also bound by the obligations that sequencing centers have followed for a decade or more? This is an important issue to discuss for it’s own sake in order to promote Open Science, but also for the conundrums it throws up about data release policies in genomics. For example, if individual groups who are sequencing genomes are not bound by the same data release policies as sequencing centers, then a group at e.g. Sanger or Baylor working on a genome is actually now put at a competetive disadvantage in the NGS era because they would be forced to release their data.

I argue that if the wider research community does not abide by the current practices of early data release in genomics, the democratization of sequencing will lead to the slow death of openness in genomics. We could very well see a regression to the mean behavior of data hording (I sometimes call this “data mine, mine, mining”) that is sadly characteristic of most of biological sciences. In turn this could decelerate progress in genomics, leading to a backlog of terabytes of un(der)analyzed data rotting on disks around the world. Are you prepared to standby, do nothing and bear witness to this bleak future? ; )

While many individual groups collecting primary genomic sequence data may hesitate to embrace the idea of pre-publication data release, it should be noted that there is also a standard procedure in place for protecting the interests of the data producer to have first chance to publish (or co-publish) large-scale analysis of the data, while permitting the wider research community to have early access. The Ft. Lauderdale agreeement recognized that:

…very early data release model could potentially jeopardize the standard scientific practice that the investigators who generate primary data should have both the right and responsibility to publish the work in a peer-reviewed journal. Therefore, NHGRI agreed to the inclusion of a statement on the sequence trace data permitting the scientific community to use these unpublished data for all purposes, with the sole exception of publication of the results of a complete genome sequence assembly or other large-scale analyses in advance of the sequence producer’s initial publication.

This type of data producer protection proviso has being taken up by some community-led efforts to release large amounts of primary sequence data prior to publiction, as laudably done by the Drosophila Population Genomics Project (Thanks Chuck!)

While the Ft. Lauderdale agreement in principle tries to balance the interests of the data producers and consumers, it is not without failings. As Mike Eisen points out on his blog:

In practice [the Ft. Lauderdale privoso] has also given data producers the power to create enormous consortia to analyze data they produce, effectively giving them disproportionate credit for the work of large communities. It’s a horrible policy that has significantly squelched the development of a robust genome analysis community that is independent of the big sequencing centers.

Eisen rejects the Ft. Lauderdale agreement in favor of a new policy he entitles The Batavia Open Genomic Data Licence.  The Batavia License does not require an embargo period or the need to inform data producers of how they intend to use the data, as is expected under the Ft. Lauderdale agreement, but it requires that groups using the data publish in an open access journal. Therefore the Batavia License is not truly open either, and I fear that it imposes unnecessary restrictions that will prevent its widespread uptake. The only truly Open Science policy for data release is a Creative Commons (CC-BY or CC-Zero) style license that has no restrictions other than attribution, a precedent that was established last year for the E. coli TY-2482 genome sequence (BGI you rock!).

A CC-style license will likely be too liberal for most labs generating their own data, and thus I argue we may be better off pushing for a individual groups to use a Ft. Lauderdale style agreement to encourage the (admittedly less than optimal) status quo to be taken up by the wider community. Another option is for researchers to release their data early via “data publications” such as those being developed by journals such as GigaScience and F1000 Reports.

Whatever the mechanism, I join with Eisen in calling for wider participation for the research to community to release their primary genomic sequence data. Indeed, it would be a truly sad twist of fate if the wider research community does not follow the genomic data release policies in the post-NGS era that were put in place in the pre-NGS era in order to protect their interests. I for one will do my best in the coming years to reciprocate the generosity that has made Drosophila genomics community so great (in the long tradition of openness dating back to the Morgan school), by releasing any primary sequence data produced by my lab prior to publication. Watch this space.

Why Are There So Few Efforts to Text Mine the Open Access Subset of PubMed Central?

The open access movement in scientific publishing has two broad aims: (i) to make scientific articles more broadly accessible and (ii) to permit unrestricted re-use of published scientific content. From its humble beginnings in 2001 with only two journals, PubMed Central (PMC) has grown to become the world’s largest repository of full-text open-access biomedical articles, containing nearly 2.4 million biomedical articles that can be freely downloaded by anyone around the world. Thus, while holding only ~11% of the total published biomedical literature, PMC can be viewed clearly as a major success in terms of making the biomedical literature more broadly accessible.

However, I argue that PMC has yet catalyze similar success on the second goal of the open-access movement — unrestricted re-use of published scientific content. This point became clear to me when writing the discussions for two papers that my lab published last year. In digging around for references to cite, I was struck by how difficult it was to find examples of projects that applied text-mining tools to the entire set of open-access articles from PubMed Central. Unsure if this was a reflection of my ignorance or the actual state of the art in the field, I canvassed the biological text mining community, the bioinformatics community and two major open-access publishers for additional examples of text-mining on the entire open-access subset of PMC.

Surprisingly, I found that after a decade of existence only ~15 articles* have ever been published that have used the entire open-access subset of PMC for text-mining research. In other words, less than 2 research articles per year are being published that actually use the open-access contents of PubMed Central for large-scale data mining or sevice provision. I find the lack of uptake of PMC by text-mining researchers to be rather astonishing, considering it is  an incredibly rich achive of the combined output of thousands of scientists worldwide.

This observation begs the question: why are there so few efforts to text mine content in PubMed Central? I don’t pretend to have the answer, but there are a number of plausible reasons that include (but are not limited to):

  • The Open Access subset of PMC is such a small component of the entire published literature that it is unsuable.
  • Full-text mining research is such a difficult task that it cannot be usefully done.
  • The text-mining community is more focused on developing methods than applying them to the biomedical literature.
  • There is not an established community of users for full-text mining research.
  • [insert your interpretation in the comments below]

Personally, I see none of these as valid explanations of why applying text mining tools to the entirety of the PMC open access subset remains so rare. While it is true that <2% of all articles in PubMed are in the PMC open-acces subset is a limitation, the facts contained within the introductions and discussion of this subset cover a substantially broader proportion of scientific knowledge. Big data? Not compared to other areas of bioscience like genomics. Mining even a few mammalian genome’s worth of DNA sequence data is more technically and scientifically challenging than the English text of ~400,000 full-text articles. Text-miners also routinely apply their systems to MEDLINE abstracts, albeit often on a small scale, and there is a growing community of biocurators and bioinformaticians eager to consume data from full-text mining. So what is going on here?

Perhaps it is worth drawing an analogy with another major resource that was released at roughly the same time as PMC — the human genome sequence. According to many, including those in the popular media, the promise of human genome was oversold, perhaps to leverage financial support for this major project. Unfortunately, as Greg Petsko and Jonathan Eisen have argued, overselling the human genome project has had unintended negative consequences for the understanding, and perhaps funding, of basic research. Could the goal of reuse of open access articles likewise represent an overselling of the PMC repository? If so, then the open-access movement runs the risk of failing to deliver on one of the key planks in its platform. Failing to deliver on re-use could ultimately justify funders (if no-one is using it, why should we pay) and publishers (if no-one is using it, why should we make it open) to advocate green over gold open access, which could have a devestating impact on text-mining research, since author-deposited (green) manuscripts in PMC are off-limits for text-mining research.

I hope (and am actively working to prove) that re-use of the open access literature will not remain an unfulfilled promise. I suspect rather that we simply are in the lag phase before a period of explosive growth in full-text mining, akin to what happened in the field of genome-wide association studies after the publication of the human genome sequence. So text-miners, bioinformaticians, and computational biologists do your part to maximize the utility of Varmus, Lipman and Brown’s vision of an Arxiv for biology, and prove that the twin aims of the open access movement can be fulfilled.

* Published text mining studies using the entirety of the Open Access subset of PMC:

UPDATE – New papers using PMC since original post

Why the Research Works Act Doesn’t Affect Text-mining Research

As the the central digital repository for life science publications, PubMed Central (PMC) is one of the most significant resources for making the Open Access movement a tangible reality for researchers and citizens around the world. Articles in PMC are deposited through two routes: either automatically by journals that participate in the PMC system, or directly by authors for journals that do not. Author deposition of peer-reviewed manuscripts in PMC is mandated by funders in order to make the results of publicly- or charity-funded research maximally accessible, and has led to over 180,000 articles being made free (gratis) to download that would otherwise be locked behind closed-access paywalls. Justifiably, there has been outrage over recent legislation (the Research Works Act) that would repeal the NIH madate in the USA and thereby prevent important research from being freely available.

However, from a text-miner’s perspective author-deposited manuscripts in PMC are closed access since, while they can be downloaded and read individually, virtually none (<200) are available from the PMC’s Open Access subset that includes all articles that are free (libre) to download in bulk and text/data mine. This includes ~99% of the author deposited manuscripts from the journal Nature, despite a clear statement from 2009 entitled “Nature Publishing Group allows data- and text-mining on self-archived manuscripts”. In short, funder mandates only make manuscripts public but not open, and thus whether the RWA is passed or not is actually moot from a text/data-mining perspective.

Why is this important? The simple reason is that there are currently only ~400,000 articles in the PMC Open Access subset, and therefore author-deposited manuscripts are only two-fold less abundant than all articles currently available for text/data-mining. Thus what could be a potentially rich source of data for large-scale information extraction remains locked away from programmatic analysis. This is especially tragic considering the fact that at the point of manuscript acceptance, publishers have invested little-to-nothing into the publication process and their claim to copyright is most tenuous.

So instead of discussing whether we should support the status quo of weak funder mandates by working to block the RWA or expand NIH-like mandates (e.g. as under the Federal Research Public Access Act, FRPAA), the real discussion that needs to be had is how to make funder mandates stronger to insist (at a minimum) that author-deposited manuscripts be available for text/data-mining research. More-of-the same, not matter how much, only takes us half the distance towards the ultimate goals of the Open Access movement, and doesn’t permit the crucial text/data mining research that is needed to make sense of the deluge of information in the scientific literature.

Credits: Max Haussler for making me aware of the lack of author manuscripts in PMC a few years back, and Heather Piwowar for recently jump-starting the key conversation on how to push for improved text/data mining rights in future funder mandates.

Related Posts:

Why Doesn’t the Ecological Society of America Allow Their Open Access Content to be Text Mined?

A recent tweet from Todd Vision and blog post by Jonathan Eisen’s have alerted me to the shameful defense of the status quo in scientific publishing advanced by the the Ecological Society of America concerning the Office of Science and Technology Policy’s recent request for information on Open Access. This particular thread caught my eye because I still have fresh bruises from being denied access to Open Access ESA journal content for text-mining research. Denied access to Open content – how is this possible, you say?

Over the last two years I have been targetting scientific society’s whose journal’s are not in the tiny fraction of the scientific literature in the PubMed Central Open Access subset, hoping to encourage them to release their content for text-mining research projects in my group (e.g. http://www.text2genome.org). My attitude has been that Society’s are the ones to go after, since they often hold the copyrights and are typically run by colleagues who I can directly appeal to. After productive (yet rather protracted) communication with The Genetics Society of America, the UK Genetics Society and the Society for Molecular Biology and Evolution, we’ve been able to obtain back-content for Genetics, Heredity and Molecular Biology and Evolution for our projects.* Heredity has gone far enough to announce that their content is now open for text-mining research on their home page (victory!)

In stark contrast, a similar line of inquiry with the Ecological Society of America has led to a very sour and unproductive experience which I will summarize here to demonstrate the the ESA’s recent response letter to the OSTP is consistent with a general attitude of protecting their journal content. This narrative echoes Peter Murray-Rust’s painful story of his years negotiating with Elsevier for access to content, which likewise has no positive conclusion.

While Katherine McCarter is true in saying that the ESA publishes a subset of their content under OA licenses, it is not true that this content in is in any meaningful way “open” in a 21st-century, linked-data, remix-and-reuse context. Why? Because like virtually all of the ecological, agricultural and environmental literature, ESA OA content is not deposited in a public archive like PubMed Central, and can only be accessed via the ESA journal website.  However, this content is not accessible to text-mining since the ESA journal permissions clearly state:

Altering, recompiling, systematic or programmatic copying, or reselling of text or other information from ESA Journals in any form or medium is prohibited. Systematic or programmatic downloading, service bureau redistribution services, printing for fee-for-service purposes and/or the systematic making of print or electronic copies for transmission to non-subscribing institutions are prohibited.

Since I have been burned in the past by aggressive closed-access publishers shutting down my office IP for naively downloading content that my univeristy has a site license for, I dutifully went down the proper channel of requesting permissions to automatically download ESA content from the Permissions Editor, Dr. Cliff Duke. For the record, I can say that Dr. Duke has been faultlessly professional throughout the process and was positive about my initial request, an excerpt of which follows:

From: Cliff Duke <CSDuke@xxx.xxx>
Date: 28 June 2011 14:19:18 GMT+01:00
To: Casey Bergman <casey.bergman@xxx.xxx>
Subject: RE: request for permission to use ESA content in text-mining research

Casey,

In answer to your question — not that I recall, but I also don’t recall any previous similar requests, and I’ve been the permissions editor for about seven years. However, I doubt your request will be the last such, given the increasing interest in this kind of research.

Cliff

—–Original Message—–
From: Casey Bergman [mailto:casey.bergman@xxx.xxx]
Sent: Tuesday, June 28, 2011 9:14 AM
To: Cliff Duke
Subject: Re: request for permission to use ESA content in text-mining research

Dear Dr. Duke -

Many thanks for the very quick reply.  I appreciate your efforts in bringing this to the attention of the ESA leadership and I fully understand that this may take some time to sort out (it took several months with GSA).

One quick question at this stage: is there any precedent at ESA for permitting bulk access for text mining research?

Best regards,
Casey

—–Original Message—–
On 28 Jun 2011, at 13:48, Cliff Duke wrote:

Dr. Bergman,

I will discuss your request with our executive director and editors and get back to you as soon as I can. Our director is on travel this week, and I am on vacation next week, so it may be a couple of weeks before you hear back from us. Let me know if you have any questions meanwhile.

Regards,
Cliff Duke

Clifford S. Duke, Ph.D.
Permissions Editor

Ecological Society of America
1990 M Street NW, Suite 700
Washington, DC 20036
Phone: (202) 833-8773
Fax: (202) 833-8775
E-mail: csduke [at] esa.org

—–Original Message—–
From: Casey Bergman [mailto:casey.bergman@xxx.xxx]
Sent: Tuesday, June 28, 2011 8:34 AM
To: Cliff Duke
Subject: request for permission to use ESA content in text-mining research

Dear Dr. Duke -

Greetings, I am a researcher at the University of Manchester, with an interest in application of text and data mining to biological problems at the interface of computational and evolutionary biology. I am writing to request permission to use ESA journal content in text-mining research project that I am developing to submit as proposal to the UK Natural Environment Research Council. Specifically, I would like to request permission to automate download of the entire/the open-access subset of all ESA titles, which I understand is not permitted under the standard ESA policy (http://www.esapubs.org/esapubs/permissions.htm).

[SNIP]

Dr Duke then put me in touch with the Managing Editor of the ESA journals, David Baldwin who promptly ignored my request for several months, despite repeated emails and phone calls on e.g. 6 September 2011, 27 September 2011, 13 October 2011.  I finally received one email from David Baldwin on 20 October 2011, where he promised (but failed) to get back to me a few days later:

From: Casey Bergman <casey.bergman@xxx.xxx>
Date: 20 October 2011 12:42:01 GMT+01:00
To: J David Baldwin <jdb27@xxx.xxx>
Cc: Cliff Duke <csduke@xxx.xxx>
Bcc: Casey Bergman <casey.bergman@xxx.xx>
Subject: Re: request for permission to use ESA content in text-mining research

Dear David -

Many thanks for replying. I fully understand that nonstandard requests can take some time. My previous interactions with Genetics and Heredity have also taken many months to lead to positive decisions on releasing content for text mining research.

All that is required in the short term is explicit permission to execute automated downloads on the http://www.esajournals.org site that abide by the limits of your systems (<50 sessions in 10 minutes).  No other technical issues need to be addressed on your side.

Also, I am happy in the first instance to restrict automated downloads to the Open Access subset of ESA publications, if a decision to permit access to the entirety of ESA content is more difficult.

Best regards,
Casey

On 20 Oct 2011, at 12:03, J David Baldwin wrote:

Dear Dr. Bergman–

I know you are keen to discuss this, but I’m afraid you have picked a particularly bad week for contacting me. Today (Thursday, 20 October) won’t be any better than the past two days, and tomorrow (Friday) I’ll be out of town. I’ll look over your request before Monday, 24 October, and will e-mail you again by then. I’m afraid I get handed off the nonstandard requests (“the buck stops here”), and yet I have my own priorities (i.e., working to keep the journal issues on schedule).

Cheers,
David

—–Original Message—–
From: Casey Bergman [mailto:casey.bergman@xxx.xxx]
Sent: Thursday, October 13, 2011 9:01 AM
To: J David Baldwin
Cc: Cliff Duke
Subject: Re: request for permission to use ESA content in text-mining research

Dear David -

Would there be a convenient time for you sometime in the next week or so to discuss how we might be able to access ESA content programmatically?  I am +5 hrs to you, so late afternoon for me = morning for you is typically the best timeline to arrange a call.

Best regards,
Casey

Well I can report that as of 6 January 2012, the buck certain has stopped with David Baldwin on this issue, since he still refuses to respond to the most minimal request for permission to automatically download only the Open Access subset of the ESA content. Since no effort is required on his part other than to say “yes”, I take his inaction to speak for the ESA that they have no interest in supporting text and data mining research on their content — even their OA content — which is fully consistent with their specious arguments for protectionism of society subsidies through closed access publishing put forward in their response letter to the OSTP. Given the undeniable importance of data in the ecological literature for science and society, the ESA should be ashamed for locking away this precious resource from the world and being adamant in their position that this is in any way morally or ethically justified.

I look forward to David Baldwin’s response on this request, and hope the the ESA is more progressive in their outlook toward open access publishing and text/data mining in the coming years….

* Credits to Tracey DePellegrin Connelly, Scott Hawley and Lauren McIntrye for helping to free Genetics content;  Roger Butlin for helping free Heredity content; and Ken Wolfe, Soojin Yi and the SMBE council for helping to free MBE content.

Related Posts:

Is Science really “Making Data Maximally Available”?

Earlier this year Hanson, Sugdon and Alberts [1] argued in a piece entitled “Making Data Maximally Available” that journals like Science play a crucial role in making scientific data “publicly and permanently available” and that efforts to improve the standard of supporting online materials will increase their utility and the impact of their associated publications. While I whole-heartedly agreed with their view that improving supplemental materials is a better solution to the current disorganized [2] and impermanent [3] state of affairs (as opposed to the unwise alternative of discarding them altogether [4]), there were a few things about this piece that really irked me, and I had intended to write a letter to the editor on this with a colleague that unfortunately didn’t materialize, so I thought I’d post them here.

First, the authors make an artificial distinction between the supporting online materials associated with a paper and the contents of the paper itself. Clearly the most important data in a scientific report is in the full text of the article, and thus if making data in supporting online materials “maximally available” is a goal, surely so must be making data in full-text article itself. Second, in the context of the wider discussion on “big data” in which these points are made, it must be noted that maximal availability is only one step towards maximal utility, the other being maximal access. As the entire content of Science magazine is not available for unrestricted download and re-use from PubMed Central‘s Open Access repository, maximal utility of data in the full text or supplemental materials of articles published in Science is currently fettered because it is not available for bulk text mining or data mining. Amazingly, this is true even for Author-deposited manuscripts in PubMed Central, which are not currently included in the PubMed Central Open Access subset and therefore not available for bulk download and re-use.

Therefore it seems imperative that, in addition to making a clarion call for the improved availability of data, code and references in supplemental materials, the Editors of Science should issue a clear policy statement about the use of full-text articles and supplemental online materials that are published in Science for text and data mining research. At a minimum, Science should join with other high profile journals such as Nature [5] in clarifying the use of Author-deposited manuscripts in PubMed Central for text and data mining that are required to be deposited under funding body mandates for these very purposes. Additionally, Science should make a clear statement about the copyright and re-use policies for supporting online materials of all published articles, which are freely available for download without a Science subscription, and currently fall in the grey area between restricted and open access.

As we move firmly into the era of big data where issues of access and re-use of data becoming increasingly acute, Science, as the representative publication of the world’s largest general scientific society, should take the lead in opening its content for text and data mining, to the mutual benefit of authors, researchers and the AAAS.

References:

1. Hanson et al. (2011) Making Data Maximally Available. Science 331:649
2. Santos et al. (2005) Supplementary data need to be kept in public repositories. Nature 438:738
3. Anderson et al. (2006) On the persistence of supplementary resources in biomedical publications. BMC Bioinformatics 7:260
4. Journal of Neuroscience policy on Supplemental Material
5. Nature Press release on data- and text-mining of self-archived manuscripts

Related Posts:


Twitter Updates


Follow

Get every new post delivered to your Inbox.

Join 72 other followers