Comments on the RCUK’s New Draft Policy on Open Access

RCUK, the umbrella agency that represents several major publicly-funded Research Councils in the UK, has recently released a draft document outlining revision to its policy on Open Access publishing for RCUK-funded research. One of the leaders of the Open Access movement, Peter Suber, has provided strong assent for this policy and ably summarized the salient features of this document on a G+ post, with which I concur. Based on his encourgagement to submit comments to RCUK directly, I’ve emailed the following two points for RCUK to consider in their revision of this policy:

From: Casey Bergman <Casey.Bergman@xxx.xx.xx>
Date: 18 March 2012 15:22:29 GMT
To: <communications@rcuk.ac.uk>
Subject: Open Access Feedback

Hello –

I write to support the draft RCUK policy on Open Access, but would like to raise two points that I see are crucial to effectively achieving the aims of libre Open Access:

1) The green OA route does not always ensure libre OA, and often green OA documents remain unavailable for text and data mining.  For example, author-deposited manuscripts in (UK)PMC are not available for text mining, since the are not in the “OA subset” (see https://caseybergman.wordpress.com/2012/02/11/why-the-research-works-act-doesnt-affect-text-mining-research/).  Thus, for RCUK to mandate libre OA via the green route, RCUK would need to work with repositories like (UK)PMC to make sure that green author-deposited manuscripts go into the OA subset that can be automatically downloaded for re-use.

2) Further information should be provided about the following comment: “In addition, the Research Councils are happy to work with individual institutions on how they might build an institutional Open Access fund drawing from the indirect costs on grants.”  RCUK should take the lead on establishing financial models that are viable for recovering OA costs that can easily be adopted by universities.  Promoting the development of University OA funds that can effectively recover costs from RCUK grants to support gold OA papers that are published after the life-time of a grant would be a major boost for publishing RCUK funded work under a libre OA model.

Yours sincerely,

Casey Bergman, Ph.D.
Faculty of Life Sciences
University of Manchester
Michael Smith Building
Oxford Road, M13 9PT
Manchester, UK

Advertisements

The Roberts/Ashburner Response

A previous post on this blog shared a helpful boilerplate response to editors for politely declining to review for non-Open Access journals, which I received originally from Michael Ashburner. During a quick phone chat today, Ashburner told me that he in fact inherited a version of this response originally from Nobel laureate Richard Roberts, co-discoverer of introns, lead author on the Open Letter to Science calling for a “Genbank” of the scientific literature, and long-time editor of Nucleic Acids Research, one of the first classical journals to move to a fully Open Access model. So to give credit where it is due, I’ve updated the title of the “Just Say No” post to make the attribution of this letter more clear. We owe both Roberts and Ashburner many thanks for paving the way to a better model of scientific communication and leading by example.

Why Are There So Few Efforts to Text Mine the Open Access Subset of PubMed Central?

The open access movement in scientific publishing has two broad aims: (i) to make scientific articles more broadly accessible and (ii) to permit unrestricted re-use of published scientific content. From its humble beginnings in 2001 with only two journals, PubMed Central (PMC) has grown to become the world’s largest repository of full-text open-access biomedical articles, containing nearly 2.4 million biomedical articles that can be freely downloaded by anyone around the world. Thus, while holding only ~11% of the total published biomedical literature, PMC can be viewed clearly as a major success in terms of making the biomedical literature more broadly accessible.

However, I argue that PMC has yet catalyze similar success on the second goal of the open-access movement — unrestricted re-use of published scientific content. This point became clear to me when writing the discussions for two papers that my lab published last year. In digging around for references to cite, I was struck by how difficult it was to find examples of projects that applied text-mining tools to the entire set of open-access articles from PubMed Central. Unsure if this was a reflection of my ignorance or the actual state of the art in the field, I canvassed the biological text mining community, the bioinformatics community and two major open-access publishers for additional examples of text-mining on the entire open-access subset of PMC.

Surprisingly, I found that after a decade of existence only ~15 articles* have ever been published that have used the entire open-access subset of PMC for text-mining research. In other words, less than 2 research articles per year are being published that actually use the open-access contents of PubMed Central for large-scale data mining or sevice provision. I find the lack of uptake of PMC by text-mining researchers to be rather astonishing, considering it is  an incredibly rich achive of the combined output of thousands of scientists worldwide.

This observation begs the question: why are there so few efforts to text mine content in PubMed Central? I don’t pretend to have the answer, but there are a number of plausible reasons that include (but are not limited to):

  • The Open Access subset of PMC is such a small component of the entire published literature that it is unsuable.
  • Full-text mining research is such a difficult task that it cannot be usefully done.
  • The text-mining community is more focused on developing methods than applying them to the biomedical literature.
  • There is not an established community of users for full-text mining research.
  • [insert your interpretation in the comments below]

Personally, I see none of these as valid explanations of why applying text mining tools to the entirety of the PMC open access subset remains so rare. While it is true that <2% of all articles in PubMed are in the PMC open-acces subset is a limitation, the facts contained within the introductions and discussion of this subset cover a substantially broader proportion of scientific knowledge. Big data? Not compared to other areas of bioscience like genomics. Mining even a few mammalian genome’s worth of DNA sequence data is more technically and scientifically challenging than the English text of ~400,000 full-text articles. Text-miners also routinely apply their systems to MEDLINE abstracts, albeit often on a small scale, and there is a growing community of biocurators and bioinformaticians eager to consume data from full-text mining. So what is going on here?

Perhaps it is worth drawing an analogy with another major resource that was released at roughly the same time as PMC — the human genome sequence. According to many, including those in the popular media, the promise of human genome was oversold, perhaps to leverage financial support for this major project. Unfortunately, as Greg Petsko and Jonathan Eisen have argued, overselling the human genome project has had unintended negative consequences for the understanding, and perhaps funding, of basic research. Could the goal of reuse of open access articles likewise represent an overselling of the PMC repository? If so, then the open-access movement runs the risk of failing to deliver on one of the key planks in its platform. Failing to deliver on re-use could ultimately justify funders (if no-one is using it, why should we pay) and publishers (if no-one is using it, why should we make it open) to advocate green over gold open access, which could have a devestating impact on text-mining research, since author-deposited (green) manuscripts in PMC are off-limits for text-mining research.

I hope (and am actively working to prove) that re-use of the open access literature will not remain an unfulfilled promise. I suspect rather that we simply are in the lag phase before a period of explosive growth in full-text mining, akin to what happened in the field of genome-wide association studies after the publication of the human genome sequence. So text-miners, bioinformaticians, and computational biologists do your part to maximize the utility of Varmus, Lipman and Brown’s vision of an Arxiv for biology, and prove that the twin aims of the open access movement can be fulfilled.

* Published text mining studies using the entirety of the Open Access subset of PMC:

UPDATE – New papers using PMC since original post

Nominations for the Benjamin Franklin Award for Open Access in the Life Sciences

Earlier this week I recieved an email with the annual call for nominations for the Benjamin Franklin Award for Open Access in the Life Sciences. While I am in general not that fussed about the importance of acadamic accolades, I think this a great award since it recognizes contributions in a sub-discipne of biology — computational biology, or bioinformatics — that are specifically done in the spririt of open innovation. By placing the emphasis on recognizing openness as an achievement, the Franklin Award goes beyond other related honors (such as those awarded by the International Society for Computational Biology) and, in my view, captures the essence of the true spirit of what scientists should be striving for in their work.

In looking over the past recipients, few would argue that the award has not been given out to major contributors to the open source/open access movements in biology. In thinking about who might be appropriate to add to this list, two people sprang to mind who I’ve had the good fortune to work with in the past, both of whom have made a major impresion on my (and many others’) thinking and working practices in computational biology.  So without further ado, here are my nominations for the 2012 Benjamin Franklin Award for Open Access in the Life Sciences (in chronological order of my interaction with them)…

Suzanna Lewis

Suzanna Lewis (Lawrence Berkeley National Laboratory) is one of the pioneers of developing open standards and software for genome annotation and ontologies. She led the team repsonsible for the systematic annotation of the Drosophila melanogaster genome, which included development of the Gadfly annotation pipeline and database framework, and the annotation curation/visualization tool Apollo. Lewis’ work in genome annotation also includes playing instrumental roles in the GASP community assessement exercises to evaluate the state of the art in genome annotation, development of the Gbrowser genome browser, and the data coordination center for modENCODE project. In addition to her work in genome annotation, Lewis has been a leader in the development of open biological ontologies (OBO, NCBO), contributing to the Gene Ontology, Sequence Ontology, and Uberon anatomy ontologies, and developing open software for editing and navigating ontologies (AmiGO, OBO-Edit, and Phenote).

Carole Goble

Carole Goble (University of Manchester) is widely recognized as a visionary in the development of software to support automated workflows in biology. She has been a leader of the myGrid and Open Middleware Infrastructure Institute consortia, which have generated a large number of highly innovative open resources for e-research in the life sciences including the Taverna Workbench for developing and deploying workflows, the BioCatalogue registry of bioinformatics web services, and the social-networking inspired myExperiment workflow repository. Goble has also played an instrumental role in the development of semantic-web tools for constructing and analyzing life science ontologies, the development of ontologies for describing bioinformatics resources, as well as ontology-based tools such as RightField for managing life science data.

I hope others join me in acknowledging the outputs of these two open innovators as being more than worthy of the Franklin Award, support their nomination, and cast votes in their favor this year and/or in years to come!

Why the Research Works Act Doesn’t Affect Text-mining Research

As the the central digital repository for life science publications, PubMed Central (PMC) is one of the most significant resources for making the Open Access movement a tangible reality for researchers and citizens around the world. Articles in PMC are deposited through two routes: either automatically by journals that participate in the PMC system, or directly by authors for journals that do not. Author deposition of peer-reviewed manuscripts in PMC is mandated by funders in order to make the results of publicly- or charity-funded research maximally accessible, and has led to over 180,000 articles being made free (gratis) to download that would otherwise be locked behind closed-access paywalls. Justifiably, there has been outrage over recent legislation (the Research Works Act) that would repeal the NIH madate in the USA and thereby prevent important research from being freely available.

However, from a text-miner’s perspective author-deposited manuscripts in PMC are closed access since, while they can be downloaded and read individually, virtually none (<200) are available from the PMC’s Open Access subset that includes all articles that are free (libre) to download in bulk and text/data mine. This includes ~99% of the author deposited manuscripts from the journal Nature, despite a clear statement from 2009 entitled “Nature Publishing Group allows data- and text-mining on self-archived manuscripts”. In short, funder mandates only make manuscripts public but not open, and thus whether the RWA is passed or not is actually moot from a text/data-mining perspective.

Why is this important? The simple reason is that there are currently only ~400,000 articles in the PMC Open Access subset, and therefore author-deposited manuscripts are only two-fold less abundant than all articles currently available for text/data-mining. Thus what could be a potentially rich source of data for large-scale information extraction remains locked away from programmatic analysis. This is especially tragic considering the fact that at the point of manuscript acceptance, publishers have invested little-to-nothing into the publication process and their claim to copyright is most tenuous.

So instead of discussing whether we should support the status quo of weak funder mandates by working to block the RWA or expand NIH-like mandates (e.g. as under the Federal Research Public Access Act, FRPAA), the real discussion that needs to be had is how to make funder mandates stronger to insist (at a minimum) that author-deposited manuscripts be available for text/data-mining research. More-of-the same, not matter how much, only takes us half the distance towards the ultimate goals of the Open Access movement, and doesn’t permit the crucial text/data mining research that is needed to make sense of the deluge of information in the scientific literature.

Credits: Max Haussler for making me aware of the lack of author manuscripts in PMC a few years back, and Heather Piwowar for recently jump-starting the key conversation on how to push for improved text/data mining rights in future funder mandates.

Related Posts:

A Open Archive of My F1000 Reviews

Following on from a recent conversation with David Stephens on Twitter about my decision to resign from Faculty of 1000, F1000 has clarified their terms for the submission of evaluations and confirmed that it is permissible to “reproduce personal evaluations on institutional & personal blogs if you clearly reference F1000”.

As such, I am delighted to be able to repost here an Open Archive of my F1000 contributions. Additionally, this post acts in a second capacity as my first contribution to the Research Blogging Network. Hopefully these commentraies will be of interest to some, and should add support to the Altmetrics profiles for these papers through systems like Total Impact.

ResearchBlogging.org

Nelson CE, Hersh BM, & Carroll SB (2004). The regulatory content of intergenic DNA shapes genome architecture. Genome biology, 5 (4) PMID: 15059258

My review: This article reports that genes with complex expression have longer intergenic regions in both D. melanogaster and C. elegans, and introduces several innovative and complementary approaches to quantify the complexity of gene expression in these organisms. Additionally, the structure of intergenic DNA in genes with high complexity (e.g. receptors, specific transcription factors) is shown to be longer and more evenly distributed over 5′ and 3′ regions in D. melanogaster than in C. elegans, whereas genes with low complexity (e.g. metabolic genes, general transcription factors) are shown to have similar intergenic lengths in both species and exhibit no strong differences in length between 5′ and 3′ regions. This work suggests that the organization of noncoding DNA may reflect constraints on transcriptional regulation and that gene structure may yield insight into the functional complexity of uncharacterized genes in compact animal genomes. (@F1000: http://f1000.com/1032936)

ResearchBlogging.org

Li R, Ye J, Li S, Wang J, Han Y, Ye C, Wang J, Yang H, Yu J, Wong GK, & Wang J (2005). ReAS: Recovery of ancestral sequences for transposable elements from the unassembled reads of a whole genome shotgun. PLoS computational biology, 1 (4) PMID: 16184192

My review: This paper presents a novel method for automating the laborious task of constructing libraries of transposable element (TE) consensus sequences. Since repetitive TE sequences confound whole-genome shotgun (WGS) assembly algorithms, sequence reads from TEs are initially screened from WGS assemblies based on overrepresented k-mer frequencies. Here, the authors invert the same principle, directly identifying TE consensus sequences from those same reads containing high frequency k-mers. The method was shown to identify all high copy number TEs and increase the effectiveness of repeat masking in the rice genome. By circumventing the inherent difficulties of TE consensus reconstruction from erroneously assembled genome sequences, and by providing a method to identify TEs prior to WGS assembly, this method provides a new strategy to increase the accuracy of WGS assemblies as well as our understanding of the TEs in genome sequences. (@F1000: http://f1000.com/1031746)

ResearchBlogging.org

Rifkin SA, Houle D, Kim J, & White KP (2005). A mutation accumulation assay reveals a broad capacity for rapid evolution of gene expression. Nature, 438 (7065), 220-3 PMID: 16281035

My review: This paper reports empirical estimates of the mutational input to gene expression variation in Drosophila, knowledge of which is critical for understanding the mechanisms governing regulatory evolution. These direct estimates of mutational variance are compared to gene expression differences across species, revealing that the majority of genes have lower expression divergence than is expected if evolving solely by mutation and genetic drift. Mutational variances on a gene-by-gene basis range over several orders of magnitude and are shown to vary with gene function and developmental context. Similar results in C. elegans [1] provide strong support for stabilizing selection as the dominant mode of gene expression evolution. (@F1000: http://f1000.com/1040157)

References: 1. Denver DR, Morris K, Streelman JT, Kim SK, Lynch M, & Thomas WK (2005). The transcriptional consequences of mutation and natural selection in Caenorhabditis elegans. Nature genetics, 37 (5), 544-8 PMID: 15852004

ResearchBlogging.org

Caspi A, & Pachter L (2006). Identification of transposable elements using multiple alignments of related genomes. Genome research, 16 (2), 260-70 PMID: 16354754

My review: This paper reports an innovative strategy for the de novo detection of transposable elements (TEs) in genome sequences based on comparative genomic data. By capitalizing on the fact that bursts of TE transposition create large insertions in multiple genomic locations, the authors show that detection of repeat insertion regions (RIRs) in alignments of multiple Drosophila genomes has high sensitivity to identify both individual instances and families of known TEs. This approach opens a new direction in the field of repeat detection and provides added value to TE annotations by placing insertion events in a phylogenetic context. (@F1000 http://f1000.com/1049265)

ResearchBlogging.org

Simons C, Pheasant M, Makunin IV, & Mattick JS (2006). Transposon-free regions in mammalian genomes. Genome research, 16 (2), 164-72 PMID: 16365385

My review: This paper presents an intriguing analysis of transposon-free regions (TFRs) in the human and mouse genomes, under the hypothesis that TFRs indicate genomic regions where transposon insertion is deleterious and removed by purifying selection. The authors test and reject a model of random transposon distribution and investigate the properties of TFRs, which appear to be conserved in location across species and enriched for genes (especially transcription factors and micro-RNAs). An alternative mutational hypothesis not considered by the authors is the possibility for clustered transposon integration (i.e. preferential insertion into regions of the genome already containing transposons), which may provide a non-selective explanation for the apparent excess of TFRs in the human and mouse genomes. (@F1000: http://f1000.com/1010399)

ResearchBlogging.org

Wheelan SJ, Scheifele LZ, Martínez-Murillo F, Irizarry RA, & Boeke JD (2006). Transposon insertion site profiling chip (TIP-chip). Proceedings of the National Academy of Sciences of the United States of America, 103 (47), 17632-7 PMID: 17101968

My review: This paper demonstrates the utility of whole-genome microarrays for the high-throughput mapping of eukaryotic transposable element (TE) insertions on a genome-wide basis. With an experimental design guided by first computationally digesting the genome into suitable fragments, followed by linker-PCR to amplify TE flanking regions and subsequent hybridization to tiling arrays, this method was shown to recover all detectable TE insertions with essentially no false positives in yeast. Although limited to species with available genome sequences, this approach circumvents inefficiencies and biases associated with the alternative of whole-genome shotgun resequencing to detect polymorphic TEs on a genome-wide scale. Application of this or related technologies (e.g. [1]) to more complex genomes should fill gaps in our understanding of the contribution of TE insertions to natural genetic variation. (@F1000: http://f1000.com/1088573)

References: 1. Gabriel A, Dapprich J, Kunkel M, Gresham D, Pratt SC, & Dunham MJ (2006). Global mapping of transposon location. PLoS genetics, 2 (12) PMID: 17173485

ResearchBlogging.org

Haag-Liautard C, Dorris M, Maside X, Macaskill S, Halligan DL, Houle D, Charlesworth B, & Keightley PD (2007). Direct estimation of per nucleotide and genomic deleterious mutation rates in Drosophila. Nature, 445 (7123), 82-5 PMID: 17203060

My review: This paper presents the first direct estimates of nucleotide mutation rates across the Drosophila genome derived from mutation accumulation experiments. By using DHPLC to scan over 20 megabases of genomic DNA, the authors obtain several fundamental results concerning mutation at the molecular level in Drosophila: SNPs are more frequent than indels; deletions are more frequent than insertions; mutation rates are similar across coding, intronic and intergenic regions; and mutation rates may vary across genetic backgrounds. Results in D. melanogaster contrast with those obtained from mutation accumulation experiments in C. elegans (see [1], where indels are more frequent than SNPs, and insertions are more frequent than deletions), indicating that basic mutation processes may vary across metazoan taxa. (@F1000: http://f1000.com/1070688)

References: 1. Denver DR, Morris K, Lynch M, & Thomas WK (2004). High mutation rate and predominance of insertions in the Caenorhabditis elegans nuclear genome. Nature, 430 (7000), 679-82 PMID: 15295601

ResearchBlogging.org

Katzourakis A, Pereira V, & Tristem M (2007). Effects of recombination rate on human endogenous retrovirus fixation and persistence. Journal of virology, 81 (19), 10712-7 PMID: 17634225

My review: This study shows that the persistence, but not the integration, of long-terminal repeat (LTR) containing human endogenous retroviruses (HERVs) is associated with local recombination rate, and suggests a link between intra-strand homologous recombination and meiotic exchange. This inference about the mechanisms controlling the transposable element (TE) abundance is obtained by demonstrating that total HERV density (full-length elements plus solo LTRs) is not correlated with recombination rate, whereas the ratio of full-length HERVs relative to solo LTRs is. This work relies critically on advanced computational methods to join TE fragments, demonstrating the need for such algorithms to make accurate inferences about the evolution of mobile DNA and to reveal new insights into genome biology. (@F1000: http://f1000.com/1091037)

ResearchBlogging.org

Giordano J, Ge Y, Gelfand Y, Abrusán G, Benson G, & Warburton PE (2007). Evolutionary history of mammalian transposons determined by genome-wide defragmentation. PLoS computational biology, 3 (7) PMID: 17630829

My review: This article reports the first comprehensive stratigraphic record of transposable element (TE) activity in mammalian genomes based on several innovative computational methods that use information encoded in patterns of TE nesting. The authors first develop an efficient algorithm for detecting nests of TEs by intelligently joining TE fragments identified by RepeatMasker, which (in addition to providing an improved genome annotation) outputs a global “interruption matrix” that can be used by a second novel algorithm which generates a chronological ordering of TE activity by minimizing the nesting of young TEs into old TEs. Interruption matrix analysis yields results that support previous phylogenetic analyses of TE activity in humans but are not dependent on the assumption of a molecular clock. Comparison of the chronological orders of TE activity in six mammalian genomes provides unique insights into the ancestral and lineage-specific record of global TE activity in mammals. (@F1000: http://f1000.com/1089045)

ResearchBlogging.org

Schuemie MJ, & Kors JA (2008). Jane: suggesting journals, finding experts. Bioinformatics (Oxford, England), 24 (5), 727-8 PMID: 18227119

My review: This paper introduces a fast method for finding related articles and relevant journals/experts based on user input text and should help improve the referencing, review and publication of biomedical manuscripts. The JANE (Journal/Author Name Estimator) method uses a standard word frequency approach to find similar documents, then adds the scores in the top 50 records to produce a ranked list of journals or authors. Using either the abstract or full-text, JANE suggested quite sensible journals and authors in seconds for a manuscript we have in press, while the related eTBLAST method [1] failed to complete while I wrote this review. JANE should prove to be a very useful text mining tool for authors and editors alike. (@F1000: http://f1000.com/1101037)

References: 1. Errami M, Wren JD, Hicks JM, & Garner HR (2007). eTBLAST: a web server to identify expert reviewers, appropriate journals and similar publications. Nucleic acids research, 35 (Web Server issue) PMID: 17452348

ResearchBlogging.org

Pask AJ, Behringer RR, & Renfree MB (2008). Resurrection of DNA function in vivo from an extinct genome. PloS one, 3 (5) PMID: 18493600

My review: This paper reports the first transgenic analysis of a cis-regulatory element cloned from an extinct species. Although no differences were seen in the expression pattern of the collagen (Col2A1) enhancer from the extinct Tasmanian tiger and extant mouse, this work is an important proof of principle for using ancient DNA in the evolutionary analysis of gene regulation. (@F1000: http://f1000.com/1108816)

ResearchBlogging.org

Ginsberg J, Mohebbi MH, Patel RS, Brammer L, Smolinski MS, & Brilliant L (2009). Detecting influenza epidemics using search engine query data. Nature, 457 (7232), 1012-4 PMID: 19020500

My review: A landmark paper in health bioinformatics demonstrating that Google searches can predict influenza trends in the United States. Predicting infectious disease outbreaks currently relies on patient reports gathered through clinical settings and submitted to government agencies such as the CDC. The possible use of patient “self-reporting” through internet search queries offers unprecedented real-time access to temporal and regional trends in infectious diseases. Here, the authors use a linear modeling strategy to learn which Google search terms best correlate with regional trends in influenza-related illness. This model explains flu trends over a 5 year period with startling accuracy, and was able to predict flu trends during 2007-2008 with a 1-2 week lead time ahead of CDC reports. The phenomenal use of crowd-based predictive health informatics revolutionizes the role of the internet in biomedical research and will likely set an important precedent in many areas of natural sciences. (@F1000: http://f1000.com/1127181)

ResearchBlogging.org

Taher L, & Ovcharenko I (2009). Variable locus length in the human genome leads to ascertainment bias in functional inference for non-coding elements. Bioinformatics (Oxford, England), 25 (5), 578-84 PMID: 19168912

My review: This paper raises the important observation that differences in the length of genes can bias their functional classification using the Gene Ontology, and provides a simple method to correct for this inherent feature of genome architecture. A basic observation of genome biology is that genes differ widely in their size and structure within and between species. Understanding the causes and consequences of this variation in gene structure is an open challenge in genome biology. Previously, Nelson and colleagues [1] have shown, in flies and worms, that the length of intergenic regions is correlated with the regulatory complexity of genes and that genes from different Gene Ontology (GO) categories have drastically different lengths. Here, Taher and Ovcharenko confirm this observation of functionally non-random gene length in the human genome, and discuss the implications of this feature of genome organization on analyses that employ the GO for functional inference. Specifically, these authors show that random selection of noncoding DNA sequences from the human genome leads to the false inference of over- and under-representation of specific GO categories that preferentially contain longer or shorter genes, respectively. This finding has important implications for the large number of studies that employ a combination of gene expression microarrays and GO enrichment analysis, since gene expression is largely controlled by noncoding DNA. The authors provide a simple method to correct for this bias in GO analyses, and show that previous reports of the enrichment of “ultraconserved” noncoding DNA sequences in vertebrate developmental genes [2] may be a statistical artifact. (@F1000: http://f1000.com/1157594)

References: 1. Nelson CE, Hersh BM, & Carroll SB (2004). The regulatory content of intergenic DNA shapes genome architecture. Genome biology, 5 (4) PMID: 15059258

2. Bejerano G, Pheasant M, Makunin I, Stephen S, Kent WJ, Mattick JS, & Haussler D (2004). Ultraconserved elements in the human genome. Science (New York, N.Y.), 304 (5675), 1321-5 PMID: 15131266

ResearchBlogging.org

Hindorff LA, Sethupathy P, Junkins HA, Ramos EM, Mehta JP, Collins FS, & Manolio TA (2009). Potential etiologic and functional implications of genome-wide association loci for human diseases and traits. Proceedings of the National Academy of Sciences of the United States of America, 106 (23), 9362-7 PMID: 19474294

My review: This article introduces results from human genome-wide association studies (GWAS) into the realm of large-scale functional genomic data mining. These authors compile the first curated database of trait-associated single-nucleotide polymorphisms (SNPs) from GWAS studies (http://www.genome.gov/gwastudies/) that can be mined for general features of SNPs underlying phenotypes in humans. By analyzing 531 SNPs from 151 GWAS studies, the authors discover that trait-associated SNPs are predominantly in non-coding regions (43% intergenic, 45% intronic), but that non-synonymous and promoter trait-associated SNPs are enriched relative to expectations. The database is actively maintained and growing, and currently contains 3943 trait-associated SNPs from 796 publications. This important resource will facilitate data mining and integration with high-throughput functional genomics data (e.g. ChIP-seq), as well as meta-analyses, to address important questions in human genetics, such as the discovery of loci that affects multiple traits. While the interface to the GWAS catalog is rather limited, a related project (http://www.gwascentral.org/) [1] provides a much more powerful interface for searching and browsing data from the GWAS catalog. (@F1000: http://f1000.com/8408956)

References: 1. Thorisson GA, Lancaster O, Free RC, Hastings RK, Sarmah P, Dash D, Brahmachari SK, & Brookes AJ (2009). HGVbaseG2P: a central genetic association database. Nucleic acids research, 37 (Database issue) PMID: 18948288

ResearchBlogging.org

Tamames J, & de Lorenzo V (2010). EnvMine: a text-mining system for the automatic extraction of contextual information. BMC bioinformatics, 11 PMID: 20515448

My review: This paper describes EnvMine, an innovative text-mining tool to obtain physico-chemical and geographical information about environmental genomics samples. This work represents a pioneering effort to apply text-mining technologies in the domain of ecology, providing novel methods to extract the units and variables of physico-chemical entities, as well as link the location of samples to worldwide geographic coordinates via Google Maps. Application of EnvMine to full-text articles in the environmental genomics database envDB {1} revealed very high system performance, suggesting that information extracted by EnvMine will be of use to researchers seeking meta-data about environmental samples across different domains of biology. (@F1000: http://f1000.com/3502956)

References: 1. Tamames J, Abellán JJ, Pignatelli M, Camacho A, & Moya A (2010). Environmental distribution of prokaryotic taxa. BMC microbiology, 10 PMID: 20307274

Goodbye F1000, Hello Faculty of a Million

Dr. Seuss' The Sneetches

In the children’s story The Sneetches, Dr. Suess’ presents a world where certain members of society are marked by an arbitrary badge of distinction, and a canny opportunist uses this false basis of prestige for his financial gain*. What does this morality tale have to do with the scientific article recommendation service Faculty of 1000?  Read on…

Currently ~3000 papers are published each day in the biosciences**. Navigating this sea of information to find articles relevant to your work is no small matter. Researchers can either sink or swim with the aid of (i) machine-based technologies based on search or text-mining tools or (ii) human-based technologies like blogs or social networking services that highlight relevant work through expert recommendation.

One of the first expert recommendation services was Faculty of 1000, a service launched in 2002 with the aim of “identifying and evaluating the most significant articles from biomedical research publications” though a peer-nominated “Faculty” of experts in various subject domains. Since the launch of F1000, several other mechanisms for expert literature recommendation have also come to the foreground, including academic social bookmarking tools like citeulike or Mendeley, the rise of Research Blogging, and new F1000-like services such as annotatr, The Third Reviewer PaperCritic and TiNYARM.

Shortly after I started my group at the University of Manchester in 2005 I was invited to join the F1000 Faculty, which I gratefully accepted. At the time, I felt that it was a mark of distinction to be invited into this select club, since I felt that it would be a good platform to voice my opinions on what work I thought was notable. I was under no illusion that my induction was based only on merit, since this invitation came from my former post-doc mentor Michael Ashburner. I overlooked this issue at the time, since when you are invited to join the “in-club” as a junior faculty member, it is very tempting since you think things like this will play a positive role in your career progression. [Whether being in F1000 has helped my career I can’t say, but certainly it can’t have hurt, and I (sheepishly) admit to using it on grant and promotion applications in the past.]

Since then, I’ve tried to contribute to F1000 when I can [PAYWALL], but since it is not a core part of my job, I’ve only contributed ~15 reviews in 5 years. My philosophy has been only to contribute reviews on articles I think are of particular note and might be missed otherwise, not to review major papers in Nature/Science that everyone is already aware of. As time has progressed and it has become harder to commit time to non-essential tasks, I’ve contributed less and less, and the F1000 staff has pestered me frequently with reminders and phone calls to submit reviews. At times the pestering has been so severe that I have considered resigning just to get them off my back. And I’ve noticed that some colleagues I have a lot of respect for have also resigned from F1000, which made me wonder if they were likewise fed up with F1000’s nagging.

This summer, while reading a post on the Tree of Life blog, Jonathan Eisen made a parenthetical remark about quitting F1000, which made me more aware of why their nagging was really getting to me:

I even posted a “dissent” regarding one of [Paul Hebert’s] earlier papers on Faculty of 1000 (which I used to contribute to before they become non open access).

This comment made me realize that the F1000 recommendation service is just another closed-access venture for publishers to make money off a product generated for free by the goodwill and labor of academics. Like closed access journals, my University pays twice to get F1000 content — once for my labor and once for the subscription to the service. But unlike a normal closed-access journal, in the case of F1000 there is not even a primary scientific publication to justify the arrangement. So by contributing to F1000, essentially I take time away from my core research and teaching activities to allow a company to commercialize my IP and pay someone to nag me! What’s even more strange about this situation is that there is no rational open-access equivalent of literature review services like F1000. By analogy with the OA publishing of the primary literature, for “secondary” services I would pay a company to post one of my reviews on someone else’s article. (Does Research Blogging for free sound like a better option to anyone?)

Thus I’ve come to realize that is unjustified to contribute secondary commentary to F1000 on Open Access grounds, in the same way it is unjustified to submit primary papers to closed-access journals. If I really support Open Access publishing, then to contribute to F1000 I must either must either be a hypocrite or make an artificial distinction between the primary and secondary literature. But this gets to the crux of the matter: to the extent that recommendation services like F1000 are crucial for researchers to make sense of the onslaught of published data, then surely these critical reviews should be Open for all, just as the primary literature should be. On the other hand, if such services are not crucial, why am I giving away my IP for free to a company to capitalize on?

Well, this question has been on my mind for a while and I have looked into whether there might be evidence that F1000 evaluations have a real scientific worth in terms of highlighting good publications that might provide a reason to keep contributing to the system. On this point the evidence is scant and mixed. An analysis by the Wellcome Trust finds a very weak correlation between F1000 evaluations and the evaluations of an internal panel of experts (driven almost entirely by a few clearly outstanding papers), with the majority of highly cited papers being missed by F1000 reviewers. An analysis by the MRC shows a ~2-fold increase in the median number of citations (from 2 to 4) for F1000 reviewed articles relative to other MRC-funded research. Likewise, an analysis of the Ecology literature shows similar trends, with marginally higher citation rates for F1000 reviewed work, but with many high impact papers being missed. [Added 28 April 2012: Moreover, multifactorial analysis by Priem et al on a range of altmetric measures of impact for 24,331 PLoS articles clearly shows that the “F1000 indicator did not have shared variability with any of the derived factors” and that “Mendeley bookmark counts correlate more closely to Web of Science citations counts than expert ratings of F1000”.] Therefore the available evidence indicates that F1000 reviews do not capture the majority of good work being published, and the work that is reviewed is only of marginally higher importance (in terms of citation) than unreviewed work.

So if (i) it goes against my OA principles, (ii) there is no evidence (on average) that my opinion matters quantitatively much more than anyone else’s, and (iii) there are equivalent open access systems to use, why should I continue contributing to F1000? The only answer I can come up with is that by being a F1000 reviewer, I gain a certain prestige for being in the “in club,” as well as by some prestige-by-association for aligning myself with publications or scientists I perceive to be important. When stripped down like this, being a member of F1000 seems pretty close to being a Sneetch with a star, and that the F1000 business model is not too different than that used by Sylvester McMonkey McBean. Realizing this has made me feel more than a bit ashamed for letting the allure of being in the old-boys club and my scientific ego trick me into something I cannot rationally justify.

So, needless to say I have recently decided to resign from F1000. I will instead continue to contribute my tagged articles to citeulike (as I have for several years) and contribute more substantial reviews to this blog via the Research Blogging portal and push the use of other Open literature recommendation systems like PaperCritic, who have recently made their user-supplied content available under a Creative Commons license. (Thanks for listening PaperCritic!).

By supporting these Open services rather than the closed F1000 system (and perhaps convincing others to do the same) I feel more at home among the ranks of the true crowd-sourced “Faculty of 1,000,000” that we need to help filter the onslaught of publications. And just as Sylvester McMonkey McBean’s Star-On machine provided a disruptive technology for overturning perceptions of prestige by giving everyone a star in The Sneetches, I’m hopeful that these open-access web 2.0 systems will also do some good towards democratizing personal recommendation of the scientific literature.

* Note: This post should in no way be taken as an ad hominem against F1000 or its founder Vitek Tracz, who I respect very much as a pioneer of Open Access biomedical publishing

** This number is an estimate based on the real figure of ~2.5K papers/day in deposited in MEDLINE, extrapolated to the large number of non-biomedical journals that are not indexed by MEDLINE.  If any has better data on this, please comment below.