Did Finishing the Drosophila Genome Legitimize Open Access Publishing?

I’m currently reading Glyn Moody‘s (2003) “Digital Code of Life: How Bioinformatics is Revolutionizing Science, Medicine, and Business” and greatly enjoying the writing as well as the whirlwind summary of the history of Bioinformatics and the (Human) Genome Project(s). Most of what Moody says that I am familiar with is quite accurate, and his scholarship is thorough, so I find his telling of the story compelling. One claim I find new and curious in this book is in his discussion of the sequencing of the Drosphila melanogaster genome, more precisely the “finishing” of this genome, and its impact on the legitimacy of Open Access publishing.

The sequencing of D. melanogaster was done as a collaboration with between the Berkeley Drosophila Genome Project and Celera, as a test case to prove that whole-genome shotgun sequencing could be applied to large animal genomes.  I won’t go into the details here, but it is a widely regarded fact that the Adams et al. (2000) and Myers et al. (2000) papers in Science demonstrated the feasibility of whole-genome shotgun sequencing, but it was a lesser-known paper by Celniker et al. (2002) in Genome Biology which reported the “finished” D. melanogaster genome that proved the accuracy of whole-genome shotgun sequencing assembly. No controversy here.

More debatable is what Moody goes on to write about the Celniker et al. (2002) paper:

This was an important paper, then, and one that had a significance that went beyond its undoubted scientific value. For it appeared neither in Science, as the previous Drosophila papers had done, nor in Nature, the obvious alternative. Instead, it was published in Genome Biology. This describes itself as “a journal, delivered over the web.” That is, the Web is the primary medium, with the printed version offering a kind of summary of the online content in a convenient portable form. The originality of Genome Biology does not end there: all of its main research articles are available free online.

A description then follows of the history and virtues of PubMed Central and the earliest Open Access biomedical publishers BioMed Central and PLoS. Moody (emphasis mine) then returns to the issue of:

…whether a journal operating on [Open Access] principles could attract top-ranked scientists. This question was answered definitively in the affirmative with the announcement and analysis of the finished Drosophila sequence in January 2003. This key opening paper’s list of authors included not only [Craig] Venter, [Gene] Myers, and [Mark] Adams, but equally stellar representatives of the academic world of Science, such as Gerald Rubin, the boss of the fruit fly genome project, and Richard Gibbs, head of sequencing at Baylor College. Alongside this paper there were no less than nine other weighty contributions, including one on Apollo, a new tool for viewing and editing sequence annotation. For its own Drosophila extravaganza of March 2000, Science had marshalled seven paper in total. Clearly, Genome Biology had arrived, and with it a new commercial publishing model based on the latest way of showing the data.

This passage resonated with me since I was working at the BDGP at the time this special issue on the finishing of the Drosophila genome in Genome Biology was published, and was personally introduced to Open Access publishing through this event.  I recall Rubin walking the hallways of building 64 on his periodic visits promoting this idea, motivating us all to work hard to get our papers together by the end of 2002 for this unique opportunity. I also remember lugging around stacks of the printed issue at the Fly meeting in Chicago in 2003, plying unsuspecting punters with a copy of a journal that most people had never heard of, and having some of my first conversations with people on Open Access as a consequence.

What Moody doesn’t capture in this telling is the fact the Rubin’s decision to publish in Genome Biology almost surely owes itself to the influence that Mike Eisen had on Rubin and others in the genomics community in Berkeley at the time. Eisen and Rubin had recently collaborated on a paper, Eisen had made inroads in Berkeley on the Open Access issue by actively recruiting signatories for the PLoS open letter the year before, and Eisen himself published his first Open Access paper in Oct 2002 in Genome Biology. So clearly the idea of publishing in Open Access journals, and in particular Genome Biology, was in the air at the time. So it may not have been as bold of a step for Rubin to take as Moody implies.

Nevertheless, it is a point that may have some truth, and I think it is interesting to consider if indeed the long-standing open data philosophy of the Drosophila genetics community that led to the Genome Biology special issue was a key turning point in the widespread success of Open Access publishing over the next decade. Surely the movement would have taken off anyways at some point. But in late 2002, when the BioMed Central journals were the only place to publish gold Open Access articles, few people had tested the waters since the launch of BMC journals in 2000. While we cannot replay the tape, Moody’s claim is plausible in my view and it is interesting to ask whether widespread buy-in to Open Access publishing in biology might have been delyaed if Rubin had not insisted that the efforts of the Berkeley Drosophila Genome Project be published under and Open Access model?

UPDATE 25 March 2012

After tweeting this post, here is what Eisen and Moody have to say:

UPDATE 19 May 2012

It appears that the publication of another part of the Drosophila (meta)genome, its Wolbachia endosymbiont, played and important role in the conversion of Jonathan Eisen to supporting Open Access. Read more here.

The Roberts/Ashburner Response

A previous post on this blog shared a helpful boilerplate response to editors for politely declining to review for non-Open Access journals, which I received originally from Michael Ashburner. During a quick phone chat today, Ashburner told me that he in fact inherited a version of this response originally from Nobel laureate Richard Roberts, co-discoverer of introns, lead author on the Open Letter to Science calling for a “Genbank” of the scientific literature, and long-time editor of Nucleic Acids Research, one of the first classical journals to move to a fully Open Access model. So to give credit where it is due, I’ve updated the title of the “Just Say No” post to make the attribution of this letter more clear. We owe both Roberts and Ashburner many thanks for paving the way to a better model of scientific communication and leading by example.

Why the Research Works Act Doesn’t Affect Text-mining Research

As the the central digital repository for life science publications, PubMed Central (PMC) is one of the most significant resources for making the Open Access movement a tangible reality for researchers and citizens around the world. Articles in PMC are deposited through two routes: either automatically by journals that participate in the PMC system, or directly by authors for journals that do not. Author deposition of peer-reviewed manuscripts in PMC is mandated by funders in order to make the results of publicly- or charity-funded research maximally accessible, and has led to over 180,000 articles being made free (gratis) to download that would otherwise be locked behind closed-access paywalls. Justifiably, there has been outrage over recent legislation (the Research Works Act) that would repeal the NIH madate in the USA and thereby prevent important research from being freely available.

However, from a text-miner’s perspective author-deposited manuscripts in PMC are closed access since, while they can be downloaded and read individually, virtually none (<200) are available from the PMC’s Open Access subset that includes all articles that are free (libre) to download in bulk and text/data mine. This includes ~99% of the author deposited manuscripts from the journal Nature, despite a clear statement from 2009 entitled “Nature Publishing Group allows data- and text-mining on self-archived manuscripts”. In short, funder mandates only make manuscripts public but not open, and thus whether the RWA is passed or not is actually moot from a text/data-mining perspective.

Why is this important? The simple reason is that there are currently only ~400,000 articles in the PMC Open Access subset, and therefore author-deposited manuscripts are only two-fold less abundant than all articles currently available for text/data-mining. Thus what could be a potentially rich source of data for large-scale information extraction remains locked away from programmatic analysis. This is especially tragic considering the fact that at the point of manuscript acceptance, publishers have invested little-to-nothing into the publication process and their claim to copyright is most tenuous.

So instead of discussing whether we should support the status quo of weak funder mandates by working to block the RWA or expand NIH-like mandates (e.g. as under the Federal Research Public Access Act, FRPAA), the real discussion that needs to be had is how to make funder mandates stronger to insist (at a minimum) that author-deposited manuscripts be available for text/data-mining research. More-of-the same, not matter how much, only takes us half the distance towards the ultimate goals of the Open Access movement, and doesn’t permit the crucial text/data mining research that is needed to make sense of the deluge of information in the scientific literature.

Credits: Max Haussler for making me aware of the lack of author manuscripts in PMC a few years back, and Heather Piwowar for recently jump-starting the key conversation on how to push for improved text/data mining rights in future funder mandates.

Related Posts:

A Open Archive of My F1000 Reviews

Following on from a recent conversation with David Stephens on Twitter about my decision to resign from Faculty of 1000, F1000 has clarified their terms for the submission of evaluations and confirmed that it is permissible to “reproduce personal evaluations on institutional & personal blogs if you clearly reference F1000”.

As such, I am delighted to be able to repost here an Open Archive of my F1000 contributions. Additionally, this post acts in a second capacity as my first contribution to the Research Blogging Network. Hopefully these commentraies will be of interest to some, and should add support to the Altmetrics profiles for these papers through systems like Total Impact.

ResearchBlogging.org

Nelson CE, Hersh BM, & Carroll SB (2004). The regulatory content of intergenic DNA shapes genome architecture. Genome biology, 5 (4) PMID: 15059258

My review: This article reports that genes with complex expression have longer intergenic regions in both D. melanogaster and C. elegans, and introduces several innovative and complementary approaches to quantify the complexity of gene expression in these organisms. Additionally, the structure of intergenic DNA in genes with high complexity (e.g. receptors, specific transcription factors) is shown to be longer and more evenly distributed over 5′ and 3′ regions in D. melanogaster than in C. elegans, whereas genes with low complexity (e.g. metabolic genes, general transcription factors) are shown to have similar intergenic lengths in both species and exhibit no strong differences in length between 5′ and 3′ regions. This work suggests that the organization of noncoding DNA may reflect constraints on transcriptional regulation and that gene structure may yield insight into the functional complexity of uncharacterized genes in compact animal genomes. (@F1000: http://f1000.com/1032936)

ResearchBlogging.org

Li R, Ye J, Li S, Wang J, Han Y, Ye C, Wang J, Yang H, Yu J, Wong GK, & Wang J (2005). ReAS: Recovery of ancestral sequences for transposable elements from the unassembled reads of a whole genome shotgun. PLoS computational biology, 1 (4) PMID: 16184192

My review: This paper presents a novel method for automating the laborious task of constructing libraries of transposable element (TE) consensus sequences. Since repetitive TE sequences confound whole-genome shotgun (WGS) assembly algorithms, sequence reads from TEs are initially screened from WGS assemblies based on overrepresented k-mer frequencies. Here, the authors invert the same principle, directly identifying TE consensus sequences from those same reads containing high frequency k-mers. The method was shown to identify all high copy number TEs and increase the effectiveness of repeat masking in the rice genome. By circumventing the inherent difficulties of TE consensus reconstruction from erroneously assembled genome sequences, and by providing a method to identify TEs prior to WGS assembly, this method provides a new strategy to increase the accuracy of WGS assemblies as well as our understanding of the TEs in genome sequences. (@F1000: http://f1000.com/1031746)

ResearchBlogging.org

Rifkin SA, Houle D, Kim J, & White KP (2005). A mutation accumulation assay reveals a broad capacity for rapid evolution of gene expression. Nature, 438 (7065), 220-3 PMID: 16281035

My review: This paper reports empirical estimates of the mutational input to gene expression variation in Drosophila, knowledge of which is critical for understanding the mechanisms governing regulatory evolution. These direct estimates of mutational variance are compared to gene expression differences across species, revealing that the majority of genes have lower expression divergence than is expected if evolving solely by mutation and genetic drift. Mutational variances on a gene-by-gene basis range over several orders of magnitude and are shown to vary with gene function and developmental context. Similar results in C. elegans [1] provide strong support for stabilizing selection as the dominant mode of gene expression evolution. (@F1000: http://f1000.com/1040157)

References: 1. Denver DR, Morris K, Streelman JT, Kim SK, Lynch M, & Thomas WK (2005). The transcriptional consequences of mutation and natural selection in Caenorhabditis elegans. Nature genetics, 37 (5), 544-8 PMID: 15852004

ResearchBlogging.org

Caspi A, & Pachter L (2006). Identification of transposable elements using multiple alignments of related genomes. Genome research, 16 (2), 260-70 PMID: 16354754

My review: This paper reports an innovative strategy for the de novo detection of transposable elements (TEs) in genome sequences based on comparative genomic data. By capitalizing on the fact that bursts of TE transposition create large insertions in multiple genomic locations, the authors show that detection of repeat insertion regions (RIRs) in alignments of multiple Drosophila genomes has high sensitivity to identify both individual instances and families of known TEs. This approach opens a new direction in the field of repeat detection and provides added value to TE annotations by placing insertion events in a phylogenetic context. (@F1000 http://f1000.com/1049265)

ResearchBlogging.org

Simons C, Pheasant M, Makunin IV, & Mattick JS (2006). Transposon-free regions in mammalian genomes. Genome research, 16 (2), 164-72 PMID: 16365385

My review: This paper presents an intriguing analysis of transposon-free regions (TFRs) in the human and mouse genomes, under the hypothesis that TFRs indicate genomic regions where transposon insertion is deleterious and removed by purifying selection. The authors test and reject a model of random transposon distribution and investigate the properties of TFRs, which appear to be conserved in location across species and enriched for genes (especially transcription factors and micro-RNAs). An alternative mutational hypothesis not considered by the authors is the possibility for clustered transposon integration (i.e. preferential insertion into regions of the genome already containing transposons), which may provide a non-selective explanation for the apparent excess of TFRs in the human and mouse genomes. (@F1000: http://f1000.com/1010399)

ResearchBlogging.org

Wheelan SJ, Scheifele LZ, Martínez-Murillo F, Irizarry RA, & Boeke JD (2006). Transposon insertion site profiling chip (TIP-chip). Proceedings of the National Academy of Sciences of the United States of America, 103 (47), 17632-7 PMID: 17101968

My review: This paper demonstrates the utility of whole-genome microarrays for the high-throughput mapping of eukaryotic transposable element (TE) insertions on a genome-wide basis. With an experimental design guided by first computationally digesting the genome into suitable fragments, followed by linker-PCR to amplify TE flanking regions and subsequent hybridization to tiling arrays, this method was shown to recover all detectable TE insertions with essentially no false positives in yeast. Although limited to species with available genome sequences, this approach circumvents inefficiencies and biases associated with the alternative of whole-genome shotgun resequencing to detect polymorphic TEs on a genome-wide scale. Application of this or related technologies (e.g. [1]) to more complex genomes should fill gaps in our understanding of the contribution of TE insertions to natural genetic variation. (@F1000: http://f1000.com/1088573)

References: 1. Gabriel A, Dapprich J, Kunkel M, Gresham D, Pratt SC, & Dunham MJ (2006). Global mapping of transposon location. PLoS genetics, 2 (12) PMID: 17173485

ResearchBlogging.org

Haag-Liautard C, Dorris M, Maside X, Macaskill S, Halligan DL, Houle D, Charlesworth B, & Keightley PD (2007). Direct estimation of per nucleotide and genomic deleterious mutation rates in Drosophila. Nature, 445 (7123), 82-5 PMID: 17203060

My review: This paper presents the first direct estimates of nucleotide mutation rates across the Drosophila genome derived from mutation accumulation experiments. By using DHPLC to scan over 20 megabases of genomic DNA, the authors obtain several fundamental results concerning mutation at the molecular level in Drosophila: SNPs are more frequent than indels; deletions are more frequent than insertions; mutation rates are similar across coding, intronic and intergenic regions; and mutation rates may vary across genetic backgrounds. Results in D. melanogaster contrast with those obtained from mutation accumulation experiments in C. elegans (see [1], where indels are more frequent than SNPs, and insertions are more frequent than deletions), indicating that basic mutation processes may vary across metazoan taxa. (@F1000: http://f1000.com/1070688)

References: 1. Denver DR, Morris K, Lynch M, & Thomas WK (2004). High mutation rate and predominance of insertions in the Caenorhabditis elegans nuclear genome. Nature, 430 (7000), 679-82 PMID: 15295601

ResearchBlogging.org

Katzourakis A, Pereira V, & Tristem M (2007). Effects of recombination rate on human endogenous retrovirus fixation and persistence. Journal of virology, 81 (19), 10712-7 PMID: 17634225

My review: This study shows that the persistence, but not the integration, of long-terminal repeat (LTR) containing human endogenous retroviruses (HERVs) is associated with local recombination rate, and suggests a link between intra-strand homologous recombination and meiotic exchange. This inference about the mechanisms controlling the transposable element (TE) abundance is obtained by demonstrating that total HERV density (full-length elements plus solo LTRs) is not correlated with recombination rate, whereas the ratio of full-length HERVs relative to solo LTRs is. This work relies critically on advanced computational methods to join TE fragments, demonstrating the need for such algorithms to make accurate inferences about the evolution of mobile DNA and to reveal new insights into genome biology. (@F1000: http://f1000.com/1091037)

ResearchBlogging.org

Giordano J, Ge Y, Gelfand Y, Abrusán G, Benson G, & Warburton PE (2007). Evolutionary history of mammalian transposons determined by genome-wide defragmentation. PLoS computational biology, 3 (7) PMID: 17630829

My review: This article reports the first comprehensive stratigraphic record of transposable element (TE) activity in mammalian genomes based on several innovative computational methods that use information encoded in patterns of TE nesting. The authors first develop an efficient algorithm for detecting nests of TEs by intelligently joining TE fragments identified by RepeatMasker, which (in addition to providing an improved genome annotation) outputs a global “interruption matrix” that can be used by a second novel algorithm which generates a chronological ordering of TE activity by minimizing the nesting of young TEs into old TEs. Interruption matrix analysis yields results that support previous phylogenetic analyses of TE activity in humans but are not dependent on the assumption of a molecular clock. Comparison of the chronological orders of TE activity in six mammalian genomes provides unique insights into the ancestral and lineage-specific record of global TE activity in mammals. (@F1000: http://f1000.com/1089045)

ResearchBlogging.org

Schuemie MJ, & Kors JA (2008). Jane: suggesting journals, finding experts. Bioinformatics (Oxford, England), 24 (5), 727-8 PMID: 18227119

My review: This paper introduces a fast method for finding related articles and relevant journals/experts based on user input text and should help improve the referencing, review and publication of biomedical manuscripts. The JANE (Journal/Author Name Estimator) method uses a standard word frequency approach to find similar documents, then adds the scores in the top 50 records to produce a ranked list of journals or authors. Using either the abstract or full-text, JANE suggested quite sensible journals and authors in seconds for a manuscript we have in press, while the related eTBLAST method [1] failed to complete while I wrote this review. JANE should prove to be a very useful text mining tool for authors and editors alike. (@F1000: http://f1000.com/1101037)

References: 1. Errami M, Wren JD, Hicks JM, & Garner HR (2007). eTBLAST: a web server to identify expert reviewers, appropriate journals and similar publications. Nucleic acids research, 35 (Web Server issue) PMID: 17452348

ResearchBlogging.org

Pask AJ, Behringer RR, & Renfree MB (2008). Resurrection of DNA function in vivo from an extinct genome. PloS one, 3 (5) PMID: 18493600

My review: This paper reports the first transgenic analysis of a cis-regulatory element cloned from an extinct species. Although no differences were seen in the expression pattern of the collagen (Col2A1) enhancer from the extinct Tasmanian tiger and extant mouse, this work is an important proof of principle for using ancient DNA in the evolutionary analysis of gene regulation. (@F1000: http://f1000.com/1108816)

ResearchBlogging.org

Ginsberg J, Mohebbi MH, Patel RS, Brammer L, Smolinski MS, & Brilliant L (2009). Detecting influenza epidemics using search engine query data. Nature, 457 (7232), 1012-4 PMID: 19020500

My review: A landmark paper in health bioinformatics demonstrating that Google searches can predict influenza trends in the United States. Predicting infectious disease outbreaks currently relies on patient reports gathered through clinical settings and submitted to government agencies such as the CDC. The possible use of patient “self-reporting” through internet search queries offers unprecedented real-time access to temporal and regional trends in infectious diseases. Here, the authors use a linear modeling strategy to learn which Google search terms best correlate with regional trends in influenza-related illness. This model explains flu trends over a 5 year period with startling accuracy, and was able to predict flu trends during 2007-2008 with a 1-2 week lead time ahead of CDC reports. The phenomenal use of crowd-based predictive health informatics revolutionizes the role of the internet in biomedical research and will likely set an important precedent in many areas of natural sciences. (@F1000: http://f1000.com/1127181)

ResearchBlogging.org

Taher L, & Ovcharenko I (2009). Variable locus length in the human genome leads to ascertainment bias in functional inference for non-coding elements. Bioinformatics (Oxford, England), 25 (5), 578-84 PMID: 19168912

My review: This paper raises the important observation that differences in the length of genes can bias their functional classification using the Gene Ontology, and provides a simple method to correct for this inherent feature of genome architecture. A basic observation of genome biology is that genes differ widely in their size and structure within and between species. Understanding the causes and consequences of this variation in gene structure is an open challenge in genome biology. Previously, Nelson and colleagues [1] have shown, in flies and worms, that the length of intergenic regions is correlated with the regulatory complexity of genes and that genes from different Gene Ontology (GO) categories have drastically different lengths. Here, Taher and Ovcharenko confirm this observation of functionally non-random gene length in the human genome, and discuss the implications of this feature of genome organization on analyses that employ the GO for functional inference. Specifically, these authors show that random selection of noncoding DNA sequences from the human genome leads to the false inference of over- and under-representation of specific GO categories that preferentially contain longer or shorter genes, respectively. This finding has important implications for the large number of studies that employ a combination of gene expression microarrays and GO enrichment analysis, since gene expression is largely controlled by noncoding DNA. The authors provide a simple method to correct for this bias in GO analyses, and show that previous reports of the enrichment of “ultraconserved” noncoding DNA sequences in vertebrate developmental genes [2] may be a statistical artifact. (@F1000: http://f1000.com/1157594)

References: 1. Nelson CE, Hersh BM, & Carroll SB (2004). The regulatory content of intergenic DNA shapes genome architecture. Genome biology, 5 (4) PMID: 15059258

2. Bejerano G, Pheasant M, Makunin I, Stephen S, Kent WJ, Mattick JS, & Haussler D (2004). Ultraconserved elements in the human genome. Science (New York, N.Y.), 304 (5675), 1321-5 PMID: 15131266

ResearchBlogging.org

Hindorff LA, Sethupathy P, Junkins HA, Ramos EM, Mehta JP, Collins FS, & Manolio TA (2009). Potential etiologic and functional implications of genome-wide association loci for human diseases and traits. Proceedings of the National Academy of Sciences of the United States of America, 106 (23), 9362-7 PMID: 19474294

My review: This article introduces results from human genome-wide association studies (GWAS) into the realm of large-scale functional genomic data mining. These authors compile the first curated database of trait-associated single-nucleotide polymorphisms (SNPs) from GWAS studies (http://www.genome.gov/gwastudies/) that can be mined for general features of SNPs underlying phenotypes in humans. By analyzing 531 SNPs from 151 GWAS studies, the authors discover that trait-associated SNPs are predominantly in non-coding regions (43% intergenic, 45% intronic), but that non-synonymous and promoter trait-associated SNPs are enriched relative to expectations. The database is actively maintained and growing, and currently contains 3943 trait-associated SNPs from 796 publications. This important resource will facilitate data mining and integration with high-throughput functional genomics data (e.g. ChIP-seq), as well as meta-analyses, to address important questions in human genetics, such as the discovery of loci that affects multiple traits. While the interface to the GWAS catalog is rather limited, a related project (http://www.gwascentral.org/) [1] provides a much more powerful interface for searching and browsing data from the GWAS catalog. (@F1000: http://f1000.com/8408956)

References: 1. Thorisson GA, Lancaster O, Free RC, Hastings RK, Sarmah P, Dash D, Brahmachari SK, & Brookes AJ (2009). HGVbaseG2P: a central genetic association database. Nucleic acids research, 37 (Database issue) PMID: 18948288

ResearchBlogging.org

Tamames J, & de Lorenzo V (2010). EnvMine: a text-mining system for the automatic extraction of contextual information. BMC bioinformatics, 11 PMID: 20515448

My review: This paper describes EnvMine, an innovative text-mining tool to obtain physico-chemical and geographical information about environmental genomics samples. This work represents a pioneering effort to apply text-mining technologies in the domain of ecology, providing novel methods to extract the units and variables of physico-chemical entities, as well as link the location of samples to worldwide geographic coordinates via Google Maps. Application of EnvMine to full-text articles in the environmental genomics database envDB {1} revealed very high system performance, suggesting that information extracted by EnvMine will be of use to researchers seeking meta-data about environmental samples across different domains of biology. (@F1000: http://f1000.com/3502956)

References: 1. Tamames J, Abellán JJ, Pignatelli M, Camacho A, & Moya A (2010). Environmental distribution of prokaryotic taxa. BMC microbiology, 10 PMID: 20307274

Goodbye F1000, Hello Faculty of a Million

Dr. Seuss' The Sneetches

In the children’s story The Sneetches, Dr. Suess’ presents a world where certain members of society are marked by an arbitrary badge of distinction, and a canny opportunist uses this false basis of prestige for his financial gain*. What does this morality tale have to do with the scientific article recommendation service Faculty of 1000?  Read on…

Currently ~3000 papers are published each day in the biosciences**. Navigating this sea of information to find articles relevant to your work is no small matter. Researchers can either sink or swim with the aid of (i) machine-based technologies based on search or text-mining tools or (ii) human-based technologies like blogs or social networking services that highlight relevant work through expert recommendation.

One of the first expert recommendation services was Faculty of 1000, a service launched in 2002 with the aim of “identifying and evaluating the most significant articles from biomedical research publications” though a peer-nominated “Faculty” of experts in various subject domains. Since the launch of F1000, several other mechanisms for expert literature recommendation have also come to the foreground, including academic social bookmarking tools like citeulike or Mendeley, the rise of Research Blogging, and new F1000-like services such as annotatr, The Third Reviewer PaperCritic and TiNYARM.

Shortly after I started my group at the University of Manchester in 2005 I was invited to join the F1000 Faculty, which I gratefully accepted. At the time, I felt that it was a mark of distinction to be invited into this select club, since I felt that it would be a good platform to voice my opinions on what work I thought was notable. I was under no illusion that my induction was based only on merit, since this invitation came from my former post-doc mentor Michael Ashburner. I overlooked this issue at the time, since when you are invited to join the “in-club” as a junior faculty member, it is very tempting since you think things like this will play a positive role in your career progression. [Whether being in F1000 has helped my career I can’t say, but certainly it can’t have hurt, and I (sheepishly) admit to using it on grant and promotion applications in the past.]

Since then, I’ve tried to contribute to F1000 when I can [PAYWALL], but since it is not a core part of my job, I’ve only contributed ~15 reviews in 5 years. My philosophy has been only to contribute reviews on articles I think are of particular note and might be missed otherwise, not to review major papers in Nature/Science that everyone is already aware of. As time has progressed and it has become harder to commit time to non-essential tasks, I’ve contributed less and less, and the F1000 staff has pestered me frequently with reminders and phone calls to submit reviews. At times the pestering has been so severe that I have considered resigning just to get them off my back. And I’ve noticed that some colleagues I have a lot of respect for have also resigned from F1000, which made me wonder if they were likewise fed up with F1000’s nagging.

This summer, while reading a post on the Tree of Life blog, Jonathan Eisen made a parenthetical remark about quitting F1000, which made me more aware of why their nagging was really getting to me:

I even posted a “dissent” regarding one of [Paul Hebert’s] earlier papers on Faculty of 1000 (which I used to contribute to before they become non open access).

This comment made me realize that the F1000 recommendation service is just another closed-access venture for publishers to make money off a product generated for free by the goodwill and labor of academics. Like closed access journals, my University pays twice to get F1000 content — once for my labor and once for the subscription to the service. But unlike a normal closed-access journal, in the case of F1000 there is not even a primary scientific publication to justify the arrangement. So by contributing to F1000, essentially I take time away from my core research and teaching activities to allow a company to commercialize my IP and pay someone to nag me! What’s even more strange about this situation is that there is no rational open-access equivalent of literature review services like F1000. By analogy with the OA publishing of the primary literature, for “secondary” services I would pay a company to post one of my reviews on someone else’s article. (Does Research Blogging for free sound like a better option to anyone?)

Thus I’ve come to realize that is unjustified to contribute secondary commentary to F1000 on Open Access grounds, in the same way it is unjustified to submit primary papers to closed-access journals. If I really support Open Access publishing, then to contribute to F1000 I must either must either be a hypocrite or make an artificial distinction between the primary and secondary literature. But this gets to the crux of the matter: to the extent that recommendation services like F1000 are crucial for researchers to make sense of the onslaught of published data, then surely these critical reviews should be Open for all, just as the primary literature should be. On the other hand, if such services are not crucial, why am I giving away my IP for free to a company to capitalize on?

Well, this question has been on my mind for a while and I have looked into whether there might be evidence that F1000 evaluations have a real scientific worth in terms of highlighting good publications that might provide a reason to keep contributing to the system. On this point the evidence is scant and mixed. An analysis by the Wellcome Trust finds a very weak correlation between F1000 evaluations and the evaluations of an internal panel of experts (driven almost entirely by a few clearly outstanding papers), with the majority of highly cited papers being missed by F1000 reviewers. An analysis by the MRC shows a ~2-fold increase in the median number of citations (from 2 to 4) for F1000 reviewed articles relative to other MRC-funded research. Likewise, an analysis of the Ecology literature shows similar trends, with marginally higher citation rates for F1000 reviewed work, but with many high impact papers being missed. [Added 28 April 2012: Moreover, multifactorial analysis by Priem et al on a range of altmetric measures of impact for 24,331 PLoS articles clearly shows that the “F1000 indicator did not have shared variability with any of the derived factors” and that “Mendeley bookmark counts correlate more closely to Web of Science citations counts than expert ratings of F1000”.] Therefore the available evidence indicates that F1000 reviews do not capture the majority of good work being published, and the work that is reviewed is only of marginally higher importance (in terms of citation) than unreviewed work.

So if (i) it goes against my OA principles, (ii) there is no evidence (on average) that my opinion matters quantitatively much more than anyone else’s, and (iii) there are equivalent open access systems to use, why should I continue contributing to F1000? The only answer I can come up with is that by being a F1000 reviewer, I gain a certain prestige for being in the “in club,” as well as by some prestige-by-association for aligning myself with publications or scientists I perceive to be important. When stripped down like this, being a member of F1000 seems pretty close to being a Sneetch with a star, and that the F1000 business model is not too different than that used by Sylvester McMonkey McBean. Realizing this has made me feel more than a bit ashamed for letting the allure of being in the old-boys club and my scientific ego trick me into something I cannot rationally justify.

So, needless to say I have recently decided to resign from F1000. I will instead continue to contribute my tagged articles to citeulike (as I have for several years) and contribute more substantial reviews to this blog via the Research Blogging portal and push the use of other Open literature recommendation systems like PaperCritic, who have recently made their user-supplied content available under a Creative Commons license. (Thanks for listening PaperCritic!).

By supporting these Open services rather than the closed F1000 system (and perhaps convincing others to do the same) I feel more at home among the ranks of the true crowd-sourced “Faculty of 1,000,000” that we need to help filter the onslaught of publications. And just as Sylvester McMonkey McBean’s Star-On machine provided a disruptive technology for overturning perceptions of prestige by giving everyone a star in The Sneetches, I’m hopeful that these open-access web 2.0 systems will also do some good towards democratizing personal recommendation of the scientific literature.

* Note: This post should in no way be taken as an ad hominem against F1000 or its founder Vitek Tracz, who I respect very much as a pioneer of Open Access biomedical publishing

** This number is an estimate based on the real figure of ~2.5K papers/day in deposited in MEDLINE, extrapolated to the large number of non-biomedical journals that are not indexed by MEDLINE.  If any has better data on this, please comment below.

Just Say No – The Roberts/Ashburner Response

UPDATE: see follow-up post “The Roberts/Ashburner Response” to get more of the story on the origin of this letter.

I had the pleasure of catching up with my post-doc mentor Michael Ashburner today, and among other things we discussed the ongoing development of UKPMC and the importance of open access publishing. Although I consider myself a strong open access advocate, I did not sign the PLoS open letter in 2001, since at the time I was a post-doc and not in a position fully to control where I published. Therefore I couldn’t be sure that I could abide by the manifesto 100%, and didn’t want to put my name to something I couldn’t deliver on. As it turns out this is still the case to a certain degree and (because of collaborations) my freely-available-article-index remains at a respectable 85% (33/39), but alas will never reach the coveted 100% mark.

Nevertheless, I have steadily adopted most of the policies of the open letter, especially as my group has gotten more heavily involved in text-mining research over the years. This became especially true after an encounter with a publisher in 2008 who forced my campus IT to shut down my office IP because I was downloading articles from a journal for which our University has a site license. Perhaps unsurprisingly, this nasty experience radicalized me into more of an open access evangelist. After discussing this event at the time with Ashburner, he reminded me of the manifesto and one of its most powerful tools for changing the landscape of scholarly publishing – refusing to reviewing for journals/publishers who do not submit their content to PubMed Central (see the white-list of journals here).

I have dug this letter out countless times since then and used versions of it when asked to review for non-PMC journals, as it expresses the principles in plain and powerful language. I had another call to dig it out today and thought that I’d post the “Ashburner response” so others have a model to follow if they chose this path.

Enjoy!

From: “Michael Ashburner” <michael.ashburner@xxx.xxx>
Date: 30 August 2008 13:48:03 GMT+01:00
To: “Casey Bergman” <casey.bergman@xxx.xxx>
Subject: Just say No

Dear Editor,

Thank you for your invitation to review for your journal. Because it is not open access and does not provide its back content to PubMed Central, or any similar resource, I regret that I am unwilling to do this.

I would urge you to seriously reconsider both policies and would ask that you send this letter to your co-editors and publisher. In the event that you do change your policy, even to the extent of providing your back content to PubMed Central, or a similar resource, then I will be happy to review for you.

The scientific literature is at present the most significant resource available to researchers. Without access to the literature we cannot do science in any scholarly manner. Your journal refuses to embrace the idea that the purpose of the scientific literature is to communicate knowledge, not to make a profit for publishers. Without the free input of manuscripts and referees’ time your journal would not exist. By and large, the great majority of the work you publish is paid for by taxpayers. We now, either as individuals or as researchers whose grants are top-sliced, have to pay to read our own work and that of our colleagues, either personally or through our institutes’ libraries. I find that, increasingly, literature that is not available by open access is simply being ignored. Moreover, I am very aware that, increasingly, discovering information from the literature relies on some sort of computational analysis. This can only be effective if the entire content of primary research papers is freely available. Finally, by not being an open access journal you are disenfranchising both scientists who cannot afford (or whose institutions cannot afford) to pay for access and the general public.

There are now several good models for open access publication, and I would urge your journal to adopt one of these. There is an extensive literature on open access publishing, and its economic implications. I would be pleased to send you references to this literature.

Yours sincerely,

Michael Ashburner

Related Posts:

Is Science really “Making Data Maximally Available”?

Earlier this year Hanson, Sugdon and Alberts [1] argued in a piece entitled “Making Data Maximally Available” that journals like Science play a crucial role in making scientific data “publicly and permanently available” and that efforts to improve the standard of supporting online materials will increase their utility and the impact of their associated publications. While I whole-heartedly agreed with their view that improving supplemental materials is a better solution to the current disorganized [2] and impermanent [3] state of affairs (as opposed to the unwise alternative of discarding them altogether [4]), there were a few things about this piece that really irked me, and I had intended to write a letter to the editor on this with a colleague that unfortunately didn’t materialize, so I thought I’d post them here.

First, the authors make an artificial distinction between the supporting online materials associated with a paper and the contents of the paper itself. Clearly the most important data in a scientific report is in the full text of the article, and thus if making data in supporting online materials “maximally available” is a goal, surely so must be making data in full-text article itself. Second, in the context of the wider discussion on “big data” in which these points are made, it must be noted that maximal availability is only one step towards maximal utility, the other being maximal access. As the entire content of Science magazine is not available for unrestricted download and re-use from PubMed Central‘s Open Access repository, maximal utility of data in the full text or supplemental materials of articles published in Science is currently fettered because it is not available for bulk text mining or data mining. Amazingly, this is true even for Author-deposited manuscripts in PubMed Central, which are not currently included in the PubMed Central Open Access subset and therefore not available for bulk download and re-use.

Therefore it seems imperative that, in addition to making a clarion call for the improved availability of data, code and references in supplemental materials, the Editors of Science should issue a clear policy statement about the use of full-text articles and supplemental online materials that are published in Science for text and data mining research. At a minimum, Science should join with other high profile journals such as Nature [5] in clarifying the use of Author-deposited manuscripts in PubMed Central for text and data mining that are required to be deposited under funding body mandates for these very purposes. Additionally, Science should make a clear statement about the copyright and re-use policies for supporting online materials of all published articles, which are freely available for download without a Science subscription, and currently fall in the grey area between restricted and open access.

As we move firmly into the era of big data where issues of access and re-use of data becoming increasingly acute, Science, as the representative publication of the world’s largest general scientific society, should take the lead in opening its content for text and data mining, to the mutual benefit of authors, researchers and the AAAS.

References:

1. Hanson et al. (2011) Making Data Maximally Available. Science 331:649
2. Santos et al. (2005) Supplementary data need to be kept in public repositories. Nature 438:738
3. Anderson et al. (2006) On the persistence of supplementary resources in biomedical publications. BMC Bioinformatics 7:260
4. Journal of Neuroscience policy on Supplemental Material
5. Nature Press release on data- and text-mining of self-archived manuscripts

Related Posts: