Keeping Up with the Scientific Literature using Twitterbots: The FlyPapers Experiment

A year ago I created a simple “twitterbot” to stay on top of the Drosophila literature called FlyPapers, which tweets links to new abstracts in Pubmed and preprints in arXiv from a dedicated twitter account (@fly_papers). While most ‘bots on Twitter post spam or creative nonsense, an increasing number of people are exploring the use of twitterbots for more productive academic purposes. For example, Rod Page set up the @evoldir twitterbot way back in 2009 as an alternative to receiving email posts to the Evoldir mailing list, and likewise Gordon McNickle developed the @EcoLog_L twitterbot for the Ecolog-L mailing list. Similar to FlyPapers, others have established twitterbots for domain-specific literature feeds, such as the @BioPapers  for Quantitative Biology preprints on arXiv, @EcoEvoJournals for publications in the areas of Ecology & Evolution and @PlantEcologyBot for papers on Plant Ecology. More recently, Alberto Acerbi developed the @CultEvoBot to post links to blogs and new articles on the topic of cultural evolution. (I recommend reading posts by Rod, Gordon and Alberto for further insight into how and why they established these twitterbots.) One year in, I thought I’d summarize my thoughts on the FlyPapers experiment, and to make good on a promise I made to describe my set-up in case others are interested.

First, a few words on my motivation for creating FlyPapers. I have been receiving a daily update of all papers in the area of Drosophila in one form or another for nearly 10 years. My philosophy is that it is relatively easy to keep up on a daily basis with what is being published, but it’s virtually impossible to catch up when you let the river of information flow for too long. I first started receiving daily email updates from NCBI, which cluttered up my inbox and often got buried. Then I migrated to using RSS on Google Reader, which led to a similar problem of many unread posts accumulating that needed to be marked as “read”. Ultimately, I realized what I want from a personalized publication feed — a flow of links to articles that can be quickly scanned and clicked, but which requires no other action and can be ignored when I’m busy — was better suited to a Twitter client than a RSS reader. Moreover, in the spirit of “maximizing the value of your keystrokes“, it seemed that a feed that was useful for me might also be useful for others, and that Twitter was the natural medium to try sharing this feed since many scientists are already using twitter to post links to papers. Thus FlyPapers was born.

Setting up FlyPapers was straightforward and required no specialist know-how. I first created a dedicated Twitter account with a “catchy” name. Next, I created an account with dlvr.it, which takes a RSS/Twitter/email feed as input and routes the output to the FlyPapers Twitter account. I then set up an RSS feed from NCBI based on a search for the term “Drosophila” and add this as a source to the dlvr.it route. Shortly thereafter, I added a RSS feed for preprints in Arxiv using the search term “Drosophila” and added this to the same dlvr.it route. (Unfortunately, neither PeerJ Preprints nor bioRxiv currently have the ability to set up custom RSS feeds, and thus are not included in the FlyPapers stream.) NCBI and Arxiv only push new articles once a day, and each article is posted automatically as a distinct tweet for ease of viewing, bookmarking and sharing. The only gotcha I experienced in setting the system up was making sure when creating the Pubmed RSS feed to set the “number of items displayed” high enough (=100). If the number of articles posted in one RSS update exceeds the limit you set when you create the Pubmed RSS feed, Pubmed will post a URL to a Pubmed query for the entire set of papers as one RSS item, rather than post links to each individual paper. (For Gordon’s take on how he set up his Twitterbots, see this thread.) [UPDATE 25/2/14: Rob Lanfear has posted detailed instructions for setting up a twitterbot using the strategy I describe above at https://github.com/roblanf/phypapers. See his comment below for more information.]

So, has the experiment worked? Personally, I am finding FlyPapers a much more convenient way to stay on top of the Drosophila literature than any previous method I have used. Apparently others are finding this feed useful as well.

One year in, FlyPapers now has 333 followers in 16 countries, which is a far bigger and wider following than I would have ever imagined. Some of the followers are researchers I am familiar with in the Drosophila world, but most are students or post-docs I don’t know, which suggests the feed is finding relevant target audiences via natural processes on Twitter. The account has now posted 3,877 tweets, or ~10-11 tweets per day on average, which gives a rough scale for the amount of research being published annually on Drosophila. Around 10% of tweeted papers are getting retweeted (n=386) or favorited (n=444) by at least one person, and the breadth of topics being favorited/retweeted spans virtually all of Drosophila biology. These facts suggest that developing a twitterbot for domain-specific literature can indeed attract substantial numbers of like-minded individuals, and that automatically tweeting links to articles enables a significant proportion of papers in a field to easily be seen, bookmarked and shared.

Overall, I’m very pleased with the way FlyPapers is developing. I had hoped that one of the outcomes of this experiment would be to help promote Drosophila research, and this appears to be working. I had not expected it would act as a general hub for attracting Drosophila researchers who are active on Twitter, which is a nice surprise. One issue I hadn’t considered a year ago was the potential that ‘bots like FlyPapers might have to “game” Altmetics scores. Frankly, any metric that would be so easily gamed by a primitive bot like FlyPapers probably has no real intrisic value. However, it is true that this bot does add +1 to the twitter count for all Drosophila papers. My thoughts on this are that any attempt to correct the potential influence of ‘bots on Altmetrics scores should unduly not penalize the real human engagement bots can facilitate, so I’d say it is fair to -1 the orginal FlyPapers tweets in an Altmetrics calculation, but retain the retweets created by humans.

One final consequence of putting all new Drosophila literature onto Twitter that I would not have anticipated is that some tweets have been picked up by other social media outlets, including disease-advocacy accounts that quickly pushed basic research findings out to their target audience:

This final point suggests that there may be wider impacts from having more research articles automatically injected into the Twitter ecosystem. Maybe those pesky twitterbots aren’t always so bad after all.

Twitter Tips for Scientific Journals

The growing influence of social media in the lives of Scientists has come to the forefront again recently with a couple of new papers that provide An Introduction to Social Media for Scientists and a more focussed discussion of The Role of Twitter in the Life Cycle of a Scientific Publication. Bringing these discussions into traditional journal article format is important for spreading the word about social media in Science outside the echo chamber of social media itself. But perhaps more importantly, in my view, is that these motivating papers reflect a desire for Scientists to participate, and urge others to participate, in shaping a new space for scientific exchange in the 21st century.

Just as Scientists themselves are adopting social media, many scientific journals/magazines are as well. However, most discussions about the role of social media in scientific exchange overlook the issue of how we Scientists believe traditional media outlets, like scientific journals, should engage in this new forum. For example in the Darling et al. paper on the The Role of Twitter in the Life Cycle of a Scientific Publication, little is said about the role of journal Twitter accounts in the life cycle of publications beyond noting:

…to encourage fruitful post-publication critique and interactions, scientific journals could appoint dedicated online tweet editors who can storify and post tweets related to their papers.

This oversight is particularly noteworthy for several reasons. First, it is fact that many journals, and journal staff, play active roles in engaging with the scientific debate on social media and are not simply passive players in the new scientific landscape.  Second, Scientists need to be aware that journals extensively monitor our discussions and activity on social media in ways that were not previously possible, and we need to consider how this affects the future of scientific publishing. Third, Scientists should see social media represents an opportunity to establish new working relationships with journals that break down the old models that increasingly seem to harm both Science and Scientists.

In the same way that we Scientists are offering tips/advice to each other for how to participate in the new media, I feel that this conversation should also be extended to what we feel are best practices for journals to engage in the scientific process through social media. To kick this off, I’d like to list some do’s and don’ts for how I think journals should handle their presence on Twitter, based on my experiences following, watching and interacting with journals on Twitter over the last couple of years.

  • Do engage with (and have a presence on) social media. Twitter is rapidly on the uptake with scientists, and is the perfect forum to quickly transmit/receive information to/from your author pool/readership. I find it a little strange in fact if a journal doesn’t have a Twitter account these days.
  • Do establish a social media policy for your official Twitter account. Better yet, make it public, so Scientists know the scope of what we should expect from your account.
  • Don’t use information from Twitter to influence editorial or production processes, such as the acceptance/rejection of papers or choice of reviewers.  This should be an explicit part of your social media policy. Information on social media could be incorrect and by using unverified information from Twitter you could allow competitors/allies to block/promote each other’s work.
  • Don’t use a journal Twitter account as a table of contents for your journal. Email TOCs or RSS feeds exist for this purpose already.
  • Do tweet highlights from your journal or other journals. This is actually what I am looking for in a journal Twitter account, just as I am from the accounts of other Scientists.
  • Do use journal accounts to retweet unmodified comments from Scientists or other media outlets about papers in your journal. This is a good way for Scientists to find other researchers interested in a topic and know what is being said about work in your journal. But leave the original tweet intact, so we can trace it to the originator and so it doesn’t look like you have edited the sentiment to suit your interests.
  • Don’t use journal account to express personal opinions. I find it totally inappropriate that individuals at some journals hide behind the journal name and avatar to use journal twitter accounts as a soapbox to express their personal opinions. This is a really dangerous thing for a journal to do since it reinforces stereotypes about the fickleness of editors that love to wield the power that their journal provides them. It’s also a bad idea since the opinions of one or a few people may unintentionally affect a journal or publisher.
  • Do encourage your staff to create personal accounts and be active on social media. Editors and other journal staff should be encouraged to express personal opinions about science, tweet their own highlights, etc. This is a great way for Scientists to get to know your staff (for better or worse) and build an opinion about who is handling our work at your journal. But it should go without saying that personal opinions should be made through personal accounts, so we can follow/unfollow these people like any other member of the community and so their opinions do not leverage the imprimatur of your journal.
  • Do use journal Twitter accounts to respond to feedback/complaints/queries. Directly replying to comments from the community on Twitter is a great way to build trust in your journal.  If you can’t or don’t want to reply to a query in the open, just reply by asking the person to email your helpdesk. Either way shows good faith that you are listening to our concerns and want to engage. Ignoring comments from Scientists is bad PR and can allow issues to amplify beyond your control, with possible negative impacts on your journal (image) in the long run.
  • Don’t use journal Twitter accounts to tweet from meetings. To me this is a form of expressing personal opinion that looks like you are endorsing certain Scientists/fields/meetings or, worse yet, that you are looking to solicit them to submit their work to your journal, which smacks of desperation and favoritism. Use personal accounts instead to tweet from meetings, since after all what is reported is a personal assessment.

These are just my first thoughts on this issue (anonymised to protect the guilty), which I hope will act as a springboard for others to comment below on how they think journals should manage their presence on Twitter for the benefit of the Scientific community.

Launch of the PLOS Text Mining Collection

Just a quick post to announce that the PLOS Text Mining Collection is now live!

This PLOS Collection arose out of a twitter conversation with Theo Bloom last year, and has come together through the hard work of the authors of the papers in the Collection, the PLOS Collections team (in particular Sam Moore and Jennifer Horsely), and my co-organizers Larry Hunter and Andrey Rzhetsky. Many thanks to all for seeing this effort to completion.

Because of the large body of work in the area of text mining published in PLOS, we struggled with how best to present all these papers in the collection without diluting the experience for the reader. In the end, we decided only to highlight new work from the last two years and major reviews/tutorials at the time of launch. However, as this is a living collection, new articles will be included in the future, and the aim is to include previously published work as well. We hope to see many more papers in the area of text mining published in the PLOS family of journals in the future.

An overview of the PLOS Text Mining Collection is below (cross-posted at the PLOS EveryONE blog) and a commentary on Collection is available at the Official PLOS Blog entitled “A mine of information – the PLOS Text Mining Collection“.

Background to the PLOS Text Mining Collection

Text Mining is an interdisciplinary field combining techniques from linguistics, computer science and statistics to build tools that can efficiently retrieve and extract information from digital text. Over the last few decades, there has been increasing interest in text mining research because of the potential commercial and academic benefits this technology might enable. However, as with the promises of many new technologies, the benefits of text mining are still not clear to most academic researchers.

This situation is now poised to change for several reasons. First, the rate of growth of the scientific literature has now outstripped the ability of individuals to keep pace with new publications, even in a restricted field of study. Second, text-mining tools have steadily increased in accuracy and sophistication to the point where they are now suitable for widespread application. Finally, the rapid increase in availability of digital text in an Open Access format now permits text-mining tools to be applied more freely than ever before.

To acknowledge these changes and the growing body of work in the area of text mining research, today PLOS launches the Text Mining Collection, a compendium of major reviews and recent highlights published in the PLOS family of journals on the topic of text mining. As one of the major publishers of the Open Access scientific literature, it is perhaps no coincidence that research in text mining in PLOS journals is flourishing. As noted above, the widespread application and societal benefits of text mining is most easily achieved under an Open Access model of publishing, where the barriers to obtaining published articles are minimized and the ability to remix and redistribute data extracted from text is explicitly permitted. Furthermore, PLOS is one of the few publishers who is actively promoting text mining research by providing an open Application Programming Interface to mine their journal content.

Text Mining in PLOS

Since virtually the beginning of its history [1], PLOS has actively promoted the field of text mining by publishing reviews, opinions, tutorials and dozens of primary research articles in this area in PLOS Biology, PLOS Computational Biology and, increasingly, PLOS ONE. Because of the large number of text mining papers in PLOS journals, we are only able to highlight a subset of these works in the first instance of the PLOS Text Mining Collection. These include major reviews and tutorials published over the last decade [1][2][3][4][5][6], plus a selection of research papers from the last two years [7][8][9][10][11][12][13][14][15][16][17][18][19] and three new papers arising from the call for papers for this collection [20][21][22].
The research papers included in the collection at launch provide important overviews of the field and reflect many exciting contemporary areas of research in text mining, such as:

  • methods to extract textual information from figures [7];
  • methods to cluster [8] and navigate [15] the burgeoning biomedical literature;
  • integration of text-mining tools into bioinformatics workflow systems [9];
  • use of text-mined data in the construction of biological networks [10];
  • application of text-mining tools to non-traditional textual sources such as electronic patient records [11] and social media [12];
  • generating links between the biomedical literature and genomic databases [13];
  • application of text-mining approaches in new areas such as the Environmental Sciences [14] and Humanities [16][17];
  • named entity recognition [18];
  • assisting the development of ontologies [19];
  • extraction of biomolecular interactions and events [20][21]; and
  • assisting database curation [22].

Looking Forward

As this is a living collection, it is worth discussing two issues we hope to see addressed in articles that are added to the PLOS text mining collection in the future: scaling up and opening up. While application of text mining tools to abstracts of all biomedical papers in the MEDLINE database is increasingly common, there have been remarkably few efforts that have applied text mining to the entirety of the full text articles in a given domain, even in the biomedical sciences [4][23]. Therefore, we hope to see more text mining applications scaled up to use the full text of all Open Access articles. Scaling up will maximize the utility of text-mining technologies and the uptake by end users, but also demonstrate that demand for access to full text articles exists by the text mining and wider academic communities.

Likewise, we hope to see more text-mining software systems made freely or openly available in the future. As an example of the state of affairs in the field, only 25% of the research articles highlighted in the PLOS text mining collection at launch provide source code or executable software of any kind [13][16][19][21]. The lack of availability of software or source code accompanying published research articles is, of course, not unique to the field of text mining. It is a general problem limiting progress and reproducibility in many fields of science, which authors, reviewers and editors have a duty to address. Making release of open source software the rule, rather than the exception, should further catalyze advances in text mining, as it has in other fields of computational research that have made extremely rapid progress in the last decades (such as genome bioinformatics).

By opening up the code base in text mining research, and deploying text-mining tools at scale on the rapidly growing corpus of full-text Open Access articles, we are confident this powerful technology will make good on its promise to catalyze scholarly endeavors in the digital age.

References

1. Dickman S (2003) Tough mining: the challenges of searching the scientific literature. PLoS biology 1: e48. doi:10.1371/journal.pbio.0000048.
2. Rebholz-Schuhmann D, Kirsch H, Couto F (2005) Facts from Text—Is Text Mining Ready to Deliver? PLoS Biol 3: e65. doi:10.1371/journal.pbio.0030065.
3. Cohen B, Hunter L (2008) Getting started in text mining. PLoS computational biology 4: e20. doi:10.1371/journal.pcbi.0040020.
4. Bourne PE, Fink JL, Gerstein M (2008) Open access: taking full advantage of the content. PLoS computational biology 4: e1000037+. doi:10.1371/journal.pcbi.1000037.
5. Rzhetsky A, Seringhaus M, Gerstein M (2009) Getting Started in Text Mining: Part Two. PLoS Comput Biol 5: e1000411. doi:10.1371/journal.pcbi.1000411.
6. Rodriguez-Esteban R (2009) Biomedical Text Mining and Its Applications. PLoS Comput Biol 5: e1000597. doi:10.1371/journal.pcbi.1000597.
7. Kim D, Yu H (2011) Figure text extraction in biomedical literature. PloS one 6: e15338. doi:10.1371/journal.pone.0015338.
8. Boyack K, Newman D, Duhon R, Klavans R, Patek M, et al. (2011) Clustering More than Two Million Biomedical Publications: Comparing the Accuracies of Nine Text-Based Similarity Approaches. PLoS ONE 6: e18029. doi:10.1371/journal.pone.0018029.
9. Kolluru B, Hawizy L, Murray-Rust P, Tsujii J, Ananiadou S (2011) Using workflows to explore and optimise named entity recognition for chemistry. PloS one 6: e20181. doi:10.1371/journal.pone.0020181.
10. Hayasaka S, Hugenschmidt C, Laurienti P (2011) A network of genes, genetic disorders, and brain areas. PloS one 6: e20907. doi:10.1371/journal.pone.0020907.
11. Roque F, Jensen P, Schmock H, Dalgaard M, Andreatta M, et al. (2011) Using electronic patient records to discover disease correlations and stratify patient cohorts. PLoS computational biology 7: e1002141. doi:10.1371/journal.pcbi.1002141.
12. Salathé M, Khandelwal S (2011) Assessing Vaccination Sentiments with Online Social Media: Implications for Infectious Disease Dynamics and Control. PLoS Comput Biol 7: e1002199. doi:10.1371/journal.pcbi.1002199.
13. Baran J, Gerner M, Haeussler M, Nenadic G, Bergman C (2011) pubmed2ensembl: a resource for mining the biological literature on genes. PloS one 6: e24716. doi:10.1371/journal.pone.0024716.
14. Fisher R, Knowlton N, Brainard R, Caley J (2011) Differences among major taxa in the extent of ecological knowledge across four major ecosystems. PloS one 6: e26556. doi:10.1371/journal.pone.0026556.
15. Hossain S, Gresock J, Edmonds Y, Helm R, Potts M, et al. (2012) Connecting the dots between PubMed abstracts. PloS one 7: e29509. doi:10.1371/journal.pone.0029509.
16. Ebrahimpour M, Putniņš TJ, Berryman MJ, Allison A, Ng BW-H, et al. (2013) Automated authorship attribution using advanced signal classification techniques. PLoS ONE 8: e54998. doi:10.1371/journal.pone.0054998.
17. Acerbi A, Lampos V, Garnett P, Bentley RA (2013) The Expression of Emotions in 20th Century Books. PLoS ONE 8: e59030. doi:10.1371/journal.pone.0059030.
18. Groza T, Hunter J, Zankl A (2013) Mining Skeletal Phenotype Descriptions from Scientific Literature. PLoS ONE 8: e55656. doi:10.1371/journal.pone.0055656.
19. Seltmann KC, Pénzes Z, Yoder MJ, Bertone MA, Deans AR (2013) Utilizing Descriptive Statements from the Biodiversity Heritage Library to Expand the Hymenoptera Anatomy Ontology. PLoS ONE 8: e55674. doi:10.1371/journal.pone.0055674.
20. Van Landeghem S, Bjorne J, Wei C-H, Hakala K, Pyysal S, et al. (2013) Large-Scale Event Extraction from Literature with Multi-Level Gene Normalization. PLOS ONE 8: e55814. doi:10.1371/journal.pone.0055814
21. Liu H, Hunter L, Keselj V, Verspoor K (2013) Approximate Subgraph Matching-based Literature Mining for Biomedical Events and Relations. PLoS ONE 8(4): e60954. doi:10.1371/journal.pone.0060954
22. Davis A, Weigers T, Johnson R, Lay J, Lennon-Hopkins K, et al. (2013) Text mining effectively scores and ranks the literature for improving chemical-gene-disease curation at the Comparative Toxicogenomics Database. PLOS ONE 8: e58201. doi:10.1371/journal.pone.0058201
23. Bergman CM (2012) Why Are There So Few Efforts to Text Mine the Open Access Subset of PubMed Central? http://caseybergman.wordpress.com/2012/03/02/why-are-there-so-few-efforts-to-text-mine-the-open-access-subset-of-pubmed-central/.

Accelerating Your Science with arXiv and Google Scholar

As part of my recent conversion to using arXiv, I’ve been struck by how posting preprints arXiv synergizes incredibly well with Google Scholar. I’ve tried to make some of these points on Twitter and elsewhere, but I thought I’d try to summarize here what I see as a very powerful approach to accelerating Open Science using arXiv and several features of the Google Scholar toolkit. Part of the motivation for writing this post is that I’ve tried to make this same pitch to several of my colleagues, and was hoping to be able to point them to a coherent version of this argument, which might be of use for others as well.

A couple of preliminaries. First, the main point of this post is not about trying to convince people to post preprints to arXiv. The benefits of preprinting on arXiv are manifold (early availability of results, allowing others to build on your work sooner, prepublication feedback on your manuscript, feedback from many eyes not just 2-3 reviewers, availability of manuscript in open access format, mechanism to establish scientific priority, opportunity to publicize your work in blogs/twitter, increased duration for citations) and have been ably summarized elsewhere. This post is specifically about how one can get the most out of preprinting on arXiv by using Google Scholar tools.

Secondly, it is important to make sure people are aware of two relatively recent developments in the Google Scholar toolkit beyond the basic Google Scholar search functionality — namely, Google Scholar Citations and Google Scholar Updates. Google Scholar Citations allows users to build a personal profile of their publications, which draws in citation data from the Google Scholar database, allowing you to “check who is citing your publications, graph citations over time, and compute several citation metrics”, which also will “appear in Google Scholar results when people search for your name.” While Google Scholar Citations has been around for a little over a year now, I often find that many Scientists are either not aware that it exists, or have not activated their profile yet, even though it is scarily easy to set up. Another more recent feature available for those with active Google Scholar Citations profiles is called Google Scholar Updates, a tool that can analyze “your articles (as identified in your Scholar profile), scan the entire web looking for new articles relevant to your research, and then show you the most relevant articles when you visit Scholar”. As others have commented, Google Scholar Updates provides a big step forward in sifting through the scientific literature, since it provides a tailored set of articles delivered to your browser based on your previous publication record.

With these preliminaries in mind, what I want to discuss now is how a Google Scholar plays so well with preprints on arXiv to accelerate science when done in the Open. By posting preprint to arXiv and activating your Google Scholar Citation profile, you immediately gain several advantages, including the following:

  1. arXiv preprints are rapidly indexed by Google Scholar (with 1-2 days in my experience) and thus can be discovered easily by others using a standard Google Scholar search.
  2. arXiv preprints are listed in your Google Scholar profile, so when people browse your profile for your most recent papers they will find arXiv preprints at the top of the list (e.g. see Graham Coop’s Google Scholar profile here).
  3. Citations to your arXiv preprints are automatically updated in your Google Scholar profile, allowing you to see who is citing your most recent work.
  4. References included in your arXiv preprints will be indexed by Google Scholar and linked to citations in other people’s Google Scholar profiles, allowing them to find your arXiv preprint via citations to their work.
  5. Inclusion of an arXiv preprint in your Google Scholar profile allows Google Scholar Updates to provide better recommendations for what you should read, which is particularly important when you are moving into a new area of research that you have not previously published on.
  6. [Update June 14, 2013] Once Google Scholar has indexed your preprint on arXiv it will automatically generate a set of Related Articles, which you can browse to identify previously published work related to your manuscript.  This is especially useful at the preprint stage, since you can incorporate related articles you may have missed before submission or during revision.

I probably have overlooked other possible benefits of the synergy between these two technologies, since they are only dawning on me as I become more familiar with these symbiotic scholarly tools myself. What’s abundantly clear to me at this stage though is that by embracing Open Science and using arXiv together with Google Scholar puts you at a fairly substantial competitive advantage in terms of your scholarship, in ways that are simply not possible using the classical approach to publishing in biology.

Suggesting Reviewers in the Era of arXiv and Twitter

Along with many others in the evolutionary genetics community, I’ve recently converted to using arXiv as a preprint server for new papers from my lab. In so doing, I’ve confronted an unexpected ethical question concerning pre-printing and the use of social media, which I was hoping to generate some discussion about as this practice becomes more common in the scientific community. The question concerns the suggestion of reviewers for a journal submission of a paper that has previously been submitted to arXiv and then subsequently discussed on social media platforms like Twitter. Specifically put, the question is: is it ethical to suggest reviewers for a journal submission based on tweets about your arXiv preprint?

To see how this ethical issue arises, I’ll first describe my current workflow for submitting to arXiv and publicizing it on Twitter. Then, I’ll propose an alternative that might be considered to be “gaming” the system, and discuss precedents in the pre-social media world that might inform the resolution of this issue.

My current workflow for submission to arXiv and announcement on twitter is as follows:

  1. submit manuscript to a journal with suggested reviewers based on personal judgement;
  2. deposit the same version of the manuscript that was submitted to journal in arXiv;
  3. wait until arXiv submission is live and then tweet links to the arXiv preprint.

From doing this a few times (as well as benefiting from additional Twitter exposure via Haldane’s Sieve), I’ve realized that there can often be fairly substantive feedback about an arXiv submission via twitter in the form of who (re)tweets links to it and what people are saying about the manuscript. It doesn’t take much thought to realize that this information could potentially be used to influence a journal submission in the form of which reviewers to suggest or oppose using an alternative workflow:

  1. submit manuscript to arXiv;
  2. wait until arXiv submission is live and then tweet about it;
  3. moniter and assimilate feedback from Twitter;
  4. submit manuscript to journal with suggested and opposed reviewers based on Twitter activity.

This second workflow incidentally also arises under the first workflow if your initial journal submission is rejected, since there would naturally be a time lag in which it would be difficult to fully ignore activity on Twitter about an arXiv submission.

Now, I want to be clear that I haven’t and don’t intend to use the second workflow (yet), since I have not fully decided if this an ethical approach to suggesting reviewers. Nevertheless, I lean towards the view that it is no more or less ethical than the current mechanisms of selecting suggested reviewers based on: (1) perceived allies/rivals with relevant expertise or (2) informal feedback on the work in question presented at meetings.

In the former case of using who you perceive to be for or against your work, you are relying on personal experience and subjective opinions about researchers in your field, both good and bad, to inform your choice of suggested or opposed reviewers. This is some sense no different qualitatively to using information on Twitter prior to journal submission, but is instead based on a closed network using past information, rather than an open network using information specific to the piece of work in question. The latter case of suggesting reviewers based on feedback from meeting presentations is perhaps more similar to the matter at hand, and I suspect would be considered by most scientists to be a perfectly valid mechanism to suggest or oppose reviewers for a journal submission.

Now, of course I recognize that suggested reviewers are just that, and editors can use or ignore these suggestions as they wish, so this issue may in fact be moot. However, based on my experience, suggested reviewers are indeed frequently used by editors (if not, why would they be there?). Thus resolving whether smoking out opinions on Twitter is considered “fair play” is probably something the scientific community should consider more thoroughly in the near future, and I’d be happy to hear what other folks think about this in the comments below.

On the Preservation of Published Bioinformatics Code on Github

A few months back I posted a quick analysis of trends in where bioinformaticians choose to host their source code. A clear trend emerging in the bioinformatics community is to use github as the primary repository of bioinformatics code in published papers.  While I am a big fan of github and I support its widespread adoption, in that post I noted my concerns about the ease with which an individual can delete a published repository. In contrast to SourceForge, where it is extremely difficult to delete a repository once files have been released and this can only be done by SourceForge itself, deleting a repository on github takes only a few seconds and can be done (accidentally or intentionally) by the user who created the repository.

Just to see how easy this is, I’ve copied the process for deleting a repository on github here:

  • Go to the repo’s admin page

  • Click “Delete this repository”

  • Read the warnings and enter the name of the repository you want to delete
  • Click “I understand the consequences, delete this repository

Given the increasing use of github in publications, I feel the issue of repository deletion on github needs to be discussed by scientists and publishers more in the context of the important issue of long-term maintenance of published code. The reason I see this as important is that most github repositories are published via individual user accounts, and thus only one person holds the keys to preservation of the published code. Furthermore, I suspect funders, editors, publishers and (most) PIs have no idea how easy it is under the current model to delete published code. Call me a bit paranoid, but I see it is my responsibility as a PI to ensure the long-term preservation of published code, since I’m the one who signs off of data/resource plans in grants/final reports. Better to be safe than sorry, right?

On this note, I was pleased to see a retweet in my stream this week (via C. Titus Brown) concerning news that the journal Computers & Geosciences has adopted an official policy for hosting published code on github:

The mechanism that Computers & Geosciences has adopted to ensure long-term preservation of code in their journal is very simple – for the editor to fork code submitted by a github user into a journal organization (note: a similar idea was also suggested independently by Andrew Perry in the comments to my previous post). As clearly stated in the github repository deletion mechanism “Deleting a private repo will delete all forks of the repo. Deleting a public repo will not.” Thus, once Computers & Geosciences has forked the code, risk to the author, journal and community of a single point of failure is substantially ameliorated, with very little overhead to authors or publishers.

So what about the many other journals that have no such digital preservation policy but currently publish papers with bioinformatics code in github? Well, as a stopgap measure until other journals get on board with similar policies (PLOS & BMC, please lead the way!), I’ve taken the initiative to create a github organization called BioinformaticsArchive to serve this function. Currently, I’ve forked code for all but one of the 64 publications with github URLs in their PubMed record. One of the scary/interesting things to observe from this endeavor is just how fragile the current situation is. Of the 63 repositories I’ve forked, about 50% (n=31) had not been previously forked by any other user on github and could have been easily deleted, with consequent loss to the scientific community.

I am aware (thanks to Marc Robinson Rechavi) there are many more published github repositories in the full-text of articles (including two from our lab), which I will endeavor to dig out and add to this archive asap. If anyone else would like to help out with the endeavor, or knows of published repositories that should included, send me an email or tweet and I’ll add them to the archive. Comments on how to improve on the current state of preservation of published bioinformatics code on github and what can be learned form Computers and Geosciences new model policy are most welcome!

Why You Should Reject the “Rejection Improves Impact” Meme

ResearchBlogging.org

Over the last two weeks, a meme has been making the rounds in the scientific twittersphere that goes something like “Rejection of a scientific manuscript improves its eventual impact”.  This idea is based a recent analysis of patterns of manuscript submission reported in Science by Calcagno et al., which has been actively touted in the scientific press and seems to have touched a nerve with many scientists.

Nature News reported on this article on the first day of its publication (11 Oct 2012), with the statement that “papers published after having first been rejected elsewhere receive significantly more citations on average than ones accepted on first submission” (emphasis mine). The Scientist led its piece on the same day entitled “The Benefits of Rejection” with the claim that “Chances are, if a researcher resubmits her work to another journal, it will be cited more often”. Science Insider led the next day with the claim that “Rejection before publication is rare, and for those who are forced to revise and resubmit, the process will boost your citation record”. Influential science media figure Ed Yong tweeted “What doesn’t kill you makes you stronger – papers get more citations if they were initially rejected”. The message from the scientific media is clear: submitting your papers to selective journals and having them rejected is ultimately worth it, since you’ll get more citations when they are published somewhere lower down the scientific publishing food chain.

I will take on faith that the primary result of Calcagno et al. that underlies this meme is sound, since it has been vetted by the highest standard of editorial and peer review at Science magazine. However, I do note that it not possible to independently verify this result since the raw data for this analysis was not made available at the time of publication (contravening Science’s “Making Data Maximally Available Policy“), and has not been made available even after being queried. What I want to explore here is why this meme is so uncritically being propagated in the scientific press and twittersphere.

As succinctly noted by Joe Pickrell, anyone who takes even a cursory look at the basis for this claim would see that it is at best a weak effect*, and is clearly being overblown by the media and scientists alike.

Taken at face value, the way I read this graph is that papers that are rejected then published elsewhere have a median value of ~0.95 citations, whereas papers that are accepted at the first journal they are submitted to have a median value of ~0.90 citations. Although not explicitly stated in the figure legend or in the main text, I assume these results are on a natural log scale since, based on the font and layout, this plot was most likely made in R and the natural scale is the default in R (also, the authors refer the natural scale in a different figure earlier in the text). Thus, the median number of citations per article that rejection may provide an author is on the order of ~0.1.  Even if this result is on the log10 scale, this difference translates to a boost of less than one citation.  While statistically significant, this can hardly be described as a “significant increase” in citation. Still excited?

More importantly, the analysis of the effects of rejection on citation is univariate and ignores all most other possible confounding explanatory variables.  It is easy to imagine a large number of other confounding effects that could lead to this weak difference (number of reviews obtained, choice of original and final journals, number of authors, rejection rate/citation differences among discipline or subdiscipline, etc., etc.). In fact, in panel B of the same figure 4, the authors show a stronger effect of changing discipline on the number of citations in resubmitted manuscripts. Why a deeper multivariate analysis was not performed to back up the headline claim that “rejection improves impact” is hard to understand from a critical perspective. [UPDATE 26/10/2012: Bala Iyengar pointed out to me a page on the author's website that discusses the effects of controlling for year and publishing journal on the citation effect, which led me to re-read the paper and supplemental materials more closely and see that these two factors are in fact controlled for in the main analysis of the paper. No other possible confounding factors are controlled for however.]

So what is going on here? Why did Science allow such a weak effect with a relatively superficial analysis to be published in the one of the supposedly most selective journals? Why are major science media outlets pushing this incredibly small boost in citations that is (possibly) associated with rejection? Likewise, why are scientists so uncritically posting links to the Nature and Scientist news pieces and repeating “Rejection Improves Impact” meme?

I believe the answer to the first two questions is clear: Nature and Science have a vested interest in making the case that it is in the best interest of scientists to submit their most important work to (their) highly selective journals and risk having it be rejected.  This gives Nature and Science first crack at selecting the best science and serves to maintain their hegemony in the scientific publishing marketplace. If this interpretation is true, it is an incredibly self-serving stance for Nature and Science to take, and one that may back-fire since, on the whole, scientists are not stupid people who blindly accept nonsense. More importantly though, using the pages of Science and Nature as a marketing campaign to convince scientists to submit their work to these journals risks their credibility as arbiters of “truth”. If Science and Nature go so far as to publish and hype weak, self-serving scientometric effects to get us to submit our work there, what’s to say that would they not do the same for actual scientific results?

But why are scientists taking the bait on this one?  This is more difficult to understand, but most likely has to do with the possibility that most people repeating this meme have not read the paper. Topsy records over 700 and 150 tweets to the Nature News and Scientist news pieces, but only ~10 posts to the original article in Science. Taken at face value, roughly 80-fold more scientists are reading the news about this article than reading the article itself. To be fair, this is due in part to the fact that the article is not open access and is behind a paywall, whereas the news pieces are freely available**. But this is only the proximal cause. The ultimate cause is likely that many scientists are happy to receive (uncritically, it seems) any justification, however tenuous, for continuing to play the high-impact factor journal sweepstakes. Now we have a scientifically valid reason to take the risk of being rejected by top-tier journals, even if it doesn’t pay off. Right? Right?

The real shame in the “Rejection Improves Impact” spin is that an important take-home message of Calcagno et al. is that the vast majority of papers (>75%) are published in the first journal to which they are submitted.  As a scientific community we should continue to maintain and improve this trend, selecting the appropriate home for our work on initial submission. Justifying pipe-dreams that waste precious time based on self-serving spin that benefits the closed-access publishing industry should be firmly: Rejected.

Don’t worry, it’s probably in the best interest of Science and Nature that you believe this meme.

* To be fair, Science Insider does acknowledge that the effect is weak: “previously rejected papers had a slight bump in the number of times they were cited by other papers” (emphasis mine).

** Following a link available on the author’s website, you can access this article for free here.

References
Calcagno, V., Demoinet, E., Gollner, K., Guidi, L., Ruths, D., & de Mazancourt, C. (2012). Flows of Research Manuscripts Among Scientific Journals Reveal Hidden Submission Patterns Science DOI: 10.1126/science.1227833

Related Posts

The Cost to Science of the ENCODE Publication Embargo

The big buzz in the genomics twittersphere today is the release of over 30 publications on the human ENCODE project. This is a heroic achievement, both in terms of science and publishing, with many groundbreaking discoveries in biology and pioneering developments in publishing to be found in this set of papers. It is a triumph that all of these papers are freely available to read, and much is being said elsewhere in the blogosphere about the virtues of this project and the lessons learned from the publication of these data. I’d like to pick up here on an important point made by Daniel MacArthur in his post about the delays in the publication of these landmark papers that have arisen from the common practice of embargoing papers in genomics. To be clear, I am not talking about embargoing the use of data (which is also problematic), but embargoing the release of manuscripts that have been accepted for publication after peer review.

MacArthur writes:

Many of us in the genomics community were aware of the progress the [ENCODE] project had been making via conference presentations and hallway conversations with participants. However, many other researchers who might have benefited from early access to the ENCODE data simply weren’t aware of its existence until today’s dramatic announcement – and as a result, these people are 6-12 months behind in their analyses.

It is important to emphasize that these publication delays are by design, and are driven primarily by the journals that set the publication schedules for major genomics papers. I saw first-hand how Nature sets the agenda for major genomics papers and their associated companion papers as part of the Drosophila 12 Genomes Project. This insider’s view left a distinctly bad taste in my mouth about how much control a single journal has over some of the most important community resource papers that are published in Biology.  To give more people insight into this process, I am posting the agenda set by Nature for publication (in reverse chronological order) of the main Drosophila 12 Genomes paper, which went something like this:

7 Nov 2007: papers are published, embargo lifted on main/companion papers
28 Sept 2007: papers must be in production
21 Sept 2007: revised versions of papers received
17 Aug 2007: reviews are returned to authors
27 Jul 2007: papers are submitted

Not only was acceptance of the manuscript essentially assumed by the Nature editorial staff, the entire timeline was spelled out in advance, with an embargo built in to the process from the outset. Seeing this process unfold first hand was shocking to me, and has made me very skeptical of the power that the major journals have to dictate terms about how we, and other journals, publish our work.

Personally, I cannot see how this embargo system serves anyone in science other than the major journals. There is no valid scientific reason that major genome papers and their companions cannot be made available as online accepted preprints, as is now standard practice in the publishing industry. As scientists, we have a duty to ensure that the science we produce is released to the general public and community of scientists as rapidly and openly as possible. We do not have a duty to serve the agenda of a journal to increase their cachet or revenue stream. I am aware that we need to accept delays due to quality control via the peer review and publication process. But the delays due to the normal peer review process are bad enough, as ably discussed recently by Leslie Voshall. Why on earth would we accept that journals build in further unnecessary delays into the publication process?

This of course leads to the pertinent question: how harmful is this system of embargoes? Well, we can estimate put an upper estimate on * this pretty easily from the submission/acceptance dates of the main and companion ENCODE papers (see table below). In general, most ENCODE papers were embargoed for a minimum of 2 months but some were embargoed for up to nearly 7 months. Ignoring (unfairly) the direct impact that these delays may have on the careers of PhD students and post-docs involved, something on the order of 112 months of access to these important papers have been lost to all scientists by this single embargo. Put another way, nearly up to * 10 years of access time to these papers has been collectively lost to science because of the ENCODE embargo. To the extent that these papers are crucial for understanding the human genome, and the consequences this knowledge has for human health, this decade lost to humanity is clearly unacceptable. Let us hope that the ENCODE project puts an end to the era of journal-mandated embargoes in genomics.

DOI Date Received Date Accepted Date published Months in review Months in embargo
nature11247 24-Nov-11 29-May-12 05-Sep-12 6.0 3.2
nature11233 10-Dec-11 15-May-12 05-Sep-12 5.1 3.6
nature11232 15-Dec-11 15-May-12 05-Sep-12 4.9 3.6
nature11212 11-Dec-11 10-May-12 05-Sep-12 4.9 3.8
nature11245 09-Dec-11 22-May-12 05-Sep-12 5.3 3.4
nature11279 09-Dec-11 01-Jun-12 05-Sep-12 5.6 3.1
gr.134445.111 06-Nov-11 07-Feb-12 05-Sep-12 3.0 6.8
gr.134957.111 16-Nov-11 01-May-12 05-Sep-12 5.4 4.1
gr.133553.111 17-Oct-11 05-Jun-12 05-Sep-12 7.5 3.0
gr.134767.111 11-Nov-11 03-May-12 05-Sep-12 5.6 4.0
gr.136838.111 21-Dec-11 30-Apr-12 05-Sep-12 4.2 4.1
gr.127761.111 16-Jun-11 27-Mar-12 05-Sep-12 9.2 5.2
gr.136101.111 09-Dec-11 30-Apr-12 05-Sep-12 4.6 4.1
gr.134890.111 23-Nov-11 10-May-12 05-Sep-12 5.5 3.8
gr.134478.111 07-Nov-11 01-May-12 05-Sep-12 5.7 4.1
gr.135129.111 21-Nov-11 08-Jun-12 05-Sep-12 6.5 2.9
gr.127712.111 15-Jun-11 27-Mar-12 05-Sep-12 9.2 5.2
gr.136366.111 13-Dec-11 04-May-12 05-Sep-12 4.6 4.0
gr.136127.111 16-Dec-11 24-May-12 05-Sep-12 5.2 3.4
gr.135350.111 25-Nov-11 22-May-12 05-Sep-12 5.8 3.4
gr.132159.111 17-Sep-11 07-Mar-12 05-Sep-12 5.5 5.9
gr.137323.112 05-Jan-12 02-May-12 05-Sep-12 3.8 4.1
gr.139105.112 25-Mar-12 07-Jun-12 05-Sep-12 2.4 2.9
gr.136184.111 10-Dec-11 10-May-12 05-Sep-12 4.9 3.8
gb-2012-13-9-r48 21-Dec-11 08-Jun-12 05-Sep-12 5.5 2.9
gb-2012-13-9-r49 28-Mar-12 08-Jun-12 05-Sep-12 2.3 2.9
gb-2012-13-9-r50 04-Dec-11 18-Jun-12 05-Sep-12 6.4 2.5
gb-2012-13-9-r51 23-Mar-12 25-Jun-12 05-Sep-12 3.0 2.3
gb-2012-13-9-r52 09-Mar-12 25-May-12 05-Sep-12 2.5 3.3
gb-2012-13-9-r53 29-Mar-12 19-Jun-12 05-Sep-12 2.6 2.5
Min 2.3 2.3
Max 9.2 6.8
Avg 5.1 3.7
Sum 152.7 112.1

Footnote:

* Based on a converation on twitter with Chris Cole, I’ve revised this to be estimate to reflect the upper bound, rather than a point estimate of time lost to science.

Announcing the PLoS Text Mining Collection

Based on a spur of the moment tweet earlier this year, and a positive follow up from Theo Bloom, I’m very happy to announce that PLoS has now put the wheels in motion to develop a Collection of articles that highlight the importance of Text Mining research. The Call for Papers has just been announced today, and I’m very excited to see this effort highlight the synergy between Open Access, Altmetrics and Text Mining research. I’m particularly keen to see someone take the reigns on writing a good description of the API for PLoS (and other publishers). And a good lesson to all to be careful to watch what you tweet!

The Call for Paper below is cross posted at the PLoS Blog

Call for Papers: PLoS Text Mining Collection

The Public Library of Science (PLoS) seeks submissions in the broad field of text-mining research for a collection to be launched across all of its journals in 2013. All submissions submitted before October 30th, 2012 will be considered for the launch of the collection. Please read the following post for further information on how to submit your article.

The scientific literature is exponentially increasing in size, with thousands of new papers published every day. Few researchers are able to keep track of all new publications, even in their own field, reducing the quality of scholarship and leading to undesirable outcomes like redundant publication. While social media and expert recommendation systems provide partial solutions to the problem of keeping up with the literature, systematically identifying relevant articles and extracting key information from them can only come through automated text-mining technologies.

Research in text mining has made incredible advances over the last decade, driven through community challenges and increasingly sophisticated computational technologies. However, the promise of text mining to accelerate and enhance research largely has not yet been fulfilled, primarily since the vast majority of the published scientific literature is not published under an Open Access model. As Open Access publishing yields an ever-growing archive of unrestricted full-text articles, text mining will play an increasingly important role in drilling down to essential research and data in scientific literature in the 21st century scholarly landscape.

As part of its commitment to realizing the maximal utility of Open Access literature, PLoS is launching a collection of articles dedicated to highlighting the importance of research in the area of text mining. The launch of this Text Mining Collection complements related PLoS Collections on Open Access and Altmetrics (forthcoming), as well as the recent release of the PLoS Application Programming Interface, which provides an open API to PLoS journal content.

As part of this Text Mining Collection, we are making a call for high quality submissions that advance the field of text-mining research, including:

  • New methods for the retrieval or extraction of published scientific facts
  • Large-scale analysis of data extracted from the scientific literature
  • New interfaces for accessing the scientific literature
  • Semantic enrichment of scientific articles
  • Linking the literature to scientific databases
  • Application of text mining to database curation
  • Approaches for integrating text mining into workflows
  • Resources (ontologies, corpora) to improve text mining research

Please note that all submissions submitted before October 30th, 2012 will be considered for the launch of the collection (expected early 2013); submissions after this date will still be considered for the collection, but may not appear in the collection at launch.

Submission Guidelines
If you wish to submit your research to the PLoS Text Mining Collection, please consider the following when preparing your manuscript:

All articles must adhere to the submission guidelines of the PLoS journal to which you submit.
Standard PLoS policies and relevant publication fees apply to all submissions.
Submission to any PLoS journal as part of the Text Mining Collection does not guarantee publication.

When you are ready to submit your manuscript to the collection, please log in to the relevant PLoS manuscript submission system and mention the Collection’s name in your cover letter. This will ensure that the staff is aware of your submission to the Collection. The submission systems can be found on the individual journal websites.

Please contact Samuel Moore (smoore@plos.org) if you would like further information about how to submit your research to the PLoS Text Mining Collection.

Organizers
Casey Bergman (University of Manchester)
Lawrence Hunter (University of Colorado-Denver)
Andrey Rzhetsky (University of Chicago)

 

Did Finishing the Drosophila Genome Legitimize Open Access Publishing?

I’m currently reading Glyn Moody‘s (2003) “Digital Code of Life: How Bioinformatics is Revolutionizing Science, Medicine, and Business” and greatly enjoying the writing as well as the whirlwind summary of the history of Bioinformatics and the (Human) Genome Project(s). Most of what Moody says that I am familiar with is quite accurate, and his scholarship is thorough, so I find his telling of the story compelling. One claim I find new and curious in this book is in his discussion of the sequencing of the Drosphila melanogaster genome, more precisely the “finishing” of this genome, and its impact on the legitimacy of Open Access publishing.

The sequencing of D. melanogaster was done as a collaboration with between the Berkeley Drosophila Genome Project and Celera, as a test case to prove that whole-genome shotgun sequencing could be applied to large animal genomes.  I won’t go into the details here, but it is a widely regarded fact that the Adams et al. (2000) and Myers et al. (2000) papers in Science demonstrated the feasibility of whole-genome shotgun sequencing, but it was a lesser-known paper by Celniker et al. (2002) in Genome Biology which reported the “finished” D. melanogaster genome that proved the accuracy of whole-genome shotgun sequencing assembly. No controversy here.

More debatable is what Moody goes on to write about the Celniker et al. (2002) paper:

This was an important paper, then, and one that had a significance that went beyond its undoubted scientific value. For it appeared neither in Science, as the previous Drosophila papers had done, nor in Nature, the obvious alternative. Instead, it was published in Genome Biology. This describes itself as “a journal, delivered over the web.” That is, the Web is the primary medium, with the printed version offering a kind of summary of the online content in a convenient portable form. The originality of Genome Biology does not end there: all of its main research articles are available free online.

A description then follows of the history and virtues of PubMed Central and the earliest Open Access biomedical publishers BioMed Central and PLoS. Moody (emphasis mine) then returns to the issue of:

…whether a journal operating on [Open Access] principles could attract top-ranked scientists. This question was answered definitively in the affirmative with the announcement and analysis of the finished Drosophila sequence in January 2003. This key opening paper’s list of authors included not only [Craig] Venter, [Gene] Myers, and [Mark] Adams, but equally stellar representatives of the academic world of Science, such as Gerald Rubin, the boss of the fruit fly genome project, and Richard Gibbs, head of sequencing at Baylor College. Alongside this paper there were no less than nine other weighty contributions, including one on Apollo, a new tool for viewing and editing sequence annotation. For its own Drosophila extravaganza of March 2000, Science had marshalled seven paper in total. Clearly, Genome Biology had arrived, and with it a new commercial publishing model based on the latest way of showing the data.

This passage resonated with me since I was working at the BDGP at the time this special issue on the finishing of the Drosophila genome in Genome Biology was published, and was personally introduced to Open Access publishing through this event.  I recall Rubin walking the hallways of building 64 on his periodic visits promoting this idea, motivating us all to work hard to get our papers together by the end of 2002 for this unique opportunity. I also remember lugging around stacks of the printed issue at the Fly meeting in Chicago in 2003, plying unsuspecting punters with a copy of a journal that most people had never heard of, and having some of my first conversations with people on Open Access as a consequence.

What Moody doesn’t capture in this telling is the fact the Rubin’s decision to publish in Genome Biology almost surely owes itself to the influence that Mike Eisen had on Rubin and others in the genomics community in Berkeley at the time. Eisen and Rubin had recently collaborated on a paper, Eisen had made inroads in Berkeley on the Open Access issue by actively recruiting signatories for the PLoS open letter the year before, and Eisen himself published his first Open Access paper in Oct 2002 in Genome Biology. So clearly the idea of publishing in Open Access journals, and in particular Genome Biology, was in the air at the time. So it may not have been as bold of a step for Rubin to take as Moody implies.

Nevertheless, it is a point that may have some truth, and I think it is interesting to consider if indeed the long-standing open data philosophy of the Drosophila genetics community that led to the Genome Biology special issue was a key turning point in the widespread success of Open Access publishing over the next decade. Surely the movement would have taken off anyways at some point. But in late 2002, when the BioMed Central journals were the only place to publish gold Open Access articles, few people had tested the waters since the launch of BMC journals in 2000. While we cannot replay the tape, Moody’s claim is plausible in my view and it is interesting to ask whether widespread buy-in to Open Access publishing in biology might have been delyaed if Rubin had not insisted that the efforts of the Berkeley Drosophila Genome Project be published under and Open Access model?

UPDATE 25 March 2012

After tweeting this post, here is what Eisen and Moody have to say:

UPDATE 19 May 2012

It appears that the publication of another part of the Drosophila (meta)genome, its Wolbachia endosymbiont, played and important role in the conversion of Jonathan Eisen to supporting Open Access. Read more here.