Incentivising open data & reproducible research through pre-publication private access to NGS data at EBI

Yesterday Ewan Birney posted a series of tweets expressing surprise that more people don’t take advantage of ENA’s programmatic access to submit and store next-generation sequencing (NGS) data to EBI, that I tried to respond to in broken twitter English. This post attempts to clarify how I think ENA’s system could be improved in ways that I think would benefit both data archiving and reproducible research, and possibly increase uptake and sustainability of the service.

I’ve been a heavy consumer of NGS data from EBI for a couple of years, mainly thanks to their plain-vanilla fastq.gz downloads and clean REST interface for extracting NGS metadata. But I’ve only just recently gone through the process of submitting NGS data to ENA myself, first using their web portal and more recently taking advantage of REST-based programmatic access. Aside from the issue of how best to transfer many big files to EBI in an automatic way (which I’ve blogged about here), I’ve been quite impressed by how well-documented and efficient ENA’s NGS submission process is. For those who’ve had bad experiences submitting to SRA, I agree with Ewan that ENA provides a great service, and I’d suggest giving EBI a try.

In brief, the current ENA submission process entails:

  1. transfer of user’s NGS data to EBI’s “dropbox”, which is basically a private storage area on EBI’s servers that requires user/password authentication (done by user).
  2. creation and submission of metadata files with information about runs and samples (done by user)
  3. validation of data/metadata and creation of accession numbers for the projects/experiments/samples/runs (done by EBI)
  4. conversion of submitted NGS data to EBI formatted version, giving new IDs to each read and connecting appropriate metadata to each NGS data file (done by EBI)
  5. public release of accession-number based annotated data (done by EBI on the user’s release date or after publication)

Where I see the biggest room for improvement is in the “hupped” phase when data is submitted but private.  During this phase, I can store data at EBI privately for up to two years, and thus keep a remote back-up of my data for free, which is great, but only in its original submitted format  I can’t, however, access the exact version of my data that will ultimately become public, i.e. using the REST interface with what will be the published accession numbers on data with converted read IDs.  For these reasons, I can’t write pipelines that use the exact data that will be referenced in a paper, and thus I cannot fully verify that the results I publish can be reproduced by someone else. Additionally, I can’t “proof” what my submission looks like, and thus I have to wait until the submission is live to make any corrections to my data/metadata if they haven’t been converted as intended. As a work around, I’ve been releasing data pre-publication, doing data checks and programming around the live data to ensure that my pipelines and results are reproducible. I suspect not all labs would be comfortable doing this, mainly for fear of getting scooped using their own data.

In experiencing ENA’s data submission system from the twin viewpoints of a data producer and consumer, I’ve had a few thoughts about how to improve the system that could also address the issue of wider community uptake. The first change I would suggest as a simple improvement to EBI’s current service would be to allow REST/browser access to a private, live version of formatted NGS data/metadata during the “hupped” phase with simple HTTP-based password authentication.  This would allow users to submit and store their data privately, but also to have access to the “final” product prior to release. This small change could have many benefits, including:

  • incentivising submission of NGS data early in the life-cycle of a project rather than as an after-thought during publication,
  • reducing the risk of local data loss or failure to submit NGS data at the time of publication,
  • allowing distributed project partners to access big data files from a single, high-bandwith, secure location,
  • allowing quality checks on final version of data/metadata prior to publication/data release, and
  • allowing analysis pipelines to use the final archived version of data/metadata, ensuring complete reproducibility and unified integration with other public datasets.

A second change, which I suspect is more difficult to implement, would be to allow users to pay to store their data for longer than a fixed period of time. I’d say two years is around the lower time limit from when data comes off a sequencer to a paper being published. Thus, I suspect there are many users who are reluctant to submit and store data at ENA prior to paper submission, since their data might be made public before they are ready to share. But if users could pay a modest monthly/quarterly fee to store their data privately past the free period up until publication, this might encourage them to deposit early and gain the benefits of storing/checking/using the live data, without fear that their data will be released earlier than they would like. This change could also lead to a new, low-risk funding stream for EBI, since they would only be charging for more time to private access for data already that is already on disk.

The extended pay-for-privacy model works well for both the user and the community, and could ultimately encourage more early open data release. Paying users will benefit from replicated, offsite storage in publication-ready formats without fear of getting scooped. This will come as a great benefit to many users who are currently struggling with local NGS data storage issues. Reciprocally, the community benefits because contributors who want to pay for extended private data end up supporting common infrastructure disproportionately more than those who release data publicly early. And since it becomes increasingly costly to keep your data private, there is ultimately an incentive to make your data public. This scheme would especially benefit preservation of the large amounts of usable data that go stale or never see the light of day because of delays or failures to write up and thus never get submitted to ENA. And of course, once published, private data would be made openly available immediately, all in a well-formatted and curated manner that the community can benefit from. What’s not to like?

Thoughts on if, or how, these half-baked ideas could be turned into reality are much appreciated in the comments below.

Simplifying Access to Paywalled Literature with Mobile Vouchers

Increasingly I read new scientific papers on a mobile device, often at home in the evening when I’m not on my university’s network. Most of the articles I read come from scientists on Twitter, Twitterbots or RSS feeds, which I try to read directly from my Twitter or RSS clients (Tweetbot and Feedly for iOS, respectively). Virtually every day, I hit paywalls trying to read non-open access papers from these sources, which aggravate me, waste my time, and require a variety of workarounds to (legally) access papers that differ depending on the publisher/journal.

For publishers that expose an obvious “Institutional login” option, I will typically try to log in using the UK Federation Shibboleth authentication system, which uses my university credentials. But Tweetbot and Feedly don’t store my Shibboleth user/pass, so for each article I either have to manually enter my user/pass, or open the page in Safari where my Shibboleth user/pass are stored. This app switch breaks my flow and leads to tab proliferation, neither of which are optimal. Some journals that use an institutional login temporarily store my details for around a week so I don’t have to do this every time I read a paper, but I still find myself entering the my details for the same journals over and over.

For journals that don’t have an institutional login option or hide this option from plain view, I tend to switch from Twitter/RSS to my IPad Settings in order to log in to my university VPN. The VPN login on my iPad similarly does not store my password, requiring me to type in my university password over and over. This wouldn’t be such a big deal, but my university’s requirement of including one uppercase Egyptian hieroglyph and one lowercase Celtic rune makes entering my password with the iOS keyboard a hassle.

In going through this frustrating routine yet again today trying to access an article in Genetics, I stumbled on a nice feature that I hadn’t seen before called “Mobile Vouchers” that allows me to avoid this rigmarole in the future. As explained on the Genetics Mobile Voucher FAQ:

A voucher is a code that will tie your mobile device to your institution’s subscriptions. This voucher will grant you access to protected content while not on your institution’s network. Each mobile device must be vouched for individually and vouchers are only valid for the publisher for which it is issued.

Obtaining a voucher is super easy. If you are not on your university network, you first need to be logged into your VPN to obtain a voucher. Once on your university network, just visit http://www.genetics.org/voucher/get, enter your name/email address and then submit. This will issue a voucher that you can use immediately to authenticate your device (it will also email you with this information). Voilà, no paywalls for Genetics on your iPad for the next six months or so. In addition to decreasing frustration and increasing flow for scientists, I can see this technology being really useful for PhD students, postdocs and visiting scientists to retain access to the literature for a few months after the end of their positions.

I was surprised I hadn’t seen this before, since it eliminates one of my chronic annoyances as a consumer of the digital scientific literature. Maybe others would disagree, but I would say that publishers haven’t done a very good job of advertising this very useful feature. Googling around, I didn’t find much on mobile vouchers other than a SlideShare presentation from Highwire press from 2011, which suggests the technology has been around for some time:

 

I also couldn’t find much information on which journals offer this service, but a few google searches led me to the following list of publishers/journals that offer mobile vouchers. It appears that most of these journals use HighWire press to serve their content, and that vouchers can operate at the publisher (e.g. Oxford University Press) or journal (e.g. Genetics, PNAS) scale. The OUP voucher is particularly useful since it covers Molecular Biology and Evolution and Bioinformatics, which (together with Genetics) are the journals I hit paywalls for most frequently. Since these vouchers do expire eventually, I thought it would be good to bookmark these links for future use and to highlight this very useful tech tip. Links to other publishers and any other information on mobile vouchers would be most welcome in the comments.

Oxford University Press
http://services.oxfordjournals.org/site/subscriptions/mobile-voucher-faq.xhtml

Royal Society
http://admincenter.royalsocietypublishing.org/cgi/voucher-use

Rockefeller Press
http://www.rupress.org/site/subscriptions/mobile-voucher-faq.xhtml

Lyell
http://www.lyellcollection.org/site/subscriptions/mobile-voucher-faq.xhtml

Sage
http://online.sagepub.com/site/subscriptions/mobile-voucher-faq.xhtml

BMJ
http://journals.bmj.com/site/subscriptions/mobile-voucher-faq.xhtml

AACR
http://www.aacrjournals.org/site/Access/mobile_vouchers.xhtml

Genetics
http://www.genetics.org/site/subscriptions/mobile-voucher-faq.xhtml

PNAS
http://www.pnas.org/site/subscriptions/mobile-voucher-faq.xhtml

JBC
http://www.jbc.org/site/subscriptions/mobile-voucher-faq.xhtml

Endocrine
http://www.eje-online.org/site/subscriptions/mobile-voucher-faq.xhtml

J. Neuroscience
http://www.jneurosci.org/site/subscriptions/mobile-voucher-faq.xhtml

GeoScienceWorld
http://www.geoscienceworld.org/site/subscriptions/mobile-voucher-faq.xhtml

Economic Geology
http://www.segweb.org/SEG/Publications/SEG/_Publications/Mobile_Vouchers.aspx

Multi-sample SNP calling circa 1994

Last November, when news of Fred Sanger‘s death was making its way around scientific circles, so too were many images of Sanger DNA sequencing reactions visualized as autoradiograms. These images brought back memories of a style of Sanger sequencing gel that I first saw in an undergraduate class on population genetics taught by Charles (“Chip”) Aquadro at Cornell University in the autumn of 1994, which left a deep impression on me. My personal photograph 51, if you will.

At the time, I was on course to be a high-school biology teacher, a plan that was scuppered by being introduced to the then-emerging field of molecular population genetics covered in Aquadro’s class. I distinctly remember Aquadro putting up a transparency on the overhead showing an image of a Sanger gel where each of the four bases were run in sets that included each individual in the sample, allowing single nucleotide polymorphisms (“SNPs”) to be easily identified by eye. This image made an extremely strong impression on me, transforming the abstract A and a alleles typically discussed in population genetics into concrete molecular entities. Together with the rest of the material in Aquadro’s class, this image convinced me to pursue a career in evolutionary genetics.

I emailed Aquadro around that time last year to see if he had such an image digitized, and he said he’d try to dig one out. A few weeks ago he sent me the following image, which shows the state-of-the-art in multi-sample SNP calling circa 1994:

SangerSequencingGel-RosyDmel-Aquadro

Multi-sample Sanger sequencing gel of a fragment of the Drosophila melanogaster rosy (Xdh) gene (credit: Charles Aquadro). The first four lanes represent the four bases of the “reference” sequence, followed by four sets of lanes (one for each base) containing sequencing reactions for each individual in the sample. Notice how when a band is missing from a set for one individual, it is present in a different set for that same individual. This format allowed the position and identity of variable sites in a sample to be identified quickly, without having to read off the complete sequence for each individual.

For those of us who now perform multi-sample SNP calling at the whole-genome scale using something like a Illumina->BWA->SAMtools pipeline, it is sometimes hard to comprehend how far things have progressed technologically in the last 20 years.

Perhaps equally dramatic are the changes in the larger social and scientific value placed on the use of sequence analysis and the identification of variation in natural populations. At that time, the Aquadro lab was referred to in a friendly, if somewhat disparaging, way as the “Sequence and Think Lab” by others in the department (because “all they do in that lab is sequence and think”). As the identification of natural molecular variation in humans quickly becomes the basis for personalized medicine, and as next-generation sequencing is incorporated into more basic molecular biological techniques, it is impressive to see how quickly the “sequence and think” model has moved from a peripheral to a central role in modern biology.

 

Keeping Up with the Scientific Literature using Twitterbots: The FlyPapers Experiment

A year ago I created a simple “twitterbot” to stay on top of the Drosophila literature called FlyPapers, which tweets links to new abstracts in Pubmed and preprints in arXiv from a dedicated twitter account (@fly_papers). While most ‘bots on Twitter post spam or creative nonsense, an increasing number of people are exploring the use of twitterbots for more productive academic purposes. For example, Rod Page set up the @evoldir twitterbot way back in 2009 as an alternative to receiving email posts to the Evoldir mailing list, and likewise Gordon McNickle developed the @EcoLog_L twitterbot for the Ecolog-L mailing list. Similar to FlyPapers, others have established twitterbots for domain-specific literature feeds, such as the @BioPapers  for Quantitative Biology preprints on arXiv, @EcoEvoJournals for publications in the areas of Ecology & Evolution and @PlantEcologyBot for papers on Plant Ecology. More recently, Alberto Acerbi developed the @CultEvoBot to post links to blogs and new articles on the topic of cultural evolution. (I recommend reading posts by Rod, Gordon and Alberto for further insight into how and why they established these twitterbots.) One year in, I thought I’d summarize my thoughts on the FlyPapers experiment, and to make good on a promise I made to describe my set-up in case others are interested.

First, a few words on my motivation for creating FlyPapers. I have been receiving a daily update of all papers in the area of Drosophila in one form or another for nearly 10 years. My philosophy is that it is relatively easy to keep up on a daily basis with what is being published, but it’s virtually impossible to catch up when you let the river of information flow for too long. I first started receiving daily email updates from NCBI, which cluttered up my inbox and often got buried. Then I migrated to using RSS on Google Reader, which led to a similar problem of many unread posts accumulating that needed to be marked as “read”. Ultimately, I realized what I want from a personalized publication feed — a flow of links to articles that can be quickly scanned and clicked, but which requires no other action and can be ignored when I’m busy — was better suited to a Twitter client than a RSS reader. Moreover, in the spirit of “maximizing the value of your keystrokes“, it seemed that a feed that was useful for me might also be useful for others, and that Twitter was the natural medium to try sharing this feed since many scientists are already using twitter to post links to papers. Thus FlyPapers was born.

Setting up FlyPapers was straightforward and required no specialist know-how. I first created a dedicated Twitter account with a “catchy” name. Next, I created an account with dlvr.it, which takes a RSS/Twitter/email feed as input and routes the output to the FlyPapers Twitter account. I then set up an RSS feed from NCBI based on a search for the term “Drosophila” and add this as a source to the dlvr.it route. Shortly thereafter, I added a RSS feed for preprints in Arxiv using the search term “Drosophila” and added this to the same dlvr.it route. (Unfortunately, neither PeerJ Preprints nor bioRxiv currently have the ability to set up custom RSS feeds, and thus are not included in the FlyPapers stream.) NCBI and Arxiv only push new articles once a day, and each article is posted automatically as a distinct tweet for ease of viewing, bookmarking and sharing. The only gotcha I experienced in setting the system up was making sure when creating the Pubmed RSS feed to set the “number of items displayed” high enough (=100). If the number of articles posted in one RSS update exceeds the limit you set when you create the Pubmed RSS feed, Pubmed will post a URL to a Pubmed query for the entire set of papers as one RSS item, rather than post links to each individual paper. (For Gordon’s take on how he set up his Twitterbots, see this thread.) [UPDATE 25/2/14: Rob Lanfear has posted detailed instructions for setting up a twitterbot using the strategy I describe above at https://github.com/roblanf/phypapers. See his comment below for more information.]

So, has the experiment worked? Personally, I am finding FlyPapers a much more convenient way to stay on top of the Drosophila literature than any previous method I have used. Apparently others are finding this feed useful as well.

One year in, FlyPapers now has 333 followers in 16 countries, which is a far bigger and wider following than I would have ever imagined. Some of the followers are researchers I am familiar with in the Drosophila world, but most are students or post-docs I don’t know, which suggests the feed is finding relevant target audiences via natural processes on Twitter. The account has now posted 3,877 tweets, or ~10-11 tweets per day on average, which gives a rough scale for the amount of research being published annually on Drosophila. Around 10% of tweeted papers are getting retweeted (n=386) or favorited (n=444) by at least one person, and the breadth of topics being favorited/retweeted spans virtually all of Drosophila biology. These facts suggest that developing a twitterbot for domain-specific literature can indeed attract substantial numbers of like-minded individuals, and that automatically tweeting links to articles enables a significant proportion of papers in a field to easily be seen, bookmarked and shared.

Overall, I’m very pleased with the way FlyPapers is developing. I had hoped that one of the outcomes of this experiment would be to help promote Drosophila research, and this appears to be working. I had not expected it would act as a general hub for attracting Drosophila researchers who are active on Twitter, which is a nice surprise. One issue I hadn’t considered a year ago was the potential that ‘bots like FlyPapers might have to “game” Altmetics scores. Frankly, any metric that would be so easily gamed by a primitive bot like FlyPapers probably has no real intrisic value. However, it is true that this bot does add +1 to the twitter count for all Drosophila papers. My thoughts on this are that any attempt to correct the potential influence of ‘bots on Altmetrics scores should unduly not penalize the real human engagement bots can facilitate, so I’d say it is fair to -1 the orginal FlyPapers tweets in an Altmetrics calculation, but retain the retweets created by humans.

One final consequence of putting all new Drosophila literature onto Twitter that I would not have anticipated is that some tweets have been picked up by other social media outlets, including disease-advocacy accounts that quickly pushed basic research findings out to their target audience:

This final point suggests that there may be wider impacts from having more research articles automatically injected into the Twitter ecosystem. Maybe those pesky twitterbots aren’t always so bad after all.

UPDATE:

For those interested in setting up their own scientific twitterbot, see Rob Lanfear’s excellent and easy-to-follow instructions here. Peter Carlton has also outlined another method for setting up a twitterbot here, as has Sho Iwamoto here.

RELATED POSTS:

Battling Administrivia Using an Intramural Question & Answer Forum

The life of a modern academic involves juggling many disparate tasks, and like a computer using more physical memory than it has, swapping between various tasks leads to inefficiency and low performance in our jobs. Personally, the time fragmentation and friction induced by transitioning from task to task seems to be one of the main sources of stress in my work life.  The main reason for this is that many daily tasks on my to-do list are essential but fiddly and time-consuming administrivia (placing orders, filling in forms, entering marks into a database) that prevent me from getting to the things that I enjoy about being an academic: doing research, interacting with students, reading papers, etc.

I would go so far as to say that the mismatch between the desires of most academics and the reality of their jobs is the main source of academic “burnout” and low morale in what otherwise should be an awesome profession. I would also venture that administrivia is one of the major sources of the long hours we endure, since after wading through the “chaff”, we will (dammit!) put in the time on nights and weekends for the things we are most passionate about to sustain our souls. And based on the frequency of sentiments relating to this topic flowing through my Twitter feed, I’d say the negative impact of adminsitrivia is a pervasive problem in modern academic life, not restricted to any one institute.

While it is tempting to propose ameliorating the administrivia problem by simply eliminating bureaucracy, the growth of the administrative sector in higher education makes this solution a virtual impossibility. I have ultimately become resigned to the fact that the fundamentally inefficient nature of university bureaucracy cannot be structurally reformed and begun to seek other solutions to make my work life better. In doing so, I believe I’ve hit on a simple solution to the adminstrivia problem that I’m hoping might help others as well. In fact, I’m now convinced this solution is simple and powerful enough to actually be effective.

Accepting that it cannot be fully eliminated, my view is that the key to reducing the time and morale burden of administrivia is to realize that most routine tasks in University life are just protocols that require some amount of tacit knowledge about policies or procedures. Thus, all that is needed to reduce the negative impact of administrivia to its lowest possible level is to develop a system whereby accurate and relevant protocols can be placed at one’s fingertips so that they can be completed as fast as possible. The problem is that such protocols either don’t exist, don’t exist in a written form, or exist as scattered documents across various filesystems and offices that you have to expend substantial time finding. So how do we develop such protocols without generating more bureaucracy and exacerbating the problem we are attempting to solve?

My source of inspiration for ameliorating administrivia with minimal overhead comes from the positive experiences I have had using online Question and Answering (Q & A) forums based on the Stack Exchange model (principally the BioStars site for answering questions about bioinformatics).  For those not familiar with such systems, the Q & A model popularized by the Stack Exchange platform (and its clones) is a system that allows questions to be asked and answers to be voted on, moderated, edited and commented on in a very intuitive and user-friendly manner. For some reason I am not able to fully explain, the engineering behind the Q & A model naturally facilitates both knowledge exchange and community building in a way that is on the whole extremely positive, and seems to prevent the worst aspects of human nature commonly found on older internet forums and commenting systems.

So here is my proposal to battling the impact of academic administrivia: implement an intramural, University-specific Q & A forum for academic and administrative staff to pose and answer each other’s practical questions, converting tacit knowledge stored in people’s heads, inboxes and intranets into a single knowledge-bank that can be efficiently used and re-used by others who have the same queries. The need for an “intramural” solution and the reason this strategy cannot be applied globally, as it has for Linux administration, Poker or Biblical Hermeneutics, is that Universities (for better or worse) have their own local policies and procedures that can’t be easily shared or benefit from general worldwide input.

We have been piloting the use of the Open Source Question Answer (OSQA) platform (a clone of Stack Exchange) among a subset of our faculty for about a year, with good uptake and virtually unanimous endorsement from everyone who has used it. We currently require a real name policy for users, have limited the system to questions of procedure only, and have encouraged users to answer their own questions after solving burdensome tasks. To make things easy to administer technically, we are using an out of the box virtual machine of OSQA provided by Bitnami. The anonymized screenshot below gives a flavor of the banal, yet time-consuming queries that arise repeatedly in our institution that such a system makes easier to accomplish. I trust colleagues at other institutions will find similar tasks frustratingly familiar.

Untitled

The main reason I am posting this idea now is that I am scheduled to give a demo and presentation to my Dean and management team this week to propose rolling this system out to a wider audience. In preparation for this pitch, I’ve been trying to assemble a list of pros and cons that I am sure is incomplete and would benefit from the input of other people familiar with how Universities and Q & A platforms work.

The pros of an intramural Q & A platform for battling administrivia I’ve come up with so far include:

  • Increasing efficiency, leading to higher productivity for both academic and administrative staff;
  • Reducing the sense of frustration about bureaucratic tasks, leading to higher morale;
  • Improving sense of empowerment and community among academic and administrative staff;
  • Providing better documentation of procedures and policies;
  • Serving as an “aide memoire”;
  • Aiding the success of junior academic staff;
  • Ameliorating the effects of administrative turnover;
  • Providing a platform for people who may not speak up in staff meetings to contribute;
  • Allows “best practice” to emerge through crowd-sourcing;
  • Identifying common problems that should be prioritized for improvement;
  • Identifying like-minded problem solvers in a big institution;
  • Integrating easily around existing IT platforms;
  • Ability to be deployed at any scale (lab group, department, faculty, school, etc.)
  • Allows information to be accessible 24/7 when admininstrative offices are closed (H/T @jdenavascues).

I confess struggling to find true cons, but these might include (rejoinders in parentheses):

  • Security risks (can be solved with proper administration and authentication)
  • Inappropriate content (real name policy should minimize, can be solved with proper moderation);
  • Answers might be “impolitic” (real name policy should minimize, can be solved with proper moderation; H/T @DrLabRatOry)
  • Time wasting (unlikely since whole point is to enhance productivity);
  • Lack of uptake (even if the 90-9-1 rule applies, it is an improvement on the status quo);
  • Perceived as threat to administrative staff (far from it, this approach benefits administrative staff as much as academic staff);
  • Information could be come stale (can be solved with proper moderation and periodic updating).

I’d be very interested to get feedback from others about this general strategy (especially by Tues PM 17 Sep 2013), thoughts on related efforts, or how intramural Q & A platforms could be used in other ways in an academic setting beyond battling administrivia in the comments below.

Related Posts:

On the 30th Anniversary of DNA Sequencing in Population Genetics

30 years ago today, the “struggle to measure genetic variation” in natural populations was finally won. In a paper entitled “Nucleotide polymorphism at the alcohol dehydrogenase locus of Drosophila melanogaster” (published on 4 Aug 1983), Martin Kreitman reported the first effort to use DNA sequencing to study genetic variation at the ultimate level of resolution possible. Kreitman (1983) was instantly recognized as a major advance and became a textbook example in population genetics by the end of the 1980s. John Gillespie refers to this paper as “a milestone in evolutionary genetics“. Jeff Powell in his brief history of molecular population genetics goes so far as to say “It would be difficult to overestimate the importance of this paper”.

Arguably, the importance of Kreitman (1983) is greater now than ever, in that it provides both the technical and conceptual foundations for the modern gold rush in population genomics, including important global initiatives such as the 1000 Genomes Project. However, I suspect this paper is less well know to the increasing number of researchers who have come to studying molecular variation from routes other than through a training in population genetics. For those not familiar with this landmark paper, it is worth taking the time to read it or Nathan Pearson‘s excellent summary over on Genomena.

As with other landmark scientific efforts, I am intrigued by how such projects and papers come together. Powell’s “brief history” describes how Kreitman arrived at using DNA to study variation in Adh, including some direct quotes from Kreitman (p. 145). However, this account leaves out an interesting story about the publication of this paper that I had heard bits and pieces of over time. Hard as it may be to imagine in today’s post-genomic sequence-everything world, using DNA sequencing to study genetic variation in natural populations was not immediately recognized as being of fundamental importance, at least by the editors of Nature where it was ultimately published.

To better understand the events of the publication of this work, I recently asked Richard Lewontin, Kreitman’s PhD supervisor, to provide his recollections on this project and the paper. Here is what he had to say by email (12 July 2013):

Dear Casey Bergman

I am delighted that you are commemorating Marty’s 1983 paper that changed the whole face of experimental population genetics. The story of the paper is as follows.

It was always the policy in our lab group that graduate students invented their own theses. My view was (and still is) that someone who cannot come up with an idea for a research program and a plan for carrying it out should not be a graduate student. Marty is a wonderful example of what a graduate student can do without being told what to do by his or her professor. Marty came to us from a zoology background and one day not very long after he became a member of the group he came to me and asked how I would feel about his investigating the genetic variation in Drosophila populations by looking at DNA sequence variation rather than the usual molecular method of looking at proteins which then occupied our lab. My sole contribution to Marty’s proposal was to say “It sounds like a great idea.”  I had never thought of the idea before but it became immediately obvious to me that it was a marvelous idea.  So Marty went over on his own initiative, to Wally Gilbert’s lab and learned all the methodology from George Church who was then in the Gilbert lab.

After Marty’s work was finished and he was to get his degree, he wrote a paper based on his thesis and, with my encouragement, sent the paper to Nature. He offered to make me a co-author, but I refused on long-standing principle. Since the idea and the work were entirely his, he was the sole author, a policy that was general in our group. I had no doubt that it was the most important work done in experimental population genetics  in many years and Nature was an obvious choice for this pathbreaking work.

The paper was soon returned by the Editor saying that they were not interested because they already had so many papers that gave the DNA sequence of various genes that they really did not want yet another one! Obviously they missed the point. My immediate reaction was to have Marty send the paper to a leading influential British Drosophila geneticist who would obviously understand its importance, asking him to retransmit the paper to Nature with his recommendation. He did so and the Editor of Nature then accepted it for publication. The rest is history.

Our own lab very quickly converted from protein electrophoresis to DNA sequencing, and I spent a lot of time using and updating the computer interface with the gel reading process, starting from  Marty’s original programs for reading gels and outputting sequences. We never went back to protein electrophoresis. While protein gel electrophoresis certainly revolutionized population genetics, Marty’s introduction of DNA sequencing as the method for evolutionary genetic investigation of population genetic issues was a much more powerful one and made possible the testing of a  variety of evolutionary questions for which protein gel electrophoresis was inadequate. Marty deserves to be considered as one of the major developers of evolutionary and population genetics studies.

Yours ,

Dick Lewontin

Some may argue that Kreitman (1983) did not reveal all forms of genetic variation at the molecular level (e.g. large-scale structural variants) and therefore does not truly represent the “end” of the struggle to measure variation. What is clear, however, is that Kreitman (1983) does indeed represent the beginning of the “struggle to interpret genetic variation” at the fundamental genetic level, a struggle that may ultimately take longer then measuring variation itself. According to Maynard Olson interpreting (human) genomic variation will be a multi-generational effort “like building the European cathedrals“. 30 years in, Olson’s assessment is proving to be remarkably accurate. Here’s to Kreitman (1983) for laying the first stone!

Related Posts:

Calvin Bridges, Automotive Pioneer

Calvin Bridges in 1935 (Photo Credit: Smithsonian Institution Collections SIA Acc. 90-105 [SIA2008-0022])

Calvin Bridges (1889-1938) is perhaps best known as one of the original Drosophila geneticists in world. As an original member of Thomas Hunt Morgan’s Fly Room at Columbia University, Bridges made fundamental contributions to classical genetics, notably contributing the first paper ever published in the journal Genetics. The historical record on Bridges is scant, since Morgan and Alfred Sturtevant destroyed Bridges’ papers after his death to preserve the name of their dear friend whose politics and attitudes to free love were radical in many ways. Morgan’s biographical memoir of Bridges presented to the National Academy of Sciences in 1940 contains very little detail on Bridges’ life, and this historical black hole has piqued my curiosity for some time.

Recently, I stumbled across a listing in the New York Times for an exhibit in Brooklyn recreating the original Columbia Fly Room, which will be used as a set in an upcoming film of the same name directed by Alexis Gambis. Gambis’ film approaches the Fly Room from the perspective of a visit to the lab by one of Bridges children, Betsy Bridges. I recommend other Drosophila enthusiasts to check out The Fly Room website and follow @theflyroom and @alexisgambis on Twitter for updates about the project.

In digging around more about this project, I found a link to the Kickstarter page that was used to raise funds for the film. This page includes an amazing story about Bridges that I had never heard about previously. Apparently, after Morgan and his group moved to Caltech in 1928, Bridges built from scratch a futuristic car of his own design called “The Lightening Bug”. This initially came a big surprise to me, but on reflection it is in keeping with Bridge’s role as the main technical innovator for the original Drosophila group. For example, Bridges introduced the binocular dissecting scope, the etherizer, the controlled temperature incubator, and agar-based fly food into the Drosophilist’s toolkit.

Here is a clipping from Modern Mechanix from Aug 1936 describing the Lightening Bug:

Coverage of Calvin Bridge’s Lightning Bug in Modern Mechanix (Aug 1936).

Bridge’s Lightening Bug was notable enough to be written up in Time Magazine in May 1936, which described his car as follows:

It is almost perfectly streamlined, even the license plates and tail-lamp being recessed into the body and covered with Pyralin windows flush with the streamlining. There are no door handles; the doors must be opened with special keys. Dr. Bridges pronounced the Lightning Bug crash-proof and carbon-monoxide-proof. “My whole aim,” said he, “was to show what could be done to attain safety, economy and readability in a small car.”

Newshawks discovered that for months, when he got tired of looking at fruit flies, the geneticist had retired to a garage, put on a greasy jumper and worked on his car far into the night, hammering, welding, machining parts on a lathe. Now & then, the foreman reported, Dr. Bridges hit his thumb with a hammer. Once he had to visit a hospital to have removed some tiny bits of steel which flew into his eyes. It was Calvin Bridges’ splendid eyesight which first attracted Dr. Morgan’s interest in him when Bridges was a shaggy, enthusiastic student at Columbia.

Calvin Bridges next to the Lightening Bug (Time Magazine, 4 May 1936).

Gambis has also posted a video of the Lightening Bug being driven by Bridges taken by Pathé News. Gambis estimates this clip was from around 1938, but it is probably from 1936/7 since Bridges died in Dec 1938 and by the time Ed Novitski started graduate school at CalTech in the autumn of 1938 Bridges was terminally ill, but appears fit in this clip.  This clip clearly shows the design of Bridges’ Lightening Bug was years ahead of its time in comparison to the other cars in the background. I also would wager this is the only video footage in existence of Calvin Bridges.

The only other information I could find on the web about the Lightening Bug was a small news clipping that was making the rounds in local new April/May 1936:

LighteningBug

Interestingly, the only mention I can find of this story in historical accounts of the Drosophila group is one parenthetical note by Shine and Wrobel in their 1976 biography of Morgan that had previously escaped my notice. On page 120, they discuss how Morgan handled the receipt of his 1933 Nobel Prize in Physiology or Medicine (emphasis mine):

…Morgan was very modest about the honor. He frequently pointed out that it was a tribute to experimental biology than to any one man….As Morgan acknowledged the joint nature of the work, he divided the tax-free $40,000 award equally among his own children and those of Bridges and Sturtevant (but not of Muller’s). He gave no reason; in the letter to Sturtevant for example, he said merely I’m enclosing some money for your children. (Bridges, however, is said to have used his to build a new car.)

So there you have it: Calvin Bridges, Drosophila geneticist, was also an unsung automotive pioneer whose foray into designing futuristic cars was likely funded in part by the proceeds of the 1933 Nobel Prize!

Related Posts: