Archive for the 'history' Category

On the 30th Anniversary of DNA Sequencing in Population Genetics

30 years ago today, the “struggle to measure genetic variation” in natural populations was finally won. In a paper entitled “Nucleotide polymorphism at the alcohol dehydrogenase locus of Drosophila melanogaster” (published on 4 Aug 1983), Martin Kreitman reported the first effort to use DNA sequencing to study genetic variation at the ultimate level of resolution possible. Kreitman (1983) was instantly recognized as a major advance and became a textbook example in population genetics by the end of the 1980s. John Gillespie refers to this paper as “a milestone in evolutionary genetics“. Jeff Powell in his brief history of molecular population genetics goes so far as to say “It would be difficult to overestimate the importance of this paper”.

Arguably, the importance of Kreitman (1983) is greater now than ever, in that it provides both the technical and conceptual foundations for the modern gold rush in population genomics, including important global initiatives such as the 1000 Genomes Project. However, I suspect this paper is less well know to the increasing number of researchers who have come to studying molecular variation from routes other than through a training in population genetics. For those not familiar with this landmark paper, it is worth taking the time to read it or Nathan Pearson‘s excellent summary over on Genomena.

As with other landmark scientific efforts, I am intrigued by how such projects and papers come together. Powell’s “brief history” describes how Kreitman arrived at using DNA to study variation in Adh, including some direct quotes from Kreitman (p. 145). However, this account leaves out an interesting story about the publication of this paper that I had heard bits and pieces of over time. Hard as it may be to imagine in today’s post-genomic sequence-everything world, using DNA sequencing to study genetic variation in natural populations was not immediately recognized as being of fundamental importance, at least by the editors of Nature where it was ultimately published.

To better understand the events of the publication of this work, I recently asked Richard Lewontin, Kreitman’s PhD supervisor, to provide his recollections on this project and the paper. Here is what he had to say by email (12 July 2013):

Dear Casey Bergman

I am delighted that you are commemorating Marty’s 1983 paper that changed the whole face of experimental population genetics. The story of the paper is as follows.

It was always the policy in our lab group that graduate students invented their own theses. My view was (and still is) that someone who cannot come up with an idea for a research program and a plan for carrying it out should not be a graduate student. Marty is a wonderful example of what a graduate student can do without being told what to do by his or her professor. Marty came to us from a zoology background and one day not very long after he became a member of the group he came to me and asked how I would feel about his investigating the genetic variation in Drosophila populations by looking at DNA sequence variation rather than the usual molecular method of looking at proteins which then occupied our lab. My sole contribution to Marty’s proposal was to say “It sounds like a great idea.”  I had never thought of the idea before but it became immediately obvious to me that it was a marvelous idea.  So Marty went over on his own initiative, to Wally Gilbert’s lab and learned all the methodology from George Church who was then in the Gilbert lab.

After Marty’s work was finished and he was to get his degree, he wrote a paper based on his thesis and, with my encouragement, sent the paper to Nature. He offered to make me a co-author, but I refused on long-standing principle. Since the idea and the work were entirely his, he was the sole author, a policy that was general in our group. I had no doubt that it was the most important work done in experimental population genetics  in many years and Nature was an obvious choice for this pathbreaking work.

The paper was soon returned by the Editor saying that they were not interested because they already had so many papers that gave the DNA sequence of various genes that they really did not want yet another one! Obviously they missed the point. My immediate reaction was to have Marty send the paper to a leading influential British Drosophila geneticist who would obviously understand its importance, asking him to retransmit the paper to Nature with his recommendation. He did so and the Editor of Nature then accepted it for publication. The rest is history.

Our own lab very quickly converted from protein electrophoresis to DNA sequencing, and I spent a lot of time using and updating the computer interface with the gel reading process, starting from  Marty’s original programs for reading gels and outputting sequences. We never went back to protein electrophoresis. While protein gel electrophoresis certainly revolutionized population genetics, Marty’s introduction of DNA sequencing as the method for evolutionary genetic investigation of population genetic issues was a much more powerful one and made possible the testing of a  variety of evolutionary questions for which protein gel electrophoresis was inadequate. Marty deserves to be considered as one of the major developers of evolutionary and population genetics studies.

Yours ,

Dick Lewontin

Some may argue that Kreitman (1983) did not reveal all forms of genetic variation at the molecular level (e.g. large-scale structural variants) and therefore does not truly represent the “end” of the struggle to measure variation. What is clear, however, is that Kreitman (1983) does indeed represent the beginning of the “struggle to interpret genetic variation” at the fundamental genetic level, a struggle that may ultimately take longer then measuring variation itself. According to Maynard Olson interpreting (human) genomic variation will be a multi-generational effort “like building the European cathedrals“. 30 years in, Olson’s assessment is proving to be remarkably accurate. Here’s to Kreitman (1983) for laying the first stone!

Related Posts:

Calvin Bridges, Automotive Pioneer

Calvin Bridges in 1935 (Photo Credit: Smithsonian Institution Collections SIA Acc. 90-105 [SIA2008-0022])

Calvin Bridges (1889-1938) is perhaps best known as one of the original Drosophila geneticists in world. As an original member of Thomas Hunt Morgan’s Fly Room at Columbia University, Bridges made fundamental contributions to classical genetics, notably contributing the first paper ever published in the journal Genetics. The historical record on Bridges is scant, since Morgan and Alfred Sturtevant destroyed Bridges’ papers after his death to preserve the name of their dear friend whose politics and attitudes to free love were radical in many ways. Morgan’s biographical memoir of Bridges presented to the National Academy of Sciences in 1940 contains very little detail on Bridges’ life, and this historical black hole has piqued my curiosity for some time.

Recently, I stumbled across a listing in the New York Times for an exhibit in Brooklyn recreating the original Columbia Fly Room, which will be used as a set in an upcoming film of the same name directed by Alexis Gambis. Gambis’ film approaches the Fly Room from the perspective of a visit to the lab by one of Bridges children, Betsy Bridges. I recommend other Drosophila enthusiasts to check out The Fly Room website and follow @theflyroom and @alexisgambis on Twitter for updates about the project.

In digging around more about this project, I found a link to the Kickstarter page that was used to raise funds for the film. This page includes an amazing story about Bridges that I had never heard about previously. Apparently, after Morgan and his group moved to Caltech in 1928, Bridges built from scratch a futuristic car of his own design called “The Lightening Bug”. This initially came a big surprise to me, but on reflection it is in keeping with Bridge’s role as the main technical innovator for the original Drosophila group. For example, Bridges introduced the binocular dissecting scope, the etherizer, the controlled temperature incubator, and agar-based fly food into the Drosophilist’s toolkit.

Here is a clipping from Modern Mechanix from Aug 1936 describing the Lightening Bug:

Coverage of Calvin Bridge’s Lightning Bug in Modern Mechanix (Aug 1936).

Bridge’s Lightening Bug was notable enough to be written up in Time Magazine in May 1936, which described his car as follows:

It is almost perfectly streamlined, even the license plates and tail-lamp being recessed into the body and covered with Pyralin windows flush with the streamlining. There are no door handles; the doors must be opened with special keys. Dr. Bridges pronounced the Lightning Bug crash-proof and carbon-monoxide-proof. “My whole aim,” said he, “was to show what could be done to attain safety, economy and readability in a small car.”

Newshawks discovered that for months, when he got tired of looking at fruit flies, the geneticist had retired to a garage, put on a greasy jumper and worked on his car far into the night, hammering, welding, machining parts on a lathe. Now & then, the foreman reported, Dr. Bridges hit his thumb with a hammer. Once he had to visit a hospital to have removed some tiny bits of steel which flew into his eyes. It was Calvin Bridges’ splendid eyesight which first attracted Dr. Morgan’s interest in him when Bridges was a shaggy, enthusiastic student at Columbia.

Calvin Bridges next to the Lightening Bug (Time Magazine, 4 May 1936).

Gambis has also posted a video of the Lightening Bug being driven by Bridges taken by Pathé News. Gambis estimates this clip was from around 1938, but it is probably from 1936/7 since Bridges died in Dec 1938 and by the time Ed Novitski started graduate school at CalTech in the autumn of 1938 Bridges was terminally ill, but appears fit in this clip.  This clip clearly shows the design of Bridges’ Lightening Bug was years ahead of its time in comparison to the other cars in the background. I also would wager this is the only video footage in existence of Calvin Bridges.

The only other information I could find on the web about the Lightening Bug was a small news clipping that was making the rounds in local new April/May 1936:

LighteningBug

Interestingly, the only mention I can find of this story in historical accounts of the Drosophila group is one parenthetical note by Shine and Wrobel in their 1976 biography of Morgan that had previously escaped my notice. On page 120, they discuss how Morgan handled the receipt of his 1933 Nobel Prize in Physiology or Medicine (emphasis mine):

…Morgan was very modest about the honor. He frequently pointed out that it was a tribute to experimental biology than to any one man….As Morgan acknowledged the joint nature of the work, he divided the tax-free $40,000 award equally among his own children and those of Bridges and Sturtevant (but not of Muller’s). He gave no reason; in the letter to Sturtevant for example, he said merely I’m enclosing some money for your children. (Bridges, however, is said to have used his to build a new car.)

So there you have it: Calvin Bridges, Drosophila geneticist, was also an unsung automotive pioneer whose foray into designing futuristic cars was likely funded in part by the proceeds of the 1933 Nobel Prize!

Related Posts:

Directed Genome Sequencing: the Key to Deciphering the Fabric of Life in 1993

Seeing the #AAASmtg hashtag flowing on my twitter stream over the last few days reminded my that my former post-doc advisor Sue Celniker must be enjoying her well-deserved election to the American Association for the Advancement of Science (AAAS). Sue has made a number of major contributions to Drosophila genomics, and I personally owe her for the chance to spend my journeyman years with her and so many other talented people in the Berkeley Drosophila Genome Project. I even would go so far as to say that it was Sue’s 1995 paper with Ed Lewis on the “Complete sequence of the bithorax complex of Drosophila” that first got me interested in “genomics.” I remember being completely in awe of the Genbank accession from this paper which was over 300,000 bp long! Man, this had to be the future. (In fact the accession number for the BX-C region, U31961, is etched in my brain like some telephone numbers from my childhood.) By the time I arrived at BDGP in 2001, the sequencing of the BX-C was already ancient history, as was the directed sequencing strategy used for this project.  These rapid changes made discovery of a set of discarded propaganda posters collecting dust in Reed George’s office that were made at the time (circa 1993) extolling the virtues of “Directed Genome Sequencing” as the key to “Deciphering the Fabric of Life” all the more poignant. I dug a photo I took of one of these posters today to commemerate the recognition of this pioneering effort (below). Here’s to a bygone era, and hats off to pioneers like Sue who paved the road for the rest of us in (Drosophila) genomics!

Directed-genome-sequencing

From Electron to Retrotransposon: “-on” the Origin of a Common Suffix in Molecular Biology

Over the last year or so, I have become increasingly interested in understanding the origin of major concepts in genetics and molecular biology. This is driven by several motivating factors, primarily to cure my ignorance/satisfy my curiosity but also to be able to answer student queries more authoritatively and unearth unsolved questions in biology. One of the more interesting stories I have stumbled across relates to why so many terms in molecular biology (e.g. codon, replicon, exon, intron, transposon, etc.) end with the suffix “-on”? While nowhere as pervasive the “-ome” suffix that has contaminated biological parlance of late,  the suffix “-on” has clearly left its mark in some of the most frequently used terms the lexicon of molecular biology.

According to Brenner (1996) [1], the common use of the suffix “-on” in molecular biological terms can be traced to Seymour Benzer’s dissection of the fine structure of the rII locus in bacteriophage T4, which overturned the classical idea that a gene is an indivisible unit:

To mark this new view of the gene, Seymour invented new terms for the now different units of mutation, recombination and function. As he was a physicist, he modelled his terms on those of physics and just as electrons, protons and neutrons replaced the once indivisible atom, so genes came to be composed of mutons, recons and cistrons. The the unit of function, the cistron was based on the cis–trans complementation test, of which only the trans part is usually done…Of these terms, only cistron came to be widely used. It is conjectured that the other two, the muton and the recon, disappeared because Seymour failed to follow the first rule for inventing new words, which is to check what they may mean in other languages…Seymour’s pioneering invention of units was followed by a spate of other new names not all of which will survive. One that seems to have taken root is codon, which I invented in 1957; and the terms intron and exon, coined by Walter Gilbert, are certain to survive as well. Operon is moot; it is still frequently used in prokaryotic genetics but as the weight of research shifts to eukaryotes, which do not have such units of regulation, it may be lost. Replicon, invented by Francis Jacob and myself in 1962, seems also to have survived, despite the fact that we paid insufficient attention to how it sounded in other languages.

Thus, the fact that many molecular biological terms end in “-on” (initiated by Benzer) owes its origin to patterns of nomenclature in chemistry/nuclear physics (which itself began with Stoney’s proposal of the term electron in 1894) and the desire to identify “fundamental units” of biological structure and function.

While Brenner’s commentary provides a crucial first-hand account to understand the origin of these terms, it does not provide any primary references concerning the coining of these terms. So I’ve spent some time digging out the original usage for a number of more common molecular biology “-ons”, which I thought many be of use or interest to others.

The terms reconmuton and cistron were defined by Benzer (1957) [2] as follows:

  • Recon: “The unit of recombination will be defined as the smallest element in the one-dimensional array that is interchangeable (but not divisible) by genetic recombination. One such element will be referred to as a “recon.””
  • Muton: “The unit of mutation, the “muton” will be defined as the smallest element that, when altered, can give rise to a mutant form of the organism.”
  • Cistron: “A unit of function can be defined genetically, independent of biochemical information, by means of the elegant cis-trans comparison devised by [Ed] Lewis…Such a map segment, corresponding to a function which is unitary as defined by the cis-trans test applied to the heterocaryon, will be defined as a cistron.”

I have not been able to find a definitive first reference that defines the term codon the fundamental unit of the genetic code. According to Brenner (1996) [1] and US National Library of Medicine’s Profiles in Science webpage on Marshall Nirenberg [2], the term codon was introduced by Brenner in 1957 “to describe the fundamental units engaged in protein synthesis, even though the units had yet to be fully determined. Francis Crick popularized the term in 1959. After 1962, Nirenberg began to use “codon” to characterize the three-letter RNA code words” [3].

The term operon was introduced by Jacob et al. (1960) [4] and defined as follows (italics theirs):

  • Operon: “Celle-ci comprendrait des unités d’expression coordonée (opérons) constituées par un opérateur et le group de gènes de structure coodoneés par lui.”
The term replicon was introduced by Jacob and Brenner (1963) [4] and defined as follows (italics theirs):
  • Replicon: “Il est donc clair qu’un chromosome (de bactérie ou de phage) ou un épisome constitue une unité de réplication indépendante ou réplicon, dont la reproduction est régie par la présence et l’activité de certain déterminants qu’il porte. Les caractères des réplicons exigent qu’ils déterminent des systèmes spécifique gouvernant leur propre réplication.”
Near and dear to my heart is the term transposon, which was first introduced by Hedges and Jacob (1974) [7] (italics theirs):
  • Transposon: “We designate DNA sequences with transposition potential as transposons (units of transposition)”

The very commonly used terms intron and exon were defined by Gilbert (1978) [6] as follows:

  • Intron & Exon: “The notion of the cistron, the genetic unit of function that one thought to correspond to a polypeptide chain, must be replaced by that of a transcription unit containing regions which will be lost from the mature messenger – which I suggest we call introns (for intragenic regions) – alternating with regions which will be expressed – exons.”

And finally, Boeke et al. (1985) [8] defined the term retrotransposon in the following passage (italics theirs):

  • Retrotransposon: “These observations, together with the finding that introns are spliced out of the Ty upon transposition, suggest that reverse transcription is a step in the transposition of Ty elements…We therefore propose the term retrotransposon  for Ty and related elements.”
So there you have it, from electron to retrotransposon in just a few steps. I’ve left out some lesser used terms with this suffix for the moment (e.g. regulon, stimulon, modulon), so as not to let this post go -on and -on. If anyone has any major terms to add here or corrections to my reading of the tea leaves, please let me know in the comments below.
References:
[1] Brenner, S. (1995) “Loose end: Molecular biology by numbers… one.” Current Biology 5(8): 964.
[2] Benzer, S. (1957) “The Elementary Units of Heredity.” in Symposium on the Chemical Basis of Heredity p. 70–93.  Johns Hopkins University Press
[4] Jacob, F., et al. (1960) “L’opéron: groupe de gènes à expression coordonnée par un opérateur.” C.R. Acad. Sci. Paris 250: 1727-1729.
[5] Jacob, F., and S. Brenner. (1963) “Sur la regulation de la synthese du DNA chez les bacteries: l’hypothese du replicon.” C. R. Acad. Sci 246: 298-300.
[6] Gilbert, W. (1978) “Why genes in pieces?.” Nature 271(5645): 501.
[7] Hedges, R. W., and A. E. Jacob. (1974) “Transposition of ampicillin resistance from RP4 to other replicons.” Molecular and General Genetics MGG 132(1): 31-40.
[8] Boeke, J.D., et al. (1985) “Ty elements transpose through an RNA intermediate.” Cell 40(3): 491.
Credits:
Jim Shapiro (University of Chicago) gave very helpful pointers to possible places where the term “transposon” might have originally have been introduced.

On The Neutral Sequence Fallacy

ResearchBlogging.org

Beginning in the late 1960s, Motoo Kimura overturned over a century of “pan-selectionist” thinking in evolutionary biology by proposing what has come to be called The Neutral Theory of Molecular Evolution. The Neutral Theory in its basic form states that the dynamics of the majority of changes observed at the molecular level are governed by the force of Genetic Drift, rather than Darwinian (i.e. Positive) Natural Selection. As with all paradigm shifts in Science, there was much of controversy over the Neutral Theory in its early years, but nevertheless the Neutral Theory has firmly established itself as the null hypothesis for studies of evolution at the molecular level since the mid-1980s.

Despite its widespread adoption, over the last ten years or so there has been a worrying increase in abuse of terminology concerning the Neutral Theory, which I will collectively term here the “Neutral Sequence Fallacy” (inspired by T. Ryan Gregory’s Platypus Fallacy). The Neutral Sequence Fallacy arises when the distinct concepts of functional constraint and selective neutrality are conflated, leading to the mistaken description of functionally unconstrained sequences as being “Neutral”. The Fallacy, in short, is to assign the term Neutral to a particular biomolecular sequence.

The Neutral Sequence Fallacy now routinely causes problems in the fields of evolutionary and genome biology, both in terms of generating conceptual muddles as well as shifting the goalposts needed to reject the null model of sequence evolution. I have intended to write about this problem for years in order to put a halt to this growing abuse of Neutral terminology, but unfortunately never found the time. However, this issue has unfortunately reared its head more strongly in the last few days with new forms of the Neutral Sequence Fallacy arising in the context of discussions about the ENCODE project, motivating a rough version of this critique to finally see the light of day. Here I will try to sketch out the origins of the Neutral Sequence Fallacy, in its original pre-genomic form that was debunked by Kimura while he was alive, and in its modern post-genomic form that has proliferated unchecked since the early comparative genomic era.

The Neutral Sequence Fallacy draws on several misconceptions about the Neutral Theory, and begins with the abbreviation of the theory’s name from its full form (The Neutral Mutation – Random Drift Hypothesis) to its colloquial form (The Neutral Theory). This abbreviation de-emphasizes that the concept of selective neutrality applies to mutations (i.e. variants, alleles), not biomolecular sequences (i.e. regions of the genome, proteins). Simply put, only variants of a sequence can be neutral or non-neutral, not sequences themselves.

The key misconception that permits the Neutral Sequence Fallacy to flourish is the incorrect notion that if a sequence is neutrally evolving, it implies a lack of functional constraint operating on that sequence, and vice versa. Other ways to state this misconception are: “a sequence is Neutral if it is under no selective constraint” or conversely “selective constraint rejects Neutrality”. This misconception arose originally in the 1970s, shortly after the proposal of The Neutral Theory when many researchers were first coming to terms with what the theory meant. This misconception became prevalent enough that it was the first to be addressed head-on by Kimura (1983) nearly 30 years ago in section 3.6 of his book The Neutral Theory of Molecular Evolution entitled “On some misunderstandings and criticisms” (emphasis is mine):

Since a number of criticisms and comments have been made regarding my neutral theory, often based on misunderstandings, I would like to take this opportunity to discuss some of them. The neutral theory by no means claims that the genes involved are functionless as mistakenly suggested by Zuckerkandl (1978). They may or may not be, but what the neutral theory assumes is that the mutant forms of each gene participating in molecular evolution are selectively nearly equivalent, that is, they can do the job equally well in terms of survival and reproduction of the individual. (p. 50)

As pointed out by Kimura and Ohta (1977), functional constraints are consistent with neutral substitutions within a class of mutants. For example, if a group of amino acids are constrained to be hydrophilic, there can be random changes within the codons producing such amino acids…There is, of course, negative selection against hydrophobic mutants in this region, but, as mentioned before, negative selection does not contradict the neutral theory.  (p. 53)

It is understandable how this misconception arises, because in the limit of zero functional constraint (e.g. in a non-functional pseudogene), all alleles become effectively equivalent to one another and are therefore selectively neutral. However, this does not mean that an unconstrained sequence is Neutral (unless we redefine the meaning of Neutrality, see below), because a sequence itself cannot be Neutral, only variants of a sequence can be Neutral with respect to each other.

It is crucial in this context to understand that the Neutral Theory accommodates all levels of selective constraint, and sequences under selective constraint can evolve Neutrally (see formal statement of this in Equation 5.1 of Kimura 1983). This point is often lost on many people. Until you get this, you don’t understand the Neutral Theory. A simple example shows how this is true. Consider a single codon in a protein coding region that codes for a degenerate amino acid. Deletion of the third codon position would creat a frameshift, and thus a third position “silent” site is indeed functional. However, alternative codons for this amino acid are functionally equivalent and evolve (close to) neutrally. The fact that these alternative alleles evolve neutrally has to do with their equivalence of function, not the degree of their functional constraint.

~~~~

To demonstrate the The Neutral Sequence Fallacy, I’d like to point out a few clear examples of this misconception in action.  The majority of transgressions in this area come from the genomics community where people may not have been formally trained in evolution, but I am sad to say that an increasing number of evolutionary biologists are also falling victim to The Neutral Sequence Fallacy these days. My reckoning is that the The Neutral Sequence Fallacy gained traction again in the post-genomic era around the time of the mouse genome paper by Waterston et al. (2002). In this widely-read paper, putatively unconstrained ancestral repeats were referred to (incorrectly) as “neutrally evolving DNA”, and used to estimate the fraction of the human genome under selective constraint. This analysis culminated with the following question: “How can we cleanly separate neutral and selected sequences?”. Under the Neutral Theory, this question makes no sense. First, sequences cannot be neutral; and second the framework used to detect functional constraints by comparative genomics assumes Neutral evolution of both classes of sites (unconstrained and constrained) – i.e. most changes between species are driven by Genetic Drift not Positive Selection. The proper formulation of this question should have been: “How can we cleanly separate unconstrained and constrained sequences?”.

Here is another clear example of the Neutral Sequence Fallacy in action from Lunter et al. (2006):

Figure 5 from Lunter et al. (2006). Notice how in the top panel, regions of the genome are contrasted as being “Neutral” vs. “Functional”. Here the term “Neutral” is being used incorrectly to mean selectively unconstrained. The bottom panel shows how indels are suppressed in Functional regions leading to intergap segments.

Here are a couple of more examples of the Neutral Sequence Fallacy in action, right in the title of fairly high-profile comparative genomics papers:

Title from Enlistki et al (2003). Notice that the functional class of “Regulatory DNA” is incorrectly contrasted as being the complement of nonfunctional “Neutral Sites”. In fact, both classes of sites are assumed to evolve neutrally in the authors’ model.

Title from Chin et al (2005). As above, notice how the concept of “Functionally conserved” is incorrectly stated to be the opposite of “Neutral sequence” and both classes of sites are assumed to evolve neutrally in the authors’ model.

I don’t mean to single these papers out, they just happen to represent very clear examples of the Neutral Sequence Fallacy in action. In fact, the Lunter et al. (2006) paper is one of my all time favorites, but it bugs the hell out of me when I have to unpick student’s misconceptions after they read it. Frustratingly, the list of papers repeating the Neutral Sequence Fallacy is long and growing. I have recently started to collect them as a citeulike library to provide examples for students to understand how not to make this common mistake. (If anyone else would like to contribute to this effort, please let me know — there is much work to be done to reverse this trend.)

~~~~

So what’s the big deal here?  Some would argue that these authors actually know what they are talking about, but they just happen to be using the wrong terminology. I wish that this were the case, but very often it is not. In many papers that I read or review that perpetrate the Neutral Sequence Fallacy, I usually find further examples of seriously flawed evolutionary reasoning, suggesting that they actually do not have a deep understanding of the issues at hand. In fact, evidence of the Neutral Sequence Fallacy is usually a clear hallmark in a paper that the authors are most likely practicing population genetics or molecular evolution without a license. This leads to a Neutral Sequence Fallacy of the 1st Kind: where authors do not understand the difference between the concepts functional constraint and selective neutrality. The problems for the Neutral Theory caused by violations of the 1st Kind are deep and clear. Because the Neutral Theory is not fully understood, it is possible to construct a straw-man version of the null hypothesis of Neutrality that can easily be “rejected” simply by finding evidence of selective constraint. Furthermore, because selectively unconstrained sequences are asserted (incorrectly) to be “Neutral” without actually evaluating their mode of evolution, this conceptual error undermines the entire value of the Neutral Theory as a null hypothesis testing framework.

But some authors really do know the difference between these ideas, and just happen to be using the term “Neutral” as shorthand for the term “Unconstrained.” Increasingly, I see some of my respected peers making this mistake in print who are card-carrying molecular evolutionists and do know their stuff. In these cases what is happening is a Neutral Sequence Fallacy of the 2nd Kind: understanding the difference between functional constraint and selective neutrality, but using lazy terminology that confuses these ideas in print. This is most often found in the context of studies on noncoding DNA where, in the absence of the genetic code to conveniently constrain terminology, people use terms like “neutral standard” or “neutral region” or “neutral sites” or “neutral proxy” in place of  “putatively unconstrained”. While the meaning of violations of the 2nd Kind can be overlooked and parsed correctly by experts in molecular evolution (I hope), this sloppy language causes substantial confusion about the Neutral Theory by students or non-evolutionary biologists who are new to the field, and leads to whole swathes of subsequent violations of the 1st Kind. Moreover, defining sequences as Neutral serves those with an Adaptationist agenda: since a control region is defined as being Neutral, all mutations that occur in that region must therefore be neutral as well, and thus any potential complications of the non-neutrality of mutations in one’s control region are conveniently swept under the carpet. Violations of the 2nd Kind are often quite insidious since they are generally perpetrated by people with some authority in evolutionary biology, often who are unaware of their misuse of terminology and who will vigorously deny that they are using terms which perpetuate a classical misconception laid to rest by Kimura 30 years ago.

~~~~

Which brings us to the most recent incarnation of the Neutral Sequence Fallacy in the context of the ENCODE project. In a companion post explaining the main findings of the ENCODE Project, Ewan Birney describes how the ENCODE Project reinforced recent findings that many biochemical events operate on the genome that are highly reproducible, but have no known function. In describing these event, Birney states:

I really hate the phrase “biological noise” in this context. I would argue that “biologically neutral” is the better term, expressing that there are totally reproducible, cell-type-specific biochemical events that natural selection does not care about. This is similar to the neutral theory of amino acid evolution, which suggests that most amino acid changes are not selected either for or against…Whichever term you use, we can agree that some of these events are “neutral” and are not relevant for evolution.

Under the standard view of the Neutral Theory, Birney misuses the term “Neutral” here to mean lack of functional constraint, repeating the classical form of the Neutral Sequence Fallacy. Because of this, I argue that Birney’s proposed terminology be rejected, since it will perpetuate a classic misconception in Biology. Instead, I propose the term “biologically inert”.

But wait a minute, you say, this is actually a transgression of the 2nd Kind. Really what is going on here is a matter of semantics. Birney knows the difference between functional constraint and selective neutrality. He is just formalizing the creeping misuse of the term Neutral to mean “Nonfunctional” that has been happening over the last decade.  If so, then I argue he is proposing to assign to the term Neutral the primary misconception of the Neutral Theory previously debunked by Kimura. This is a very dangerous proposal, since it will lead to further confusion in genomics arising from the “overloading” of the term Neutral (Kimura’s meaning: selectively equivalent; Birney’s meaning: no functional constraint). This muddle will subsequently prevent most scientists from properly understanding the Neutral Theory, and lead to many further examples of the Neutral Sequence Fallacy of both Kinds.

In my view, semantic switches like this are dangerous in Science, since they massively hinder communication and, therefore, progress. Semantic switches also lead to a distortion of understanding about key concepts in science. A famous case in point is Watson’s semantic switch of Crick’s term “Central Dogma” that corrupted Crick’s beautifully crafted original concept into the watered down textbook misinterpretation that is most often repeated: “DNA makes RNA make protein” (See Larry Moran’s blog for more on this).  Some may say this is the great thing about language, the same word can mean different things to different people. This view is best characterized in the immortal words of Humpty-Dumpty in Lewis Carroll’s Through the Looking Glass:

Others, including myself, disagree and prefer to have fixed definitions for scientific terms.

In a second recent case of the Neutral Sequence Fallacy creeping into discussions in the context of ENCODE, Michael Eisen proposes that we develop a “A neutral theory of molecular function” to interpret the meaning of these reproducible biochemical events that have no known function. Inspired by the introduction of a new null hypothesis in evolutionary biology ushered in by the Neutral Theory, Eisen calls for a new “neutral null hypothesis” that requires the molecular functions to be proven, not assumed. I laud any attempt to promote the use of null models for hypothesis testing in molecular biology, and whole-heartedly agree with Eisen’s main message about the need for a null model for molecular function.

But I disagree with Eisen’s proposal for a “neutral null hypothesis”, which from my reading of his piece, directly couples the null hypothesis for function with the null hypothesis for sequence evolution. By synonymizing the Ho of the functional model with the Ho of the evolutionary model, regions of the genome that would fail to reject the null functional model (i.e. have no functional constraint) will then be conflated with “being Neutral” (incorrect) or evolving neutrally (potentially correct), whereas those regions that reject the null functional model will be immediately considered as evolving non-neutrally (which may not always be the case since functional regions can evolve neutrally). While I assume this is not what is intended by Eisen, this is almost inevitably the outcome of suggesting a “neutral null hypothesis” in the context of biomolecular sequences. A “neutral null hypothesis for molecular function” makes it all to easy to merge the concepts of functional constraint and selective neutrality, which will inevitably lead many to the Neutral Sequence Fallacy. As Kimura does, Eisen should formally decouple the concept of functional constraint on a sequence from the mode of evolution by which that sequence evolves. Eisen should instead be promoting a “A null model of molecular function” that cleanly separates the concepts of function and evolution (an example of such a null model is embodied in Sean Eddy’s Random Genome Project). If not, I fear this conflation of concepts, like Birney’s semantic switch, will lead to more examples of the Neutral Sequence Fallacy of both Kinds.

~~~~

The Neutral Sequence Fallacy shares many sociological similarities with the chronic misuse and misconceptions about the concept of Homology. As discussed by Marabotti and Facchiano in their article “When it comes to homology, bad habits die hard“, there was a peak of misuse of the term Homology in the mid-1980s, which lead to backlash of many publications demanding more rigorous use of the term Homology. Despite this backlash and the best efforts of many scientists to stem the tide of misuse of Homology, ~43% of abstracts surveyed in 2007 use Homology incorrectly, down from 51% in 1986 before the assault on its misuse began. As anyone teaching the concept knows, unpicking misconceptions about Homology vs. Similarity is crucial for getting students to understand evolutionary theory. I argue that the same is true for the distinction between Functional Constraint and Selective Neutrality. When it comes to Functional Constraints on biomolecular sequences, our choice of terminology should be anything but Neutral.

References:

Chin CS, Chuang JH, & Li H (2005). Genome-wide regulatory complexity in yeast promoters: separation of functionally conserved and neutral sequence. Genome research, 15 (2), 205-13 PMID: 15653830

Elnitski L, Hardison RC, Li J, Yang S, Kolbe D, Eswara P, O’Connor MJ, Schwartz S, Miller W, & Chiaromonte F (2003). Distinguishing regulatory DNA from neutral sites. Genome research, 13 (1), 64-72 PMID: 12529307

Lunter G, Ponting CP, & Hein J (2006). Genome-wide identification of human functional DNA using a neutral indel model. PLoS computational biology, 2 (1) PMID: 16410828

Marabotti A, & Facchiano A (2009). When it comes to homology, bad habits die hard. Trends in biochemical sciences, 34 (3), 98-9 PMID: 19181528

Waterston RH, Lindblad-Toh K, Birney E, Rogers J, Abril JF, Agarwal P, Agarwala R, Ainscough R, Alexandersson M, An P, Antonarakis SE, Attwood J, Baertsch R, Bailey J, Barlow K, Beck S, Berry E, Birren B, Bloom T, Bork P, Botcherby M, Bray N, Brent MR, Brown DG, Brown SD, Bult C, Burton J, Butler J, Campbell RD, Carninci P, Cawley S, Chiaromonte F, Chinwalla AT, Church DM, Clamp M, Clee C, Collins FS, Cook LL, Copley RR, Coulson A, Couronne O, Cuff J, Curwen V, Cutts T, Daly M, David R, Davies J, Delehaunty KD, Deri J, Dermitzakis ET, Dewey C, Dickens NJ, Diekhans M, Dodge S, Dubchak I, Dunn DM, Eddy SR, Elnitski L, Emes RD, Eswara P, Eyras E, Felsenfeld A, Fewell GA, Flicek P, Foley K, Frankel WN, Fulton LA, Fulton RS, Furey TS, Gage D, Gibbs RA, Glusman G, Gnerre S, Goldman N, Goodstadt L, Grafham D, Graves TA, Green ED, Gregory S, Guigó R, Guyer M, Hardison RC, Haussler D, Hayashizaki Y, Hillier LW, Hinrichs A, Hlavina W, Holzer T, Hsu F, Hua A, Hubbard T, Hunt A, Jackson I, Jaffe DB, Johnson LS, Jones M, Jones TA, Joy A, Kamal M, Karlsson EK, Karolchik D, Kasprzyk A, Kawai J, Keibler E, Kells C, Kent WJ, Kirby A, Kolbe DL, Korf I, Kucherlapati RS, Kulbokas EJ, Kulp D, Landers T, Leger JP, Leonard S, Letunic I, Levine R, Li J, Li M, Lloyd C, Lucas S, Ma B, Maglott DR, Mardis ER, Matthews L, Mauceli E, Mayer JH, McCarthy M, McCombie WR, McLaren S, McLay K, McPherson JD, Meldrim J, Meredith B, Mesirov JP, Miller W, Miner TL, Mongin E, Montgomery KT, Morgan M, Mott R, Mullikin JC, Muzny DM, Nash WE, Nelson JO, Nhan MN, Nicol R, Ning Z, Nusbaum C, O’Connor MJ, Okazaki Y, Oliver K, Overton-Larty E, Pachter L, Parra G, Pepin KH, Peterson J, Pevzner P, Plumb R, Pohl CS, Poliakov A, Ponce TC, Ponting CP, Potter S, Quail M, Reymond A, Roe BA, Roskin KM, Rubin EM, Rust AG, Santos R, Sapojnikov V, Schultz B, Schultz J, Schwartz MS, Schwartz S, Scott C, Seaman S, Searle S, Sharpe T, Sheridan A, Shownkeen R, Sims S, Singer JB, Slater G, Smit A, Smith DR, Spencer B, Stabenau A, Stange-Thomann N, Sugnet C, Suyama M, Tesler G, Thompson J, Torrents D, Trevaskis E, Tromp J, Ucla C, Ureta-Vidal A, Vinson JP, Von Niederhausern AC, Wade CM, Wall M, Weber RJ, Weiss RB, Wendl MC, West AP, Wetterstrand K, Wheeler R, Whelan S, Wierzbowski J, Willey D, Williams S, Wilson RK, Winter E, Worley KC, Wyman D, Yang S, Yang SP, Zdobnov EM, Zody MC, & Lander ES (2002). Initial sequencing and comparative analysis of the mouse genome. Nature, 420 (6915), 520-62 PMID: 12466850

Credits:

Thanks to Chip Aquadro for originally pointing out to me when I perpetrated the Neutral Sequence Fallacy (of the 1st Kind!) during a journal club as an undergraduate in his lab. I can distinctly recall hot, embarrassment of the moment while being schooled in this important issue by a master. Thanks also to Alan Moses, who was the first of many people I converted to the light on this issue, and who has encouraged me since to write this up for a wider audience. Thanks also to Douda Bensasson for putting up with me ranting about this issue for years, and helpful comments on this post.

Related Posts:

The Logistics of Scientific Growth in the 21st Century

ResearchBlogging.org

Over the last few months, I’ve noticed a growing number of reports about declining opportunities and increasing pressure for early stage academic researchers (Ph.D. students, post-docs and junior faculty). For example, the Washington Post published an article in early July about trends in the U.S. scientific job market entitled “U.S. pushes for more scientists, but the jobs aren’t there.” This post generated over 3,500 comments on the WaPo website alone and was highly discussed in the twittersphere. In mid July, Inside Higher Ed reported that an ongoing study revealed a recent, precipitous drop in the interest of STEM (Science/Technology/Engineering/Mathematics) Ph.D. students wishing to pursue an academic tenure-track career. These results confirmed those published in PLoS ONE in May that showed the interest to pursue an academic career of STEM students surveyed in 2010 showed evidence of a decline during the course of Ph.D. studies:

Figure 1. Percent of STEM Ph.D. judging a career to be “extremely attractive”. Taken from Saurman & Roach (2012).

Even for those lucky enough to get an academic appointment, the bad news seems to be that it is getting harder to establish a research program.  For example, the average age for a researcher to get their first NIH grant (a virtual requirement for tenure for many biologists in the US) is now 42 years old. National Public Radio quips “50 is the new 30, if you’re a promising scientist.”

I’ve found these reports very troubling since, after over nearly fifteen years of slogging it out since my undergrad to achieve the UK equivalent of a “tenured” academic position, I am acutely aware of the how hard the tenure track is for junior scientists at this stage in history. On a regular basis I see how the current system negatively affects the lives of talented students, post-docs and early-stage faculty. I have for some time wanted to write about my point of view on this issue since I see these trends as indicators of bigger changes in the growth of science than individuals may be aware of.  I’ve finally been inspired to do so by a recent piece by Euan Ritchie and Joern Fischer published in The Conversation entitled “Cracks in the ivory tower: is academia’s culture sustainable?“, which I think hits the nail on head about the primary source of the current problems in academics: the deeply flawed philosophy that “more is always better”.

My view is that the declining opportunities and increasing malaise among early-stage academics is a by-product of the fact that the era of exponential growth in academic research is over.  That’s nonsense, you say, the problems we are experiencing now are because of the current global economic downturn. What’s happening now is a temporary blip, things will return to happier days when we get back to “normal” economic growth and governments increase investment in research. Nonsense, I say. This has nothing to do with the current economic climate and instead has more to do with long term trends in the growth of scientific activity over the last three centuries.

My views are almost entirely derived from a book written by Derek de Solla Price entitled Little Science, Big Science. Price was a scientist cum historian who published this slim tome in 1963 based a series of lectures at Brookhaven National Lab in 1962. It was a very influential book in the 1960s and 1970s, since it introduced citation analysis to a wide audience. Along with Eugene Garfield of ISI/Impact Factor fame (or infamy, depending on your point of view), Price is credited as being one of the founding fathers of Scientometrics. Sadly, this important book is now out of print, the Wikipedia page on this book is a stub with no information, and Google books has not scanned it into their electronic library, showing just how far the ideas in this book are out of the current consciousness. I am not the first to lament that Price’s writings have been ignored in recent years.

In a few short chapters, Price covers large-scale trends in the growth of science and the scientific literature from its origins in the 17th century, which I urge readers to explore for themselves. I will focus here only on one of his key points that relates to the matter at hand — the pinch we are currently feeling in science. Price shows that as scientific disciplines matured in the 20th century, they achieved a characteristic exponential growth rate, which appears linear on a logarithmic scale. This can be seen terms of both the output of scientific papers (Figure 2) or scientists themselves (Figure 3).

Figure 2. Taken from de Solla Price 1963.

Figure 4. A model of logistic growth for Science in the late 20th and early 21st century (taken from de Solla Price 1963).

Figure 3. Taken from de Solla Price 1963.

Price showed that there was a roughly constant doubling time for different forms of scientific output (number of journals, number of papers, number of scientists, etc.) of about 10-15 years. That is, the amount of scientific output at a given point in history is twice as large as it was 10-15 years before. This incessant growth is why we all feel like it is so hard to keep up on the literature (and incidentally why I believe that text mining is now an essential tool). And these observations led Price to make the famous claim that “Eighty to 90 per cent of all the scientists who have ever lived are alive now”.

Crucially, Price pointed out that the doubling time of the number of scientists is much shorter than the doubling time of the overall human population (~50 years). Thus, the proportion of scientists relative to the total human population has been increasing for decades, if not centuries. Price makes the startling but obvious outcomes of this observation very clear: either everyone on earth will be a scientist one day, or the growth rate of science must decrease from its previous long term trends. He then goes on to argue that the most likely outcome is the latter, and that scientific growth rates will change from exponential to logistic growth and reach saturation sometime within 100 years from the publication of his book in 1963 (Figure 4):

Figure 4. A model of logistic growth for Science (taken from de Solla Price 1963).

So maybe the bad news circulating in labs, coffee rooms and over the internet is not a short term trend based on the current economic downturn, but instead reflects the product of a long term trend in the history of science?  Perhaps the crunch that we are currently experiencing in academic research now is the byproduct of the fact that we are in Price’s transition from exponential to logistic growth in science? If so, the pressures we are experiencing now may simply reflect that the current rate of production of scientists is no longer matched to the long-term demand for scientists in society.

Whether or not this model of growth in science is true is clearly debatable (please do so below!). But if we are in the midst of making the transition from exponential to logistic growth in science, then there are a number of important implications that I feel scientists at all stages of their careers should be aware of:

1) For PhD students and post-docs: you have every right to be feeling like the opportunities in science may not be there for you as they were for your supervisors and professors. This message sucks, I know, but one important take-home message from this is that it may not have anything to do with your abilities; it may just have to do with when you came along in history. I am not saying that there will be no opportunities in the future, just fewer as a proportion of the total number of jobs in society relative to current levels. I’d argue that this is a cautiously optomistic view, since anticipating the long-term trends will help you develop more realistic and strategic approaches to making career choices.

2) For early-stage academics: your career trajectory is going to be more limited that you anticipated going into this gig. Sorry mate, but your lab is probably not going to be as big as you might think it should be, you will probably get fewer grants, and you will have more competition for resources than you witnessed in your PhD or post-doc supervisor’s lab. Get used it. If you think you have it hard, see point 1). You are lucky to have a job in science. Also bear in mind that the people judging your career progression may hold expectations that are no longer relevant, and as a result you may have more conflict with senior members of staff during the earlier phases of your career than you expect. Most importantly, if you find that this new reality is true for you, then do your best to adjust your expectations for PhD  students and post-docs as well.

3) For established academics: you came up during the halcyon days of growth in science, so bear in mind that you had it easy relative to those trying to make it today. So when you set your expectations for your students or junior colleagues in terms of performance, recruitment or tenure, be sure to take on board that they have it much harder now than you did at the corresponding point in your career [see points 1) and 2)]. A corollary of this point is that anyone actually succeeding in science now and in the future is (on average) probably better trained and works harder than you (at the corresponding point in your career), so on the whole you are probably dealing with someone who is more qualified for their job than you would be.  So don’t judge your junior colleagues with out-of-date views (that you might not be able to achieve yourself in the current climate) and promote values from a bygone era of incessant growth. Instead, adjust your views of success for the 21st century and seek to promote a sustainable model of scientific career development that will fuel innovation for the next hundred years.

References

de Solla Price D (1963) Little Science. Big Science. New York: Columbia University Press.

Kealey T (2000). More is less. Economists and governments lag decades behind Derek Price’s thinking Nature, 405 (6784) PMID: 10830939

Sauermann H, & Roach M (2012). Science PhD career preferences: levels, changes, and advisor encouragement. PloS one, 7 (5) PMID: 22567149

Related Posts:

Where Do Bioinformaticians Host Their Code?

Awhile back I was piqued by a discussion on BioStar about “Where would you host your open source code repository today?“, which got me thinking about the relative merits of the different sites for hosting bioinformatics software.  I am not an evangelist for any particular version control system or hosting site, and I leave it to readers to have a look into these systems themselves or at the BioStar thread for more on the relative merits of major hosting services, such as Sourceforge, Google Code, github and bitbucket. My aim here is not to advocate any particular system (although as a lab head I have certain predilections*), but to answer the straightforward empirical question: where do bioinformaticians host their code?

To do this, I’ve queried PubMed for keywords in the URLs of the four major hosting services listed above to get estimates of their uptake in biomedical publications.  This simple analysis clearly has some caveats, including the fact that many publications link to hosting services in sections of the paper outside the abstract, and that many bioinformaticians (frustratingly) release code via insitutional or personal webpages. Furthermore, the various hosting services arose at different times in history, so it is also important to interpret these data in a temporal context.  These (and other caveats) aside, the following provides an overview of how the bioinformatics community votes with their feet in terms of hosting their code on the major repository systems…

First of all, the bad news: of the many thousands of articles published in the field of bioinformatics, as of July Dec 31 2012 just under 700 papers (n=676) have easily discoverable code linked to a major repository in their abstract. The totals for each repository system are: 446 Sourceforge, 152 on Google Code, 78 on github and only 5 on bitbucket. So, by far, the majority of authors have chosen not to host their code on a major repository. But for the minority of authors who have chosen to release their code via a stable repository system, most use Sourceforge (which was is the oldest and most established source code repository) and effectively nobody is using bitbucket.

The first paper to link published code to a major repository system was only a decade ago in 2002, and a breakdown of the growth in code hosting since then looks like this:

 Year Sourceforge Google github
2002 4 0 0
2003 3 0 0
2004 10 0 0
2005 21 1 0
2006 24 0 0
2007 30 1 0
2008 30 10 0
2009 48 10 0
2010 69 21 8
2011 94 46 18
2012 113 63 52
Total 446 152 78

Trends in bioinformatics code repository usage 2002-2012.

A few things are clear from these results: 1) there is an upward trend in biomedical researchers hosting their code on major repository sites (the apparent downturn in 2012 is because data for this year is incomplete), 2) Sourceforge has clearly been the dominant players in the biomedical code repository game to date, but 3) the current growth rate of github appears to be outstripping both Sourceforge and Google Code. Furthermore, it appears that github is not experiencing any lag in uptake, as was observed in the 2002-2004 period for Sourceforge and 2006-2009 period for Google Code. It is good to see that new players in the hosting market are being accepted at a quicker rate than they were a decade ago.

Hopefully the upward trend for bioinformaticians to release their code via a major code hosting service will continue (keep up the good work, brothers and sisters!), and this will ultimately create a snowball effect such that it is no longer acceptable to publish bioinformatics software without releasing it openly into the wild.


  • As a lab manager I prefer to use Sourceforge in our published work, since Sourceforge has a very draconian policy when it come to deleting projects, which prevents accidental or willful deletion of a repository. In my opinion, Google Code and (especially) github are too permissive in terms of allowing projects to be deleted. As a lab head, I see it is my duty to ensure the long-term preservation of published code above all other considerations. I am aware that there are mechanisms to protect against deletion of repositories on github and Google Code, but I would suspect that most lab heads do not utilize them and that a substantial fraction of published academic code is one click away from deletion.

Will the Democratization of Sequencing Undermine Openness in Genomics?

It is no secret, nor is it an accident, that the success of genome biology over the last two decades owes itself in large part to the Open Science ideals and practices that underpinned the Human Genome Project. From the development of the Bermuda principles in 1996 to the Ft. Lauderdale agreement in 2003, leaders in the genomics community fought for rapid, pre-publication data release policies that have (for the most part) protected the interests of genome sequencing centers and the research community alike.

As a consequence, progress in genomic data acquisition and analysis has been incredibly fast, leading to major basic and medical breakthroughs, thousands of publications, and ultimately to new technologies that now permit extremely high-throughput DNA sequencing. These new sequencing technologies now give individual groups sequencing capabilities that were previously only acheivable by large sequencing centers. This development makes it timely to ask: how do the data release policies for primary genome sequences apply in the era of next-generation sequencing (NGS)?

My reading of the the history of genome sequence release policies condenses the key issues as follows:

  • The Bermuda Principles say that assemblies of primary genomic sequences of human and other organims should be made within 24 hrs of their production
  • The Ft. Lauderdale Agreement says that whole genome shotgun reads should be deposited in public repositories within one week of generation. (This agreement was also encouraged to be applied to other types of data from “community resource projects” – defined as research project specifically devised and implemented to create a set of data, reagents or other material whose primary utility will be as a resource for the broad scientific community.)

Thus, the agreed standard in the genomics field is that raw sequence data from the primary genomic sequence of organisms should be made available within a week of generation. In my view this also applies to so-called “resequencing” efforts (like the 1000 Genomes Project), since genomic data from a new strain or individual is actually a new primary genome sequence.

The key question concerning genomic data release policies in the NGS era, then, is do these data release policies apply only to sequencing centers or to any group producing primary genomic data? Now that you are a sequencing center, are you also bound by the obligations that sequencing centers have followed for a decade or more? This is an important issue to discuss for it’s own sake in order to promote Open Science, but also for the conundrums it throws up about data release policies in genomics. For example, if individual groups who are sequencing genomes are not bound by the same data release policies as sequencing centers, then a group at e.g. Sanger or Baylor working on a genome is actually now put at a competetive disadvantage in the NGS era because they would be forced to release their data.

I argue that if the wider research community does not abide by the current practices of early data release in genomics, the democratization of sequencing will lead to the slow death of openness in genomics. We could very well see a regression to the mean behavior of data hording (I sometimes call this “data mine, mine, mining”) that is sadly characteristic of most of biological sciences. In turn this could decelerate progress in genomics, leading to a backlog of terabytes of un(der)analyzed data rotting on disks around the world. Are you prepared to standby, do nothing and bear witness to this bleak future? ; )

While many individual groups collecting primary genomic sequence data may hesitate to embrace the idea of pre-publication data release, it should be noted that there is also a standard procedure in place for protecting the interests of the data producer to have first chance to publish (or co-publish) large-scale analysis of the data, while permitting the wider research community to have early access. The Ft. Lauderdale agreeement recognized that:

…very early data release model could potentially jeopardize the standard scientific practice that the investigators who generate primary data should have both the right and responsibility to publish the work in a peer-reviewed journal. Therefore, NHGRI agreed to the inclusion of a statement on the sequence trace data permitting the scientific community to use these unpublished data for all purposes, with the sole exception of publication of the results of a complete genome sequence assembly or other large-scale analyses in advance of the sequence producer’s initial publication.

This type of data producer protection proviso has being taken up by some community-led efforts to release large amounts of primary sequence data prior to publiction, as laudably done by the Drosophila Population Genomics Project (Thanks Chuck!)

While the Ft. Lauderdale agreement in principle tries to balance the interests of the data producers and consumers, it is not without failings. As Mike Eisen points out on his blog:

In practice [the Ft. Lauderdale privoso] has also given data producers the power to create enormous consortia to analyze data they produce, effectively giving them disproportionate credit for the work of large communities. It’s a horrible policy that has significantly squelched the development of a robust genome analysis community that is independent of the big sequencing centers.

Eisen rejects the Ft. Lauderdale agreement in favor of a new policy he entitles The Batavia Open Genomic Data Licence.  The Batavia License does not require an embargo period or the need to inform data producers of how they intend to use the data, as is expected under the Ft. Lauderdale agreement, but it requires that groups using the data publish in an open access journal. Therefore the Batavia License is not truly open either, and I fear that it imposes unnecessary restrictions that will prevent its widespread uptake. The only truly Open Science policy for data release is a Creative Commons (CC-BY or CC-Zero) style license that has no restrictions other than attribution, a precedent that was established last year for the E. coli TY-2482 genome sequence (BGI you rock!).

A CC-style license will likely be too liberal for most labs generating their own data, and thus I argue we may be better off pushing for a individual groups to use a Ft. Lauderdale style agreement to encourage the (admittedly less than optimal) status quo to be taken up by the wider community. Another option is for researchers to release their data early via “data publications” such as those being developed by journals such as GigaScience and F1000 Reports.

Whatever the mechanism, I join with Eisen in calling for wider participation for the research to community to release their primary genomic sequence data. Indeed, it would be a truly sad twist of fate if the wider research community does not follow the genomic data release policies in the post-NGS era that were put in place in the pre-NGS era in order to protect their interests. I for one will do my best in the coming years to reciprocate the generosity that has made Drosophila genomics community so great (in the long tradition of openness dating back to the Morgan school), by releasing any primary sequence data produced by my lab prior to publication. Watch this space.

Did Finishing the Drosophila Genome Legitimize Open Access Publishing?

I’m currently reading Glyn Moody‘s (2003) “Digital Code of Life: How Bioinformatics is Revolutionizing Science, Medicine, and Business” and greatly enjoying the writing as well as the whirlwind summary of the history of Bioinformatics and the (Human) Genome Project(s). Most of what Moody says that I am familiar with is quite accurate, and his scholarship is thorough, so I find his telling of the story compelling. One claim I find new and curious in this book is in his discussion of the sequencing of the Drosphila melanogaster genome, more precisely the “finishing” of this genome, and its impact on the legitimacy of Open Access publishing.

The sequencing of D. melanogaster was done as a collaboration with between the Berkeley Drosophila Genome Project and Celera, as a test case to prove that whole-genome shotgun sequencing could be applied to large animal genomes.  I won’t go into the details here, but it is a widely regarded fact that the Adams et al. (2000) and Myers et al. (2000) papers in Science demonstrated the feasibility of whole-genome shotgun sequencing, but it was a lesser-known paper by Celniker et al. (2002) in Genome Biology which reported the “finished” D. melanogaster genome that proved the accuracy of whole-genome shotgun sequencing assembly. No controversy here.

More debatable is what Moody goes on to write about the Celniker et al. (2002) paper:

This was an important paper, then, and one that had a significance that went beyond its undoubted scientific value. For it appeared neither in Science, as the previous Drosophila papers had done, nor in Nature, the obvious alternative. Instead, it was published in Genome Biology. This describes itself as “a journal, delivered over the web.” That is, the Web is the primary medium, with the printed version offering a kind of summary of the online content in a convenient portable form. The originality of Genome Biology does not end there: all of its main research articles are available free online.

A description then follows of the history and virtues of PubMed Central and the earliest Open Access biomedical publishers BioMed Central and PLoS. Moody (emphasis mine) then returns to the issue of:

…whether a journal operating on [Open Access] principles could attract top-ranked scientists. This question was answered definitively in the affirmative with the announcement and analysis of the finished Drosophila sequence in January 2003. This key opening paper’s list of authors included not only [Craig] Venter, [Gene] Myers, and [Mark] Adams, but equally stellar representatives of the academic world of Science, such as Gerald Rubin, the boss of the fruit fly genome project, and Richard Gibbs, head of sequencing at Baylor College. Alongside this paper there were no less than nine other weighty contributions, including one on Apollo, a new tool for viewing and editing sequence annotation. For its own Drosophila extravaganza of March 2000, Science had marshalled seven paper in total. Clearly, Genome Biology had arrived, and with it a new commercial publishing model based on the latest way of showing the data.

This passage resonated with me since I was working at the BDGP at the time this special issue on the finishing of the Drosophila genome in Genome Biology was published, and was personally introduced to Open Access publishing through this event.  I recall Rubin walking the hallways of building 64 on his periodic visits promoting this idea, motivating us all to work hard to get our papers together by the end of 2002 for this unique opportunity. I also remember lugging around stacks of the printed issue at the Fly meeting in Chicago in 2003, plying unsuspecting punters with a copy of a journal that most people had never heard of, and having some of my first conversations with people on Open Access as a consequence.

What Moody doesn’t capture in this telling is the fact the Rubin’s decision to publish in Genome Biology almost surely owes itself to the influence that Mike Eisen had on Rubin and others in the genomics community in Berkeley at the time. Eisen and Rubin had recently collaborated on a paper, Eisen had made inroads in Berkeley on the Open Access issue by actively recruiting signatories for the PLoS open letter the year before, and Eisen himself published his first Open Access paper in Oct 2002 in Genome Biology. So clearly the idea of publishing in Open Access journals, and in particular Genome Biology, was in the air at the time. So it may not have been as bold of a step for Rubin to take as Moody implies.

Nevertheless, it is a point that may have some truth, and I think it is interesting to consider if indeed the long-standing open data philosophy of the Drosophila genetics community that led to the Genome Biology special issue was a key turning point in the widespread success of Open Access publishing over the next decade. Surely the movement would have taken off anyways at some point. But in late 2002, when the BioMed Central journals were the only place to publish gold Open Access articles, few people had tested the waters since the launch of BMC journals in 2000. While we cannot replay the tape, Moody’s claim is plausible in my view and it is interesting to ask whether widespread buy-in to Open Access publishing in biology might have been delyaed if Rubin had not insisted that the efforts of the Berkeley Drosophila Genome Project be published under and Open Access model?

UPDATE 25 March 2012

After tweeting this post, here is what Eisen and Moody have to say:

UPDATE 19 May 2012

It appears that the publication of another part of the Drosophila (meta)genome, its Wolbachia endosymbiont, played and important role in the conversion of Jonathan Eisen to supporting Open Access. Read more here.

The Roberts/Ashburner Response

A previous post on this blog shared a helpful boilerplate response to editors for politely declining to review for non-Open Access journals, which I received originally from Michael Ashburner. During a quick phone chat today, Ashburner told me that he in fact inherited a version of this response originally from Nobel laureate Richard Roberts, co-discoverer of introns, lead author on the Open Letter to Science calling for a “Genbank” of the scientific literature, and long-time editor of Nucleic Acids Research, one of the first classical journals to move to a fully Open Access model. So to give credit where it is due, I’ve updated the title of the “Just Say No” post to make the attribution of this letter more clear. We owe both Roberts and Ashburner many thanks for paving the way to a better model of scientific communication and leading by example.


Twitter Updates


Follow

Get every new post delivered to your Inbox.

Join 72 other followers