The Logistics of Scientific Growth in the 21st Century

ResearchBlogging.org

Over the last few months, I’ve noticed a growing number of reports about declining opportunities and increasing pressure for early stage academic researchers (Ph.D. students, post-docs and junior faculty). For example, the Washington Post published an article in early July about trends in the U.S. scientific job market entitled “U.S. pushes for more scientists, but the jobs aren’t there.” This post generated over 3,500 comments on the WaPo website alone and was highly discussed in the twittersphere. In mid July, Inside Higher Ed reported that an ongoing study revealed a recent, precipitous drop in the interest of STEM (Science/Technology/Engineering/Mathematics) Ph.D. students wishing to pursue an academic tenure-track career. These results confirmed those published in PLoS ONE in May that showed the interest to pursue an academic career of STEM students surveyed in 2010 showed evidence of a decline during the course of Ph.D. studies:

Figure 1. Percent of STEM Ph.D. judging a career to be “extremely attractive”. Taken from Saurman & Roach (2012).

Even for those lucky enough to get an academic appointment, the bad news seems to be that it is getting harder to establish a research program.  For example, the average age for a researcher to get their first NIH grant (a virtual requirement for tenure for many biologists in the US) is now 42 years old. National Public Radio quips “50 is the new 30, if you’re a promising scientist.”

I’ve found these reports very troubling since, after over nearly fifteen years of slogging it out since my undergrad to achieve the UK equivalent of a “tenured” academic position, I am acutely aware of the how hard the tenure track is for junior scientists at this stage in history. On a regular basis I see how the current system negatively affects the lives of talented students, post-docs and early-stage faculty. I have for some time wanted to write about my point of view on this issue since I see these trends as indicators of bigger changes in the growth of science than individuals may be aware of.  I’ve finally been inspired to do so by a recent piece by Euan Ritchie and Joern Fischer published in The Conversation entitled “Cracks in the ivory tower: is academia’s culture sustainable?“, which I think hits the nail on head about the primary source of the current problems in academics: the deeply flawed philosophy that “more is always better”.

My view is that the declining opportunities and increasing malaise among early-stage academics is a by-product of the fact that the era of exponential growth in academic research is over.  That’s nonsense, you say, the problems we are experiencing now are because of the current global economic downturn. What’s happening now is a temporary blip, things will return to happier days when we get back to “normal” economic growth and governments increase investment in research. Nonsense, I say. This has nothing to do with the current economic climate and instead has more to do with long-term trends in the growth of scientific activity over the last three centuries.

My views are almost entirely derived from a book written by Derek de Solla Price entitled Little Science, Big Science. Price was a scientist-cum-historian who published this slim tome in 1963 based a series of lectures at Brookhaven National Lab in 1962. It was a very influential book in the 1960s and 1970s, since it introduced citation analysis to a wide audience. Along with Eugene Garfield of ISI/Impact Factor fame (or infamy, depending on your point of view), Price is credited as being one of the founding fathers of Scientometrics. Sadly, this important book is now out of print, the Wikipedia page on this book is a stub with no information, and Google books has not scanned it into their electronic library, showing just how far the ideas in this book are out of the current consciousness. I am not the first to lament that Price’s writings have been ignored in recent years.

In a few short chapters, Price covers large-scale trends in the growth of science and the scientific literature from its origins in the 17th century, which I urge readers to explore for themselves. I will focus here only on one of his key points that relates to the matter at hand — the pinch we are currently feeling in science. Price shows that as scientific disciplines matured in the 20th century, they achieved a characteristic exponential growth rate, which appears linear on a logarithmic scale. This can be seen terms of both the output of scientific papers (Figure 2) or scientists themselves (Figure 3).

Figure 2. Taken from de Solla Price 1963.

Figure 4. A model of logistic growth for Science in the late 20th and early 21st century (taken from de Solla Price 1963).

Figure 3. Taken from de Solla Price 1963.

Price showed that there was a roughly constant doubling time for different forms of scientific output (number of journals, number of papers, number of scientists, etc.) of about 10-15 years. That is, the amount of scientific output at a given point in history is twice as large as it was 10-15 years before. This incessant growth is why we all feel like it is so hard to keep up on the literature (and incidentally why I believe that text mining is now an essential tool). And these observations led Price to make the famous claim that “Eighty to 90 per cent of all the scientists who have ever lived are alive now”.

Crucially, Price pointed out that the doubling time of the number of scientists is much shorter than the doubling time of the overall human population (~50 years). Thus, the proportion of scientists relative to the total human population has been increasing for decades, if not centuries. Price makes the startling but obvious outcomes of this observation very clear: either everyone on earth will be a scientist one day, or the growth rate of science must decrease from its previous long-term trends. He then goes on to argue that the most likely outcome is the latter, and that scientific growth rates will change from exponential to logistic growth and reach saturation sometime within 100 years from the publication of his book in 1963 (Figure 4):

Figure 4. A model of logistic growth for Science (taken from de Solla Price 1963).

So maybe the bad news circulating in labs, coffee rooms and over the internet is not a short-term trend based on the current economic downturn, but instead reflects the product of a long-term trend in the history of science?  Perhaps the crunch that we are currently experiencing in academic research now is the byproduct of the fact that we are in Price’s transition from exponential to logistic growth in science? If so, the pressures we are experiencing now may simply reflect that the current rate of production of scientists is no longer matched to the long-term demand for scientists in society.

Whether or not this model of growth in science is true is clearly debatable (please do so below!). But if we are in the midst of making the transition from exponential to logistic growth in science, then there are a number of important implications that I feel scientists at all stages of their careers should be aware of:

1) For PhD students and post-docs: you have every right to be feeling like the opportunities in science may not be there for you as they were for your supervisors and professors. This message sucks, I know, but one important take-home message from this is that it may not have anything to do with your abilities; it may just have to do with when you came along in history. I am not saying that there will be no opportunities in the future, just fewer as a proportion of the total number of jobs in society relative to current levels. I’d argue that this is a cautiously optimistic view, since anticipating the long-term trends will help you develop more realistic and strategic approaches to making career choices.

2) For early-stage academics: your career trajectory is going to be more limited that you anticipated going into this gig. Sorry mate, but your lab is probably not going to be as big as you might think it should be, you will probably get fewer grants, and you will have more competition for resources than you witnessed in your PhD or post-doc supervisor’s lab. Get used it. If you think you have it hard, see point 1). You are lucky to have a job in science. Also bear in mind that the people judging your career progression may hold expectations that are no longer relevant, and as a result you may have more conflict with senior members of staff during the earlier phases of your career than you expect. Most importantly, if you find that this new reality is true for you, then do your best to adjust your expectations for PhD  students and post-docs as well.

3) For established academics: you came up during the halcyon days of growth in science, so bear in mind that you had it easy relative to those trying to make it today. So when you set your expectations for your students or junior colleagues in terms of performance, recruitment or tenure, be sure to take on board that they have it much harder now than you did at the corresponding point in your career [see points 1) and 2)]. A corollary of this point is that anyone actually succeeding in science now and in the future is (on average) probably better trained and works harder than you (at the corresponding point in your career), so on the whole you are probably dealing with someone who is more qualified for their job than you would be.  So don’t judge your junior colleagues with out-of-date views (that you might not be able to achieve yourself in the current climate) and promote values from a bygone era of incessant growth. Instead, adjust your views of success for the 21st century and seek to promote a sustainable model of scientific career development that will fuel innovation for the next hundred years.

References

de Solla Price D (1963) Little Science. Big Science. New York: Columbia University Press.

Kealey T (2000). More is less. Economists and governments lag decades behind Derek Price’s thinking Nature, 405 (6784) PMID: 10830939

Sauermann H, & Roach M (2012). Science PhD career preferences: levels, changes, and advisor encouragement. PloS one, 7 (5) PMID: 22567149

Related Posts:

Advertisements

Where Do Bioinformaticians Host Their Code?

Awhile back I was piqued by a discussion on BioStar about “Where would you host your open source code repository today?“, which got me thinking about the relative merits of the different sites for hosting bioinformatics software.  I am not an evangelist for any particular version control system or hosting site, and I leave it to readers to have a look into these systems themselves or at the BioStar thread for more on the relative merits of major hosting services, such as Sourceforge, Google Code, github and bitbucket. My aim here is not to advocate any particular system (although as a lab head I have certain predilections*), but to answer the straightforward empirical question: where do bioinformaticians host their code?

To do this, I’ve queried PubMed for keywords in the URLs of the four major hosting services listed above to get estimates of their uptake in biomedical publications.  This simple analysis clearly has some caveats, including the fact that many publications link to hosting services in sections of the paper outside the abstract, and that many bioinformaticians (frustratingly) release code via insitutional or personal webpages. Furthermore, the various hosting services arose at different times in history, so it is also important to interpret these data in a temporal context.  These (and other caveats) aside, the following provides an overview of how the bioinformatics community votes with their feet in terms of hosting their code on the major repository systems…

First of all, the bad news: of the many thousands of articles published in the field of bioinformatics, as of July Dec 31 2012 just under 700 papers (n=676) have easily discoverable code linked to a major repository in their abstract. The totals for each repository system are: 446 Sourceforge, 152 on Google Code, 78 on github and only 5 on bitbucket. So, by far, the majority of authors have chosen not to host their code on a major repository. But for the minority of authors who have chosen to release their code via a stable repository system, most use Sourceforge (which was is the oldest and most established source code repository) and effectively nobody is using bitbucket.

The first paper to link published code to a major repository system was only a decade ago in 2002, and a breakdown of the growth in code hosting since then looks like this:

 Year Sourceforge Google github
2002 4 0 0
2003 3 0 0
2004 10 0 0
2005 21 1 0
2006 24 0 0
2007 30 1 0
2008 30 10 0
2009 48 10 0
2010 69 21 8
2011 94 46 18
2012 113 63 52
Total 446 152 78

Trends in bioinformatics code repository usage 2002-2012.

A few things are clear from these results: 1) there is an upward trend in biomedical researchers hosting their code on major repository sites (the apparent downturn in 2012 is because data for this year is incomplete), 2) Sourceforge has clearly been the dominant players in the biomedical code repository game to date, but 3) the current growth rate of github appears to be outstripping both Sourceforge and Google Code. Furthermore, it appears that github is not experiencing any lag in uptake, as was observed in the 2002-2004 period for Sourceforge and 2006-2009 period for Google Code. It is good to see that new players in the hosting market are being accepted at a quicker rate than they were a decade ago.

Hopefully the upward trend for bioinformaticians to release their code via a major code hosting service will continue (keep up the good work, brothers and sisters!), and this will ultimately create a snowball effect such that it is no longer acceptable to publish bioinformatics software without releasing it openly into the wild.


  • As a lab manager I prefer to use Sourceforge in our published work, since Sourceforge has a very draconian policy when it come to deleting projects, which prevents accidental or willful deletion of a repository. In my opinion, Google Code and (especially) github are too permissive in terms of allowing projects to be deleted. As a lab head, I see it is my duty to ensure the long-term preservation of published code above all other considerations. I am aware that there are mechanisms to protect against deletion of repositories on github and Google Code, but I would suspect that most lab heads do not utilize them and that a substantial fraction of published academic code is one click away from deletion.

Will the Democratization of Sequencing Undermine Openness in Genomics?

It is no secret, nor is it an accident, that the success of genome biology over the last two decades owes itself in large part to the Open Science ideals and practices that underpinned the Human Genome Project. From the development of the Bermuda principles in 1996 to the Ft. Lauderdale agreement in 2003, leaders in the genomics community fought for rapid, pre-publication data release policies that have (for the most part) protected the interests of genome sequencing centers and the research community alike.

As a consequence, progress in genomic data acquisition and analysis has been incredibly fast, leading to major basic and medical breakthroughs, thousands of publications, and ultimately to new technologies that now permit extremely high-throughput DNA sequencing. These new sequencing technologies now give individual groups sequencing capabilities that were previously only acheivable by large sequencing centers. This development makes it timely to ask: how do the data release policies for primary genome sequences apply in the era of next-generation sequencing (NGS)?

My reading of the the history of genome sequence release policies condenses the key issues as follows:

  • The Bermuda Principles say that assemblies of primary genomic sequences of human and other organims should be made within 24 hrs of their production
  • The Ft. Lauderdale Agreement says that whole genome shotgun reads should be deposited in public repositories within one week of generation. (This agreement was also encouraged to be applied to other types of data from “community resource projects” – defined as research project specifically devised and implemented to create a set of data, reagents or other material whose primary utility will be as a resource for the broad scientific community.)

Thus, the agreed standard in the genomics field is that raw sequence data from the primary genomic sequence of organisms should be made available within a week of generation. In my view this also applies to so-called “resequencing” efforts (like the 1000 Genomes Project), since genomic data from a new strain or individual is actually a new primary genome sequence.

The key question concerning genomic data release policies in the NGS era, then, is do these data release policies apply only to sequencing centers or to any group producing primary genomic data? Now that you are a sequencing center, are you also bound by the obligations that sequencing centers have followed for a decade or more? This is an important issue to discuss for it’s own sake in order to promote Open Science, but also for the conundrums it throws up about data release policies in genomics. For example, if individual groups who are sequencing genomes are not bound by the same data release policies as sequencing centers, then a group at e.g. Sanger or Baylor working on a genome is actually now put at a competetive disadvantage in the NGS era because they would be forced to release their data.

I argue that if the wider research community does not abide by the current practices of early data release in genomics, the democratization of sequencing will lead to the slow death of openness in genomics. We could very well see a regression to the mean behavior of data hording (I sometimes call this “data mine, mine, mining”) that is sadly characteristic of most of biological sciences. In turn this could decelerate progress in genomics, leading to a backlog of terabytes of un(der)analyzed data rotting on disks around the world. Are you prepared to standby, do nothing and bear witness to this bleak future? ; )

While many individual groups collecting primary genomic sequence data may hesitate to embrace the idea of pre-publication data release, it should be noted that there is also a standard procedure in place for protecting the interests of the data producer to have first chance to publish (or co-publish) large-scale analysis of the data, while permitting the wider research community to have early access. The Ft. Lauderdale agreeement recognized that:

…very early data release model could potentially jeopardize the standard scientific practice that the investigators who generate primary data should have both the right and responsibility to publish the work in a peer-reviewed journal. Therefore, NHGRI agreed to the inclusion of a statement on the sequence trace data permitting the scientific community to use these unpublished data for all purposes, with the sole exception of publication of the results of a complete genome sequence assembly or other large-scale analyses in advance of the sequence producer’s initial publication.

This type of data producer protection proviso has being taken up by some community-led efforts to release large amounts of primary sequence data prior to publiction, as laudably done by the Drosophila Population Genomics Project (Thanks Chuck!)

While the Ft. Lauderdale agreement in principle tries to balance the interests of the data producers and consumers, it is not without failings. As Mike Eisen points out on his blog:

In practice [the Ft. Lauderdale privoso] has also given data producers the power to create enormous consortia to analyze data they produce, effectively giving them disproportionate credit for the work of large communities. It’s a horrible policy that has significantly squelched the development of a robust genome analysis community that is independent of the big sequencing centers.

Eisen rejects the Ft. Lauderdale agreement in favor of a new policy he entitles The Batavia Open Genomic Data Licence.  The Batavia License does not require an embargo period or the need to inform data producers of how they intend to use the data, as is expected under the Ft. Lauderdale agreement, but it requires that groups using the data publish in an open access journal. Therefore the Batavia License is not truly open either, and I fear that it imposes unnecessary restrictions that will prevent its widespread uptake. The only truly Open Science policy for data release is a Creative Commons (CC-BY or CC-Zero) style license that has no restrictions other than attribution, a precedent that was established last year for the E. coli TY-2482 genome sequence (BGI you rock!).

A CC-style license will likely be too liberal for most labs generating their own data, and thus I argue we may be better off pushing for a individual groups to use a Ft. Lauderdale style agreement to encourage the (admittedly less than optimal) status quo to be taken up by the wider community. Another option is for researchers to release their data early via “data publications” such as those being developed by journals such as GigaScience and F1000 Reports.

Whatever the mechanism, I join with Eisen in calling for wider participation for the research to community to release their primary genomic sequence data. Indeed, it would be a truly sad twist of fate if the wider research community does not follow the genomic data release policies in the post-NGS era that were put in place in the pre-NGS era in order to protect their interests. I for one will do my best in the coming years to reciprocate the generosity that has made Drosophila genomics community so great (in the long tradition of openness dating back to the Morgan school), by releasing any primary sequence data produced by my lab prior to publication. Watch this space.

Did Finishing the Drosophila Genome Legitimize Open Access Publishing?

I’m currently reading Glyn Moody‘s (2003) “Digital Code of Life: How Bioinformatics is Revolutionizing Science, Medicine, and Business” and greatly enjoying the writing as well as the whirlwind summary of the history of Bioinformatics and the (Human) Genome Project(s). Most of what Moody says that I am familiar with is quite accurate, and his scholarship is thorough, so I find his telling of the story compelling. One claim I find new and curious in this book is in his discussion of the sequencing of the Drosphila melanogaster genome, more precisely the “finishing” of this genome, and its impact on the legitimacy of Open Access publishing.

The sequencing of D. melanogaster was done as a collaboration with between the Berkeley Drosophila Genome Project and Celera, as a test case to prove that whole-genome shotgun sequencing could be applied to large animal genomes.  I won’t go into the details here, but it is a widely regarded fact that the Adams et al. (2000) and Myers et al. (2000) papers in Science demonstrated the feasibility of whole-genome shotgun sequencing, but it was a lesser-known paper by Celniker et al. (2002) in Genome Biology which reported the “finished” D. melanogaster genome that proved the accuracy of whole-genome shotgun sequencing assembly. No controversy here.

More debatable is what Moody goes on to write about the Celniker et al. (2002) paper:

This was an important paper, then, and one that had a significance that went beyond its undoubted scientific value. For it appeared neither in Science, as the previous Drosophila papers had done, nor in Nature, the obvious alternative. Instead, it was published in Genome Biology. This describes itself as “a journal, delivered over the web.” That is, the Web is the primary medium, with the printed version offering a kind of summary of the online content in a convenient portable form. The originality of Genome Biology does not end there: all of its main research articles are available free online.

A description then follows of the history and virtues of PubMed Central and the earliest Open Access biomedical publishers BioMed Central and PLoS. Moody (emphasis mine) then returns to the issue of:

…whether a journal operating on [Open Access] principles could attract top-ranked scientists. This question was answered definitively in the affirmative with the announcement and analysis of the finished Drosophila sequence in January 2003. This key opening paper’s list of authors included not only [Craig] Venter, [Gene] Myers, and [Mark] Adams, but equally stellar representatives of the academic world of Science, such as Gerald Rubin, the boss of the fruit fly genome project, and Richard Gibbs, head of sequencing at Baylor College. Alongside this paper there were no less than nine other weighty contributions, including one on Apollo, a new tool for viewing and editing sequence annotation. For its own Drosophila extravaganza of March 2000, Science had marshalled seven paper in total. Clearly, Genome Biology had arrived, and with it a new commercial publishing model based on the latest way of showing the data.

This passage resonated with me since I was working at the BDGP at the time this special issue on the finishing of the Drosophila genome in Genome Biology was published, and was personally introduced to Open Access publishing through this event.  I recall Rubin walking the hallways of building 64 on his periodic visits promoting this idea, motivating us all to work hard to get our papers together by the end of 2002 for this unique opportunity. I also remember lugging around stacks of the printed issue at the Fly meeting in Chicago in 2003, plying unsuspecting punters with a copy of a journal that most people had never heard of, and having some of my first conversations with people on Open Access as a consequence.

What Moody doesn’t capture in this telling is the fact the Rubin’s decision to publish in Genome Biology almost surely owes itself to the influence that Mike Eisen had on Rubin and others in the genomics community in Berkeley at the time. Eisen and Rubin had recently collaborated on a paper, Eisen had made inroads in Berkeley on the Open Access issue by actively recruiting signatories for the PLoS open letter the year before, and Eisen himself published his first Open Access paper in Oct 2002 in Genome Biology. So clearly the idea of publishing in Open Access journals, and in particular Genome Biology, was in the air at the time. So it may not have been as bold of a step for Rubin to take as Moody implies.

Nevertheless, it is a point that may have some truth, and I think it is interesting to consider if indeed the long-standing open data philosophy of the Drosophila genetics community that led to the Genome Biology special issue was a key turning point in the widespread success of Open Access publishing over the next decade. Surely the movement would have taken off anyways at some point. But in late 2002, when the BioMed Central journals were the only place to publish gold Open Access articles, few people had tested the waters since the launch of BMC journals in 2000. While we cannot replay the tape, Moody’s claim is plausible in my view and it is interesting to ask whether widespread buy-in to Open Access publishing in biology might have been delyaed if Rubin had not insisted that the efforts of the Berkeley Drosophila Genome Project be published under and Open Access model?

UPDATE 25 March 2012

After tweeting this post, here is what Eisen and Moody have to say:

UPDATE 19 May 2012

It appears that the publication of another part of the Drosophila (meta)genome, its Wolbachia endosymbiont, played and important role in the conversion of Jonathan Eisen to supporting Open Access. Read more here.

The Roberts/Ashburner Response

A previous post on this blog shared a helpful boilerplate response to editors for politely declining to review for non-Open Access journals, which I received originally from Michael Ashburner. During a quick phone chat today, Ashburner told me that he in fact inherited a version of this response originally from Nobel laureate Richard Roberts, co-discoverer of introns, lead author on the Open Letter to Science calling for a “Genbank” of the scientific literature, and long-time editor of Nucleic Acids Research, one of the first classical journals to move to a fully Open Access model. So to give credit where it is due, I’ve updated the title of the “Just Say No” post to make the attribution of this letter more clear. We owe both Roberts and Ashburner many thanks for paving the way to a better model of scientific communication and leading by example.

Time Management Tips from Francis Crick

Academic researchers nowadays are asked to participate in a multitude of tasks outside the core remit of a scholar, which have succinctly been summarized by Scott Hawley as “to learn, to write and to teach.” Deftly handling requests for participation in “non-core” activities is an art, but is essential if one wishes to maintain an active research programme. While it is clear that email and computers have made things worse, this problem is indeed not new and we can look to history for good strategies to cope with it. Chris Beckett writes of:

A response strategy [Francis] Crick adopted in the 1960s to cope with an enormous post and to make a serious point playfully was the occasional use of a pre-printed postcard offering a number of reply options. The seventeen listed (see Figure 3) are a faithful reflection of the requests he regularly received.

Francis Crick's all-purpose reply card.

While I don’t expect to have opportunity to use many of these myself in the future, and there are some that I don’t agree with rejecting outright (e.g. read your manuscript), this list serves as a checklist for non-core academic activities and a useful reminder of what we didn’t go into science for in the first place.