Earlier this year Hanson, Sugdon and Alberts  argued in a piece entitled “Making Data Maximally Available” that journals like Science play a crucial role in making scientific data “publicly and permanently available” and that efforts to improve the standard of supporting online materials will increase their utility and the impact of their associated publications. While I whole-heartedly agreed with their view that improving supplemental materials is a better solution to the current disorganized  and impermanent  state of affairs (as opposed to the unwise alternative of discarding them altogether ), there were a few things about this piece that really irked me, and I had intended to write a letter to the editor on this with a colleague that unfortunately didn’t materialize, so I thought I’d post them here.
First, the authors make an artificial distinction between the supporting online materials associated with a paper and the contents of the paper itself. Clearly the most important data in a scientific report is in the full text of the article, and thus if making data in supporting online materials “maximally available” is a goal, surely so must be making data in full-text article itself. Second, in the context of the wider discussion on “big data” in which these points are made, it must be noted that maximal availability is only one step towards maximal utility, the other being maximal access. As the entire content of Science magazine is not available for unrestricted download and re-use from PubMed Central‘s Open Access repository, maximal utility of data in the full text or supplemental materials of articles published in Science is currently fettered because it is not available for bulk text mining or data mining. Amazingly, this is true even for Author-deposited manuscripts in PubMed Central, which are not currently included in the PubMed Central Open Access subset and therefore not available for bulk download and re-use.
Therefore it seems imperative that, in addition to making a clarion call for the improved availability of data, code and references in supplemental materials, the Editors of Science should issue a clear policy statement about the use of full-text articles and supplemental online materials that are published in Science for text and data mining research. At a minimum, Science should join with other high profile journals such as Nature  in clarifying the use of Author-deposited manuscripts in PubMed Central for text and data mining that are required to be deposited under funding body mandates for these very purposes. Additionally, Science should make a clear statement about the copyright and re-use policies for supporting online materials of all published articles, which are freely available for download without a Science subscription, and currently fall in the grey area between restricted and open access.
As we move firmly into the era of big data where issues of access and re-use of data becoming increasingly acute, Science, as the representative publication of the world’s largest general scientific society, should take the lead in opening its content for text and data mining, to the mutual benefit of authors, researchers and the AAAS.
1. Hanson et al. (2011) Making Data Maximally Available. Science 331:649
2. Santos et al. (2005) Supplementary data need to be kept in public repositories. Nature 438:738
3. Anderson et al. (2006) On the persistence of supplementary resources in biomedical publications. BMC Bioinformatics 7:260
4. Journal of Neuroscience policy on Supplemental Material
5. Nature Press release on data- and text-mining of self-archived manuscripts
- Watts Up with That – An Open Letter to Bruce Alberts of Science Magazine
- Stoat – Nah, don’t believe it
- Kitware Blog – AAAS: Your Paper MUST include Data, MUST include Code.
- NIF Blog – Supplementary material: Can NIF bring order to the netherworld of publishing?
- The Tree of Life - Calling on AAAS to Deposit all Archives of Science in Pubmed Central
- NeuroDojo – Occupy Science (the journal)