This week Science magazine has an interesting special issue on scientific data, covering a variety of topics from data backup and data visualization to open data. It seems these contributions are free to access for registered user of their web site, and it certainly is worthwhile to have a look.
The editorial in particular lays out Science’s policy on open data. Sharing scientific results is of course a motivation for publishing a paper in the first place. And to allow for independent verification of scientific results, the data contained in a publication has to be available and shared with other scientists. This sharing has to be done in a permanent way that guarantees access to archives also in future.
Is the data analysis traceable?
However, there is another point that hasn’t come across that strong from this special issue, but one that I also consider to be very important. And that is that data processing itself needs to be tracked, by which I mean the steps from the raw scientific data as measured all the way to the plots in a scientific paper need to be traceable.
Logging data manipulation is important, not only to prevent fraud but also to re-analyse the data if needed, to uncover errors in the analysis for example. That involves of course that custom computer code is preserved. But it also means that any significant temporary data generated during the analysis is preserved. Much in the same way that any edits in Wikipedia can be tracked back step by step.
Practical issues
This kind of data archival, from preserving soft and hardware to multiple versions of large data sets, is probably something that can easily go beyond the capabilities of smaller research groups, and much more needs to be discussed how such archiving of data (and sharing it) can be facilitated. This raises questions such as whether commercial lab software could achieve this task, or whether central facilities on a university or a national level are better suited for this. And of course, also what the role of journals should be in that process.
Judging from my experience visiting research labs, data safeguarding and archiving might be on many researcher’s minds, but clearly there is still a long way to go until we really are able to establish the kind of curation of scientific data that is fit for purpose. Putting the issue on the agenda is certainly an important step.
February 11, 2011 at 14:39
I am somewhat surprised that you mention the role journals might have in this archival process, yet do not mention the possibility of partnering with libraries (at the university and national laboratory level) to create institutional repositories of data. Librarians have expert skill at classifying, providing access, storing, and preserving data in many formats, including electronic/computer files.
February 11, 2011 at 14:46
yes, good point! My feeling is generally that ultimately such archiving should be done in a more centralized fashion, i.e. beyond back up disks generated by a student for example. When I wrote “on a national level”, I was thinking about national archives, but libraries certainly should have been explicitly mentioned, thanks for pointing this out.
February 11, 2011 at 15:24
How would such a system be run? You’ll have trouble getting private companies to pony up for this kind of a database, because while they don’t necessarily want to store the information themselves, they’re not going to want others to have access to what they deem proprietary information. Their bottom line will also come up, and data storage is not “sexy” on a budget, however necessary. Your best shot would probably involve universities and governmental institutions. Libraries have developed a lot more sophisticated systems than just sitting a student down with a disk or CD burner. Your information is useless without a way to access it. Libraries would actually be very useful in developing a system to actually retrieve the information once it’s in the archive.
February 11, 2011 at 15:42
well, I think this is a pretty complex issue. Before talking about structures for archiving, I think one needs also to talk about data formats, and how to preserve software etc. By commercial lab software I mean that already there are commercial packages for lab software around that also create data backups. That is archiving on a very localized level – which could be a possibility. But that doesn’t address compatibility issues (also an issue for vendors of scientific instruments) and more centralized data handling of course. So to me, it seems the issue needs to be seen across the entire system – from the data creation to archiving and retrieval, e.g. via libraries as you suggest. But at this stage I think noone has a clear answer. And I think that is the problem we should attempt to solve.
PS: the data backup by students: there I meant research students within research groups, not at libraries. Well, that’s at least how I backed up my data 10 years ago, a lot probably has changed…