This post is preparatory to a section in the developing sustainability reference model. We welcome comments on this approach. Is this a sufficient set of questions to consider sustainability in particular (general) contexts? Are there major issues that have been forgotten in this scenario (bearing in mind that it is always a generalisation)?
The Blue Ribbon Task Force looked at four different sustainability contexts: Scholarly discourse, Research Data, Commercially-owned cultural content, and Collectively produced web content. We thought it worth revisiting those contexts in the light of our developing reference model. The first of these is research data.
It should be said at the outset that “research data” is an extremely broad term, with very different meanings across the various scholarly disciplines, and even within the same discipline given different approaches. Nevertheless, we believe there are useful things to say.
The analysis will address 7 important questions:
- Who benefits from use?
- Who selects what is kept?
- Who owns the resource?
- Who preserves (or manages) the resource?
- Who pays?
- What are key attributes specific to this context?
- What are the key risks?
Who benefits from use of research data?
The major person to benefit from good management and preservation (in the short term) of research data is the researcher herself. Managing research data well is a great help to researchers.
Once papers are written and in the publishing process, having research data accessible to reviewers can be helpful. Likewise when new project proposals are written having data accessible means greater confidence in referees.
Some research funders require a search process for pre-existing data before certain kinds of research are undertaken. Archiving data can help unnecessary experiments being run.
In some areas of observational research archiving data is a requirement, as observations cannot be made at a later date under the same conditions. This is particularly true of environmental and social sciences. Medical sciences also have strong requirements for research data to be archived. In all these cases the archived research data can be seen as another research instrument with data from the past available for re-use. Some researchers build their careers on re-analysis of existing research data.
Finally, the public at large has an interest in research data. There may be few who can make sense of it, but there are many examples where the public has made good use of data archives. It is worth remembering that many researchers enter “the public” on leaving the research area, and some of these individuals have as much skill in making sense of existing data as current researchers.
Who selects what is kept?
In the first instance the researchers themselves select the data they wish to keep, and their purpose is to further the research itself. These selections should be influenced by the data management plan (although we recognise that these are not yet common).
Researchers also select the data they wish to use as the basis for the argument written up in their publications. This selection may be influenced by reviewers of referees. These data should be kept available in a static state (if possible) in order to be accessible to readers who might want to use them to validate the publications’ conclusions.
When (or if) the data move from the researchers’ control into a data archive or data repository, the selection will be done by the relevant data archivist (or equivalent), informed by selection criteria or collection guidelines. In some cases the selection process may be influenced by (or carried out by) a peer-review panel of experts.
Who owns the resource?
This can be a seriously difficult question for much research data. The answer may depend on the nature of the data and the legal jurisdiction you live in. It may also depend on any prior agreements written into project proposals, memoranda of understanding, data management plans, etc.
Roughly speaking a fact is not copyrightable, but the expression of the fact may be. From this stems a great deal of complexity. In practice, you may never be certain who owns research data in (for example) a multi-group, multi-institution, international collaboration.
In practice, most research data is treated as being owned by the researcher or research group who generated it and have custody of it. Many researchers regard their data as being advantageous to them and do not wish to share it, holding it almost as a trade secret. There are also moves suggesting that researchers should explicitly disavow ownership of their research data after an initial period, using Creative Commons CC0 tool to put data into the public domain, so far as is possible in their legal jurisdiction. This reduces problems in re-using such data.
Data archives will usually not claim to take ownership of data, but operate under a licence from the researchers to preserve the data and make it available.
Who preserves (or manages) the resource?
Again, in the first instance the researcher manages the resource. Practice here will vary widely. Some laboratory-based science requires data management according to strict protocols, using lab notebooks to record details. Much individual scholarship will be based entirely on the chosen practices of that individual. Many research groups will be unspoken amalgams of individual practice.
For data associated with a publication, in many instances the publication (eg a journal) will have mechanisms such as supplemental data for holding data supporting an item that does not fit within the rhetorical text. In many cases that supplemental data will be frozen as tables or images in a PDF file; not at all re-usable. It would be better for such data to be deposited in an institutional (or department, or research group, or subject) data repository in a machine-processible form.
Data with an expected longer life should not be left attached to a personal or even departmental web page, but should be moved to some kind of repository or data service with a more sustainable future.
The $64,000 question! Those who fund research are clearly prepared to pay for the management of the data required for the research during the project duration. Some contributions to this funding may come from other funding streams, such as institutional infrastructure funds.
A research group that has some longevity greater than a single research project (or a department of researchers in related areas) will usually need some mechanism for retaining data in a re-usable form, as it forms part of their intellectual capital.
At some point, however, data of sufficient value should be handed off elsewhere. If a relevant subject or discipline data archive or service exists then that should always be the first choice. If not an institutional data archive would be helpful. If one does not exist, chivvy your librarian or institution’s research director to get one established!
One factor limiting the spread of subject data archives (and the sustainability of those that exist) is that money for infrastructural support such as data archives is in direct competition with money for more research. This is a real limiting factor on the sustainability of data archives, and is one of the reasons behind the infamous demise of the UK Arts and Humanities Data Service.
What are key attributes specific to this context?
The late Jim Gray from Microsoft invented the term “the 4th paradigm” to represent research based on masses of collected data in 2007 (see Hey, Tansley, & Tolle, (Eds.). (2009). The Fourth Paradigm: Data-Intensive Scientific Discovery. Redmond: Microsoft Research. Retrieved from http://research.microsoft.com/fourthparadigm).
The idea of using other peoples’ data is as old as research but has changed greatly in modern times. The internet and the Web especially have made data sharing feasible in ways that were unimaginable before. However, data sharing has still not migrated into common practice amongst many researchers (particularly since many senior researchers started their careers well before current capabilities came into being).
It is worth also noting in passing that it is entirely wrong to treat research data if they are mere extensions of familiar text objects. Research data varies dramatically in scale in at least 5 different ways: size of data object (from tiny to enormous), numbers of objects (to the billions and beyond), and rates of deposit, rates of change (yes, change), and rates of access. These and many other features make research data potentially entirely unlike data or text that have come before.
What are the key risks?
Risks for research data management, curation and preservation are legion.
At point of capture, there are risks of poor data management, of poor context capture (ie metadata and pals), and of course downright fraud.
Throughout data processing (which may extend for years, of course), there are all the risks above (since new data products will be generated during processing). There is also the risk that the “computational lineage” will not be (adequately) captured (see Bose & Frew (2005). Lineage retrieval for scientific data processing: a survey. ACM Computing Surveys, 37(1), 1-28. doi:10.1145/1057977.1057978).
At the stage of archiving, perhaps the principal risk is that the focus of interest has moved on. The Principal Investigator is writing the grant to follow up on the follow-up project. The staff have mostly moved on. The PhD students are in the last stages of writing up and have other priorities.
These risks are compounded if the PI is inclined to horde data for some supposed advantage (the suggestion that “the coolest thing to do with your data will be thought of by someone else”, attributed to Rufus Pollock, doesn’t please some people). There will likely be some uncertainty on the rights or permissions needed. There may be privacy or ethical issues. These and similar arguments will decrease the desire to archive data.
Many feel that taking short term positions (“the data are on my web site”) is enough.
And to top this, in many disciplines and institutions, there will be very limited options for long term archiving anyway. It’s not surprising that the default option is… do nothing.
Once archived, if the archive itself comes under threat (and given the competition for funding for infrastructure versus research mentioned above, this is always a possibility) there are very limited handoff options.
[Update to add links