ADMIRAL Benefits Case Study
ADMIRAL (A Data Management Infrastructure for Research Across the Life Sciences)1 is a UK JISC-funded project to facilitate the capture of research data and their subsequent publication via an institutional repository. It is being conducted by the Image Bioinformatics Research Group in the Zoology Department of Oxford University, the Oxford Bodleian Library Service, and the British Library. The goal is to make it easy for researchers to collect, curate, submit, publish and review datasets in support of conventional paper publications. (We refer to this as "sheer curation"2.)
Data storage within the ADMIRAL Project is a two-stage process, first to a local private file store, and then of selected datasets to a data repository, from which they may be published. The main day-to-day interface between researchers and the ADMIRAL system is a shared file system implemented using common open source software (Linux, Samba, etc.), which is easily accessed from most personal computer systems without requiring installation of additional software. Researchers are initially encouraged to use this for keeping copies of their work-in-progress datasets, by provision of automatic daily backups to a University-managed backup facility. The shared file system is overlaid with Web access capabilities; i.e. data in the shared file system can also be accessed and updated using HTTP and WebDAV protocols, using the same access credentials. We have also developed additional services for processing and presentation of data in the ADMIRAL file system; one such service permits packaging and submission of selected files as a dataset (or 'research object') to the Oxford Databank, a data repository run by the university library service. Other Web services, yet to be developed, may facilitate augmentation of local data with information from global repositories (semantic enhancement), or running the data through externally provided analytical services or work flows.
ADMIRAL is working with four diverse research groups within the Zoology Department: the Silk Group that studies the biophysical properties of spider silk, with a view to developing new biomimetic materials having similar properties; the Evolutionary Development group that studies the evolution of genes affecting organism development, particularly those determining the body plan of invertebrates; the Behavioural Ecology Group whose members are, among other things, studying tool-making behaviour in crows and risk-taking behaviour in starlings; and the Elephant Group, whose members work alongside the Save The Elephants Charity in Kenya to track elephant movements and to develop strategies for reducing conflict between wild elephant populations and human settlements.
The ADMIRAL Project PI is David Shotton, Reader in Image Bioinformatics in the Zoology Department at Oxford University, a "lapsed cell biologist" who is now working on generic ways to preserve and publish biological data, enabling reuse for purposes not necessarily envisages within the original research project. The ADMIRAL project manager is Graham Klyne, Computing Officer and lead developer within the Image Bioinformatics Research Group in the Zoology Department, who has previously helped to develop a variety of Internet and web standards, who has developed a variety of software systems for scientific and technical applications, and who, with David Shotton, has been closely involved in a number of preceding JISC projects (Defining Image Access, FlyWeb, Shuffl, MILARQ).
Established Practice and Challenges
From our initial interviews and data audits with researchers, we found their current practice for research data management to be somewhat ad-hoc, with extreme variability between research groups and even between researchers within a single group. Research group leaders had some awareness of the need to manage and preserve data, but for other group members (post-docs, graduate students) this issue was very much secondary to their immediate research interests, and their goal of publishing papers in high-impact journals.
Little thought was given to the eventual publication of their own research datasets. This was, in part, due to there being little recognition among researchers that research data are, in their own right, serious academic outputs, and also because there is a tendency among some researchers to regard their data as a personal resource for their own exploitation, rather than something to be shared.
This attitude is changing slowly - and we are attempting to catalyse this sea-change in attitudes through the ADMIRAL Project itself - as research councils and other funders publish data sharing policies and require that research data management plans are part of any submitted research proposal. For example, the BBSRC have published a data sharing policy4 that states:
- "BBSRC expects research data generated as a result of BBSRC support to be made available with as few restrictions as possible in a timely and responsible manner to the scientific community for subsequent research"
- "All applications seeking research grant funding from BBSRC must submit a statement on data sharing. This should include concise plans for data management and sharing as part of research grant proposal or provide explicit reasons why data sharing is not possible or appropriate."
- "the BBSRC 'Safeguarding Good Scientific Practice' document states that it expects primary data to be securely held for a period of ten years after completion of a research project". (See also: BBSRC Statement on Safeguarding Good Scientific Practice5.)
Our work in the ADMIRAL Project is part of much wider e-research activities to improve the management and status of research datasets as bona fide academic outputs, activities shared by many other JISC and non-JISC projects, including the DataCite, Open Citations, Dryad-UK, myExperiment and Wf4Ever projects with which we are personally involved. The broad case for sharing research data, and the impediments to realizing the benefits of such sharing, were documented in some detail in a recent Nature article, Empty archives, by Bryn Nelson3, which noted the following particular problems among others:
- Finding and organizing data: "When the time came, scientists couldn't find their data, or didn't understand how to use the archive, or lamented that they just didn't have any more hours left in the day to spend on this business"
- Fear of abuse: "But in practice those advantages often fail to outweigh researchers' concerns. What will keep work from being scooped, poached or misused? What rights will the scientists have to relinquish?"
- Effort required to present data: "Where will they get the hours and money to find and format everything?"
- Uncoordinated data storage: "All too many observations lie isolated and forgotten on personal hard drives and CDs, trapped by technical, legal and cultural barriers"
- Interpretation of shared data: "How data should be shared is also a substantial problem. A prime example is the issue of data standards: the conventions that spell out exactly how the digital information is formatted, and exactly how the contextual information (metadata) is listed."
- Citing shared data: "Another issue facing journals and data banks is how to ensure proper citations for data sets."
A common theme of these particular problems is that they illustrate the challenges of data preservation and sharing within researchers' existing work flows. We and others have noted that researchers are unlikely to adopt new practices unless they are (a) compatible with their current practices, (b) easy to implement, and (c) provide some immediate benefit in their day-to-day work.
Benefits from project
While the small scale of the ADMIRAL Project is such that it can't hope to meet all of the indicated data sharing challenges on its own, it does address several of the issues described in the Empty archives article:
- Finding and organizing data:
- ADMIRAL aims to make acquisition of data at source as easy as possible for the researchers ("sheer curation"2). We will show progress on this front by having real research datasets, from small research groups for whom the obstacles described are particularly acute, routinely saved to and securely backed up in our local ADMIRAL file system, and by having seleced datasets, deemed worthy of publication by the researchers who generated the data, lodged in our Oxford Databank institutional repository for long-term archiving and Web publication, after an optional embargo period. Further, application of the notion of "curation by addition"9 allows researchers to lodge intermediate results in the ADMIRAL filestore, and migrate these to the Oxford Databank, progressively improving them through their day-to-day work to the point of being ready for open publication.
- We do not yet have hard evidence of this taking place, but one researcher from the Silk Group indicated that he would like to have all his datasets submitted to the Databank repository. (This has not yet happened because, within a few weeks of starting to use the ADMIRAL system, he had completely filled up his allocated storage capacity, and then stopped using the system without telling us! We are now working with departmental IT support staff to link ADMIRAL to an expandable Departmental networked storage facility, so that researchers can expand their available storage capacity on a cost recovery basis6. Very soon after this has been accomplished, we anticipate further file submissions to ADMIRAL and subsequent deposition of datasets from ADMIRAL to Databank.)
- Fear of abuse:
- ADMIRAL aims to address this by allowing researchers to retain control over exactly what is published when. Each of the local ADMIRAL instances, one for each research group, is entirely private to that group. User-selected datasets can then be submitted from there to the Oxford Databank for preservation and publication, with the researchers maintaining personal control, enabling them to specify embargo periods prior to data publication, so that they have time to publish their own results based on the data as research articles in leading journals, without being scooped.
- The general theme here is that we retain researchers' trust by not wresting control of their data away from them. We will show this by having research data sets approved for publication, with or without an embargo specified. However, it is hard to expect this to be demonstrated within the short time-frame of the ADMIRAL project, except for a few exemplar datasets. The previous example of the Silk Group researcher is an indication of his willingness in principle to share specified datasets in due course - even for data that is currently commercially sensitive.
- Effort required to present data:
- ADMIRAL aims to make progress in this direction by providing tools that provide useful views over the data and simultaneously automate acquisition of the information that makes possible such views (an aspect of "curation by addition"9). Progress towards this goal will be demonstrated by having examples of additional metadata lodged with datasets that have been collected as the researchers analyse and organize their data to support journal publications. This is a large topic, and any progress in this area cannot be more than indicative.
- Uncoordinated data storage:
- ADMIRAL does not directly address legal or cultural barriers, but progress on the technical barriers is being demonstrated through the use of common, shared storage mechanisms and systems.
- Interpretation of shared data:
- ADMIRAL does not address the issue of standards for primary data - indeed, we believe that researchers should be free to use formats that best serve their research needs. But it does use standard formats and vocabularies for the descriptive metadata that accompany the datasets (currently we are using RDF/XML, Dublin Core and OAI/ORE). The benefits of this will be demonstrated through the deposited datasets having associated metadata that allows a single query to retrieve information about multiple deposited datasets.
- The ADMIRAL metadata format explicitly allows RDF metadata employing new vocabulary/schema standards to be adopted as needed - this is important, especially for research data, as the appropriate standards are not always clear at the outset of a research programme. In our past work with Drosophila genomic researchers, the appropriate metadata standards for published data became apparent only after much of the original data had been collected and initially annotated. We are currently in negotiation with the library services to allow customized metadata indexes to be configured in the Databank service on a per-silo basis, allowing queries over domain-specific metadata (e.g. species name or gene identifier).
- Citing shared data:
- ADMIRAL addresses this by working with the British Library to provide for DOIs to be assigned to published datasets, through its engagement with the DataCite project. This will be demonstrated by having datasets published with assigned DOIs that can be cited within research papers and other datasets. Even better would be to have examples of citations of such Oxford Databank datasets, but that is unlikely to happen within the time frame of the current JISC ADMIRAL project.
Summary and key points
ADMIRAL aims to promote data sharing by reducing the barriers to sharing faced by researchers. Our starting point is a shared file system that researchers can use immediately with little or no introduction, and which provides an immediate benefit of daily data backup. This addresses one of the most widely recognized data management problems - data security - that researchers identified at the outset of the ADMIRAL project.
The key demonstrable outcome from the ADMIRAL Project will be a system that researchers are able and willing to use on a routine basis, which closes the loop between researchers' experimental data and data repositories. The most compelling evidence for the benefits of this approach, unlikely to be seen within the lifetime of this JISC project, will be concrete examples of data that has been published by one researcher and subsequently used and cited by another.
Overlaying the file system with a Web access layer (HTTP and WebDAV) allows us to provide additional services that can be used to "improve" the data in small ways. The first such service is an easy-to-use service that allows selected datasets to be lodged with the Oxford Databank, our university's data repository service, from where the datasets can be published for wider, long term access. (There is yet an unresolved issue of who foots the bill for providing such access in the long term, but having a system with demonstrable utility prepares the ground for such a debate.)
There is an important aspect of data re-use currently being addressed by a wider community of researchers that is concerned with the exchange of data as "Research Objects" between e-Research supporting systems7. Our ADMIRAL experiences are part of these discussions8, with the package format used for ADMIRAL repository submission (ZIP, RDF, OAI-ORE and DC, loosely based on BagIt) meeting many of the minimum requirements that have been proposed for research objects. This is an indication that our ADMIRAL work can play a continuing part in a wider landscape of research data management and sharing.
Additionally, our existing method of deploying separate independent ADMIRAL instances, configured using VMware as virtual machines within a common physical server and storage environment, bodes well for the potential future deployment of ADMIRAL systems and services more widely in other environments, including cloud-based environments, for the benefit of the broader research community.
References and notes
1. ADMIRAL project: http://imageweb.zoo.ox.ac.uk/wiki/index.php/ADMIRAL
2. Sheer curation: http://en.wikipedia.org/wiki/Digital_curation#Sheer_curation
3. Data Sharing: Empty Archives, Bryn Nelson, Nature 461, 160-163 (2009), doi:10.1038/461160a: http://www.nature.com/news/2009/090909/full/461160a.html
4. BBSRC Data sharing policy: http://www.bbsrc.ac.uk/organisation/policies/position/policy/data-sharing-policy.aspx
5. BBSRC Statement on Safeguarding Good Scientific Practice: http://www.bbsrc.ac.uk/web/FILES/Policies/good_scientific_practice.pdf
6. Data Storage Costs: http://admiral-announce.blogspot.com/2011/01/data-storage-costs.html
7. Why Linked Data is Not Enough for Scientists, 2010, http://eprints.ecs.soton.ac.uk/21587/
8. ADMIRAL data packages l: http://imageweb.zoo.ox.ac.uk/wiki/index.php/ADMIRAL_data_packages
9. Modelling and storing a phonetics database inside a store - introduces the notion of "curation by addition": http://oxfordrepo.blogspot.com/2008/10/modelling-and-storing-phonetics.html