FlyData/DataManagementRequirements

From ImageWeb

Jump to: navigation, search

Contents

Towards a generic laboratory data management platform

Data management for the long tail of science research (this needs justification based on statistics of research funding for small vs large projects and groups).

This page collects user requirements and example activities that motivate designs for a generic web-based platform for laboratory data management in small research groups.


Overview

TODO: turn this condensed summary of benefits and requirements in to proper English

Benefits

  • improve quality and usability of data from which research conclusions are drawn, by gathering and maintaining ancillary details alongside the raw data
  • provide access to tools that help researchers better understand and present the implications of their data
  • facilitate comparison and composition with other researchers' data
  • make data ready for publication or sharing with other researchers
  • facilitate publication of results by traditional and emerging means

Requirements:

  • accessible for small research groups
  • enhance existing practices
  • work with researchers' existing tools
  • researchers maintain full control over the form and use of their data
  • facilitate exploitation of the wealth of tools available on the web

Intended users

In a Nature commentary (Nature, Vol 410, April 2001, p1023), Tim Berners-Lee and Jim Hendler discuss sharing of research data and the "eternal conflict between working rapidly as a small group and taking the time communicate more widely". They further argue that "the world works as a spectrum between these extremes, with a tendency to start small [...] and filter over time to a wider commonality of concept". Our approach is intended to address just this: providing facilities to help small groups manage their data without imposing overbearing overhead, yet retaining enough structure to facilitate migration and linking to a wider web of data.

We are looking to address the data management requirements of small research groups, which, unlike many larger projects, do not have dedicated IT resources to look after data management concerns and infrastructure. To this end, we anticipate a web-based platform that can be run as an externally hosted service, or installed and used "out of the box" as a locally hosted service with a standard configuration that can, if necessary, be supported by a remote team.

The requirements discussed here have been suggested by our work with Helen White-Cooper's team of Drosophila gene function researchers, who we regard as archetypal potential users for this system. Our work with Helen's team is used to illustrate how the proposed system might be of value of researchers, though our goal is to create a more widely applicable data management framework. Hence, these requirements will in due course need to be tested with other potential users.

[[[workflow diagram here - from poster - cite it]]]

Working with Helen's team, we have observed the software tools they use on a day-to-day basis for acquisition, analysis and presentation of data are:

  • Microsoft Excel spreadsheet
  • Web browser (for accessing bioinformatics information sources)
  • Apple Macintosh desktop (file and data organization)
  • Microsoft Word and Powerpoint (results and data presentation)

TODO: review and confirm this list with Helen

We have also noted similar emphasis on use of spreadsheet software for experimental data capture in discussions with colleagues from environmental sciences (see NEBC–EBI Workshop notes).

We have also seen concern for data availability in another area of research, infectious disease epidemiology. At a June 2008 meeting held in Oxford to discuss improving data publication for infectious disease epidemiology research, Angela MacLean observed that improved linking would help to avoid "trawling through the literature in dusty libraries". She went on to say that for one particular study, it took 3-4 months of researcher effort just to find the papers for an HIV re-analysis of 21 original studies, and that this is a situation that needs to be improved. It was also claimed that data burying is wasting "UK plc" man-hours, and that funding bodies are therefore not getting best return on their investment - man-years of researcher effort are being wasted, and work is duplicated if existing work is not well-referenced.

Philosophy and principles

The approach to this project is directed by the following principles:

  • Addressing the "long tail" of science research - providing facilities to support data acquisition, management and publication for small research groups who do not have dedicated information technology and data handling support staff; also, facilitating eventual submission to institutional, national or other repositories for preservation. [[[justify long tail]]]
  • Providing a service that makes it easier for researchers to exploit the wealth of systems, services and supporting data that are available through the Web and Internet, rather than trying to provide all facilities within the service. For example, by providing a service for data hosting that uses simple web URIs for accessing data, the system can access data stored in different services that are similarly accessible on the Web.
  • Aim to create a system that augments current methods and is easy to use, but which also promotes (if not enforces) practices that avoid some of the more haphazard approaches to research data management that have been observed.
  • Simplify access to data by providing personalized entry points that can exploit a record of a user's recent activities to facilitate access to recently-used data sets, and maintain personalized bookmarks and annotations to other data and services. (For example, consider how systems like Flickr and Del.icio.us provide personalized access to data, along with a framework for exploiting personalized annotations.)
  • Deliver the advantages of linked data, by designing a system consisting of largely independent elements that can work together through Web links to supporting services and data. This in turn allows research teams to link their data to information stored in widely-referenced central databases, and to coordinate different datasets within a single research project (e.g., in the FlyTED work, combining Affymetrix microarray data, FlyBase gene information, real-time PCR observations and in-situ gene expression images with researcher annotations at all stages). External linkage should apply not only to data, but also annotations, processes, services, and more.
  • For longer-term sustainability, software should be open source, in anticipation that enough people will find it sufficiently useful to contribute resources to maintaining and developing the system. At the level of a particular research group, support-related concerns would be mitigated by having a common, open-source software platform, and integrating with existing components for which support is more readily available; research group data managers can then focus on their group's particular requirements.

User requirements by functional area

Acquisition

It must be easy for the system to acquire research group data, and its use should fit with existing data processing practices - cf. "sheer curation" [[[ref]]]. In this context, "research group data" includes not only the raw data sets (numbers, images, etc.), but also any additional descriptive material (preferably in a machine processable format) needed to interpret the data. (It is the inclusion of machine processable descriptions that makes this a semantic data publishing platform.)

For example: Helen White-Cooper's research group perform initial data collection using Excel spreadsheets, a tool with which they are familiar, and which is well-suited to this task. The data management system must easily ingest these data, and allow immediate benefits in other functional areas from so doing, yet retain the option to perform further data processing using familiar spreadsheet tools. An example of an immediate additional benefit might be the ability to display locally obtained microarray data for the same gene in different tissue types obtained from FlyAtlas.

Multi-file datasets

The above example dataset consists of just one file. More complex data ingest problems arise when a data set consists of multiple files. For example, the acquisition of in-situ gene expression images, along with researchers' annotations describing the gene expression patterns shown by the images. Helen White-Cooper's research group use an Excel spreadsheets containing links to local image files alongside image annotations.

We would like to be able to perform all data acquisition using just a web browser, but for complex multi-file datasets it may be desirable to provide a special tool - unlike a common web browser, any such tool is likely to be somewhat tailored to the data being collected.

Data types

In the first instance, we propose to focus on ingest of tabular (spreadsheet) data and images, but many other kinds of data might be candidates for further consideration; e.g. network structures, audio, video, multidimensional microscopy images. But bear in mind that existing systems may be better suited for acquiring such data (e.g. the Open Microscopy Environment (OME) [[[ref]]] for multidimensional images), and that the laboratory data management system should permit annotations of and access to data stored elsewhere.

Retrieval

Having collected data int the data management system, it must of course be possible to get it back again.

Software tools commonly used by research biologists are spreadsheet programs (Excel, etc., noted above) and web browsers (for accessing online Bioinformatics databases. To work smoothly alongside existing practices, using the same tools to retrieve and review data is highly desirable. Thus, we want data to be retrievable using standard web protocols as supported by browsers and other common tools.

On the Web, retrieving information requires use of the appropriate URI - one needs to know a URI that references the desired information. When storing data, each element must be allocated a URI that can be used to subsequently retrieve that data. But allocating URIs for datasets is only part of this requirement. It is also important that researchers can browse their data without a priori knowledge of what URIs are available for accessing dataset, and search for datasets using a range of criteria.

For example, when a file of microarray results has been uploaded, the system should automatically record what it knows about the upload (the date, type of data, who uploaded it), and also allow the submitter to indicate additional information (e.g. details of the strain of organism used to obtain the microarray data) and store that alongside a record of the dataset's URI. A user should then be able to retrieve a of datasets they have uploaded, or search for them by a variety of annotation criteria.

(Role for faceted browse here?)

Retrieving elements within datasets

For large datasets, it may be desirable to access parts of the data individually. For example, Affymetrix microarray data may contain results for about 15,000 gene probes. A researcher may subsequently wish to access data for a specific probe. While it is possible to retrieve all the data and then pick out the required element, it may be faster and more convenient to have a separate URI to retrieve data individually for each microarray probe.

A more compelling example concerns the capture of multiple in situ gene expression images and their annotations: the ingested data - a collection of images and annotations - may be treated as a single dataset. But for comparison and review of the observations, applications used by researchers almost certainly want to be able to access the images individually, and separately from their annotations. Not being able to do this would greatly complicate the job of any software used for reviewing the data. Thus, each individual image, and possibly each individual annotation, should be retrievable using an individual URI.

When URIs are allocated to access parts of the dataset, they should be recorded in a way that is subsequently browseable and searchable by researchers.

Version identification

In general, acquired data will be subject to revision, and a data management system must be able to deal with multiple revisions of a dataset, and present them in a useful fashion. For example, when referencing a dataset it will generally be desirable to access its current version, but where specific results have been obtained through data analysis or interpretation, it may be necessary to refer to a specific version of the dataset to support repeatability of the analysis.

For example, the annotation of in situ images by Helen White-Cooper's research team has undergone several revisions to progressively align the recorded annotations with developmental terminology from existing ontologies, and has in turn been used to propose new or revised ontology terms. Thus, at various stages of the project, the local annotation data and the reference ontology used by FlyBase have undergone revisions. The results of a search for annotation terms against FlyBase or some other bioinformatic database will depend on the particular versions against of local and external data against which the search is performed.

Thus, the system system must be able to record the fact that some dataset is a revision of some other dataset, and maintain this revision information. It should also be able to present information about datasets that is not over-burdened with revision information, while providing access to older revisions when needed.

When dealing with image annotations, each revision should be noted as such when it is submitted to the system (e.g. by a user interface that provides options to upload a new dataset or to upload a revision to an existing dataset). Subsequently, the system may preferentially display links to just the latest versions of each dataset, but provides options through its browse and search mechanisms to locate older versions. The web interfaces of software version management systems, provided by services such as SourceForge, show how this might be achieved).

Revision of elements within a dataset

We note that combining dataset element access with revision tracking can raise some tricky choices. If a revision involves adding a new row to a table, does it make sense to ask about older revisions of a specific row? It might depends on how the row is identified. In the case of Affymetrix data, each row is distinguished by a unique probe identifier, and this could reasonably be used for meaningfully accessing a particular row of data across revisions. But sample data used to create a scatter plot may not present a useful way to track changes to a particular datum.

These requirements currently take no view about how these issues should be resolved: revisions should be tracked and discoverable at the level of entire datasets, but whether to do so at the level of individual data elements is subject to individual circumstances and implementation considerations.

(see also Jun/Al's paper? I'm not sure how it is relevant)

Sharing, access control and personalization

Having acquired data, it should be visible by other research group members, and equally accessible from different computers.

Capture and initial annotation of in situ gene expression images was performed using a computer situated beside the microscope and connected to microscope camera; further annotation and review may be performed using researchers' personal computers, located in separate offices. In this environment, access to data depends on local network file sharing solutions, and the data would not generally be accessible from outside the laboratory network (for example, from a researcher's home computer, or by a collaborator at another institution).

Using the web to collect and retrieve data means that data sharing capabilities come "for free", but some mechanisms will be needed to manage those with whom the data is shared.

The practice in Helen White-Cooper's research group, based on local network file sharing, has been to use a single network logon for all research group members, which results in the loss of provenance information and is somewhat contrary to what are considered normal information security practices. This approach is viable for a small, intimate research group where data is not widely accessible. In a larger group a more formal mechanism for controlling such sharing is highly desirable.

We assume that research groups wish to restrict access to their data until such time as they are satisfied that it is fit for wider circulation. Sharing within a research group using open networking technology thus requires identification of users, so that research group members can be recognized and permitted access to the data.

Identification of users should support more than just access control: for each user, there will be datasets that are currently in use, and these should be readily accessible whenever they connect to the system. User identification is also an essential piece of provenance information about a data item and/or its annotations.

For example, after a series of in situ images has been acquired using a microscope-connected computer in a laboratory area, the researcher may then move to their own personal computer to perform review and further annotation. On identified herself to the data management system, the researcher should immediately be presented with a list of the recently captured images. This user interface model is similar to that provided by widely used systems such as Flickr (http://www.flickr.com) and Delicious (http://del.icio.us).

Note that, for the purpose of discussing user requirements, we distinguish here between shared access within the research group and wider publication and sharing, which is discussed later, even though some of the same mechanisms may be involved.

Provenance

In gathering research data, we also wish to capture information about its provenance: how, when and by whom it was created; when and by whom it was submitted to the system; and the filtering and transformation to which it has been subjected. This is particularly germane for research data, as these are factors that may affect the validity or acceptance of any conclusions drawn from the data.

Provenance information may be a useful aid to interpretation of data. For example, a researcher reviewing data may not understand why some annotation has been applied; if the person making the annotation is known, they can be asked for further explanation.

The data management system should facilitate the collection of such information, where possible capturing it automatically.

For example, the date and time of submission of any data or annotations are easily captured, and the person submitting new information should be captured by way of the identified user. Discoverable metadata such as EXIF information in JPEG images may be made visible as additional annotations of the image data.

Where the system has knowledge of processes applied to the original data (e.g. see Processing below), this information should be recorded.

Working with and linking to externally stored data

In addition to acquiring datasets within the data management system, it must be possible to link to external web-accessible data sets. Use of remote data sets should be as nearly as possible identical to locally collected datasets: instead of uploading a new dataset, a URI is provided indicating where the data can be found. Additional information and annotations are supported as for local data. The version identification mechanisms should be applicable to remote data. The system may provide an option to make a local copy of the data as a hedge against failure of the original web site.

For example, our local research group wishes to compare locally gathered testis gene expression data from various Drosophila strains with multi-tissue data for the same genes from an external source, FlyAtlas (http://www.flyatlas.org). In addition to a web site for human browsing, FlyAtlas provides access to its underlying dataset via the web, and it is this can be combined with local data for easy comparison.

Work our group has done on a PLoS infectious disease paper is yielding some compelling examples of the value of being able to overlay geographical images - comparing topography, development, disease incidence, etc.

Presentation and paper writing support

Having acquired data and provided mechanisms to retrieve it, mechanisms are required to create useful displays of the data. This is a very open-ended requirement, and one that will never be fully, or even substantially, satisfied by a single system. The data management platform must be capable of using external web-based services to create data displays; the API used should be very simple, allowing such services to be implemented in a variety of ways, including simple CGI scripts. (See also Processing below).

Some examples from our exemplar project are:

  • image gallery displays of in situ images for a particular gene and Drosophila strain
  • image galleries showing in situ images of a particular gene in different strains
  • image galleries of in situ images of a particular gene in different strains along with researchers' annotations and comments about that gene
  • with real-time PCR data, graphical plots of PCR product levels against PCR cycle number provide additional information thats serves as a cross-check on the microarray data

More useful and compelling displays (indicated to us by Helen White-Cooper's team) combine data from multiple sources:

  • local microarray data alongside data from FlyAtlas, allowing comparison of locally observed testis gene expression levels with those seen in other tissue types
  • image galleries of in situ images of a particular gene in different strains along with FlyBase summary information about the selected gene
  • FlyAtlas tissue-specific expression level histograms displayed on a 'gbrowse'-style gene map

These visualizations should be generated in forms that can usefully be captured for inclusion in a paper, along with metadata that can be used to link from a graphic in a paper back to the data upon which the graphic is based (see also Publication and dissemination).

Principles of analytical design
In creating data presentations, the following principles of analytical design from Beautiful Evidence by Edward Tufte [[[ref]]] may be useful to bear in mind:
  1. Show comparisons, contrasts, differences
  2. Show causality, mechanism, structure, explanation
  3. Multivariate analysis
  4. Integration of evidence (words, numbers, images, diagrams)
  5. Documentation (author, sources, scales)
    • What is it?
    • Who did it?
    • Who's that?
    • Where and when was it done?
    • What sources?
    • Assumptions?
    • Scales of measurement?
    • Who published (and printed) the work?
    • Who sponsored the work?
  6. Content matters most - quality, integrity, relevance

Processing

I have a note to "develop requirement to explore the original data for some published result" - what's required here?

In many cases, we find that data is not readily visualized without some additional processing. We may wish to convert raw observation data into a frequency distribution, or perform a BLAST search on PCR primer sequence data, and use the result to identify a particular gene whose product is tagged by the PCR product. We may also find that data from different sources is not comparable without some additional processing. For example, in examining data about age-related incidence of Leptospirosis, we have found that different studies group data into different age bands: with some processing the data could be converted in to a directly comparable form.

The data management platform must be capable of submitting datasets to arbitrary processing by external services, and recording the results along with a description of the process performed. The interface used for accessing processing service should be very easy, allowing such services to be implemented as simple CGI scripts, and should not preclude the use of workflow services (e.g. myGrid [[[ref]]], myExperiment [[[ref]]]) do coordinate complex processing.

In the case of our Drosophila gene expression studies, the initial microarray screening of 6 Drosophila strains is subjected to statistical analysis which is used to narrow the range of genes considered for more detailed study and in situ imaging. The results of this analysis should be generated and captured within the data management environment so that it available for later review, and maintains the possibility of re-running the analysis on new microarray data, or comparing the outcome with alternative statistical approaches.

Coreference

A particular case of processing that may be required when comparing data from different sources is coreference determination. Different data sources may use different identifiers or keys to access what is (for some purpose) the same subject.

For example, our Drosophila gene researchers have needed to compare local gene expression data with similar data from different tissues provided by the FlyAtlas [[[ref]]] database. In this case, the local data has been obtained using an Affymetrix "Drosophila 1" microarray, where the FlyAtlas data has been obtained using an Affymetrix "Drosophila 2" microarray. For many genes, these microarrays use different probe sequences having different probe identifiers to detect the product expressed by a particular gene. The correspondence between the different microarray probes (which is approximate in many cases) can be obtained by reference to a separately published mapping table (or alternatively by reference to other bioinformatics databases). Finding this correspondence, preliminary to matching results for a particular gene, could be provided as an external processing service.

Publication and dissemination

The data management platform should make it easy to gather for publication some or all of the data created or used by a research project, along with comparisons and presentations that researchers have drawn upon for inspiration and forming a project's conclusions. This means exporting raw data and selected visualizations of that data in formats that facilitate republication via institutional or national repositories (e.g. University repositories such as ORA, or national repositories such as [eCrystals]).

A principal motivation for development of this data management platform is to facilitate publication and sharing of data. Many commentators have observed that vast amounts of research funding are not fully exploited because the resulting data are lost once they have been used to provide the raw material for an academic paper or two [[[refs?]]]. Although we are aware that some researchers are wary of having their raw data be freely available, we see a general trend toward more open publication of research data, and against this background we see a significant impediment is the effort needed to capture the data in a form suitable for publication and sharing.

We have created the FlyTED database (http://www.fly-ted.org), which contains in situ hybridization images of gene expression in testes for about 1000 genes and 6 different Drosophila strains (wild type and 5 sterile mutants), along with associated annotations and other metadata recorded by researchers as they created the images. Each image is the result of painstaking preparation, involving the dissection of the organ to be examined, staining with a specially manufactured "tag" that indicates presence of a selected gene product, transfer to a microscope slide, image capture and recording of the salient details. Capturing such images for about 1000 genes and 6 strains has occupied a team of 3-4 researchers for 2-3 years. In creating the FlyTED database, we have manually gathered the images and annotations from spreadsheet files, written custom software to convert them for storage in an adapted ePrints database. Overall, this has been a laborious process, which might have been greatly eased by availability of tools that allow researchers to capture and collate the relevant data as the project proceeds, and then to make the organized data available for subsequent publication.

Discovery

Published data is of little use if it is not discoverable. The data management platform should expose key metadata in ways that is accessible by conventional web search engines (e.g. Google) and also by an emerging breed of semantic web search tools (e.g. Swoogle, Sindice).

For example, a researcher should be able to find our Drosophila in situ gene expression images using a Google search along the lines of "drosophila testis gene expression images". Currently, this search provides a link to our database server as its 9th entry, and this reference is to the methods page rather than to the front page of the database. This is not bad, but given the specificity of the search, we might do better.

A more detailed search would be for in situ images of a particular gene, such as "drosophila testis gene expression image schuy": this search currently yields a prominent link to the FlyTED "Browse by Gene Name" page, but not to the page showing an actual image of this gene's expression.

Also, as already mentioned, it is important to be able to perform searches within a data collection, to locate an item of interest. Thus, with FlyTED we can search directly for gene "schuy" and immediately see a page of gene expression images for this gene. But it could be useful also to be able to search using other key values, such as microarray probe identifier '149083_at', to retrieve similar results.


Peer review, trust, annotation

A concern commonly expressed about openly published data is a lack of quality control or peer review. How are secondary researchers using this data to know that it has been reliably observed and faithfully reported? Papers are not published in academic journals until they have passed peer review, which imposes a basic standard of credibility on academic publication. Publishing data (or any information) to the web does not require this basic standard of review.

Several solutions to this problem have been proposed; most commonly deployed on the web are "reputation systems", which essentially allow web users to rate each other. The most obvious example of this is Ebay and Amazon, where buyers and sellers submit "votes of confidence" in each other, which in turn are aggregated by the system. Another example is Digg (http://digg.com/), a system that allows users to rate other published content on the web. What these systems all provide is some mechanism whereby authenticated users can record approval or disapproval about other users oir their published content. This might even be regarded as a kind of open peer review environment.

For our laboratory data management platform, the concerns of quality and review can be addressed at several levels:

  • clear documentation of the process and chain of reasoning from original observations to published data. The data management platform should be able to record derivative relationships between datasets, and references to information about the result dataset was obtained.
  • reference to the database from peer reviewed academic papers. The data management platform should be able to record and display annotations that indicate reviewed papers that refer to the database.
  • reference to the database from specific peer-reviewed data publication journals. For example, the FlyTED database is described by a short paper submitted for publication in "Nucleic Acid Research (?????)".
  • hosting or sponsoring by a reputable institution and attributed to researchers with publication records. For example, our FlyTED database is hosted by Oxford University's Zoology department, and is associated with published papers by the projects' principal investigators. In due course, we hope this can be hosted by an Oxford University institutional repository. The data management platform should facilitate migration of data to such a repository.
  • Publicly visible third party comments and analyses associated with published datasets.

The common strand here is inclusion of ancillary information to allow a reader to judge, directly or indirectly, the quality of a dataset. The data management platform should be able to capture and present authenticated annotations of datasets. The exact nature and form of these annotations cannot be prescribed, as these depend on the nature of the datasets.

The issue of trust in open data was discussed at a June 2008 meeting held in Oxford to discuss improving data publication for infectious disease epidemiology research. A number of views were expressed, including skepticism about the relevance of "trust networks". The importance of provenance information to establishing trust was noted by several attendees, including stable identifiers (like DOIs) for authors and commenters. It was suggested that publication of review comments or "semantic annotations" by journal publishers would give them more credibility, but would this mean the publishers had to review all such annotations. Finally, it was proposed that availability of raw data would improve trust by facilitating stronger peer review: a reviewer with access tio the data and mathematical model used would be able to re-run an analysis.

Preservation

Another concern expressed about web-publication is its longevity and dependability of continued access. These are fundamentally an organizational problems, and need to be addressed by institutions or coordinating bodies that have the will and resources to ensure continued accessibility. In Oxford, we see a move toward extending the institutional ePrints repository for papers to also include data. Our data management platform should facilitate the transfer and migration of data and metadata from local research group custody to an institutional or national repository for long term preservation.

Probably the main hurdle to achieving this is capturing appropriate data and metadata in the first place, so that when it comes to preserving it, possibly after many of the original researchers have moved on, the required information is all available.


The role of data management in research (OeRCD ODIT report)

In May 2008, Luis Martinez at Oxford University conducted a scoping study for the management of research data generated by Oxford researchers [1]. The study was intended to identify top requirements from researchers for services to help them manage their data more effectively, and also to form a basis for the Oxford case study for the UK Research Data Service feasibility study for a national shared data service.

The study is based substantially on interviews with Oxford researchers, and a large number of the documented findings are particularly appropriate to inform requirements to be addressed by a laboratory data management system. As such, the report represents a rich seam of independently documented researchers' concerns, many of which reinforce or augment the conclusions above our own experience of working with a small group of researchers:

General (2.2)

  • Overall, the vast majority of researchers interviewed thought that there are potential services that could help them manage their data more effectively

Top infrastructure issues (2.2.6)

  • A secure and user-friendly solution that allows storage of large volume of data and sharing of these in a controlled fashion way allowing fine grain access control mechanisms.
  • A sustainable infrastructure that allows publication and long-term preservation of research data for those disciplines not currently served by domain specific services such as the UK Data Archive, NERC Data Centres, European Bioinformatics Institute and others.

Collection (2.2.2)

  • The collection of data occurs in supercomputers generating huge simulations; in laboratories with sophisticated instruments such as microscopes or scanners producing large scale images, from field work in Social Science interviewing human subjects or archaeological excavations generating pictures, annotations and maps; from health professionals filling paper questionnaires when examining their patients; from research with manuscripts undertaken in libraries and museums worldwide; etc.
  • There is significant work done by researchers collecting data from published articles and books. This data collection involves manually inputting several datasets included in the publications into spreadsheets to then analyze them.
  • The collection of data can be highly expensive. A particular project claimed to spend around £500K collecting data that are unique and of high value to other researchers.
  • The reproducibility of the data varies too, some data is not possible to be collected again or it may be too costly to do so. In other cases, like simulations, the algorithm and the input files are more important than the actual data generated by them.
  • A variety of types of data collected. These can be nicely grouped using the Research Information Network data typology: data from observations that capture some measurement of a phenomenon in a location at a given point in time; generated by computer simulations to test models; data from experiments in scientific laboratories using sophisticated equipment like microscopes or scanners; derived data as a resulting product of manipulation and processing of primary data and canonical or reference data like those relating to gene sequences or literary texts.
  • Research data produced by Oxford researchers comes in a variety of formats that include text (flat text files, MS Word, Word Perfect), numerical (MS Excel, MS Access, MySQL, SPSS, STATA), video, audio, images (jpeg, tiff). Some of these data come in proprietary formats that can only be used with specific software products. There are a wide range of sizes, from a few megabytes in disciplines like Humanities to terabytes generated by simulations in Medical Sciences and MPLS.
  • The usefulness of the data to other researchers varies depending on the discipline. Some of the data created in fast moving scientific disciplines may be of value for the next five years whilst data created in areas such as Social Sciences or Art and Humanities may be useful indefinitely. Not always the raw data are useful to other researchers.
  • Secondary data is used by most of the researchers interviewed. These data are found through either wellknown discipline specific data repositories or obtained in an informal basis through networks of contacts from conferences or similar events.

Processing and annotation (2.2.3)

  • Data are commonly stored on personal computers or departmental servers. The data produced as part of computing simulations mostly in the divisions of MPLS and Medical Sciences is enormous in size and researchers struggle to find secure alternatives. As one of the interviewees stated: "there is nowhere in Oxford where people running simulations can store their data".
  • The data tends to be so precious to researchers that almost all of them have back-up strategies and mostly use OUCS services for this purpose. Nonetheless, some research groups use CDs and DVDs to back up and archive their data.
... he is storing all of his data for the last fifteen years on DVDs and he has teens of terabytes. They didn’t think about it at the outset and they are not technical. There are horror stories around the university about how the data is stored
  • Annotation of the data to record its provenance and content takes place mostly by including the information within the data, using hierarchical folder structure with file names and readme files in some cases. This information helps the researcher using the data while working with them but it would be hard for others to make sense of it. After some time even the researcher that created the data in the first place would find difficult to reuse the data, as one of them stated: "in five years time I wouldn’t know what was going on".
  • In some big projects where data and data curation have played an important role, high quality metadata has been created to describe the data produced. Nonetheless, very few researchers are aware of existing standards to describe their data.
  • In research projects or units that generate sensitive data policies and procedures are in place to store and access these data securely. Research groups that work with sensitive data have experience in anonymizing these.
  • Most researchers share the data they work with and this again happens in many ways. Some will use email if the files are not too big, they will upload it into a website and share the URL, copy it into portable media to then mail them, etc. The problem arises when the size of the data does not allow any of these methods or when the data is so sensitive that ethics approval is needed. During the interviews a situation come up where researchers had to copy the data from a storage device to another and physically take it to the collaborators.
  • Some of the tools for analyzing the data include R, SPSS, STATA, MATLAB, MaxQDA, user developed algorithms and open source visualization tools.


Publication, preservation and ethical concerns (2.2.4)

  • Although some researchers have used different national and international data centres to deposit their data (this is mostly the case of Social Science researchers depositing data at the Economic and Social Data Service) the vast majority of researchers interviewed had never published data in a domain specific data archive.
  • Researchers in Oxford are in many cases using departmental websites to publish their data but: at the end of the projects, who owns that data, who takes care of it, do we just destroy it?
  • Most of the participants on the interviews agreed that if data is produced with public money it should be made publicly available so that others can take advantage of it. They were mostly aware of how expensive data collection can be and how data reuse is the green way of doing things.
  • In most cases, researchers interviewed saw the usefulness of having summary data embedded in published articles to be deposited in a format that allows manipulation.
  • There are several reasons for researchers to not deposit their data but the main one is they are that this would involve some extra work and that funding agencies do not require it. In addition to this, researchers can be very attached to their data and not very keen on sharing it openly: Data feels quiet personal because of the effort that it takes to collect it, I am happy to share with serious people who are going to do good work, not willing to share with everyone.
  • Publishing some types of data poses many challenges. An example is the ethical clearance needed for accessing some of the medical data produced. As one participant argued: Who takes responsibility to deal with the ethical clearance?
  • Some researchers expressed a need to get advice on their IP rights when publishing their data and specifically when the data is published for a fee to those in for-profit organizations.

Support (2.2.5)

  • Researchers felt that none or very little support is provided to help them manage their data. The support available mostly comes from technical officers in the departments or members of OUCS and involves advice on options for storage and sharing and in some cases database design. In many cases, the support received depends on the contacts within the institution: the support I get is because I happen to know people that are responsible of different services but there is nothing available, if I was new to Oxford I wouldn’t have a clue.
  • Few research groups have dedicated Data Managers or IT specialist who take responsibility on multiple aspects of data management like designing interview questionnaires, databases, data input tools and looking after the secure storage and access to the data.
  • When looking for help to manage their data researchers tend to go to technical staff rather than librarians as they see their problems as mainly technical. Nonetheless, in some cases researchers using secondary data would welcome assistance to locate and access data resources.
  • In certain research groups they saw extremely important to have whatever support for doing the data management embedded in their group so that their specific requirements from researchers in this group can be understood. Not only that, the person responsible for managing their data will require relevant experience in their research area to be able to make any sense of them.

Related work

Data hosting services

  • Infochimps
  • SourceForge
  • Talis platform
  • Flickr

Data visualization services

  • Swivel, Many Eyes
  • Openfindings (Lisa)

Data hosting systems

  • OME
  • Fedora, ORA

Data handling workflow

  • Taverna (myGrid)
  • myExperiment

Annotation systems

  • Connotea
  • Delicious


--GrahamKlyne 09:24, 7 August 2008 (BST)

Personal tools
Oxford DMP online
MIIDI
Claros