ADMIRAL Technical Overview

From ImageWeb

Jump to: navigation, search

Contents

ADMIRAL Technical Overview and Rationale

DRAFT FOR DISCUSSION AND COMMENT - 2011-06-21

High level goals

For many years, academic support services have been promoting the idea that research datasets should be published and openly accessible, to better support the research process through improved verifiability of results and re-use of hard-won experimental observations. Many early attempts to create data repositories have met with no more than modest success. The exceptions include fields in which the sharing of data is essential for meaningful progress to be made, such as pooling of gene sequence information for biological research into the function of genes. The successful data repositories are generally backed by high levels of centralized funding, and a general community expectation that certain kinds of data will be submitted to these repositories. But there remains a "long tail" of smaller scale research, often very labour-intensive, whose datasets rarely make it beyond the original researchers' hard disks. Part of the reason for this is that hard-pressed researchers in small groups without dedicated IT or data management support find the data deposit process too burdensome.

The ADMIRAL project aims to redress this problem through application of two key principles:

  • Sheer curation - making data management and deposit a seamless part of normal day-to-day research activity. This requires us to focus on extending the breach of existing tools that researchers choose to employ, without imposing a burden of new learning or significantly altered practices. To be effective, it must be possible to capture the research data via any system, without installation of special software. Researchers often use a variety of machines, many of which are associated with particular experimental equipment (microscopes, sample analysis, etc.), and we aim to make it trivially easy to collect data from all these sources. Collectively, researchers commonly use a range of Windows, MacOS and Linux systems upon which data may originate.
  • Curation by addition - don't require researchers to have everything in place all at once, or in any particular sequence. Rather, take what they've got and then allow for it to be improved until its ready for publication. Research is often an incremental process, and the exact shape of its later stages may be not fully apparent at the outset. An incremental, unconstrained approach to data gathering permits user to focus on what's important to their research at any time. Much of the information required for publication is present at some stage - so we want to make sure it remains accessible up to the time that it is finally arranged and prepared for publication.

Overview design

ADMIRAL consists of two main parts: a locally managed data storage and staging facility (which we refer to as an ADMIRAL server) , and a data preservation and publication platform created and managed by the Bodleian Library Service as part of the Oxford Research Archive initiative (referred to as the Databank server).

  • An ADMIRAL server is deployed on a per-research-group basis, and provides a shared file space for researchers to collect, prepare and share data sets, accessible in the normal way as a network-shareded file system.
  • Databank provides a long-term repository and publication platform for selected research data.

ADMIRAL server

An ADMIRAL server presents a network-shared file system, accessed using common desktop system facilities, and usable in the same way as the system's local file system. By presenting a standard file system for data storage, no constraints (other than storage capacity) are placed on the kinds of existing data ( e.g. spreadsheets, documents, images, videos, etc.) that can be stored.

The file system is overlaid with web access, which allows:

  1. selective sharing of data with colleagues who cannot conveniently use the normal file system mechanisms. This may be particularly useful for sharing with colleagues from outside institutions.
  2. web-based support services that can perform data handling tasks, such as visualization and analysis, to improve the value of the data. A particular such service implemented by an ADMIRAL server is selection and submission of data to the Databank repository.

It is also possible to access data in an ADMIRAL server by other means, such as SSH/SCP, SFTP, etc.

An ADMIRAL server is built from Ubuntu Linux, a freely available open source operating system. Ubuntu has all the basic capabilities needed, is widely used and well maintained. If necessary, additional features can be added in the form of new programs, but we try to avoid doing that as much as possible. Rich seams of additional applications are available and can be added to the basic platform, providing clear potential for long-term evolution of the ADMIRAL platform capabilities for diverse research groups.

Fils sharing is provided by Samba software, which implements the standard Windows file sharing protocols (SMB, also known as CIFS), which is widely used, well supported, and for which client software available on all of the target desktop systems we have considered.

Web Access is provided by Apache web server software, which likewise is widely used, well supported, flexible and hardened for use on the open World Wide Web.


Databank server

Databank is built on an object store abstraction, and presents a REST style web API and a simple web interface. Most of the technical implementation details are hidden behind this interface. As such, Databank is a pure web application, with a focus on doing one thing well (i.e. data preservation and publication).

Data stores and silos. A databank instance can support multiple silos. Each silo corresponds to a separate administrative domain, each with its own set of authorized users. Within a silo, data package identifiers, which are simple name strings, are required to be unique. In due course, each silo may be configurable to the needs of a particular department, research group or domain.

The REST API has evolved through interaction with the ADMIRAL server development team, and supports the functions listed below. There is also a basic web interface for accessing these functions, but for the most part they are presented to ADMIRAL users via facilities on the ADMIRAL server.

  • Package handling: uploading, unpacking, publishing and updating composite data packages.
  • Package metadata: uploaded data packages may provide an RDF file of metadata. Databank itself generates a range of metadata about submitted packages, and the combined metadata is included in a data package manifest file.
  • Databank performs basic version management, both at the package level and at the individual file level. By default, just the most recent version of any data package or file is presented, but older versions may be accessed by adding version a query option to the URI.
  • Basic mechanisms for embargo management are provided. A package may be embargoed (blocked from access by non-owners), or published. AT the present time, there is no framework in place for embargo policy management or policy-based embargo control.
  • Databank can request and assign DOIs for published datasets. This feature is not yet exposed at the user interface, and we do not yet have a policy framework in place for issuing DOIs. The issue of DOIs is related to but separate from the issue of dataset embargo.
  • A SOLR index of Metadata terms is maintained to assist discovery. It is intended that the index may be configured on a per-silo

Databank software installs into a standard Debian or Ubuntu operating system, and maybe other common Linux platforms.

ADMIRAL server base system and deployment

The base operating system used for the ADMIRAL server is Ubuntu 9.10, which was current at the time the project was started, but which has since been superseded. Currently, our preferred version would by 10.04 LTS, which has longer term security patch support, but we were unable to update as the VMBuilder tool used to build an ADMIRAL server instance was broken in version 10.04.

We use a separate virtual-machine-hosted system for each research group, for separation of administrative domains. The hosting environment we currently use is VMWare ESXi4, chosen on the advice of our local Zoology department IT support. Choosing a hosting environment that IT support are comfortable with was seen as an important step toward longer term sustainability of the ADMIRAL servers within the Zoology department.

We use the VMBuilder tool to substantially automate creation of ADMIRAL server system images. Many of the required packages are preinstalled in the image, but it turns out that several still need to be installed following the first system boot, for a variety of reasons. We now think VMBuilder was a poor choice, mainly because we are not stuck on Ubuntu version 9.10. Also, the initial system build is not as completely automated as we thought would be achieved; in particular, the initial OpenLDAP configuration is complex and error-prone. Subsequent discussions lead us to the view that standard Debian packaging, with "pre-seeding" of configuration parameters, would be a better way to proceed.

We use separate file system volumes for the base system and user data. Ideally, we want to be able to swap the base system without disturbing user data or other user options, but this has proven tricky to get right. We need to ensure that a copy of all user-dependent information is on the data volume, but the working copy (e.g. user accounts, network details) need to be present on base system root file system. This leads to some ad-hoc scripts required to reinstate user configuration options on a new system.

ADMIRAL uses LVM for flexible storage expansion (but not relocation). Moving to a new base system has been tricky, as user identifiers need to be mapped and propagated to the new data volume. Conceivably, if we kept the LDAP database (or a dump thereof) on the user data volume, we could update the base system but preserve the ADMIRAL user identifiers.

When one research group asked for more storage then we could support on our VMWare hosting server, we worked with departmental IT support to move their data to a volume on a separate iSCSI-connected storage array. Initially, this worked well, but following a power failure in the department we had a series of problems that led to corruption of one of the virtual machine system volumes. We understand this problem was caused, in part, by contention for access to network connectivity, and the virtual machine management interface has been moved to a separate physical interface. We have also experienced problems where a virtual machine gets "stuck", and we are unable to recover it without re-booting the entire virtual hosting environment, with attendant disruption to the other virtual machines. As part of the Dataflow project, we intend to deploy a second virtual hosting environment very similar to the current ADMIRAL environment for testing purposes, but also to be configured in such a way that we can migrate virtual machines between the servers, thereby allowing us to perform maintenance if we encounter similar problems in future.

File access

The ADMIRAL server underlying file system is a standard Ext3 or Ext4 Linux file system, with ACL support. In principle, any file system could be used as long as Linux ACLs are supported.

We have experienced some strange problems with ACL support, possibly as a result of interactions with Samba. It seems that file permissions are not always correctly generated from the ACLs for empty files. We also have experienced a problem when attempting to create a general collaboration area in which new files created via Samba were created *without owner read access* from the Linux host - we have not been able to resolve this problem.

As well as access via CIFS/Samba and HTTP/Apache, we have also found it useful to allow file access via SSH/SCP. This is especially useful for remote users for whom the CIFS protocol is not available or inappropriate for use (being essentially a local area protocol). In such circumstances, webDAV is a possible file access protocol, but experience with WebDAV file system clients has generally found that they do not perform well for remote connections. By comparison, SCP with an appropriate desktop client (e.g. WinSCP or Cyberduck for Windows or MacOS respectively) performs well and has well-integrated, well-tested security.

ADMIRAL file server Web access overlay

As noted above, overlaying the shared file system with web access is a key feature of the ADMIRAL server, enabling a range of additional services to be used over ADMIRAL server data. This has been achieved using the Apache web server with mod_DAV (to support HTTP reading and writing of files).

For this to work, the Apache web server must be able to read and write all of the user file areas. But it runs as a separate user, not as the authenticated user, so this level of access for the web server has to be permitted by the file system. Achieving this is not possible using the standard Unix user/group access control model, so we use Linux Access Control Lists to allow such access. Because the web server itself can access all file areas, it in turn must enforce the security policy for its clients, based on HTTP-authenticated user identifiers. This is achieved through script-generated Apache configuration files, one for each ADMIRAL user. While we have managed to get this to work, it is inflexible (being based on a specific file layout with default ACLs inherited from higher-level directories), and on occasion has been a little fragile.


Authentication, access control, security

Our main requirements here are:

  1. common authentication credentials for all modes of file access (CIFS, HTTP, SSH/SCP, local login, etc.)
  2. common access control settings for all modes of file access.

Our initial aim was to link with the University SSO for authentication, which is Kerberos-based, but that proved problematic for file system protocols. Generally, we needed kerberos-aware desktop client software to properly authenticate via SSO for all access modes, and such clients were either unavailable or buggy in many cases. Our fallback was to use OpenLDAP, currently as a local instance on the ADMIRAL server. We were in any case expecting to use LDAP for access control, so it was a small step to use it for authentication as well. This means that users need new account names and passwords on the local ADMIRAL server, separate from their common University credentials. Setting up LDAP is quite complex, and hard to fully understand all the details.

It is worth noting that we were able to set up WebAuth easily enough for Kerberos-based authentication of web access via SSO, but that would mean we use different credentials form HTTP access vs CIFS, SSH and local login access. We tried Kerberos-enabled WebDAV clients on Linux, Windows and MacOS. Problems with Apache Kerberos authentication, MacOS Kerberized WebDAV client crashed system. Gave up on this approach.

We also ended up with LDAP for user access controls (via group membership), with parallel configuration of file system ACLs and Apache authentication and access control, which is inflexible and fragile. Samba is configured to authenticate against LDAP, so special non-standard "smbldap-*" utilities are used for user registration and maintenance. All of this is handled via custom scripts, with additional data held on the user data file system to facilitate replication of users on another system. The user management provided in this way is very user-unfriendly - we are working on a web interface for the scripts, but would prefer to not require this. We would prefer to re-use available user management tools.

For browser and Web service API access we use standard HTTP authentication mechanisms (HTTP basic authentication, following a redirect of all requests to use HTTPS ), which results in:

  • Confusing use of ADMIRAL and Databank credentials.
  • No way to provide user logout from web session (as the browser remembers credentials with no standard way to programmatically flush them).


Data package submission and viewing interface

A data package is submitted from a directory tree on the ADMIRAL server file store. The user selects the root directory for a package, and all sub-directories are assumed to be part of that package. Metadata that accompanies a data package submission is stored as an RDF file in the data package root directory on the ADMIRAL server, and supplies defaults for subsequent re-submissions of the same data package.

The data package submission and viewing Uses combination of Python CGI and Javascript to drive web interface. We originally chose not to use a framework for simplicity, but this meant we were dependent on standard HTTP authentication mechanisms, and required use of HTTP (reverse) proxying to support AJAX requests directed to Databank, which used together can create confusing authentication requests for users. Also, this means there is no way to log out from a session.

For new features we are moving to use of WebPy/mod-wsgi for server-side logic (currently for user admin interface). This is based on a WSGI version of the CherryPy dispatcher, which at one time was part of the Turbogears framework.

Code (HTML, CSS, Javascript, Python and shell scripts) is currently supplied from a Mercurial repository on each ADMIRAL server, to facilitate updating of support software without regenerating the whole system.

The ADMIRAL server provides a view on Databank contents, intended to be customizable to local needs. It is also possible to deploy local customization of ADMIRAL and Databank views through additional web pages and/or Javascript, which may be served from within an ADMIRAL file system or published data package, but to date we have not implemented any feature in this way.


Data packaging for databank submission

A data package is wrapped in a format based on BagIt for transmission to the Databank server. This is not a full BagIt implementation. It is essentially a ZIP file with an RDF metadata file in root. Problems have been encountered with empty data packages when using the Python ZIP library. Structure can be awkward or inflexible in some situations - a metadata subdirectory in package root might be more flexible - but this has not yet been a problem for us.

The metadata is included as an RDF/XML file. Both Javascript and Python have reasonably easy-to-use parsers.


Databank deployment

In our current deployment, a separate silo is allocated for each research group. An alternative choice would have been a single silo for all research groups in the whole department. Our choice is mainly so that different group do not need to avoid data package names used by other groups, but also provides that each research group only sees their own packages when listing Databank contents, and each silo is a separate administrative domain with its own set of users.

Within a silo, researchers must ensure that submitted data packages have unique names. The alternative (and original design) was that an arbitrary number would be included as part of the URI for each data package, but feedback from users and other experts argued strongly that the entire URI should consist of elements that are meaningful from a user's perspective.

Databank offers the potential for per-silo SOLR index configuration, which can be used for customized metadata indexing for data discovery. E.g. a bioinformatics group might include gene and species name information in their searchable metadata. To date this feature has not been used or tested.

All data package submissions by ADMIRAL users are handled via the ADMIRAL server system. Databank was originally envisaged to include a direct web interface for this purpose, and may yet do so, but to date we have not tried to use this facility. One advantage of handling submissions via the ADMIRAL server is that we can capture and store submission metadata locally, making subsequent update submissions easier to perform.

Personal tools
Oxford DMP online
MIIDI
Claros