ADMIRAL LSDS requirements and survey
From ImageWeb
Contents |
Introduction
Fundamental to our early goals is the establishment of an easy-to-use local data sharing service.
See also:
Requirements
We need to select a file-sharing system that meets the following requirements (there is some overlap here, though each reflects a different use-case):
- Rock-solid reliability and availability
- File system clients for Windows, Linux and MacOS
- Competent access control, preferably with some level of RBAC and delegation
- Files remotely readable and writable, preferably using normal file access mechanisms
- Readable and writable using a simple HTTP-based protocol (e.g. WebDAV)
- Files accessible (readable) as ordinary web resources
- Locally or cloud hostable
- Can be augmented with additional local services for researchers (backup, annotation, packaging, visualization, etc.)
Also desirable:
- Synchronization with local or offline working copies
- File versioning
Candidates
- Linux with CIFS, Apache, MOD_WebDAV. Ubiquitous, free, reliable local file sharing, good platform for additional services. Remote access is less clear. Local access control is easy enough, but doesn't always play well with WebDAV (e.g. want common Auth/Access control for CIFS and WebDAV).
- Xythos (http://www.xythos.com/products/webfile_server.html) - solid, functional WebDAV server. Comprehensive access control. Focussed on document management, and not obviously separable from that. Commercial and rather expensive.
- DropBox (https://www.dropbox.com/), was GetDropBox.com. Has a really nice interface, but is commercial and is only available as hosted service. Hence also hard to add additional services. Good example of the kind of facilities we'd like to see.
- ExtremeFS ...
- SOLR / Lucene indexing for search and discovery
- http://milton.ettrema.com/index.html - Milton webDAV library, apparently can be used to wrap an arbitrary resource (database, file system, etc.)
- http://virtuoso.openlinksw.com/dataspace/dav/wiki/Main/VOSDAV - Virtuoso OpenLink software also includes a WebDAV server
- http://fuse.sourceforge.net/sshfs.html - SSH file system client for Linux
- http://www.cam.ac.uk/cs/pwf/remote/ - Cambridge University file access facilities
Commentary
A preliminary search for "WebDAV server software" (and a number of variants) and "File synchronization software" does not reveal any candidate systems that are obviously superior to a standard Linux platform with CIFS.
The Milton WebDAV server library looks interesting, but mainly as a way of fronting data servers of various kinds, and will probably involve some programming to use.
The Virtuoso OpenLink server support for WebDAV is also interesting, particularly in light of our intended use of RDF, but it's not obvious how we gain near-term benefit from this.
Tentative proposal
In the absence of an obviously superior system, I propose to use a standard Linux system (Ubuntu) with CIFS as the main file sharing mechanism. CIFS is an implementation of Microsoft's SMB file system protocol, and has the advantage of very widely used clients for Windows, Linux and Mac platforms, among others. The disadvantage of CIFS is that is is mainly a local area file access protocol, and not really suitable for remote access.
Remote access can be provided by a combination of complementary mechanisms, many of which are supported by standard Linux server applications:
- SSH/SFTP (with WinSCP client for Windows, Cyberduck client for MacOS, FileZilla client for all major platforms). None of the clients just mentioned fully integrate with the file system. There is also sshfs for Linux (http://fuse.sourceforge.net/sshfs.html).
- WebDAV server via Apache HTTPD and mod_dav. Windows, MacOS and Linux have file system clients using WebDAV. Historically there have ben compatibility problems, but I'd hope that, by now, these have been ironed out.
- NFS - historically based on UDP and most suitable for local deployment, it can also be used via TCP and used for remote access. I don't know about the quality of current cross-platform NFS implementations.
Access control
(Need a plan: use Linux authentication or something else? How to integrate with SSO for CIFS access?)
(Is a Linux PAM module the answer?)
Synchronization
A basic file access protocol won't perform synchronization. Although this isn't an immediate "must have", it's clear that it could be useful for certain researchers, for example those that collect data in the field, away from Internet access.
A possible approach is to use DropBox.com in conjunction with a conventional file system: by synchronizing all or part of a user's LSDS file area via DropBox.com, and also synchronizing the same data with space on a some other machine (e.g. one used for offline working), the desired synchronization should be achieved automatically. The possible drawback is that the user must be prepared to trust the DropBox.com service as a staging area for their data.
Versioning / history
A basic file access protocol won't perform synchronization. This, too, isn't an immediate "must have", but it has come up in several contexts that this could be a very compelling feature for researchers. ("My data was all good yesterday, but I just messed it up. Can I get it back?". Hmm... that use-case is probably best handled by the system backup, but maybe "My data was good yesterday, but now the analysis is giving strange results - what has changed?")
I would guess that access to past versions doesn't have to be transparent: a researcher is probably prepared to pay some attention and spend 15 minutes to recover half a day's data.
A solution I have in mind is to use a Subversion (or mercurial or Git) repository for the history recording, with a daily cron job to record a new cpy. (I do this for my Linux home server configuration.) A simple web interface and CGI script can be used for saving specially tagged versions, and for retrieving old versions of data.
We should look for possible use-cases when performing the data audits.

