Meetings/20070503/DefiningImageAccess-SERS-Oxford
From ImageWeb
Meeting held at OULS, Osney Mead, on 3 May 2007
Present
- Sally Rumsey
- Neil Jefferies
- Alexander Huber (ODL)
- Jun Zhao
- Graham Klyne
Summary
Some key facts emerging from this discussion:
- ORA is a stategic service initiative with a view to very long term peservation. As yet, there are no image collections.
- We are not currently aware of any OAI-PMH accessible image collections currently online at Oxford.
- Domain-specific metadata will probably be best handled through the content model and composite formats. Fedora has composite object support that supports a general graph model rather than nested components (think: RDF vs XML?).
Repository software system
- What are the merits and drawbacks of your chosen repository software system(s)?
The original Oxford repository, developed as part of the SHERPA project, used ePrints software, which has limitations when creating a strategic repository design. ePrints required certain key decisions to be made at install time, which is not desirable in an environment intended to support long term preservation: flexibility to change and evolve is needed. The ePrints software is focused on the ePrints (e.g. publication and thesis) model, though recent developments are looking toward arbitrary multimedia.
DSpace is less architectural, more pragmatic, and also requires metadata assumptions stated at setup time.
Fedora has an architecture that doesn't require a priori assumptions; it will survive, evolve and scale. But it has no user interface -- rather it is a pure infrastructure element consisting of an interface definition and an implementation.
OULS are supplementing Fedora with VTLS VITAL and VALET with some home-developed custom extensions. (VALET is the open source subset of VITAL.) VITAL provides extra support for the workflow interfaces, on the top of the functionalities provided by Fedora.
Oxford Research ARchive (ORA) services are already live and accepting ePrints and eTheses. ORA is a service (NOT a project) - see the ORA home page [3]. There is a very good summary of the ORA service features here [4].
Scale and infrastructure
- What is the planned scale of SERS repository services (e.g. how many distinctly identifable objects are anticipated)?
One driving project is Google Books: about a million volumes, accessible at the book level, with an estimated average 300 pages/book.
Another repository will be OUDL collections: about 10Tbytes of data, but no clear idea about the number of identifable objects.
Currently OULS are working with Sun to replace underlying file system with a dedicated object store and to enhance the object store's underlying metadata formats. Would like built-in XML parsing, arbitrary XML support, etc. No plans to expose direct web access to content, but will use Fedora and supporting software to control access and presentation. Will probably bypass Fedora for ingest - Fedora can build repository structure from data in the object store.
ORA sits alongside the Google Books repository as a separate archive on the same underlying infrastructrue. Likely a large number of archives/collections. Quantity is potentially all research publications (including journals, conference papers, reports, etc) produced by the University university. The raw research data are currently not yet included in the archive. Consider number of researchers in university and their total output. Not necessarily core datasets, image sets, etc., which will be handled by the "repository link" project.
Repository link project: repository tie-up up with "OERC storage grid" for storage of raw data. The repository will hold metadata records for each of the datasets, which can be accessed via data grid Storage Request Broker (SRB). Can link to different "content" models. Goal is that everything will have DC records, but derived by processing from canonical source metadata. These source metadata could be based on heterogeneous schemas, which are then mapped to the universal DC model.
Note that the term "institutional repository" is misleading -- we need to think in terms of multiple independent repository systems (even if some run on common infrastructure), each containing multiple collections.
Image collections at Oxford
- a brief survey of image collections in Oxford repositories, particularly with a view to identifying any with good metadata and used for research. We would like to come away with specific live OAI-PMH endpoint URIs so we can subsequently explore the metadata in greater detail. How many collections consist of or contain images? What is typical number of records and images?
No harvestable image collections at OULS yet. Don't know of any OAI repositories in Oxford. OERC may have large collections in their data grid Storage Request Broker (SRB) store.
Image repository metadata standards are not well defined as yet, and OULS are looking for guidance in this area.
For images, there are well-organized collections in the Classics department (e.g. Beazley archive), but these are not in an OAI repository.
Organization of collections
- how are collections organized -- e.g. is there a grouping by subject or content type?
Considering two OULS collections: ePrints (journal articles, conference papers, etc.) and eTheses. These are essentially flat collections, but can be grouped by subject type based on subject metadata; arbitrary "overlay" groupings may be determined by other metadata. Could end up with collections by subject with views that differ from the default view. (Hmmm... what does "view" mean in this context? I think it may have to do with how the metadata is presentedin a web browser interface.) Different applications (metadata presentations?) can be layered over the repository. It is not a requirement of Fedora but currently collections tend to have a uniform content model - this makes for a richer possible display and use of metadata.
"Content model" - determines what metadata goes with an object, how it is displayed, what is entered when an object is deposited. Fedora and associated software may apply mappings to the metadata determined by a content model to provide a consistent interface; e.g. mapping to basic Dublin Core.
Ben (?) has generated a generic metadata mapping tool. LoC have done work on mapping metadata standards. Possibility we can learn from their work. There may be some design notes available on the web, or from OULS - these would be most interesting to read.
ACTION: Neil - to send copy of proposal for BID (Bridging the Interoperability Divide) - joint OERC, libraries, and OUCS. Different contributors have different storage resources: OERC have SRB (grid storage) which can be enhanced with a freindly user interface and metadata harvesting; libraries have ORA, which can be enhanced with links to objects on grid SRB; OUCS have the ASK learning objects repository which can be linked via other services.
Jun asked about identity issues - Neil responds that all elements will be WebAuthed. But the question was about persistent identifiers. It is not currently clear the extent to which OeRC and OUCS iidentifiers are persistent. OULS are looking at applying a query-by-description (or content) approach to these data, which should be resilient to any change of local object identifier - e.g. moving to a different base URI. OULS are considering having OAI-PMH harvesting of ORA actually return results from across the three repositories - in this way, persistent object identifiers assigned for ORA can be exploited for access to the other data sources.
Domain specific metadata
- domain specific metadata - is it supported, and does support for a new schema require repository administrator action? Discussion of a proposal for domain specific metadata that has been made by the ePrints software developer at Southampton [1].
This raises the question of whether OAI-PMH will be used for anything other than DC? Not sure at this time - may punt the question as an object composition issue. Patterns for delivering domain specific metadata based on object composition are expected to come out of the ongoing work of ORE, which should be easily supportable by Fedora. Also, uncertainty about the quality of domain-specific metadata submitted by researchers is a cause for concern.
Fedora content model: the implementation of OAI-PMH is "just" another expression of a content model through an interface.
Storage of arbitrary metadata content is not a problem, and one can pull a complete object, but could not necesarily present it through standard interfaces. Just about anything can be ingested, but dissemination is not really worked out for all content types. Dissemination here is about discovery rather than physical access; associated content model may define minimal metadata. But with a given handle it will always be possible to retrieve the raw content.
Planned image deposition activities
- what are current and planned image deposition activities?
Nothing really at present - so have not really encouraged people to come up with image collections.
Will take references to images using SRB, in the expectation that images and other large datasets will be stored in the storage grid. The repository may hold metadata records for such data. The OULS repository will hold Google and ODL images.
- are there any particular facilities for depositing and/or accessing images that are different from other kinds of submission?
Not yet.
Granularity of access
- what is the granularity of access and/or discovery for deposited items (e.g. paper, image collection, image, etc.)?
This corresponds to the level at which persistent identifiers are allocated.
Depends on content model; paper, thesis, "sensible grouping". Also depends to some extent on depositor.
It is very unlikely to hold metadata at the level of per-image. This level of metadata is expected to be disseminated by the SRB repository hosted by OERC.
Recommendations and policies
- what procedures, recommendations and/or policies do you have for getting good metadata for deposited items? How do policies vary across current collections?
Sally is working on policies. Self-archive can't demand too much of depositors, and will be asking for minimal mandatory fields. Controlled vocabularies are used where possible. Technical metadata (rights, etc) is automatically generated. Just four mandatory fields have been identified: title, publication status, peer reviewed and copyright status.
OULS plan that everything deposited will be double-checked mnanually, but are not sure yet that this approach can be adequately scaled.
Copyright and intellectual property
- how do you deal with copyright and other intellectual property issues?
This is a big area. The repository deposit agreement addresses IPR issues. Embargo will be possible. Dark archives will be possible. There is not currently a notion of deposit purely for preservation.
Rights metadata is still being considered. There will be a takedown policy in event of infringement notifications (Will this be removal or hiding of content?)
End-use restrictions can be specified. Thesis copyright is often owned by author, and there is a need to be wary of permission conflicts: mandatory deposit of theses requires sound policies in place to deal with all the ramifications.
(University believes it owns rights to all employee IPR. There are several waivers and exceptions.)
Encapsulation standards
- is there a planned role for digital object encapsulation/composition standards like METS, MPEG-21, etc?
Yes there is a role, but this is work in progress. Different repositories may have diffierent conventions. Some research grants mandate particular forms - required metadata, etc. Composite object formats will be important in realizing these requirements, but the end result of such considerations is not yet entirely clear.
(cf. ORE [8], [9] work on abstract models that can be realized in different ways.)
(new ePrints software has a pluggable interface for ingest/export.)
Bottom line: still looking for good answers and defined standards.
Ingest procedures
- what ingest procedures do you support or prefer? (e.g. batch submission, direct web submission, special submission client, etc.)
Like batch submission (e.g. from existing repository).
For normal use, direct web submission is anticipated. Want to do everything via the web.
Don't want users loose with Fedora administrative client.
- do you have a view of any larger workflows within which the repository plays a part?
(Management flows: Overall steering group - Oxford Digital Repositories.)
OULS would like to have research life-cycle involvement. Cites example of BVREH [5] (or was that IBVRE [6]?). Would like to collect metadata automatically. Involvement in startup phase of of a research project, possible involvement in searches for existing work, and in determining deposit requirements in grant proposals, would all be useful.
Separately, involement in RAE submission process - tag submissions check that RAE conditions are met. Do RAE submission entirely within repository system (e.g. Southampton ePrints RAE submission).
At the other end of the research process, pushing data out to departmental web sites, personal web pages. e.g. RSS feed of papers from repository. Control of involvement in release of materials via VLE.
Repository access
- how many people access deposited items? Do you know who they are (e.g. Oxford, other University researchers, general public)?
OULS have only been monitoring access for a month. They vaguely know who is accessing the data - people all around the world. OULS are probably doing more to monitor usage patterns than, say, Cambridge. Resources are heavily used from outside the university - 80-90% accesses are external.
OULS view the repository as a service, not just an archive. One possible use is to generate an income stream from the digitized material - as an integral part of the service rather than as a third party selling exercise. E.g. Bodleian shop [7].
Data webs approach
- feedback on our proposed approach to handling research images across multiple repositories (illustration at [2]). Do any particular pitfalls come to mind? What are the currently available options for locating images in repositories?
The immediate response to this question was that we are essentially trying to do for images what OAI-PMH + OAI harvesters does for ePrints, etc. It is the case that OULS are planning to provide functions within the repository service that parallel dataweb functions (e.g. metadata adaptation for presentation).
If we believe we are offering more, we need to be clear what that may be. Or maybe, in the context of repositories, if our proposals address the same problems in different ways, we should articulate the advatanges of our approach in a wider context. Possible points to consider include: a generic integration solution that works for any OAI repositories, integrating OAI repositories with other web resources, working with domain-specific metadata, adapting semantics to particular communities of interest.
References
[2] http://imageweb.zoo.ox.ac.uk/drupal/files/20070416-SoftwareOverview.png
[3] http://www.ouls.ox.ac.uk/ora
[4] http://www.ouls.ox.ac.uk/__data/assets/pdf_file/0010/14689/ORA_key_facts.pdf
[5] http://bvreh.humanities.ox.ac.uk/
[6] http://www.vre.ox.ac.uk/ibvre/
[7] http://shop.bodley.ox.ac.uk/
[8] http://www.openarchives.org/ore/
[9] http://www.ukoln.ac.uk/repositories/digirep/index/OAI-ORE

