DefiningImageAccess/Report
From ImageWeb
| NOTE: this page is now obsolete - the actual project final report is at: http://imageweb.zoo.ox.ac.uk/pub/2007/DefiningImageAccess/FinalReport/ |
Defining Image Access Final Report DRAFT
See also: DefiningImageAccess/Presentation
Title page
JISC Defining Image Access Project Final Report
A report from:
The Image Bioinformatics Research Group
Department of Zoology and Oxford e-Research Centre
University of Oxford
Authors:
Graham Klyne <graham.klyne@zoo.ox.ac.uk>
Julie Allinson <j.allinson@ukoln.ac.uk> (now at ***)
Jun Zhao <jun.zhao@zoo.ox.ac.uk>
David Shotton <david.shotton@zoo.ox.ac.uk>
Versions:
This version (11 July 2007): http://imageweb.zoo.ox.ac.uk/wiki/index.php/DefiningImageAccess/Report
Previous version (date): http://www.***
Latest version (date): http://imageweb.zoo.ox.ac.uk/wiki/index.php/DefiningImageAccess/Report
Final JISC version (date): http://www.***
Project title:
Defining Image Access: Requirements for interoperable discovery and delivery of image data stored in DSpace, EPrints and Fedora-based institutional repositories using a data web approach
Project purpose:
The Defining Image Access Project was a requirements analysis project to investigate the feasibility of establishing a data web that would permit cross-searchable integration of institutional repository image collections using light-weight Semantic Web techniques.
Project consultant partners:
Cambridge University Repository, DSpace@Cambridge (http://www.dspace.cam.ac.uk/).
CCLRC e-Science Centre (http://www.e-science.clrc.ac.uk/).
Dan Brickly, an independent Semantic Web consultant (http://danbri.org/).
Imperial College: Internet Centre (http://www.internetcentre.imperial.ac.uk) and Library.
Oxford University: Library Services (http://www.ouls.ox.ac.uk/), Computing Services (http://www.oucs.ox.ac.uk/) and Oxford e-Research Centre (http://www.oerc.ox.ac.uk/).
Southampton University: School of Electronics and Computer Science (http://www.ecs.soton.ac.uk/) and e-Prints software team (http://www.eprints.org/software/).
UKOLN, Digital Repositories Programme Support Team (http://www.ukoln.ac.uk/repositories/digirep/).
Project duration: January to June, 2007. Project funding: £62,991. Funded from the Discovery to Delivery strand of the JISC Repositories and Preservation Programme (http://www.jisc.ac.uk/whatwedo/programmes/programme_rep_pres.aspx). JISC programme manager: Balviar Notay (b.notay@jisc.ac.uk).
[Note: In this report, as in common usage, the word 'metadata' is used as a collective noun and takes the singular form of verbs. We apologize in advance to Latin purists to whom this may cause offence.]
Copyright © 2007 University of Oxford.
Published under the Creative Commons Attribution-Noncommercial-Share Alike 3.0 License (http://creativecommons.org/about/licenses/meet-the-licenses).
Key Messages
[Not part of the report content per se, but output from a brainstorming exercise to try and better organize the materials]
Outcomes we want to convey
David:
- Repositories currently do not have sufficient image or metedata content to be valuable in terms of our original vision of image discovery
- Our review of tools and standards encourages us that Data Webs can be created substantailly with available web software tools
- Through our survey work, we have adapted our view of data webs to be a smaller part of a framework linking repositories to a wider landscape, including systems used during the conduct of bench- or field-research
Graham, add:
- Adopting the principle of using the Web as the primary platform for integrating data sources means that many of the needed developments are already available as web software systems.
Jun, add:
- Maintaining the origins of all data: linking not copying.
- Local harvesting as opposed to centralized harvesting supports this view
- Building systems bottom-up rather than top-down: start by targeting specific domains and applications and looking to generalize those solutions, rather than trying to build a fully generalized system from the outset. One key advantage of this is "knowing if/when you've succeeded". Our philosophy is very much inspired by the principles of agile programming.
- Also "think globally, act locally" (cf. http://en.wikipedia.org/wiki/Frank_Feather, http://www.w3.org/QA/2003/07/LocalAction, http://www.manageability.org/blog/stuff/why-rest-part-3/view, lots of references but I can't find one I'd call definitive).
Recommendations
David:
- Implement SPARQL endpoints as an alternative to OAI-PMH
- (discussion of this: SPARQL isn't so much an alternative but a different part, a way to provide searching, and a way to link OAI reposiotries to the wider web. W probably need a short piece about SPARQL and its role in this landscape.)
- Active consideration of how to acquire, harvest and curate non-0textual information (images and other data)
- Build bridges to link repositories to other sources of data on the web.
Graham:
- More focus on the web as a platform
- Think in terms of an ecosystem of repositories: local <-> global, short-term <-> long-term (or: timely access <-> preservation), domain-specific <-> universal knowledge.
- Is it time to revisit the JISC Information Architecture as articulated by Andy Poiwell, and consider how to present as an overlay on the web rather than a framework in which IE components appear to relate only to other IE components.
- this is an aspect of David's "build bridges"
- Develop tooling that supports existing research practices (and concerns) while also capturing metadata that can be published later.
- Make submission to a repository much easier, overcome the hurdles for researchers to make such submissions
- Overcome the tensions about where to deposit: make it easy to deposit anywhere or everywhere (JISC are already pursuing this through BID, SWORD; MURDER submission by OULS/ORA).
Schema Alignment vs Coreference
We make much of separating schema alignment from coreference, without being entirely clear about what we mean by this. Partly, the distinction isn't always clear but, roughly, what this means is distinguishing between ontology- or schema-alignment and detecting references to the same object or instance in different data sources.
Work on relational database alignment has an easier time of it: the distinction between schema and table data is clear. This is discussed in a recent CACM article: Semantic Matching Across Heterogeneous Data Sources, Huimin Zhao, January 2007 (http://portal.acm.org/citation.cfm?id=1188913.1188916).
For semantic web data, the distinction is sometimes less clear, as there can be some lack of clarity about whether a term is a class (an ontological or schema element) or an instance -- e.g., see http://www.w3.org/TR/swbp-classes-as-values/. Some examples:
- animals vs plants - these would usually be recognized as classes of objects
- Dolly the sheep - an instance of a sheep
- Hydrogen - the class of hydrogen atoms, but statements about Hydrogen might be treating Hydrogen as an instance.
- A strain, or genetic line, of drosophila: a subclass of all drosophila, but gene expression observations would likely relate to this as an instance.
On reflection, I think the distinction may be fairly easy: if a term appears in instance data, then recognizing different terms meaning the same thing in different instance data is a coreference problem. If different instance stores describe the same attributes using different schematic structures, then schema alignment is needed. This distinction isn't unambiguous, but I think it serves as a starting point.
Comparing SPARQL and OAI-PMH
When OAI-PMH is the widely deployed protocol for accessing repository data and metadata, why are we recommending a different one?
- SPARQL provides query, oir selection by metadata term values; OAI-PMH does not.
- SPARQL links to the wider semantic web
- SPARQL can me mapped to provide queries against arbitrary data sources (e.g. D2R server for relational data).
Successes and Failures
- Successes
- we learnt a lot about the repositories and repository systems
- we learnt a lot about web-based tools
- we clarified our vision that a data web approach would be feasible
- we validated the idea of Web as a platform
- we brought new knowledge to our project, such as SPARQL
- we built contacts with other people, organized successful meetings and got people known us better (such as the Google retrievals)
- we validated the wiki philosophy
- we achieved a risk assessment against the project goals
- we have been writing up our work as the project goes on
- Failures
- could have done more surveys
- could have got to know more people
- lack of man power
- unable to demonstrate one of the key goals, aka searching across the repositories
Executive summary
[One page only - expand in later sections]
Much of what follows is predicated on our belief that domain-specific metadata is essential for discovery and interpretation of research image data in repositories; generic metadata (e.g. people, institutions, projects) for image collections is of limited use beyond well-connected research communities.
Survey findings
Concerning image collections:
- There are not many image collections in institutional repositories. At this level, the dominant focus is on electronic forms of written papers (e-prints and e-theses); specialized support needed for image and other media collections is mostly absent.
- For those image collections we did find in institutional repositories, with the exception of SERPENT at Southampton University (http://serpent.eprints.org/), there is very little domain-specific metadata of the kind that would indicate the content of an image, or provide a context for its interpretation (e.g. in gene expression images, information about the expressed gene, type of organism type, anatomical region and more are all vital for the image to be usefully interpreted).
- For software systems, images can be treated pretty much as any form of data. But the visual nature of images means that user interfaces for dealing with images should take into account the ways in which people can use and work with images - e.g. conventional search interfaces are not good for discovering images.
Concerning repository systems:
- Support for OAI-PMH is ubiquitous in institutional repositories, but is not a panacea for accessing domain metadata:
- it is administratively difficult, when possible, to deploy OAI-PMH to access varieties of domain specific metadata in an institutional repository,
- OAI-PMH cannot perform discovery based on metadata values (the repository community model seems to be to use a separate service like OAIster for such operations).
Concerning metadata:
- Domain-specific metadata can reasonably be handled as part of the data with modest repository support: e.g. for ePrints: add a couple of structural metadata fields; for Fedora: define an appropriate content model and external tool support based on OAI-PMH for basic access.
- Where available, metadata quality is variable; even generic metadata (basic Dublin Core, etc.) is not always consistently provided within a single collection.
- To achieve the goals we have set ourselves for image data webs based on existing repositories, we need to address some issues of metadata capture and/or creation.
- In related discussions with some scientific researchers, we have noticed that useful metadata is often collected in a usable machine-processable form, at a time when researchers may be reluctant to make it public. Later, when papers have been published, the metadata has been lost, or the effort required to gather the metadata for publication exceeds any benefit to the researcher for so doing. This suggests that we need to tackle metadata capture separately from, and well in advance of its publication.
- Pre-defined ontologies are best for re-use of associated data and images, but it is often very difficult to get researchers to agree in advance exactly what information is needed. We should seek out approaches that allow consensus to be codified yet do not prohibit recording of additional information that has not yet found community acceptance.
Conclusions
For image collections, more attention needs to be given to creation, capture and handling of domain-specific metadata. But the CLiC project has found that researchers will not, in general, enter more than 3 elements of metadata when submitting an image to a collection. Thus, for image data webs based on existing repositories, we need to find more successful ways of capturing such metadata. Some approaches come to mind:
- post hoc annotation of published images, often by researchers other than those who created the images. For existing image collections, this may be the only option,
- develop tooling to capture information earlier in the research cycle in ways that support and enhance existing research practices rather than work against them, [[[later, note role of collection organizing tooling and reference the list of tools reported by OxCLIC]]],
- develop techniques for storing and retrieving arbitrary metadata, in addition to the Dublin Core and other common elements that repositories currently handle.
Recommendations
To support publication and re-use of image collections, we recommend:
- deployment of publication frameworks to be researchers' trusted friend, to keep data safe and fully address concerns for confidentiality of data relating to work-in-progress,
- adaptation of repository ingress and publication tooling to support existing research processes, and timely dissemination of research data within a community of interest,
- engagement with repository object content model developments (e.g. ORE) to facilitate the migration of data publication within a community to global publication and preservation facilities.
For alignment of available metadata, we propose:
- separation of schema- and instance-level alignment sub-systems,
- adoption of a rule-based modular framework for schema alignment, with initial focus on simple forms of alignment but with options for more complex future additions, and
- engagement with metadata registry activities (e.g. IEMSR) to facilitate the re-use of common schema elements and mappings where appropriate,
Proposals for future work
We propose a software framework that:
- addresses local capture of idiosyncratic private metadata generated by a research project, and its subsequent migration to an institutional repository (i.e. recognizing that researchers and librarians bring complementary skills and concerns to bear on publication of data, initially focusing on capturing the value that researchers can bring through their intimate knowledge of the subject matter, later delivering this to institutional librarians who better understand issues surrounding global publication, discovery and preservation),
- exploits available tools where available and appropriate to the task at hand,
- provides for post hoc annotation of existing image collections as a route to metadata creation,
- uses the web as its primary framework for combining tools and information,
- exposes all metadata from diverse sources in a common format (RDF, accessed using SPARQL, RSS and/or Atom) that can be combined and queried as a distributed data source,
- uses the common format as a basis for a range of supporting tools for browsing, searching, further annotation, personalized access and more topic-specific services.
- provides a decentralized, subscription-based mechanism for aligning diverse schemas (vocabularies, ontologies) so that independently published resources describing common subjects can be cross-linked and composed, and
- incorporates personalization and trust elements in ways that allow researchers to adopt differing views on published data and annotations.
For implementation, we envisage that selected components may be independently implemented to serve the needs of specific end-user driven projects, avoiding the risks inherent in creating a large and complex software system that stands or falls on the successful integration of all components, whether needed or not.
Overview of project
Project goals
Data webs are based on distributed web publication of data and accompanying metadata, together with lightweight harvesting and aggregation of the metadata to a form that can be browsed and searched from a single point of entry, with direct links back to the original data sources for content delivery. In this view, a data web provides a focused service to meet defined data publication, access, integration and meta-research needs of a specific community of interest. Thus, a particular data web may embody domain-specific capabilities or knowledge, even though its software components may be quite generic.
We set ourselves the task of determining what is required to expose appropriate image metadata that can be harvested and marshalled to a point where they can be indexed and made cross-searchable, with links back to the repositories for retrieval/delivery of the original images. For example, the capability to search for images from several repositories showing expression of the aly gene in Drosophila melanogaster at various anatomical loci, returning links to the images themselves and annotations about observed phenotypical variations.
This project lays foundations for an anticipated JISC follow-up project, in which the vision expressed here for image search across institutional repositories can be realized, and also for continuing work on our broader vision of data webs to render published research images in institutional repositories interoperable and cross-searchable with those held in on-line journals, museum collections and elsewhere.
Aims and objectives
We set out to develop strategies and software designs for an image web to discover published images of all types and subjects across heterogeneous repositories, based on examination of institutional repositories at Oxford, Southampton, Cambridge and Imperial College (based variously on Fedora, ePrints and DSpace).
We have studied the existing repository software and metadata schemata to learn how they handle images and image metadata, with particular reference to granularity and detail of available information and how they are exposed for potential harvesting. We anticipate that our design will use Semantic Web techniques to uncover co-references in heterogeneous repositories, and use these as a basis for building cross-references between them.
Some initial guiding principles for our approach were:
- To work with existing repository and metadata formats as we find them.
- To design for use of existing software tools wherever possible.
- To design for use of existing metadata standards wherever possible.
- To avoid creating new information resources when adequate resources already exists.
- To leave control of repository content and access firmly with the publishing institution. We aim to work with whatever information they choose to make publicly available, allowing each publisher to balance their costs and benefits of making metadata freely available.
- To maintain full visibility of the existing repositories, leading users back to the sources rather than acting as a viewport through which they may be accessed.
- To minimize the amount of metadata that is harvested and aggregated, consistent with achieving our goals.
- To work as far as possible within the World Wide Web architectural framework.
- To design around the use of lightweight web application technologies with loose coupling between existing systems and maximum opportunity to replace or update any element of the technology used.
- To look toward further developments (not within the scope of this project) to create data webs by means of which images held within institutional repositories can be made cross-searchable with locally published research images and with images held by journal publishers, museums, other institutions and public databases (e.g. bioinformatics databases).
Motivation
Our image web work has been motivated by problems in post-genomic life science research. However, this JISC funded project has not been so limited, and is intended in part to explore what kinds of image resources may be available from institutional repositories. Through separate projects, we are also pursuing data web activities related to online journals and other digital collections, laboratory research user requirements, capturing image metadata as an integral part of research workflows and cooperative construction of metadata schema (ontologies); these other activities are outside the scope of the Defining Image Access project.
Gene expression experiments involve creating microscopic images of parts of organisms to which genetic techniques have been applied to make visible those regions in which a chosen gene is active (expressed). This information, in turn, helps researchers to understand how the genetic mechanisms interact with other life processes to guide the development of an organism, for example by showing how gene expression patterns vary between different mutations of a given reference organism. Producing these in situ gene expression images can be very time consuming, and a single image might be used in different lines of research. We anticipate that in the near future, such images generated in the course of a research project will be published online with sufficient metadata to interpret the image (e.g. organism, genetic strain, observed gene, developmental stage, etc.). Indeed, we have learned of a Cambridge-led multi-site project that aims to do exactly this for thousands of genes expressed at a range of different anatomical sites and stages of development in Drosophila.
A researcher exploring factors causing sterility in Drosophila may create a number of in situ images of gene expression in the testes, with a primary goal of studying sperm development. Another researcher may be interested in distribution patterns of gene expression products within a cell for study of internal cell transport mechanisms, for which some of the same images might contain useful information. But how is such a researcher to discover that the image even exists? Our Image Data Web aims to support using a single search operation to find, say, images of expression patterns for aly genes in Drosophila melanogaster that may be stored in various repositories with different but overlapping descriptive metadata. To do this requires some mechanisms to access and partially match metadata from different images sources, and locate those images that have associated metadata meeting some given criteria.
There are many unknowns about the ways in which institutional repositories deal with image collections and associated metadata; our work in this project is to survey a sampling of such image collections and associated software systems, and draw some conclusions about what can be achieved now and in the foreseeable future to find useful research images from available repositories.
Approaches
Our project is predicated on the notion that by using widely deployed web software components and ideas, we can short-circuit many of the development complexities that escalate the cost and deployability of many information-sharing systems. We are entering a time of availability of several highly developed and work-hardened Semantic Web software tools, in addition to the many widely used and scalable web server and content management systems. We aim to focus our design efforts on information design and software selection rather than software design.
To achieve our goals, we need to understand of existing open source repository systems (which are themselves examples of established web-based data management systems), with particular attention paid to their data handling capabilities, and also to the specific metadata schemas that are used by deployed systems. We also surveyed software tools that could be used in the construction of an image data web. The information we have gathered, and our interpretations of it, have been lodged in a publicly accessible wiki, to which we intend to continue adding information as our understanding grows through work in follow-on projects.
Based on this information, we have proposed a design and an implementation plan to meet our original goals.
Alongside the primary means noted, we have experimented with some collaborative web-based tools for managing the project, and gathering and publishing information. This has provided insights into the style and possibilities for new web-based styles of collaborative working that can inform our application of web ideas to repository access. (The tools selected to date include: Semantic Media Wiki, Drupal web content management system and WebCalendar.
Important issues to be addressed include:
- Minimizing the technical impact that a data web system will have on the systems currently used by repository providers or their users (though we do, of course, aim to influence their work patterns for the better)
- Understanding the metadata available from the repositories surveyed
- Identifying technical mechanisms for working with established repositories
- Selecting or designing components that are easily adapted to specific information requirements
- Adopting an approach that is compatible with widely deployed web software systems
- Designing a core metadata schema (ontology) that can be used to plan and answer queries that operate across the repositories considered.
The current project has been scoped to explore discovery of and access to images in repositories operated by the project's repository partners, by surveying software systems, available access/query mechanisms and metadata schemas used. We do not aim to conduct an exhaustive survey of software tools, the goal being to identify some that are suitable, and why, rather than declaring an overall winner. While we may undertake some limited piloting as part of our evaluation work, a demonstrable pilot system has not been a goal for this project.
Critical success factors include:
- Understanding the capabilities of the designated repository software systems.
- Understanding the metadata schemas used and presented by the selected repositories.
- Understanding the specific mechanisms available for querying and accessing content in the selected repositories.
- Identifying actual and potential use of metadata standards.
- Identifying open standards and open source tools that can handle the "heavy lifting" for an image data web implementation.
- Devising a core metadata schema that can support a range of identified queries across multiple repositories.
- Devising a costable plan for implementing an image data web.
- Ensuring consistency of our recommendations with the JISC's strategy for repository development.
Project outputs
The Defining Image Access Project’s deliverable is this Project Report that (a) details the findings and conclusions from our investigations, (b) recommends practices that should be supported by the JISC and adopted to enhance image interoperability between institutional repositories, (c) provide implementation guidelines for the creation of data webs, for use by those running institutional repositories, and (d) identify existing open source software systems that can provide elements of the desired data web functionality.
At the outset, this final report was expected to contain recommendations and observations concerning:
- Repository software systems.
- Metadata standards.
- Metadata publication for image discovery.
- Web standards.
- What to harvest versus what to access on-demand from source repositories.
- An outline proposal for the instantiation and ongoing support of our design for such an image web, to be made freely available for the benefit of the UK academic community.
We have also created a project web site, a wiki and an e-mail discussion list, to keep all interested parties fully informed of the project’s work and progress as it develops. As part of the project, we plan to hold workshops at which we can bring experts together for seminars and more informal exchanges, and we anticipate that such seminar presentations will, with the authors' permissions, be made available on the wiki.
Background
The data web philosophy
(David to add)
Previous approaches to the integration of heterogeneous data resources
(David to add)
Related activities
[[[Should this be a sub-section of the Background section above?]]]
[[[Need to merge Julie's contribution into this section.]]]
JISC projects
A list of JISC projects we have surveyed can be found here in our Semantic Media Wiki (this links to a dynamic search page, and the results may change as new information is added to the wiki).
SWORD
Using Atom and Atom Publication Protocol to support repository ingest. This approach sits very well with our philosophy of using lightweight Web protocols to support loosely coupled components. In particular, SWORD may provide the ideal mechanisms for moving images and metadata from local research group storage to a public-facing repository.
Also, it may be helpful if we feed back requirements coming from our work directly with researchers, though the remaining project duration is probably too short for this to have any significant impact.
CAIRO
This project seems to address some of the same areas as SWORD, but more focused on repository user interfaces for interacting with complex repository content. No specific contribution from this work to our plans is 0perceived at this stage, but we should watch for future developments, and maybe feed back requirements coming from our work directly with researchers.
Rich Tags
A recurring theme on our work is how to move from researcher's informal descriptions of their observations to more formally codified records of those observations. The Rich Tags is one project of a few we have encountered that seems to be addressing this particular issue, and we should be eager to incorporate any findings or tooling from this project in the systems we develop capture researchers' image description data.
Dictate
Having concluded that we wish to enable post-deposit collection of image metadata, and also that we intend to use ePrints as the basis for a research group image publication repository, it is natural that we should explore the role of Dictate to allow additional image annotations from arbitrary third parties to be captured. It is not obvious at this time how well the Dictate/Connotea tagging approach will play with more formal ontologically organized data. Interaction with the Rich Tags project (also at Southampton) might produce some interesting outcomes.
DExT
Data exchange tools - depending upon this project actually produces, these tools might be useful to us.
eBank, R4L, SPECTRa
- DefiningImageAccess/Project/eBank
- http://www.ukoln.ac.uk/projects/ebank-uk/
- http://r4l.eprints.org/
- http://www.jisc.ac.uk/whatwedo/programmes/programme_digital_repositories/project_spectra.aspx
- http://www.lib.cam.ac.uk/spectra/
eBank was a seminal project in the area of linking publications to research data. We are hoping to draw advice from participants in this project in our own future work. R4L and SPECTRa are continuation and related work.
This early work was grounded in the field of chemistry and chemical crystallography, with widely used and well structured formats for many aspects of the data. Part of our challenge is to see if we can apply the same ideas to the somewhat messier area of life science research and bioinformatics, particularly where recorded images represent a fundamental element of the scientific record.
CLADDIER and StORe
These projects have many goals similar to our longer terms goals, notably linking research papers to raw research data.
At this time, it looks as if tools from the CLADDIER project may not be directly helpful to us, but that their experiences working with atmospheric data may be very relevant. Currently, the project outcomes are not visible, but hopefully they will be so in time to influence future work.
The StORe project has conducted background survey work into the nature of scientific research and publication, and examines some the implications of including data publication in this process. They are also working on a prototype system based on "Web 2.0"-style tools, modified to include access control to allow managed dissemination of material. The executive summary of the survey phase of the report makes for interesting reading concerning the researchers' attitudes to repositories:
The StORe work appears to validate many of the conclusions we have reached. In a paper to the 2nd International Digital Curation Conference, Glasgow, 21-22 November 2006, Attitudes and aspirations in a diverse world: the Project StORe perspective on scientific repositories (http://jiscstore.jot.com/WikiHome/DisseminationPages/IDCC2%20-%20Paper.doc?cacheTime=1165415205802), Graham Pryor says:
Cultural and organisational barriers prevail in all disciplines, which serve to deter the deposit of research data in repositories, and an inherent culture of self-sufficiency in the generation and organisation of data militates against what might be viewed as prescriptive intervention by knowledge management professionals.
Later in the same paper, exploring working practices and concerns of practicing researchers, he suggests "the results from the StORe survey do imply that a step change is necessary". While the findings resonate with what our researchers tell us, I would question that what is needed is a "step change". Rather, I would say that we need to build on researchers' current practices and take account of their concerns, building trust and showing that use of repository tools is augmenting rather than distracting or threatening their research.
JISC CRIG
Common repository interface working group
We have had some discussions with some members of this group about use of SPARQL as a common mechanism for accessing repository metadata. Other areas of interest we have in common with this group include: ORE, Deposit API/SWORD, bulk ingest.
JISC IEMSR
JISC Information Environment Metadata Schema Registry
- http://www.jisc.ac.uk/whatwedo/programmes/programme_rep_pres/shared_services/project_mregistry.aspx
- http://www.ukoln.ac.uk/projects/iemsr/
Our proposals include a schema registry and alignment service, which will grow through a mechanism whereby new data sources are "subscribed" to a data web. The notion of registry suggests some overlap in purpose, though the expected circumstances of use are very different. One possible area of cooperation that we have identified is the use of IEMSR to hold schema alignment rules for commonly related schema.
See also:
The NSDL metadata registry has also been mentioned in connection with schema registry issues.
Intute Repository Search
There is clearly some overlap of purpose between our data webs and Intute Repository Search, though the approaches are quite different. We can see a number of ways in which our services might usefully interact (e.g. data webs providing information to Intute Repository Search, or data webs providing a programmatic interface to information sources including Intute Repository Search).
Pathways and OAI-ORE
The Pathways project led to the current Object Re-use and Exchange (ORE) project, which is a multinational project led by well known experts in the repositories field and with involvement from JISC. The ability to re-use and exchange repository data and metadata is central to our goals, and it seems clear that we should track a project of this significance.
Further, a recent white paper from the project indicates a direction of travel that includes RDF named graphs to describe compound objects. This approach, it seems, will sit comfortably with our plans to capture RDF domain-specific metadata about the content of an image. This is an issue it would be good to explore with participants in this project.
Digital Curation Centre SCARP project
The Image Store project, a sub-project within SCARP, is exploring animal behaviour researcher attitudes and requirements with regard to re-use of image and video data that perform the basis for scientific investigations. We are intimately involved with this project, which we expect to raise new requirements and ideas to guide our ongoing work on image webs for biological research.
Project activities
Project meetings
We organized and hosted for project meetings: a kick-off meeting to solicit ideas and pointers from project participants, meetings to discuss tools and technologies and interactions with other JISC activities, and a final project meeting to present our draft findings and solicit comments on proposed future directions.
The main project page, at Defining_Image_Access#Project_meetings, contains links to notes for the various project-wide meetings.
- Project Kick-off Meeting, 5 January 2007; (Meetings/20070105/DefiningImageAccess-KickOff).
- Second Project Meeting, 9 February 2007: Tools and Technologies for Semantic Interoperability Across Scholarly Repositories; (Meetings/20070209/DefiningImageAccess-ToolsAndTechnologies).
- Third Project Meeting, 9 March 2007: JISC Interactions meeting; (Meetings/20070309/JISC-Interactions).
- Final Project Meeting, 22 June 2007: Images and Repositories, the way forward; (Meetings/20070622/DefiningImageAccess-FinalMeeting).
Survey work
We undertook a survey of related projects, standards, software tools, services and other reports for items relevant to creation of an image data web. The initial list of topics to survey came from our own prior knowledge and suggestions made by participants at the project kick-off meeting. Additional topics have been added through the life of the project as we became aware of them and their significance.
The results of this survey work are all recorded into the wiki, linked from the page at DefiningImageAccess/RelatedWork (also accessible via the "related work" link in the web page sidebar, under "defining image access").
In the project page section Defining_Image_Access#Technical_notes there are also references to some other background information that has not been incorporated into the related work pages:
- DefiningImageAccess/Resources - contains links to other related resources that have not yet been actively surveyed.
- DefiningImageAccess/Articles - contains links to some further related reports and articles.
an overview of the information we sought in the survey of related projects, tools, and standards is provided at schema page.
Meetings with repository partners
We also visited project partner institutions for one-on-one meetings with providers of their institutional repositories, to learn about their repository systems, image collections, metadata usage and deployment, and other aspects of their operations. Links to notes from these meetings can be found at Defining_Image_Access#Repository_Meetings.
Links and findings from our survey of repository systems, based on these meetings and also our own investigations, are at DefiningImageAccess/RepositorySurvey. The main part of the repository survey work have been based on Cambridge and Southampton collections, to which we had early access. The Oxford and Imperial College discussions served mainly to confirm the paucity or fragmented nature of provision for image collections.
- DSpace@Cambridge: Meetings/20070122/Dspace@Cambridge - meeting with Patricia Killiard and Tom De Mulder at Cambridge University Library.
- ePrints@Southampton: Meetings/20070416/DefiningImageAccess-ECS-Southampton - meetings with various people at ECS, Southampton University.
- Repositories of the Oxford University Library Service: Meetings/20070503/DefiningImageAccess-SERS-Oxford - meeting with Sally Rumsey, Neil Jefferies and Alexander Huber of OULS/SERS to discuss the Oxford Research Archive and image collections.
- Imperial College: Meetings/20070615/DefiningImageAccess-ImperialCollege - meeting to discuss Imperial College image collection and repository related projects.
Software evaluation
The number of potentially useful software packages turned out to be many times greater than we could reasonably perform any meaningful hands-on evaluation. The choice of packages evaluated was based on what were perceived to be key components of an image web deployment.
Early survey results suggested that we would have little need to interact closely with repository software, but would be able to use OAI-PMH with all repositories, so we made an early decision not to perform hands-on evaluation of the repository software systems themselves. Furthermore, it transpired that the actual deployment of these as institutional repositories was typically quite different from the kind of deployment upon which we might base our evaluation. Later in the project, a different possible role for repository software at the research group level became apparent; at this stage we decided to evaluate just one system, ePrints, as it had already been shown to provide the kinds of facilities we sought. Fedora was generally acknowledged to be an "architectural" component of a larger deployment, and as such unlikely to be easily deployed for a single research group. We are advised that DSpace is currently in transition between version 1, which is a large monolithic system, and version 2 which is intended to be more modular and flexible.
The choice of other software components for evaluation was based on our perception of specific components that might form part of an image web deployment. Central to this is software that can be used to construct a SPARQL endpoint for an OAI-PMH repository. For this purpose, Sesame and Joseki were evaluated, both being systems that combine a deployable server with an underlying programmers' toolkit, and both being widely used and actively supported. In the end, we prefer Joseki for our purposes, as it has a query engine architecture that has already been used to support distributed query, and its support for relational database storage of RDF graphs. We also evaluated Allegro and Virtuoso, both of which appear to be powerful candidate systems, but did not take them further mainly because we are not aware of any significant use of these in the open source Semantic Web community. Hopefully, either of these might be considered candidates in the future if we run into a need for high-performance SPARQL endpoint server.
We searched for software that might present an OAI-PMH repository as a SPARQL endpoint. The closest we found was "OAI-PMH RDFizer" from the Simile project. Shortcomings of RDFizer for our purposes are that it does not of itself create a queryable interface for accessing the metadata, and it is not clear to us how well it can deal with non-standard metadata, especially if it is not all presented using OAI-PMH metadata access mechanisms (bearing in mind that we are considering using repository data streams for storing domain-specific metadata). It might prove to be a useful tool, but is not sufficient for all our repository metadata presentation goals.
Connotea was evaluated as a possible annotation service. We are also in the process of evaluating Dictate, the JISC-funded adaptation of Connotea to work as part of an ePrints repository. Using Connotea provided us with one route for linking image data with other external resources (e.g. publications about Mollusca are also linked with the Piglet Squid images in SERPENT). At this stage, we are not clear if Connotea might be adapted to annotation with controlled vocabularies.
We had hoped to evaluate mSpace as a tool for browsing information presented by a SPARQL end point, but so far have failed to obtain a copy of the software. There is also a software product jSpace, which claims to be a clone of mSpace, that we are considering to evaluate. The importance of such systems is that they may provide us with a way of quickly deploying a web interface to browse the content of a SPARQL endpoint or a data web. While this may not generally represent the best possible user interface for exploring information about a given domain, it represents an important tool for quickly delivering new image discovery options based on the availability of published image annotations.
For more details of the software evaluation planned and conducted, see:
- DefiningImageAccess/SoftwareEvaluation - notes for software evaluation
Technical design
The Defining Image Access project plan called for construction of a technical design to implement a data web for image collections in institutional repositories, with the intention of addressing in particular the relationship between image metadata and other relevant information available on the web. Content alignment between heterogeneous data sources, in the general case, is known to be a difficult problem. It was our hypothesis that study of particular patterns of infromation design in particular subject domains would reveal patterns that could be easily exploited for quick gains in coordinating and cross-referencing information from different sources.
Our hypothesis has not been discounted, but the lack of image metadata uncovered by our survey work leaves this particular goal unrealized. Rather, our design efforts have focused more on requirements and techniques for capturing available metadata and exposing it through repository publication, and in particular to technology deployments that bridge the gap between research practice and requirements for publication and preservation. To this end, we are adapting an instance of ePrints software for use as a research group image publication platform, replacing the web form ingest process with tools that perform bulk ingest of images and metadata from available spreadsheet data. This approach promises two key advantages: (a) building on an existing technology platform allows us to rapidly create a local publication tool for our research clients, and (b) using an existing repository platform may ease the transition to long-term storage in an institutional repository. This has become a key element of our proposed technical strategy: collect and publish locally data from researchers' existing workflows, then look to migration of selected collections to a more formally controlled repository environment.
Concerning the promised advantages, we already have some evidence of (a): in just 2 weeks we have deployed an ePrints system that publishes a substantial proportion of our local research team's collected images and metadata - this work was demonstrated by Jun Zhao at the final project meeting. The detail of presentation leaves much to be desired, but progress to improve is quite rapid, thanks in large part to working with standard web technologies and simple interfaces. Concerning purported advantage (b), we have less evidence. A range of ongoing work such as ORE, BID, SOURCE and SWORD are all suggestive that automated migration and distribution between repository systems is a real future possibility.
For pan-repository access based on Semantic Web standards, we have identified that Jena and Joseki provide most of the tooling we need to create a SPARQL front-end for an OAI-PMH repository. For ePrints, we need to privide a module that incrementally exports image metadata to a common format (RDF/XML, Turtle, or similar), and a periodically-executed script to activate the Jena bulk loader to populate the Joseki database. In this way, Joseki serves as a per-repository query endpoint (contrast with systems like OAIster and Intute Repository Search that provide a search interface across many repositories). Such tooling creates a standard access mechansism for repository data, and the possibility of deploying common browse tools (e.g. mSpace) and common search tools across multiple repositories.
Although we have not been able to test our ideas for content alignment, our ideas in this area have matured in several ways: possible use of CIDOC CRM Core as a common structural framework for experimental data; separation of schema-level and instance-level alignment of content; decentralized dataweb extension through subscription to a core ontology. (These ideas derive in part from private discussions with Brian Fuchs of the Imperial College Internet Centre, and Martin Doerr of ICS FORTH.)
See also:
Dissemination activities
[[[David: I think you may be better placed to expand this section - #g.]]]
- Presentations: see http://imageweb.zoo.ox.ac.uk/wiki/index.php/Defining_Image_Access#Presentations
- OeRC symposium "The Digital Image": 16 Mar 2007. See also:
Findings
[[[David, I notice you have created new top-level headings for each of the areas of findings. It seems to me more useful to have a single top-level "Findings" section with subsections for each of the areas. I've made this change for now to facilitate document editing. #g.]]]
Related projects
We briefly surveyed a number of related projects. Those that were relevant to our goals for creating an image web are described above, in the section Related activities. To see a complete list of the projects surveyed, refer to:
Partner repositories
We did not find many image collections in the institutional repositories we surveyed. At the institutional level, the dominant repository focus is on electronic forms of written papers (e-prints and e-theses); specialized support needed for image and other media collections is mostly absent. There are a few image collections in the Cambridge Dspace repository, but the availble metadata is largely confined to the commonly found Dublin Core values. Other image collections we learned of at Southampton and Imperial College [[[this from memory - need to check Dolores' slides when available]]] are created and held by groups or departments, not managed by institutional library or IT services.
Where available, metadata quality is variable; even generic metadata (basic Dublin Core, etc.) is not always consistently provided. For the image collections we examined, with the exception of SERPENT (see below), there is very little domain-specific metadata of the kind that would indicate the content of an image, or provide a context for its interpretation (e.g. in gene expression images, information about the expressed gene, type of organism type, anatomical region and more are all vital for the image to be usefully interpreted).
From our discussions with institutional repository managers, we judge that current attention is focused on the needs of papers, theses and articles, which seem to be well-served by Dublin Core metadata terms, and little thought has yet been given to the additional metadata requirements for discoivery of images and other forms of "opaque" data that may be stored.
Among the image collections we surveyed, the exception to prove the rule, an exemplary collection was a specialized repository SERPENT at Southampton University (http://serpent.eprints.org/, http://imageweb.zoo.ox.ac.uk/wiki/index.php/DefiningImageAccess/Repository/SERPENT). Although this site uses ePrints, one of the institutional repository software systems studied, it is operated as a specialized project repository rather than an institutional repository, and some customization has been applied to handle domain-specific metadata. As an exemplar, SERPENT has had considerable influence on our thinking about repository-based image collections.
See also:
Repository software systems
Our survey focused on just three repository software systems: DSpace, EPrints and Fedora. Of these we have also installed and started experimentation to use ePrints for image acquisition and publication.
Support for OAI-PMH is ubiquitous in repository systems (indeed, some commentators suggest that OAI-PMH support is what defines a repository), but is not a panacea for accessing domain metadata:
- it is administratively difficult, when possible, to deploy OAI-PMH to access varieties of domain specific metadata in an institutional repository,
- OAI-PMH cannot perform discovery based on metadata values (the repository community model seems to be to use a separate service like OAIster for such operations).
The current version (V1) of Dspace software is a large monolithic program, which we undertsand has become difficult to maintain and adapt in its current form. The software is being restructured and/or rewritten for the next version, to be more modular and adaptable: it is rather early to tell how successful this effort may be.
Version 3 of the EPrints software has recently been released, and is claimed to have a very modular, extensible architecture compared with earlier versions. It is widely deployed and easy to set up - our own initial installation of ePrints was installed from scratch, without prior experience, in just a few days. Customization consists in many cases of adding or editing Perl software scripts, or configuration files. Although ePrints is relatively easy to install and customize, it does not provide much in the way of specific support for deposits other than conventional publications (papers, theses, etc.), and does not have in-built support for compound content models (beyond multiple data streams associated with a single deposit record).
Fedora is generally agreed to be the most flexible of the repository systems, at the expense of not providing specific support for any particular mode of operation. Neil Jefferies (of Oxford University Library Service) described Fedora as an architectural product rather than just a software application; that is, a suitable element of a long-term preservation strategy, rather than a complete system to solve all today's problems. At Oxford, the Fedora repository core software is being deployed along with commercial software to handle deposit and access functions. Fedora supports a concept of content models, for which special handling (including metadata presentation) is handled by a content model implementation module that is installed alongside the main repository management software.
Repository image holdings
(Need to add text)
[[[I'm not sure what needs to be added here, that hasn't already been covered above in section Partner repositories; I see little value in enumerating here all the image collections found - the interesting ones for us have already been discussed. #g.]]]
Metadata standards
Apart from OAI-PMH (which has been covered pretty much in the sections on repository systems), our survey of metadata related standards included:
- Structural and object packaging standards, such as MPEG-21 DID and METS,
- Structural metadata relating to images and media files (EBU Cora, VRA Core, Z39.87)
- General publications (Dublin core, qualified dublin core, EPrints profile of Dublin Core,
- Higher-level elements with potential for applicability across domain content (CIDOC CRM Core, INDECS, SKOS, AKT, FRBR)
- Other more specialized metadata (CCLRC Scientific Metadata Model, PREMIS data dictionary for preservation metadata, Semantic Web Research Community, SCORM learning objects).
We fully expect to add more to this list over time, as they come to our attention. It is difficult to evaluate the various standards in isolation from a specific application, so at this stage, we regard awareness of the various metadata standards as the main result of this survey.
One metadata standard which has struck us as particularly relevant for scientific observations is the CIDOC CRM Core, which is a collection of about 20 or so metadata terms capable of capturing complex relationships between agents and entities through mediating "events". This seems to us to be a clean metadata model that is both very flexible and just a little more complex than Dublin Core. Further, we understand that Dublin Core (or, indeed, the various ways in which Dublin Core is used) can be represented using CIDOC CRM terms through the introduction of existential entities (which, in RDF terms, would be represented as blank nodes or uniquely-minted URIs).
We see similar attempts to regularize use of Dublin Core using its current vocabulary though initiatives like the Dublin Core Profile for Scholarly Works, previously known as "ePrints profile". The influence of FRBR is clear to see in this and other proposals, although FRBR itself goes in to far too much detail to be viable as a metadata standard in its own right: the information landscape that FRBR attempts to describe changes too fast for the vocabulary to keep up; an example of this is that FRBR proposes terms for describing the groove pitch or rotational speed of gramophone records, but does not have any way to describe the bit rate of an MP3 data file. We feel an important lesson to be taken from this is that common metadata schemes should try to describe the enduring properties of the domain, and leave more evolutionary matters to specialized schemas.
[[[What follows may overlap/duplicate discussions elsewhere]]]
[[[Link/forward reference to section(s) discussing metadata acquisition]]]
We did not survey any specifically domain-specific metadata schemes (e.g. Gene Ontology and MIBBI for bioinformatics), though as far as we can determine, it is exactly this kind of metadata that will be important for discovering and interpreting images. Further, it appears that having support for specific and evolving domain-specific metadata provided by institutional repositories is not a realistic prospect, so we need to design alternative ways to handle domain information within the common repository structures and metadata currently available.
Domain-specific metadata can reasonably be handled as part of the data with modest repository support:
- for ePrints: add a couple of structural metadata fields,
- for Fedora: define an appropriate content model and external tool support based on OAI-PMH for basic access.
There are possibilities to capture a small amount of useful structured domain-specific metadata in a dc:subject field (this being well-supported by ePrints), but not really enough to capture context for interpretation of an image. A dc:description field can be used to convey free-text commentary on the content, which might be useful if used judiciously. dc:type can provide additional clues and cues concerning the nature of the content. dc:coverage can provide topic or spatial information about the content. But, even within collections, these fields are rarely used consistently and do not offer a practical general approach for image discovery based on image subjects. (See the repository survey for comments and links to individual collections.)
Web and Internet standards
Today's World Wide Web is built mainly on a tripod of core underlying standards:
- URIs for identification (RFC 2986)
- HTTP for data access transfer (RFC 2616)
- HTML, XHTML, CSS, PNG, JPEG as a common formats for representing documents, and XML-based languages for representing data, though a variety of other formats are also commonly used (e.g. PDF, Word, RSS, Javascript, etc.).
Of these, the one element that cannot reasonably be replaced is the use of URIs: a web without URIs would be essentially different from and non-interoperable with the Web we use today. Other data transfer protocols and information formats can be and are used without fundamentally changing the Nature of the Web. These standards are not, of themselves, sufficient to define a working World Wide Web in all its richness and diversity, but they do form a common core that is common to a very high proportion of actual web activity.
The ImageWeb activity is particularly concern with standards for the description of image content. We choose W3C-recommended RDF as our underlying basis for this [1]; in particular, we focus on the RDF abstract syntax [2], recognizing that there may be different ways of giving this a concrete representation (e.g. RDF/XML [3], TRiX [4], Notation3 [5], Atom [6], GRDDL [7]).
There are two distinct (but not mutually exclusive) approaches to building data-sharing systems, or distributed applications, on the Web:
- those based on a Web Services protocol stack consisting of SOAP, WSDL and sometimes UDDI (the wikipedia article [8] is a useful background), which is essentially a web-based remote procedure call framework;
- those using a REST approach (Representational State Transfer, a term introduced by Roy Fielding in his PhD Thesis [9]), which uses a small number of primitive operations to access and manipulate representations of the state of resources. A REST style can be implemented using a SOAP-based Web Service stack, but is more commonly associated with use of bare HTTP to access and manipulate resources.
The term Service Oriented Architecture (SOA) is sometimes mentioned as an alternative in connection with Web Services and REST. It is difficult be be definitive about SOA, as there are several extant definitions, but a key element seems to be a focus on resource sharing between administrative domains by interacting with defined services, hiding implementation details. As such, SOA can be implemented as a Web Service RPC style, or using a REST approach.
For creating data webs, we are particularly concerned with the ease of independent development and evolution if software components, this being key to accessing resources that already exist on the Web. In this, the REST approach offers a key advantage by investing all of the application-specific semantics in data formats rather than in a protocol specification. The basic operations in a REST framework are common to all resources (commonly: GET, PUT, POST, DELETE; or CREATE, READ, UPDATE, DELETE), and hence are domain independent. All domain-specific semantics, therefore, must be conveyed in the resource state representation data that is exchanged. A data format can be constructed or interpreted by one application at a time, and some simple rules (e.g. ignore any parts that are not recognized) allow providing and consuming application to evolve independently. Protocol operations involve two (or more) communicating parties, and require a degree of cooperation to handle changes to the protocol operations; this can mean that the communicating components must be updated in lock-step.
We propose use of SPARQL and possibly ATOM as mechanisms for communicating image metadata. SPARQL is a W3C-recommended domain-neutral query language and protocol for accessing RDF data, which can also be used to access data stored in other formats (e.g. D2R Server [10] for accessing relational databases). ATOM [11] is related to RSS, and can be used as a carrier syntax for (slightly restricted) RDF.
See also:
- DefiningImageAccess/Resource/SparqlEndpoints - describes some experiments done for Gene Expression Image and Gene Ontology SPARQL endpoints. This page describes a sequence of SPARQL queries that combine information from the two endpoints.
Software tools
See:
We surveyed some 35 software tools with a view to identifying components from which an image data web could be constructed. Of those, the following are particularly relevant to our purposes; the comments are with reference to the proposed technical desifn that is outlined later in this report.
- EPrints DefiningImageAccess/Tool/Eprints - to implement a research group image repository, which can be used to record and present image metadata. See more detailed descrion below.
- Joseki and Jena DefiningImageAccess/Tool/Joseki - to perform local harvesting and indexing of image metadata in repositories supporting OAI-PMH, and presenting a SPARQL endpoint for querying that metadata. Also, in conjunction with work done at HP Labs on distributed queries (DefiningImageAccess/Tool/DARQ) for constructing a data web aggregator component.
- D2R server DefiningImageAccess/Tool/D2RServer - for presenting relational database information as a SPARQL endpoint.
- Connotea/DICTATE DefiningImageAccess/Tool/Connotea, DefiningImageAccess/Project/Dictate - for providing third party annotation and tagging of images. DICTATE is version of Connotea that is integrated with EPrints software.
- mSpace/jSpace DefiningImageAccess/Tool/mSpace, DefiningImageAccess/Tool/JSpace - to provide a user interface for browsing metadata and images presented by a SPARQL endpoint.
These tool selections are not definitive or final, but current judgements based on a range of technical and non-technical considerations. The selections are somewhat biased towards solutions that will help us get up-and-running as quickly as possible, not necessarily those with the greatest potential for long-term scalability.
At this time, there are no clear candidates for building the schema registry and coreference service elements of the data web component. The technical specification of these components is not clear, and at this time we have not identified any tools that obviously address their particular requirements. At this time, a likely implementation strategy would be based on a lightweight web application framework (e.g. Python Turbogears or Ruby on Rails), starting with a simple term mapping mechanism and extending this to support more complex rules as the need may arise.
Omission of other tools covered in our survey from the list above does not mean they have been dismissed from consideration, but that no specific role for them has yet been identified.
EPrints for capturing images and metadata
[[[Jun to provide overview of experimental work for Drosophila gene expression metadata capture, also noting the user input leading to the metadata choices]]]
Our project findings show that there are limited image collections in the current institutional repositories. Also, these images, with an exception of the Southampton Serpent repository (http://archive.serpentproject.com/), are rarely described using domain-specific metadata, which conveys the context for users to interpret these images. In order to obtain a practical experience of building and publishing image collections using domain-specific metadata, we conduct an experiment of setting up an image repository using the EPrints 3.0 software system.
These images are in situ gene expression images which are used by researchers in the life science to study the factors causing sterility in drosophila. Currently, thousands of images, accumulated from three years of hard work, are kept in a file system, organized by the date they were created and described using an Excel Spreadsheet. It is extremely difficult for researchers to locate images from the file directories by their domain-specific metadata, such as the gene name of an image or the mutant reflected in the image. An image repository is needed by the researchers to assist them uploading, storing, searching and publishing images with the domain-specific metadata. The EPrints 3.0 software system was adapted in our project to both solve scientists’ real problems and to achieve an understanding of building such an image repository enhanced with domain-specific metadata.
EPrints 3.0 was chosen as our experiment for the following reasons: 1) it is one of the well-established software systems in the digital library community for archiving digital items, including theses, reports, publications, etc; 2) it has built-in support for the OAI-PMH protocol which allows our image repository to be harvested by other parties; 3) it has a built-in user interface which makes it fairly quick to set up the repository and present it to the users; and finally 4) it has been adapted by the Serpent project to publish images using domain-specific metadata.
Because the EPrints system is targeted at digital archives, it has supports for describing and searching digital items by the Dublic Core metadata. However, in order to use EPrints to store and publish our images with domain-specific metadata, we need to perform the following two aspects of customizations:
- customization of the underlying database, in order to store domain-specific metadata along with images. This customization is realized in three steps:
- extending the database schema with the extra domain-specific metadata fields;
- while batch-uploading images, uploading each image along with a metadata file and describing each image using the domain-specific metadata that is kept in the metadata file.
- exposing the domain-specific metadata stored in the repository through the OAI-PMH protocol, along with the DC metadata.
- customization of the user interface, in order to describe and search for images using domain-specific metadata. Many of the DC metadata fields presented by the native EPrints’ user interface were either put into secondary position or removed to let users concentrate on the domain-specific image metadata.
The installation and initial customization of the EPrints system took around three weeks. This hands-on experience shows that: 1) it is feasible to adapt an existing digital archive software system to publishing images, especially images with domain-specific metadata; 2) it is possible to publish the domain-specific metadata not only to end users but also to computers via the OAI-PMH protocol; and finally 3) the built-in user interface support from EPrints makes it fairly quick for us to set up the repository and present it to the users.
This repository has been presented to the researchers to gather feedbacks. The initial feedbacks are positive. Further user interface customization is needed, such as highlighting the presentation of the domain-specific metadata, presenting images in a user-convenient way, etc. Also, researchers raised the requirements for integrating this image repository with images from external repositories, such as FlyBase (http://www.ebi.ac.uk/flybase/) from Cambridge and FruitFly (http://www.fruitfly.org/) from U.C.Berkley. Future study is needed in order to find out the best practice for integrating the heterogeneous resources from these heterogeneous repositories, as proposed by the software framework in our future work.
Areas requiring further evaluation
- To achieve the goals we have set ourselves for image data webs based on existing repositories, we need to address some issues of metadata capture and/or creation. Two areas have so far come to mind:
- In related discussions with some scientific researchers, we have noticed that useful metadata is often collected in a usable machine-processable form, at a time when researchers may be reluctant to make it public. Later, when papers have been published, the metadata has been lost, or the effort required to gather the metadata for publication exceeds any benefit to the researcher for so doing. This suggests that we need to tackle metadata capture separately from, and well in advance of its publication.
- For all repository software systems, images can be treated pretty much as any form of data, and it is very tempting to just say that images are a form of data. But the visual nature of images means that user interfaces for dealing with images should take into account the ways in which people can use and work with images - e.g. conventional search interfaces are not good for discovering images, and visually-oriented tools can be very helpful when trying to locate a particular image in a collection. This is an area that needs special consideration when designing or selecting tooling for image collection repositories.
- Pre-defined ontologies are best for re-use of associated data and images, but it is often very difficult to get researchers to agree in advance exactly what information is needed. We should seek out approaches that allow consensus to be codified yet do not prohibit recording of additional information that has not yet found community acceptance.
- Post-hoc annotation of images [[[details...]]]
- Schema mapping
- identifier co-reference
- metadata cleaning
- semantic reasoning
- users' requirement study
Project self-evaluation
[[[This needs fleshing out; GK or DS?]]]
From the project plan, critical success factors were:
- Understanding the capabilities of the designated repository software systems - see section Repository capabilities
- Understanding the metadata schemas used and presented by the selected repositories - see section Metadata schemas used
- Understanding the specific mechanisms available for querying and accessing content in the selected repositories - see section Repository access
- Identifying actual and potential use of metadata standards - see section Choosing metadata standards
- Identifying open standards and open source tools that can handle the "heavy lifting" for an image data web implementation - see section Choosing software tools
- Devising a core metadata schema that can support a range of identified queries across multiple repositories - see section Core metadata schema
- Devising a costable plan for implementing an image data web - see section Implementation plan
- Ensuring consistency of our recommendations with the JISC's strategy for repository development - see section JISC strategy
We have also collected some post-implementation review notes in wiki page DefiningImageAccess/PostProjectReview.
Repository capabilities
[[[Evaluation notes here]]]
Metadata schemas used
[[[Evaluation notes here]]]
Repository access
[[[Evaluation notes here]]]
Choosing metadata standards
[[[Evaluation notes here]]]
Choosing software tools
[[[Evaluation notes here]]]
Core metadata schema
[[[Evaluation notes here]]]
Implementation plan
[[[Evaluation notes here]]]
JISC strategy
[[[Evaluation notes here]]]
Project conclusions
[[[How do "project conclusions" differ from recommendations? - #g.]]]
Repository image holdings
- For image collections, more attention needs to be given to creation, capture and handling of domain-specific metadata. But the CLiC project has found that researchers will not, in general, enter more than 3 elements of metadata when submitting an image to a collection. Thus, for image data webs based on existing repositories, we need to find more successful ways of capturing such metadata. Some approaches come to mind:
- post hoc annotation of published images, often by researchers other than those who created the images. For existing image collections, this may be the only option,
- try to capture information earlier in the research cycle in ways that support and enhance existing research practices rather than work against them - this may mean capturing metadata well before the point at which the image (and metadata) may be published,
- make the repository become the researchers' trusted friend, to keep their data safe and fully address concerns for confidentiality of data relating to work-in-progress,
- develop techniques for storing and retrieving arbitrary metadata, in addition to the Dublin Core and other common elements that repositories currently handle.
Project recommendations
[[[David to add stakeholder analysis]]]
[[[The following section headings were added by David, but I'm not sure what should go in them]]]
To the JISC
To managers of institutional repositories
To image creators and depositors
Concerning Web standards
- RDF(S) and OWL
- OAI-PMH
- SPARQL
Concerning metadata standards
- CIDOC-CRM Core
Concerning metadata exposure
- OAI
- SPARQL Endpoint
- D2R and D2RQ
- Joseki
- Sesame
- Virtuoso
- AllegroGraph
- OAI-ORE
Future work: A design outline for an image data web
[[[Hmmm... the design work, as far as it goes, is part of the current project, indicating a direction for future work - #g.]]]
Scope
(David to expand: not image-in-institutional-repository to image-in-institutional-repository, but paper-in-institutional-repository to image-in-other-resource)
Software framework
We propose a software framework that:
- addresses local capture of idiosyncratic private metadata generated by a research project, and its subsequent migration to an institutional repository,
- exposes all metadata from diverse sources in a common format (RDF, accessed using SPARQL) that can be combined and queried as a distributed data source,
- uses the common format as a basis for a range of supporting tools for browsing, searching, further annotation, personalized access and more topic-specific services.
- provides for post hoc annotation of existing image collections as a route to metadata creation,
- provides a decentralized, subscription-based mechanism for aligning diverse schemas (vocabularies, ontologies) so that independently published resources describing common subjects can be cross-linked and composed.
- See this PowerPoint for an outline of the technical design.
| Diagram of image publication framework |
|---|
|
Implementation plan
For implementation, we envisage that selected components may be independently implemented for end-user driven systems, avoiding the risks inherent in creating large and complex software systems that stand or fall on the successful integration of all components, whether needed or not.
This framework is envisaged as a collection of loosely coupled components that can for the most part be developed, deployed, evolved and valued independently of each other. Within this framework the common element for interoperability is use of SPARQL to access data as an RDF graph data model. (We also envisage that some sources may be presented using a syndication format such as RSS and/or ATOM, which can also be used as a carrier syntax for RDF data, and would enable linkage to mashup services such as Yahoo Pipes - http://pipes.yahoo.com/pipes/).
We do not propose that this framework be implemented as a single monolithic entity. Rather, we imagine that specific user-centred projects can be identified that use one or more of these components in combination, and thus drive a component-wise development of the framework driven by specific use-cases. This allows the component implementations to be responsive to actual needs, and, in the style of agile development, avoids complications to meet specifications that are never required. A further benefit of this component-driven approach is testability: because the components can be deployed separately, they can be tested separately. A major problem with complex monolithic web-based systems is fragility, particularly when they are to be deployed in diverse environments or as parts of subsuming software systems - loose coupling with independent and evolving test regimes can go some way to mitigating this fragility.
The component-based approach also makes it easier to incorporate existing tools without modification, either directly or with some lightweight shimming of their interface. An example of this lightweight is our plan to create a SPARQL endpoint for an OAI-PMH repository: the Jena system provides two complete applications, Joseki and the Jena model loaded, which can run against a common database. We develop a simple, free-standing local harvester that uses OAI-PMH to collect common and domain-specific metadata, and write that to a simple file format (N-triples), and a script to periodically run that program followed by the Jena model loader, the Joseki component will provide a SPARQL endpoint for the harvested repository content. Immediately, in a small software development, we provide a mechanism to search the metadata content of a repository, a feature that is not provided by OAI-PMH.
Abbreviations and acronyms
to be populated from preceding text
See also: DefiningImageAccess/Glossary. The resulting glossary should be copied back there.
References
(To be added for formal publications, even where these linked to directly in text above, to provide browsable list)
I also suggest copying the final reference list back to a new page, DefiningImageAccess/References, where we can continue to add to it after the final report has been delivered. This would be a good opportuity to use Sematic Media Wiki features, I think.


