DefiningImageAccess/Components
From ImageWeb
Data web components
Contents |
See also: DefiningImageAccess/SoftwareDesign
Identifiers in a data web
A data web is conceived as a part of the World Wide Web, so ultimately all identifiers are mapped into URIs, which are the fundamental means of identifying any resource on the web [1]. It is considered good web practice for a URI to be usable on the web to directly access information about the object it denotes, but this is not a requirement and valid identifiers (even those beginning with "http:") are not _required_ to be usable as web addresses.
It is generally a simple matter to take an arbitrary textual identifier label, such as an accession number, and create a unique URI. Typically, this will involve using a base URI derived from or chosen by the repository (e.g. http://www.beazley.ox.ac.uk/Gems/Scarabs/Script/Danicourt%20collection.htm/) and combining this in a defined fashion with the accession number (e.g. 94). Thus, http://www.beazley.ox.ac.uk/Gems/Scarabs/Script/Danicourt%20collection.htm/D.094# might be used to identify a roman gem inscribed with fighting cocks, whose image can be seen at http://www.beazley.ox.ac.uk/Gems/Scarabs/Images/Danicourt%20collection%204/D.094m.jpg.
TODO: substitute an example based on repository data?
The example cited above exposes a related issue of avoiding confusion between the identifier for a given entity and identifiers for related artifacts. In this example, the image of the gem has a different URI from the gem itself, and the URI of the image can be used directly in a web browser to view that image. By comparison, the URI for the gem itself cannot be similarly used directly to retrieve a web representation.
Target system components
We seek tools for implementing the following system components:
OAI-PMH to SPARQL adapter
This component accesses metadata in an OAI-PMH repository, and makes it accessible as a SPARQL endpoint for consumption by the data web aggregator. This is a protocol adapter and syntactic transformation: no attempt is made to perform semantic mapping or alignment of terms used.
A key function of this adapter is to map all entity identifiers in the OAI-PMH metadata into URIs.
OAI-PMH to ATOM adapter
The purpose of this component is to make it easy for the aggregator system to learn about changes to a repository. OAI-PMH supports a form of request that returns recently updated information. The requirement for this component is tentative, and offers a constrained form of the facilities provided by the OAI-PMH to SPARQL adapter described above.
ATOM is a syndication data format, which is easily mapped to RDF, in development as an IETF standard to take over from the earlier RSS formats.
(The choice of ATOM format is not definite, but it is attractive to use a format that can be accessed using a simple HTTP request, and which can be processed directly by most browsers. Alternatives might be SPARQL, RSS 1.0, which is itself a form of RDF.)
Data web core metadata aggregator
A basic functional requirement of a data web is to combine information from multiple data providers for consumption by data users. The aggregation service does this, using information from the schema mapping and coreference services, which in turn receive information from data repository registrations. Apart from a cache of core metadata elements, the service is essentially stateless, and may be replicated as required.
The aggregator collects and caches core metadata from a number of repositories and/or collections, and provides a basic service for programs to access or query the aggregated sources. The main service provided will be a SPARQL query handler that operates across the various source repositories, accepting queries based on core metadata elements.
To perform data aggregation, information from the various data providers is mapped to the core schema. The aggregator thus performs automated instance-level alignment on the basis of hand-crafted schema-level alignment and identifer mapping information lodged with with the schema mapping and coreference services.
A secondary function of the aggregator service could be to present information from a single data source mapped to the core schema. In principle, it is possible that the aggregator service may also serve information using non-core schema, but it remains a question for ongoing research to determine whether this is a useful design option.
It is conceivable that information composition could be performed by client software systems, and the data web aggregator service development may yield libraries that can be used for this. But we envisage that it will be easier and more manageable to develop this logic as a separate service.
Schema mapping service
This service accepts registration of schema mapping descriptions, and provides a number of basic services to allow applications to perform functions across multiple data sources.
A data web is intended to work with data providers' existing information, which will, in general, be disparate. To make sense of such information we need the ability to align the metadata schema used by different providers. Further, some data consuming services may be developed that work with different metadata schemas than those offered by the data providers. Thus, we need to be able to record and find correspondences between different metadata schema.
For example the Dublin Core term "creator" corresponds approximately to the CIDOC CRM term "P94B (was created by)".
The schema mapping service records such relationships, and responds to queries that allow applications to map expressions using one metadata schema to equivalent or related expressions using a different metadata schema. A data web is typically seeded with a set of core metadata terms, which are lodged with the schema mapping service; these are taken to be central concerns for the data web.
Details of metadata schemas that are used locally by the various data providers are registered, with information about how they relate to the core metadata terms. This registration could be submitted by the providers themselves, a data web coordinating entity, or by third parties. They may also register details of other metadata elements used with information about their relation to other terms, or with just textual descriptions for human consumption.
We assume that all metadata schemas are described using Semantic Web standards (RDFS and/or OWL), and hence that all metadata elements are identified by URIs (Uniform Resource Identifiers, the standard form for identifying resources on the web [1]).
The exact mechanisms to be used for schema mapping are a matter for continuing research, but we anticipate some combination of the following techniques may be included:
- registration of equivalent metadata terms
- registration of equivalent assertions using different metadata schema (two-way inference)
- registration of inferrable assertions using different metadata schema (one-way inference)
- ontological descriptions from which class- and property-subsumption relations can be inferred
- a register of related metadata terms; e.g. an assertion using the Dublin Core term "dc:Creator" might be part of a premise from which a new statement using the CIDOC CRM term "cidoc:E65.Creation". Thus, "dc:Creator" would be related to "cidoc:Creation", even where one is a relation and the other is a class. The exact logical nature of the relation need not be defined. The intent here is that a service that depends on assertions using one metadata term can be guided in finding premises from which such an assertion can be inferred.
We do not propose automatic determination of schema mapping for this development. That is a long-standing and difficult problem that has already attracted considerable research. Although some promising results have been reported, it remains that fully generalized automatic schema mapping for use as a basis of sound logical reasoning is probably unachievable. Also, the consideration of equivalent assertions may depend on the context in which they are to be used, and require an element of expert judgement. As such, we feel that automated ontology alignment techniques may have a role in helping experts to construct mappings, but do not of themselves form a part of the system we propose.
Thus, the central schema mapping service will accept and return information about schema mappings. Using this information may involve varying degrees of computational inference which can be handled more scalably by separate, distributed services (e.g. data users, aggregation and discovery services).
Coreference service
A data web is intended to work with data providers' existing information. This means that we must bring together different data providers' information about the same entity into a combined description.
Coreference determination may occur at several levels:
- the simplest case occurs when different data providers use a common URI to refer to an entity. Significant operational and efficiency advantages can be realized when this degree of correspondence is arranged, but we cannot depend on this happy circumstance applying for all data providers.
- where data providers construct different URIs using a common entity identifers, coreference may be determined by examination of the URIs concerned.
- where data providers use the same metadata schema with entity identifers as attribute values, the task is a fairly straightforward task of identity reasoning.
- where data providers use different metadata schema but use the same entity identifiers within those schema, then coreferences may be discovered by a combination of schema mapping (sometimes called "ontology alignment") and identity reasoning.
- when data providers use different entity identifiers, then a (thesaurus-like?) enumerated mapping of individual identifiers may be required, possibly in conjunction with schema mapping.
The applicable form of coreference determination in any particular circumstance will depend on the data providers involved. The coreference service will allow a scheme of recognized identifiers to be registered for any data provider, along with a description of mapping to any other scheme of identifiers. Preferably, a data web will identify a preferred form of identifier, and all registrations will provide a mapping to that form. The mapping may use any of the mechanisms outlined above. Based on this information, the coreference service will respond to queries containing a registered form of identifier with a list of all other known identifiers that denote the same entity.
Depending on the results of user trials of the system, an advanced form of this service may also return identifiers that are closely related (but not equivalent) to a given term. E.g. a query for a named gene might return a list of names for homologues of that gene.
Discovery service
The purpose of the discovery service is to locate information about topics of interest across all of the data providers for a data web.
In some senses, this service is similar to the data aggregation service, in that it creates the appearance of an aggregated information source, but in other ways it is complementary as it maps information provided by a user to locate matching information from a data provider. A discovery service may attempt to apply some notion of relevance when selecting or ordering the results of a search.
Initial developments will focus on simple keyword searches (like Google) or simple attribute+value searches expressed with reference to the core metadata schema. User testing of these initial facilities (especially in use by academic researchers), and development of other services that use the discovery service, may inform requirements for more refined forms of discovery.
The results of a discovery service request may be returned as an aggregated results feed, or as a list of requests that may be submitted to the data web aggregator service to access aggregated information, or requests that may be submitted to individual data sources to access raw information. Further research and user feedback will be needed to understand which of these are most useful in practice.
The discovery service may also provide a facility to navigate the data web's core metadata schema. This may, for example, be used by client services that construct forms for search interfaces. Similarly, access to the coreference service may be provided to answer questions like "find the set of alternative identifiers for X".
Annotation service
Our examination of repository metadata indicates that, at the current state of deployment, available metadata may be insufficient to achieve any useful level of cross-referencing between different repositories. To overcome this, we propose to incorporate an annotation service in our development that will allow 3rd-party annotations to be recorded and used as an input to a data web. In this way, we hope to enable and encourage the creation of new metadata that can be used to cross reference repository data in ways for which the primary metadata is insufficient.
An annotation tool will need to provide:
- a data store for annotations
- a programmatic interface for automated submission of annotation data
- a browser interface for manual submission of annotation data
Data web browser
Although we anticipate that a data web will ultimately be used mainly as part of a larger computer-mediated workflow, there will always be a need for manual examination of the copmposed information. In particular, when constructing a new data web, or adding data sources to a data web, tools will be needed to allow developers and researchers to explore the information delivered.
The data web browser fulfils this role, providing options to explore the information space created and to perform searches for information about a designated subject or entity. We anticipate that the browser will be a service accessed using a normal web browser, and presents an interface capable of faceted browsing (i.e. allowing a user to browse from one entity to others that are related in a number of defined ways allowed by the data web's core schema).

