This document describes the use of existing mechanisms for accessing and querying provenance data about resources on the web.

Introduction

@@TODO

Accessing provenance data

A general expectation is that web applications may access provenance information in the same way as any web resource, by dereferencing its URI. Typically, this will be by performing an HTTP GET operation. Thus, any provenance information may be associated with a URI, and may be accessed by dereferencing that URI using normal web mechanisms.

The problem of accessing some required provenance information then reduces to the problem of finding its URI, which is dealt with separately in section .

This specification thus RECOMMENDS that if a publisher wishes to make provenance information available, it is published as a normal web resource, and provision is made for the URI of the provenance to be discoverable.

This presumption of using web retrieval to access provenance does not preclude use of other mechanisms. In particular, alternative mechanisms may be needed if there is no URI associated with some particular provenance data. One such mechanism is suggested in .

Locating provenance data

On the presumption that provenance data is a resource that can be accessed using normal web retrieval, one needs to know a URI to dereference. The provenance URI may be known in advance, in which case there is nothing more to specify. If the provenance URI is not known, then a mechanism to discover a provenance URI must be based on some information that is available to the would-be accessor. We also wish to allow that provenance information could be provided by parties other than the provider of the original resource. Indeed, provenance data for a resource may be provided by several different parties, each with different concerns.

We start by considering mechanisms for the resource provider to also indicate a provenance URI. Because the resource provider controls the response when the resource is accessed, this allows for direct indication of a provenance URI. Three mechanisms are described here:

These particular cases are selected as corresponding to primary current web protocol and data formats.

Resource accessed by HTTP

For a document accessible using HTTP, [[POWDER-DR]] describes a mechanism for associating metadata with a resource by adding an HTTP Link header field to the HTTP response to a GET or HEAD operation (other HTTP operations are not excluded, but are not considered here). Since the POWDER specification was published, the HTTP linking draft has been approved by the IETF as [[LINK-REL]] (http://tools.ietf.org/html/rfc5988).

The same basic mechanism can be used for referencing provence data, for which a new link relation type is registered according to the template in :

                Link: provenance-URI; rel="provenance"
              
When used in conjunction with an HTTP success response code (2xx ), this HTTP header indicates that provenance-URI is the URI of a provenance resource for which information is returned. At this time, the meaning of provenance links returned with other HTTP response codes is not defined: future revisions of this specification may define interpretations for these.

An HTTP response MAY include multiple provenance link headers, indicating a number of different resources that are known to the responding server, each providing provenance about the accessed resource.

Open issues

Are the provenance resources indicated in this way to be considered authoritative? I.e. if the client trusts information returned by the server (e.g. is prepared to act on inferences based on the returned data), should it also trust the provenance data, or should trust in the linked provenance data be determined separately? If the linked data is to be trusted, then the data from multiple linked provenance resources MUST be consistent if it is to be meaningful. I favour an approach whereby trust in the provenance resources is established independently, which is similar to the situation for any other resource; e.g. based on the domain that serves it, or an associated digital signature.

Resource presented as HTML

For a document presented as HTML or XHTML, without regard for how it has been obtained, [[POWDER-DR]] describes a mechanism for associating metadata with a resource by adding a <Link> element to the HTML <head> section.

The same basic mechanism can be used for referencing provence data, for which a new link relation type is registered according to the template in @@CHECK USE OF LINK RELATION REGISTRY IS OK HERE@@:

                <html xmlns="http://www.w3.org/1999/xhtml">
                   <head>
                      <meta name="wdr.issuedby" content="http://authority.example.org/company.rdf#me"/>
                      <link rel="provenance" href="provenance-URI">
                      <title>Welcome to example.com </title>
                   </head>
                   <body>
                      ...
                   </body>
                </html>
              
This element indicates that provenance-URI is the URI of a provenance resource for the containing document.

An HTML document header MAY include multiple provenance link elements, indicating a number of different resources that are known to the creator of the document, each providing provenance about the document resource.

Open issues

@@TODO - use of link relation registry with HTML <link> elements

@@TODO - The POWDER specification also adds: Documents MAY also include any of the attribution data from the POWDER document in meta tags. In particular, the issuedby field is likely to be useful to user agents deciding whether or not to fetch the full POWDER document. Any attribution data encoded in meta tags within an HTML document should be the same as that in the POWDER document. In case of discrepancy, the POWDER document should be taken as more authoritative. Is there a parallel we should add here for provenance?

Resource presented as RDF

If a resource is presented as RDF (in any of its recognized syntaxes, including RDFa), it may contain references to its own provenance using additional RDF statements.

For this purpose a new RDF property, prov:hasProvenance, is defined as a relation between two resources, where the object of the property is a resource that provides provenance data about the subject resource. Multiple prov:hasProvenance assertions may be made about a subject resource.

@@TODO: example

@@TODO: document namespace. Check naming style. Use provenance model namespace? Define as part of model?

Third party services

The mechanisms for provenance discovery described above have all assumed the provenance URI is being supplied by the provider of the original resource. Where provenance information is provided by a third party without any collaboration from the original resource provider, the provenance link cannot be provided directly and a different approach must be considered.

We assume that the application or person requesting provenance information has the URI or other unique identification of the resource for which provenance is required, and also has a URI for a third-party service that provides a provenance information service. Specifically, the third party service URI is the URI of a SPARQL endpoint which is queried for the desired provenance information.

If the requester has a URI for the original resource, they simple issue a simple SPARQL query for the URI(s) of any associated provenance data; e.g., if the original resource has URI http://example.org/resource,

                @prefix prov: <@@TBD>
                SELECT ?provenance_uri WHERE
                {
                  <http://example.org/resource> prov:hasProvenance ?provenance_uri
                }
              

If the requester has identifying information that is not the URI of the original resource, then they will need to construct a more elaborate query to locate the target resource and obtain its provenance URI(s). The nature of identifying information that can be used in this way will depend upon the third party service used, further definition of which is out of scope for this specification. For example, a query for a document identified by a DOI, say 1234.5678, might look like this:

                @prefix prov: <@@TBD>
                @prefix idservice: <@@TBD>
                SELECT ?provenance_uri WHERE
                {
                  [ idservice:hasDOI "1234.5678" ] prov:hasProvenance ?provenance_uri
                }
              

The mechanisms described here focus on finding the URI(s) for provenance information. Below, will consider access to provenance information for which there is no separate URI.

Querying provenance data

(This section will describe the use of a SPARQL endpoint serice to obtain provenance information directly from a service provider. No new protocol or vocabulary elements are defined: the mechanisms are used are thosed described above, coupled with possible use of provenance vocabulary terms in a SPARQL query.)

@@TODO

Provenance service discovery

(How to discover provenance services. There is nothing particular about provenance on this respect, and this section will discuss some of the available options without adding any new normative specification.)

@@TODO

IANA considerations

@@TODO - registration of "provenance" link relation, per http://tools.ietf.org/html/rfc5988#section-6.2.1.

Security considerations

Provenance is central to establishing trust in data. If provenance information is corrupted, it may lead agents (human or software) to draw inappropriate and possibly harmful conclusions. Therefore, care is needed to ensure that the integrity of provenance data is maintained.

When using HTTP to access provenance information, or to determine a provenance URI, secure HTTP (https) SHOULD be used.

When retrieving a provenance URI from a document, steps SHOULD be taken to ensure the document itself is an accurate copy of the original whose author is being trusted (e.g. signature checking, or verifying its checksum aainst an author-provided secure web service). against

@@TODO ... privacy, access control to provenance (from Edinburgh meeting). In particular, note that the fact that a resource is openly accessible does not mean that its provenance information should also be.

@@TODO ... more, probably

Acknowledgements

Many thanks to Robin Berjon for making our lives so much easier with his cool ReSpec tool.