ADMIRAL Databank submission requirements

From ImageWeb

Jump to: navigation, search

This page explores the metadata and mechanisms we are adopting for submitting datasets to OULS Databank for long-term preservation.

Contents

Minimum information for Databank submission

This comment from Ben O'Steen is about the minimum metadata requirements for submission to Databank:

Add in the metadata that you wish to be available. (via root.rdf and manifest.rdf files - see below) Everything else is automatically gathered or generated, ergo, there is no minimum requirement. If you don't add a title, one is generated by the folder name. If you don't add a creator, the name associated with the account is used instead. I have been suggesting BagIt as a format for submission, and folder hierarchy is retained, which means that relative links work. The minimum metadata we gather automatically and hold for each item is:
  • submitter - (we get that via authentication mech/source)
  • various dates (created, modified, etc)
  • file hierarchy structure and mimetypes (autogen'd)
  • Checksums (via the Checkm specification)
If you wish there to be RDF metadata about the item added to the index then place it at the top of the item's folder hierarchy in a file called root.rdf.

So the question becomes: what is the minimum metadata that we should collect locally and include in the package submission? The touchstone here is to select that data that we can provide more accurately that Databank might guess at, with an acceptable level of user interaction. Our proposal is thus:

  • Name of person who created the data, on the basis that the actual creator may be different from the person or agent doing the submission.
  • Title of dataset: I imagine any user will be happy to provide a better title than a guess based on a filename. A local file/directory name can be offered as a default.
  • Date of creation and/or revision. We are likely to have better information locally that is likely to be available to Databank. Default values based on the LSDS file system should be offered.
  • Copyright owner - default to person who creates the dataset; later maybe have some local default policy.
  • License terms - this is likely to be a default value, specified per-user and/or research group.
  • Embargo policy - this is likely to be a default value, specified per-user and/or research group.

Other metadata we might wish to consider including:

  • Email of creator
  • Email of submitter
  • Type(s) of data (image, video, numeric, etc.
  • Publication reference

The data will be submitted as a zip file, per the BagIt specification (https://confluence.ucop.edu/display/Curation/BagIt). The actual data to be included, and its organization, is something we'll need to work out with users. Any explicitly-provided metadata is provided though files root.rdf and manifest.rdf. In principle, root.rdf appears just one in the root of the package, and provides metadata about the package as a whole, and metdata.rdf may be provided per-directory, and provides information about individual files within the package. In practice, these are stored as named graphs within a single triple store. Files RELS_EXT and RELS_INT are artifacts of the FEDORA back-end, and may be ignored for our purposes.

If supplied, a Checkm (https://wiki.ucop.edu/display/Curation/Checkm) file will be used to validate resource contents against supplied digest values.

Submission protocol

  • HTTP POST, with digest authentication. (Using SSO credentials? Not yet finalized.)

Retrieval

Retrieving root.rdf or manifest.rdf files will return the exact metadata supplied. Using the Databank API (or possibly content negotiation) on a particular resource will return combined supplied+constructed metadata for a resource.

We have discussed extending the API so a copy of the original ZIP file can be retrieved.

Schemas used

DC:

RELS-EXT (auto-generated?):

RELS-INT (auto-generated?):

Links and examples

See:

DC example for phonetic data:

<oai_dc:dc xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:oai_dc="http://www.openarchives.org/OAI/2.0/oai_dc/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/oai_dc/ http://www.openarchives.org/OAI/2.0/oai_dc.xsd">
  <dc:title>Dataset: Tick1 audio corpus</dc:title>
  <dc:creator>Greg Kochanski</dc:creator>
  <dc:rights>Unestablished</dc:rights>
  <dc:identifier>dataset:1</dc:identifier>
</oai_dc:dc>

RELS-EXT for audio data:

<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
  <rdf:Description rdf:about="info:fedora/dataset:1">
    <rdf:type rdf:resource="http://vocab.ouls.ox.ac.uk/dataset/scheme#DataSet"></rdf:type>
    <rdfs:label xmlns:rdfs="http://www.w3.org/2000/01/rdf-schema#">/</rdfs:label>
    <dataset:physicalPath xmlns:dataset="http://vocab.ouls.ox.ac.uk/dataset/scheme#">/</dataset:physicalPath>
    <dcterms:created xmlns:dcterms="http://purl.org/dc/terms/">2006-2-24</dcterms:created>
    <dcterms:isReferencedBy xmlns:dcterms="http://purl.org/dc/terms/" rdf:resource="http://ora.ouls.ox.ac.uk/objects/uuid:1999c687-49a0-4808-9a50-2f82ab66d96f"></dcterms:isReferencedBy>
    <dcterms:isReferencedBy xmlns:dcterms="http://purl.org/dc/terms/" rdf:resource="http://kochanski.org/gpk/papers/2007/icphs.pdf"></dcterms:isReferencedBy>
  </rdf:Description>
</rdf:RDF>

Is this example looks like a RELS-INT, taken in this case from a subdirectory of the overall dataset.

<?xml version="1.0" encoding="UTF-8"?>
<rdf:RDF
   xmlns:dataset="http://vocab.ouls.ox.ac.uk/dataset/scheme#"
   xmlns:dc="http://purl.org/dc/elements/1.1/"
   xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
>
  <rdf:Description rdf:about="info:fedora/uuid:88aa7df3-2282-11de-9609-000e2ed68b2b/TEXT_DESC">
    <dc:title>TEXT_DESC</dc:title>
    <dc:description>Data description</dc:description>
  </rdf:Description>
  <rdf:Description rdf:about="info:fedora/uuid:1edb97d5-2283-11de-9609-000e2ed68b2b/raw.wav">
    <dc:title>raw.wav</dc:title>
    <dc:description>the original recording, in Microsoft WAV format.
It is a two-channel file.  One channel contains the
recorded speech, and the other channel contains either
metronome ticks or an audio channel from a microphone
positioned to pick up finger taps.   (The subject's finger
tapped on a hardcover book about 2cm from the microphone.)
The finger tap channel will pick up some speech, but faintly,
and the speech channel will pick up some finger tap sounds.
However, metronome ticks were coupled in electronically and
are completely isolated from the speech channel.</dc:description>
    <dataset:physicalPath>/ej/ej_rep12_m96/raw.wav</dataset:physicalPath>
  </rdf:Description>
  <rdf:Description rdf:about="info:fedora/uuid:1edb97d5-2283-11de-9609-000e2ed68b2b/irr.dat">
    <dc:title>irr.dat</dc:title>
    <dc:description>An irregularity measure that separates voiced speech
from unvoiced.   It quantifies speech that is not fully voiced.</dc:description>
    <dataset:physicalPath>/ej/ej_rep12_m96/irr.dat</dataset:physicalPath>
  </rdf:Description>
  <rdf:Description rdf:about="info:fedora/uuid:1edb97d5-2283-11de-9609-000e2ed68b2b/raw.ue">
    <dc:title>raw.ue</dc:title>
    <dc:description></dc:description>
    <dataset:physicalPath>/ej/ej_rep12_m96/raw.ue</dataset:physicalPath>
  </rdf:Description>
  <rdf:Description rdf:about="info:fedora/uuid:1edb97d5-2283-11de-9609-000e2ed68b2b/loud.dat">
    <dc:title>loud.dat</dc:title>
    <dc:description>The perceptual loudness.</dc:description>
    <dataset:physicalPath>/ej/ej_rep12_m96/loud.dat</dataset:physicalPath>
  </rdf:Description>
  <rdf:Description rdf:about="info:fedora/uuid:1edb97d5-2283-11de-9609-000e2ed68b2b/ue.lbl">
    <dc:title>ue.lbl</dc:title>
    <dc:description>These are the start and end-points of the speech in the
utterance, automatically generated but checked for accuracy
by a human.   A small amount of silence (probably <100ms)
is included within
the marked endpoints on either side of the utterance.</dc:description>
    <dataset:physicalPath>/ej/ej_rep12_m96/ue.lbl</dataset:physicalPath>
  </rdf:Description>
  <rdf:Description rdf:about="info:fedora/uuid:1edb97d5-2283-11de-9609-000e2ed68b2b/pdur.dat">
    <dc:title>pdur.dat</dc:title>
    <dc:description>A measure of duration for the current syllable.
Essentially, it measures how far one can go (in time)
before the spectrum changes substantially.</dc:description>
    <dataset:physicalPath>/ej/ej_rep12_m96/pdur.dat</dataset:physicalPath>
  </rdf:Description>
  <rdf:Description rdf:about="info:fedora/uuid:1edb97d5-2283-11de-9609-000e2ed68b2b/rms.dat">
    <dc:title>rms.dat</dc:title>
    <dc:description>The RMS (intensity or power).</dc:description>
    <dataset:physicalPath>/ej/ej_rep12_m96/rms.dat</dataset:physicalPath>
  </rdf:Description>
  <rdf:Description rdf:about="info:fedora/uuid:1edb97d5-2283-11de-9609-000e2ed68b2b/m.dat">
    <dc:title>m.dat</dc:title>
    <dc:description>This file contains computed tick or tap locations.
It is meaningful only for metronome data, where it simply
marks the metronome ticks.</dc:description>
    <dataset:physicalPath>/ej/ej_rep12_m96/m.dat</dataset:physicalPath>
  </rdf:Description>
  <rdf:Description rdf:about="info:fedora/uuid:1edb97d5-2283-11de-9609-000e2ed68b2b/repeats.lab">
    <dc:title>repeats.lab</dc:title>
    <dc:description></dc:description>
    <dataset:physicalPath>/ej/ej_rep12_m96/repeats.lab</dataset:physicalPath>
  </rdf:Description>
  <rdf:Description rdf:about="info:fedora/uuid:1edb97d5-2283-11de-9609-000e2ed68b2b/sss.dat">
    <dc:title>sss.dat</dc:title>
    <dc:description>A measurement of the average slope of the speech spectrum.</dc:description>
    <dataset:physicalPath>/ej/ej_rep12_m96/sss.dat</dataset:physicalPath>
  </rdf:Description>
  <rdf:Description rdf:about="info:fedora/uuid:1edb97d5-2283-11de-9609-000e2ed68b2b/raw.tap">
    <dc:title>raw.tap</dc:title>
    <dc:description>This file contains experimental tick or tap events.
For the metronome data, it contains the times at which
metronome ticks occur.   For the "tick" data, if it
exists, it lists the times at which the subject's finger
tapped to mark a stressed syllable.
This is computed from one of the channels of the raw.wav file,
but manually checked.</dc:description>
    <dataset:physicalPath>/ej/ej_rep12_m96/raw.tap</dataset:physicalPath>
  </rdf:Description>
  <rdf:Description rdf:about="info:fedora/uuid:1edb97d5-2283-11de-9609-000e2ed68b2b/f0.dat">
    <dc:title>f0.dat</dc:title>
    <dc:description>A standard computation of the speech fundamental frequency.</dc:description>
    <dataset:physicalPath>/ej/ej_rep12_m96/f0.dat</dataset:physicalPath>
  </rdf:Description>
</rdf:RDF>
Personal tools
Oxford DMP online
MIIDI
Claros