Meetings/20070426/MartinDoerr
From ImageWeb
Martin Doerr is the lead designer of the CIDOC CRM metadata scheme for cultural heritage (see DefiningImageAccess/Standard/CIDOC), and has considerable experience with mapping native metadata schemas to CIDOC. The following notes are distilled from part of a telephone conversation with him about designing schema mappings for museum data.
Discussion
The amount of time to define a mapping depends on the richness of metadata involved: e.g. MIDESS(?) is relatively rich compared with Dublin Core. (Later, I think he meant MIDAS - see http://heritage-standards.org/).
Pre-conversion vs mediated approach.
We are very keen on a mediated approach. Martin points out this may need a very powerful mapping engine. He also points out the "integration effect" for classifiying objects can be low because the data can be very opinionated. (Something about ... questions at the categorical level rather than ... objects, people, places ... different forms.) Safest way is to physically extract metadata and preprocess for alignment. Consolidation of "Factual identifiers" (what are these?) is particularly difficult.
We talked around this issue, and despite very different starting positions, I believe we are actually in broad agreement about most of the issues. While we are designing around a mediated approach, that approach does not preclude pre-harvesting or caching of results. Martin mentioned designing a special language to define mappings, an approach which very closely parallels my own thoughts. (I'm considering a functional language approach, driving functional composition to create a kind of pre-compiled mapping function; cf. Swish, functional combinator parsing, Parsec, PyParsing, etc.)
Martin mentions that he has studied "Local as view" ("LAV") approaches to schema mapping. (Local schema is subset of global schema ... seen through global schema.) Martin will send references to papers about this approach.
Common structural elements
(The following note was added rather later, so I hope my memory is not playing up.)
Martin also mentioned that, from study of many diverse systems with overlapping subject material, that he has identified a core of "structural" vocabulary elements that could describe all situations he had surveyed. Beyond this, the schema elements would include domain-specific classes all of whose whose possible relations are defined in terms of the core structural elements, and as such can generally be by simple mappings within the structural framework mentioned. I speculate that this thinking is reflected in a paper describing a schema mapping language:
- http://www.ics.forth.gr/isl/publications/paperlink/Mapping_TR385_December06.pdf - Haridimos Kondylakis, Martin Doerr and Dimitris Plexousakis, Mapping Language for Information Integration. Describes a mapping language for schema alignment that captures a number of commonly needed alignment patterns.
Subsequent thoughts
Later, some references discovered by Googling for references to LAV:
- http://www.almaden.ibm.com/cs/people/fagin/inverse.pdf
- http://www.dcs.bbk.ac.uk/~ap/talks/AUEBJan2006.ppt
- http://ilps.science.uva.nl/Teaching/II0607/twiki/pub/Main/CourseSchedule/ii-0607-Week11-Data-integration-2.pdf
- http://www.cs.wisc.edu/~anhai/papers/si-survey-db-community.pdf
I'm not sure our approach is strictly LAV as I think Martin described (see above), as we also expect to be able to use source data for which there is no global schema ... clearly, the integrative effect here is non-existent, except that the user-annotations may come in to play here.

