Meetings/20070122/Dspace@Cambridge

From ImageWeb

Jump to: navigation, search
Defining Image Access

DSpace@Cambridge

Meeting with Patricia Killiard and Tom de Mulder 22 January 2007 at the University Library, Cambridge University

Contents

General

Reflection about our project and data web ideas: Not dismayed by it - it fits in with our agenda. We have large legacy collections, some already in DSpace. We are aware of other digitization projects within the University, and are involved in some.

What value do you see in linking to other sources, repositories, NHM, publishers? See a demand for cross-searching image collections, particularly within particular communities. Would like to be able to search all universities' images.

What kind of cross-repository search facilities are already available? Peter Morgan knows about other cross searching, e.g. ROAR Registry of Open Access Repositories (http://roar.eprints.org; Southampton University). ROAR has a content search facility, using Google over a range of repositories using Google Co-op (http://www.google.com/coop/; "Google Co-op is a platform that enables you to customize the web search experience for users of both Google and your own website"). This gives information about repositories, but does not seem to permit one to search their holdings directly, and does not distinguish images from other types of holding.


Digitization activities

Have own image services unit - 3 staff full-time digitizing, and 3 other dealing with microfilming.

JISC have just funded a major digitization project at the Scott Polar Institute. In addition, Corpus Christi College Parker Library has Mellon Foundation money to fund additional University Library staff to digitize ~200-400K mediaeval manuscript pages within the college library, creating page images that will then be served to the Web at Stanford University.

Other images are being created for teaching (SAKAI project), and museum images are being created by the Fitzwilliam Museum - of events and museum objects, that will not be made public. Margaret Grieves, Deputy Director of the Fitzwilliam Museum, is trying to formalize procedures for digital assets and would be interested to learn of our EC Beazley Project.


Getting good metadata

Mostly down to policy: most successful digitization projects are those that stick to simple guidelines.

The Corpus Christi (Stanford) digitization work has very clear policies, with two people who just look at metadata, plus quality control of the images.

Tom: Secret of getting good metadata during image capture is to get involved in the projects early - at the grant-writing stage.

Ingestion into Dspace is just a (final?) stage of a wider process of image management.

Patricia: When setting up DSpace, we took whatever people gave us - people were concerned about submitting images for their long-term preservation. Often files were not ready for DSpace because metadata had not been taken care of, and it is very hard to retrofit metadata.

Also, from the project kick-off meeting: they use MPEG-21 used to embed metadata with content. Modified DC metadata only, using library application profile. -- GK

Other organizations

Recommended the Society for Imaging Science and Technology (http://www.imaging.org/) - top end of technical stuff, with very good intense conferences - see http://www.imaging.org/conferences/.

I noticed Larry Masinter, an old colleague from Web standardization activities and editor of early URI specifications, and now working for Adobe, was presenting at their 2006 conference - GK.


Ingesting, storing and serving images

Prefer batch submission, since submission via web browser is very inefficient.

Protocols for automated distant submission not possible for DSpace - problem of authentication. So Tom does the batch upload running on local DSPace server, using a new script written for each batch.

Cambridge CARET project run a development DSpace systen, and are working on a batch deposit tool for SAKAI (also for deposit to Eprints, Fedora?).


Metadata requirements for new ingest

No mandation of metadata! Tom would discuss with submitters, and then map whatever they have, in whatever form, onto Qualified DC, or find an existing XML metadata schema to cover this type of data, since DSpace can support several metadata standards. This done individually for each new collection, with creation of a new bespoke batch submission tool.


Accessing image content

The Scott Polar Institute wants to create its own web site, but doesn’t want to store the images, so it will send SOAP queries to DSpace. Will be able to support queries of the nature: "Give me all images and metadata in the collection with subject X for date Y".

We are not normally the main method of distribution: users prefer to pull images to their own sites for republication. They don't want to redirect browsers to DSpace (unless the imnages are very large), and prefer their own web sites, which:

  • maintain contextual information for their users (e.g. the topic of the web site, which is lost when a user lands in Dspace),
  • can present metadata in meaningful and helpful ways, and
  • can provide browsable indexes with sensible topic groupings

(these all being difficult for a generic repository to achieve).

DSpace can browse by common fields such as author, subject, date.

The DSpace repository lacks collection-level metadata.

For humantities and social science users, browsing is very important. Tailored categories are not feasible. For what?.

A SOAP interface was chosen since the people developing subject-specific web sites know PHP, which supports SOAP, and don't want to learn OAI-PMH, which is seen as a specialized "library" standard that most people don't understand, while all the world understands PHP. On the other side are the librarians who know OAI. Tom's role is to mediate understanding between users and DSpace. Cannot expect users to become technical experts - just want something that works. "Just call these functions" without having to understand OIA-PMH. More detailed probing of this point revealed that the real goal was to provide a higher level API for easily performing the kinds of queries the users specifically wanted, without having to understand how the metadata is organized within DSpace. (For all it's simplicty, OAI-PMH does push some need to understand the metadata onto the formulator of a query.) SOAP was the available mechanism, for realizing this API, and is not made visible to users.


Future of DSpace@Cambridge

DSpace@Cambridge is the biggest DSace installation anywhere. Users (data submitters) want DSpace to be centrally funded, and usable by people who know nothing about metadata standard, and who just want a pretty web site with safe contents.

Will be competing with Google - which gives poor results because of poor metadata, but is popular because of its supreme simplicity.

Tom: People have to agree about metadata fields, but this is hard to achieve. So if DSpace is going to pull in lots of metadata, we need software to handle ambiguities.

Primary purpose of DSpace: Preservation of data in a long-term secure environment. For data distribution, people likely to pull images from DSpace into their own web sites, or keep local image copies for use while archiving to DSpace.

E.g. embed a video in a web site, while DSpace keeps a copy of the content for safekeeping. End users would not be pointed to DSpace; rather all the contextual information would be on the specialist web site.

Where you are going wrt metadata in 2 years? Enhancements to software. More ePRints content. Limited at present by lack of liaison post to talk to depositors and help them with metadata completion.

Would like to store theses and EPrints. Images embedded within text are not likely to dealt with them as discrete objects (unless Google does this) - i.e. this is NOT an anticipated DSpace service.

PANDIS(?) experiment with link back to front organization to purchase controlled content.

2-year goal: something like RCS(?) in a METS wrapper.


Strengths and weaknesses of DSPace

Fedora model better layered - more flexible. Most DSpaces round the world are very small, but since the Cambridge one is very large, it stretches its capabilities and shows up its lack of flexibility. DSpace 2 is coming slowly and will be better, but the rest of the DSpace community is showing very little interest in this, since the existing system works OK for their small holdings.

Current DSpace version is 1.4, soon will be 1.5. This has very limited upgradability. MANAKIN is a response to this (MANAKIN is a DSpace-like system based on Apche HTTPD and Cocoon; uses XML and XSLT to deliver web content from an internal store).

DSpace V1 is very monolithic in structure, with tooo much back-end logic in the web serving application, which is limiting ability to upgrade or add new features. DSpace V2 aims to implement core repository and basic data management (cf. Fedora) separately from web serving interfaces. DSpace@Cambridge are active in requirements gathering for V2, especially from their perspective of needing scalability.

DSpace is run on one or more Linux systems. A basic experimental system can be set up in a day.

DSpace is weak in the area of access control.

Tom: would like a system of metadata acquisition that works like a wiki, with complete recording of change histopry, so that a community approach to metadata gathering can be tried.

DSpace may be the only repository system that has any demonstrated level of scalability. (I don't know who said that, but it's not entirely true: we hear at the MPEG-21 tutorial run by UKOLN that LANL run aDORe with millions of stored digital objects - GK).


Recommended software systems

An Image Management System is required for the University. Intralect (http://www.intrallect.com/) is a learning object repository system that is NOT suitable.

Can't DSpace do that for you? Tom: DSpace is used as the final step when images are ready - but it does not let you do change management, so need something to manage the steps (image capture, editing, metadata addition) prior to DSpace submisison. Need a workflow management tool. [FlyData??]

No simple software recommendation.


Users

DSpace@Cambridge "users" are taken to be the data depositors; for access to content, it's just "on the web". Hence Tom has little information about the end user "consumers" of the DSpace@Cambridge holdings, who are left very much to their own devices. While Web usage logs exist, he has not had time to analyse them, so he has no estimates of end-user access.

Can't foresee how end users will use metadata for browsing. Uses of holdings are often surprising - schools in India and students in USA requiring access to finish term papers- learn of use mostly by complaints when the system goes down!

Can also revert to search - use Google, since Google indexes all metadata fields as if they are all equivalent.


Access to Collections

Image collections are either completely "open" or completely "dark", but this is changing (e.g. low-res images made public while hi-res images are kept private. Thumbnailing is not currently activiated. Fine grained access control is not easy to achieve.

Open collections show just metadata, not thumbnails. Image files can be directly downloaded, but not pre-viewed before donwload to permit the end-user to check that this is what they want.

"Dark" collections are just archived to ensure preservation, and neither the images nor their metadata are available. One example of this is graduation photos, where releasing metadata - names of graduands - would contravene the Data Protection Act.

The system can be set up to permit the metadata to be seen, but not to permit file download. However, this is not used in Cambridge.

At present, cannot log changes in metadata, and users cannot submit comments. They see a role for user annotation, but think that DSpace is not place for it. Rather they suggest using a Wiki page to give greater structure, including a change history. Can always revert to an earlier version if hacked. Could run this in harmony with DSpace, since each DSpace entry has unique identifier.


Suggested requirements or policies for new collections

Determine what is the nature of the content, then work out how that can be described. Use Dublin Core if possible, otherwise look for an existing metadata schema.


Specific Collections

The CamRAD Rock Art images http://www.dspace.cam.ac.uk/handle/1810/62 is the first image collection in DSpace, with "poor" metadata (Authors unknown, etc). Getting the GPS coordinates munged into DSpace was difficult but done.

See the Royal Commonwealth Society Photography Project (http://www.lib.cam.ac.uk/rcs_photo_project/homepage.html): Provides browsable index, with image grouped by topic. RCS shows topic by country, religion and belief, etc. DSpace would not necessarily permit this degree of sophistication, but rather would only offers v. generic SEARCH by title, date, title or author, rather than a topic-specific BROWSE. Could not browse by species, for example. Rachel Rowe is the Royal Commonwealth Society Librarian.

Another image collection in the University Library but not in DSpace is Conrad Martens sketchbooks from voyage of the Beagle (http://www.lib.cam.ac.uk/ConradMartens/) - these have place information, etc.

Three collections that would be good as exemplars for Defining Image Access are

  1. RCS Photographic Collection with qualified DC - conforms to standards and is an example of how DSpace would like all its collections. Free. Uses controlled vocabulary for metadata: Name authorities in the RCS Photographic Collection conform to the National Council on Archives' Rules for the construction of personal, place and corporate names. Location names have been taken from the Getty Thesaurus of Geographical Names. The descriptions in this collection conform to the International Council on Archives' General International Standard on Archival Description (ISAD(G)).
  2. Anthropological ancestors (http://www.dspace.cam.ac.uk/handle/1810/25): From Social Anthropology. High value video content of unique interviews with anthropologists made by Alan MacFarlane, with good DC metadata. Free and most visited. MPEG4 format container containing QuickTime videos. Does not use controlled vocabulary for metadata.
  3. The CamRAD Rock Art images. Sample metadata at http://www.dspace.cam.ac.uk/handle/1810/111?mode=full shows geographical coordinates (latitude, longitude, altitude, bearing, inclination) that would permit a nice mashup with Google Earth.

Content description is textual only; no domain-spoecific controlled vocabularies are used.

OAI-PMH can pull out all metadata - just give appropriate identifier in URL.


Identifiers

Handles (http://www.handle.net/) are used throughout DSpace. Tom has a problem with this - CNRI reinvention is worse than the original wheel ("What if URLs go away" was an insufficient justification - what if CNRI goes away? Perhaps more likely!)

Tom would like URL's to convey information about what they represent, rather than being semantically meaningless numbers, although that goes counter to URL philosophy, would be useful to human users. Not counter to URI philosophy as long as programs are not expected to interpret the internal structure of URIs, and the form of allowable URI is not constrained.

Would like mapping from DSpace URL to title made URL-friendly (as for WordPress blog; http://wordpress.org/).

[NB WordPress has new information about images and tagging them: http://codex.wordpress.org/Using_Images]

Need a clever URL resolver of the type that HP used to have on its website www.hp.com/go/keyword, that would resolve to appropriate the resource. [This does not now exist]


Smart searches and tagging

Tom would like Web2-type suggestion of keywords "You have typed this, would you like that as a keyword"

Problem: most Web2 stuff not accessible to disabled people e.g. blind people reliant on auditory interface.

Graham: DRUPAL does have this in keyword tagging field, as does del.icio.us (http://del.icio.us/). Also auto-completion using AJAX.

MODES is a commercial image management application arising from the museum community that is specific for Windows, lightweight but does what it says - very good.


Additional things to check?

  • Z39-87 Technical metadata for still images
  • METS
  • Vernon - integrated catalog and image management system
  • DC Collection Description Application Profile

SMW Relations and Attributes (added later)

dia:OAI::http://www.dspace.cam.ac.uk/dspace-oai/request

Personal tools
Oxford DMP online
MIIDI
Claros