Meetings/20070615/DefiningImageAccess-ImperialCollege
From ImageWeb
Contents |
Imperial College meeting to discuss images and data in repositories
Introduction - agenda
1. Introductions
2. Alexei Yavlinski presentation - view of image retrieval and local work - what is happening with image collections - integrative biology - confocal microscopy. (Tom - report on college position)
3. Ashok Argent-Katwala presentation on local repository for papers and bibliography.
4. Dolores Iorizzo - talk about various repositories at imperial
5. Tom Wheel - talks about Imperial's Oracle-based repository plans
6. Henry Rzepa, Omer Casher - talk about semantic technologies for chemical data, and the "SemanticEye" system for metadata capture
7. Matt Harvey - brief discussion of HPC/Grid perspective
8. Demonstration Microsoft Photosynth preview (downstairs)
Introductions
- Brian Fuchs - coordinating multiple projects
- Joao Magalhaes - dept computing - multimeia information systems and IR
- Alexei Yavlinski - multimedia info systems group - finished PhD - talking about image indexing & retrieval, and outline of group
- Dan Brickley
- Graham Klyne
- Jun Zhao
- Omer Casher - GSK clinical imaging centre, IT infrastructure and data management. IC/GSK/MRC @ Hammersmith Hospital - officially opened 2 days ago. Focus on deployment of Siemens data management system. Big challenge ... "Images without metadata are totally meaningless" (unlike documents that are self describing.) One fruit is Semantic Eye repository.
- Henry Rzepa - department of chemistry, Imperial. May 1994 "life changed for ever" - workshop at CERN - chemists used pictures; loss of information was/is enormous. With PMR, devised better ways to handle chemical data than exchanging pictures. "Chemical semantic web". Led to early data markup languages?
- Tom Wheel - representing college IT division - working on Oracle technology, with leeway to experiment. Looking at Oracle RDF store. Looking at splicing W2.0 applications into the server.
- Dolores Iorizzo
- Ashok Argent-Katwala - stochastic models, markov chains, stochatic process algebras. Also needs to create collaborative systems. Made publications repository for department (spare time job!).
Alexei Yavlinski
"Image indexing and retrieval using automated annotation"
Approximate automated to gather "an idea of what's in an image".
MMIS group
- Founded Stefan Rueger. Spans Imperial and OU KMI. Fast image retrieval (Peter Howarth)
- Statistical multimodal fusion for semantic multimeda c
- Geo
- Image and video browsing
- ...
(Stefan Rueger - was at Imperial but left to found Knowledge Media Institution (KMI))
Automated image annotation
Challenges of photo search
- Rapid growth of image content - flickr, WWW, personal collections, Youtube, ...
- Organizing photos: naming files, tagging.
- On WWW Google indexes more than 1Bn images, ALT tags rarely used, images opaque, indexed based on surrounding web page text. 26% ALT tags, 30% with more than 1 word.
Dolores: Have to do more statistical studies of image metadata in the wild (ALT tags, words, etc.)
Challenges: lack of relevant information, image remains opaque, lack of obvious descriptions, alternative method is highly desirable.
Shows Example of idiosyncratic results.
Automated annotation as an alternative
- Unlabelled image, Bayesian statistical inference, multiple feature vectors (e.g. RGB histograms).
- Machine learning approach: training sets to tune model.
- Approaches to feature modelling: segmentation into regions, object detection. Alternative: global feature extraction - characteristic statistics.
Existing annotation methods
- "Region growing algorithm" to segment region. Look at summary properties of regions (dominant coliur, texture, size, etc.) Calulate translation table to keywords, then use table to predict keywords.
- Problems: generic segmentation is unsolved problem. Better results are computationally expensive.
- Object detection: Train statistical classifier to recognize configuration of pixels that represents some kind of object. Then autocorrelation to detect selected objects in images.
- Drawbacks: large training set; manual markup of where objects are in images... "Label me" collaborative project at MIT. Sensitivity to orientation.
Alternative approach
- Global feature extraction. Also takes account of the way images are taken (e.g. aesthetic concerns). Faster, less susceptible to illumination and pose variation.
- More limited with respect to working in face of contextual variation.
- Result of analysis is a probability distribution for words associated with an image - role for probabilistically applied tags in metadata.
Demo of prototype
Image searching engine (beholdsearch.com)
- Search for keywords by image analysis.
- Search by flickr tags
- Can combine searches with some interesting results (e.g. tagged with dog, looke like face: better than tagging alone at finding dog faces. Selection for search display based on ranking, not theshold.
Related work finds "vocabulary of visual features" whose probabilities constitute a feature vector for classification (as opposed to keywords suggested by people). Harder to train, but faster to classify.
Dan suggests that an image uploader (e.g. flickr uploades) can contain an image classifier, so that processing load is distributed among participants.
When searching on the web as a whole, problem with "anticipating negatives" in the training set.
Dan asks about whether images are considered independently, or take account of membership in collections.
Suggest: similar apprioach for detecting developmental stages in gene expression in drosophila spermatogenesis. To talk about later.
Ashok Argent-Katwala
1500 publications, about half with attached PDF.
Was a drive for submission of bibliographic information to support RAE.
Home grown software.
As well as papers: presentations, extended version,
This is a repository that works at the departmental level. Cite this as example of loocal needs (Dolores notes)
Dolores Iorizzo
List of "repositories" within the college.
- (see presentation from Dolores)
SHERPA Leap
Note how subject links show how library organizes materials within the college. Links to web sites in departments.
Biosciences -> Molecular biology databases -> external sources.
Jena - Macromolecules (!)
Library-maintained list of resources, vs publication of university materials?
Try serach for "image repositories" in imperial library site.
Look for "Image repository topics"
Tom Wheel speaks to Dolores notes ...
Oracle deployed primarily for payroll, is very expensive. Also want some value for academics. Robustness of Oracle, some working on clustering (RACs?), geographically split nodes, not very scalable (c. 6-7 nodes). Looking for technologies to distribute operations.
Heffalump trap: handling logs, bug lost data, managed to recover from logs. Oracle doesn't care about actual binary content (BLOBs). Dolores' "world of Thomas Aquinas" - one system for everyone. Doesn't agree: academics say library services must meet their requirements, not vice versa. Multiple interfaces, web services, database interfaces, XML loading, ...
For images: Oracle doesn't care. What is a worry: Oracle has reputation of being difficult (technically?). Tom picked up Oracle by working with the GUI tools: "Application Express", basic app development framework; local database can be interfaced to larger instances of Oracle database. More interesting, outsiders find IC web site un-navigable. Looking at improved navigation, semantic searching. Have working prototypes of XML searching (Autonomy) and a "semantic navigator", based on subject classification schemes. Prototype system allowed similar (not synonymous) terms. Then can use search terms from different classificationb schemes. Manually, not scalable - both subject scheme mappings and individual instance annotations. Would like library to be involved in the former, automate the latter.
Henry: "Greasemonkey". Royal society of chemists: Project Prospect is an implementation.
Oracle 10G has RDF repository. Not complete. 11G in beta is purported to be more complete RDF.
Omer: operating system? Solaris for big database. Experimental on Windows (not recommended) and Linux red Hat Enterprise.
Dan: Named graphs? Yes, but by a complicated roundabout way.
Dan: SPARQL support? Oracle makes promises for 11G in the future, not sanguine that these will be fully realized.
Basing system on finance system technology is helpful with audit trail production when storing personal data and images.
Dolores: core facilities in Imaperial for (medical?) imaging; collaboration with SKB. Worried about metadata requirements for multimedia content, hijacked by video interests. Wants to have metadata for media and text working together. (Paola Hobson's W3C group.) CISBIC (http://www3.imperial.ac.uk/cisbic) - specializes in medical images - fragmented knoweldge within this group. Cross-disciplinary work; e.g. geolocating Charles' Darwin's and mashup with modern sonar surveys.
Dolores: sees the potential of linking together the datasets from the Earth Science, Medical department and BioScience.
Richness of images: back to Dolores' notes - medical imaging, FILM (light confocal microscopy), protocols for creation not re-use.
I think, and this is reinforced by Dolores' comments: local groups will do their own thing, whatever senior management may say. The most productive way forward woukld be to work with this work ratrher than against it, and find ways to exploit this effort to wider benefit.
Is it appropriate to build a long-term preservation archive on closed software from a commercial organization (Oracle)?
Relation of image analysis to annotation: I see these as separate issues: analysis can create metadata, search/discovery based on metadata.
Henry Rzepa, Omer Casher
Henry - reflections on chemistry publishing
Imagine semantic wiki used for presentation. What do we in this meeting have in common. Written-up metadata could tell us about this.
Thesis: if available information is properly written up and documented, much knowledge
"Chemical machine vision": can we recongize molecules? Gabor Wavelets.
"Chemical datument" (data+document)
Empasizes importance of both humans and machines as consumers of resources.
Teaching and research.
"Citizendium" - always have authentication for submission: full provenance trail.
New scientific discoveries? Mention Katy Wolstencroft work in using semantic reasoning to classify protein sequences, verify and detect errors in existing classifications.
Supercomputer workflows: automated publication of results through job submission/review interface.
SPECTRa digital repository project. Shibboleth federation for collaborative wiki. DSpace software used, but not an institutional repository. spectradspace.lib.ic.ac.uk:8443
Expose via DSpace and SPARQL. Building a SPARQL endpoint for ???
Mauveine: (www.ch.ic.ac.uk/wiki2/index.php/mauveine)semantically expressed article.
- Role of images for human interpretation (cf. "The digital image" symposium talks).
- enable detecting of inconsisent statements in the article by applying semantic reasoning
What makes an institutional repository
I think the following ideas from Henry's group's work are very interesting to our work:
- using DOIs to identify and provide references to these publications on the web pages.
- providing semantically express articles in order to infer inconsistent statements in the article.
- enthusiam in using Semantic Web technologies to express their experiment data, link and verify these data, e.g. making a repository understandable not only for humans but also for computers
Omer - SemanticEye
"SemanticEye: A Semantic Web Application to Rationalize and Enhance Chemical Electronic Publishing" (http://dx.doi.org/10.1021/ci060139e)
RDF repository; several problem spaces: electronic journals, metadata from multiple imaging sources.
"Medical imaging research not done anywhere else in the pharma industry"
2nd largest GSK research investment. Seimens FMRI scanners.
Metadata embedded in DICOM images, and managed separately.
Aim to manage images and their metadata in a way similar to iTunes manage MP3s.
Integration challenges - only way to address these challenges long-term is by standards.
(CLIC.)
High value content - drives learning curve of using sweb.
XMP, InCHi, DOI, WebDAV, Sesame, SPARQL (Currently there is no support for retrieving publication metadata for a DOI, althought this work is on the way.)
RDF captured on creation of manuscript. Adobe XMP but no publishers are publishing XMP files and it is not compatible with the metadata kept by Microsoft word.
Omer "if you want to make $1M, solve part (b) of their diagram: capturing the metadata"
Dolores: how does this work with ePrints? Omer: see this as sitting "on top of" ePrints, handling just the metadata and referencing material in the repository.
("Semantic Eye Controller" - "agency" on diagram)
InChi provides domain-specific metadata.
SemanticEye ontology.
Advantages of "schema free" development (?)
Jun: SeRQL "constructs", performs very badly.
Basically, many similarities with our approach, but involves total aggregation of RDF metadata, and has no real notion of schema alignment, beyond use of SeRQL CONSTRUCT. Also dependence on Sesame for UI.
Main difference is maybe our attempt to work with what's out there
Also, translating all metadata to RDF, rather than just making it queryable
Gaps to be resolved.
Tools: OSCAR(an e-Thesis project with text mining and metadata management).
Dan mentions Exhibit from Simile - a kind of faceted browser, using an in-browser copy of the database.
Issues
- Information liberation - one prison to another (paper -> Acrobat)
- Applications don't share metadata
- e-theses, students capture data and metadata
- Tooling
- relationship between subject and institutional repositories
- DSpace is not semantic web enabled
- DSpsace not designed for programmatic access
Matt Harvey
HPC perspective
Loss of context of data associated with data and results of HPC.
Portal (looks maybe like uportal) enforces context for a job. Enforces entry of metadata for access to HPC resource.
Hidden workflows support many other tasks.
Hmmm... without the carrot of HPC access, how to engage users? SABRE tries a different approach.
Photosynth preview demonstration
Photosynth takes a collection of overlapping photographs of some space and (presumably using photogrammetric autocalibratrion techniques, maybe in conjucntion with GPS data) positions each one in 3D position and orientation. It then provides a very impressive usre interface for navigating the 3D space of images thus created. The basic idea is not unlike Katie Portwin et al's quakr system, but (as befits an organization like Microsoft compared with a couple of people in their spare time over a few weekends) the technology is vastly more polished - in terms of both the 3D positioning technology and the user interface for navigating them.
The demonstration data was presumably "cleaned up", but impressive nonetheless.

