FlyWeb/ProjectPlan

From ImageWeb

Jump to: navigation, search

FlyWeb Project Plan

FlyWeb: Data Web for Linking Laboratory Image Data with Repository Publications.

Project duration: 1 October 2007 to 31 May 2009.

JISC project page: [[[TBD]]]

This document: http://imageweb.zoo.ox.ac.uk/wiki/index.php?title=FlyWeb/ProjectPlan

Project plan submitted to the JISC on 28 Nov 2007: http://imageweb.zoo.ox.ac.uk/wiki/index.php?title=FlyWeb/ProjectPlan&oldid=5258 (See page history tab for subsequent changes.)

Building on the findings of our data web requirements analysis project Defining Image Access, the purpose of the FlyWeb Project is to implement a proof-of-concept data web to integrate research image data from the FlyTED project with related data from publication repositories, The Berkeley Drosophila Genome Project, FlyBase, the database of Drosophila genomics data, and other sources. This work will draw on our work with local Drosophila researchers to derive specific requirements and to evaluate results.

Funded by the JISC, coming out of our previous Defining Image Access work in the "Discovery to delivery" strand of the Repositories and Preservation Programme - see http://www.jisc.ac.uk/whatwedo/programmes/programme_rep_pres.aspx.


Contents

Overview of project

A copy of the proposal for this project can be seen at: http://bioimage.ontonet.org/moin/PublicResources?action=AttachFile&do=get&target=FlyWeb_proposal.pdf


Background

The Image Bioinformatics Research Group of the University of Oxford has, as of September 2007, recently completed the JISC-funded Defining Image Access Project, a six-month requirements analysis project (January to June 2007) funded from the Discovery to Delivery strand of the JISC Repositories and Preservation Programme. The purpose of that project was to investigate the feasibility of creating data webs that would permit subject-specific search integration of institutional repository image collections using Semantic Web techniques.

During the Defining Image Access Project, we also developed an image repository based on EPrints for the Drosophila gene expression images created by our departmental research colleagues. While evaluating this, these researchers raised the need to integrate their own images with images held in external repositories such as the Berkeley Drosophila Genome Project (http://www.fruitfly.org/) – in other words to create an image web for Drosophila gene expression images. The primary motivation for this follow-on Project is to meet that articulated need.

Conclusions and recommendations of the Defining Image Access Project

The conclusions of the Defining Image Access Project Final Project, and five of its recommendations, R6 to R10, are all of direct relevance to FlyWeb (see http://imageweb.zoo.ox.ac.uk/drupal/files/DIA_FinalReport_2007-08-17.pdf).

The gist of these recommendations is:

  • Building an Information Environment Architecture as an overlay on the Web, with lightweight service-oriented architectures that employ the Web as the primary underlying platform, with use of Semantic Web and Web 2.0 technologies where appropriate.
  • The Application Profile for Images not be limited to Dublin Core/FRBR type metadata, but also includes metadata for IPR, file structure, versioning, provenance and, most importantly from our viewpoint, (semantic metadata describing the content, meaning and significance of the images.
  • Seeking innovative interactions between existing JISC projects related to research data and images, in particular to engage researchers and research tool developers, in addition to members of the repository development and library communities.
  • Creation of SPARQL endpoints for all commonly used institutional repository software and related data management systems.
  • Pursue projects that assist researchers in capturing descriptive information about research datasets (i.e. domain-specific metadata) as early as possible in their workflows in ways that enhance existing research practices, and facilitate the submission of such semantically enhanced research datasets to open access repositories. These can in turn lead to systems that promote accessibility to and reuse of research data, through interoperability between institutional repositories and third-party data resources, and that enhance the links between research publications and the primary datasets upon which they are based.

What is a data web?

As a potential solution to the problem of locating data scattered across heterogeneous resources, we have proposed the development of subject-specific data webs (http://imageweb.zoo.ox.ac.uk/wiki/index.php, http://www.rin.ac.uk/data-webs) that use the Web as their native platform and enable integrated access to images or other datasets relating to these particular subjects. Within each data web, loosely coupled software tools will be used to combine metadata describing research images in a manner that permits discovery and provide links back to the original data sources to allow data delivery.

More information is in the proposal for this project (http://bioimage.ontonet.org/moin/PublicResources?action=AttachFile&do=get&target=FlyWeb_proposal.pdf).

Aims and objectives

Our work in this project is to implement a proof-of-concept data web to integrate research data resulting from gene expression experiments in the fruit fly Drosophila melanogaster with related data, drawing on our related work with local Drosophila researchers to derive specific requirements and to evaluate results.


Motivation

Our image web work is immediately motivated by problems in post-genomic life science research, though the technologies used are quite generic and could be applied in other domains.

Gene expression experiments involve creating microscopic images of parts of organisms to which genetic techniques have been applied to make visible those regions in which a chosen gene is active (expressed). This information, in turn, helps researchers to understand how the genetic mechanisms interact with other life processes to guide the development of an organism, for example by showing how gene expression patterns vary between different mutations of a given reference organism. Producing these in situ gene expression images can be very time consuming, and a single image might be used in different lines of research. We anticipate that in the near future, such images generated in the course of a research project will be published online with sufficient metadata to interpret the image (e.g. organism, genetic strain, observed gene, developmental stage, etc.).

A researcher exploring factors causing sterility in Drosophila may create a number of in situ images of gene expression in the testes, with a primary goal of studying sperm development. Another researcher may be interested in distribution patterns of gene expression products within a cell for study of internal cell transport mechanisms, for which some of the same images might contain useful information. But how is such a researcher to discover that the image even exists? Our Image Data Web aims to support using a single search operation to find, say, images of expression patterns for aly genes in Drosophila melanogaster that may be stored in various repositories with different but overlapping descriptive metadata. To do this requires some mechanisms to access and partially match metadata from different images sources, and locate those images that have associated metadata meeting some given criteria.

We have a further motivating example from the study of cilia evolution in various organisms, in which it is anticipated that combination of annotations (e.g. captured in Connotea) of published literature and electron micrographs can be correlated with gene sequences from various organisms with distinctive cilia development to identify evolutionary genetic developments that affect cilia development.

Overall approach

Our project is predicated on the notion that by using widely deployed web software components and ideas, we can short-circuit many of the development complexities that escalate the cost and deployability of many information-sharing systems. We are entering a time of availability of several highly developed and work-hardened Semantic Web software tools, in addition to the many widely used and scalable web server and content management systems. We aim to focus our design efforts on information design and software selection rather than software design. We aim to carry out a user-led design, with research user evaluation and feedback throughout cycles of iterative development.

Some specific tools we have identified as being relevant to creating an image data web include:

  • Jena and Joseki (SPARQL implementation)
  • DARQ (a distributed query handler for Jena; this is a relatively immature software tool, and may prove to be less robust than we need).
  • D2R server (SPARQL endpoint for relational data)
  • mSpace, JSpace, Exhibit (semantic web information browsers)

We also expect to write some new tools to handle semantic integration of information from diverse sources, probably using Python or a similar programming languages for rapid and flexible exploration of ideas. We aim to use ‘agile’ development techniques, developing simple testable prototypes early and expanding their functionality incrementally by iterative improvement.

See also:

Project outputs

  • Image data web software tools
  • Documentation for using the software tools
  • Documentation for deploying, configuring, maintaining and extending the software tools
  • Software tool testing framework
  • Web site
  • An understanding of the requirements for data alignment
  • Uses cases for the demonstration data web
  • Publications

Project outcomes

We aim for a deployed image data web that is used and valued by our Drosophila research partners, built using technical elements that can readily be adapted for other research application areas, within and beyond life science research. We aim to learn and publish lessons concerning use of semantic web technologies in a research setting.

Beyond the data integration tools created by this project, we aim to set the stage for exploration of issues related to open data publication:

  • how can peer review be applied to published data and, more generally, how can can unconstrained open access publication be reconciled with requirements for content quality assurance?
  • can we promote an environment in which published papers (which are fundamentally rhetorical in nature) are underpinned by published data (which are more objectively descriptive)?
  • to what extent can the open data publication underpin new forms of in silico research in which new knowledge is derived from extant published data?

Critical success factors

Critical success factors include:

  • Creation of SPARQL adapters for the various information sources accessed
  • Selection and adaptation of semantic browsers to provide user-friendly access to the SPARQL-adapted sources
  • Creation of a user-friendly interface for constructing queries to select desired information from multiple sources
  • Creation of tools to align information from multiple data sources
  • Selection and adaptation, or creation, of tools to distribute a query to multiple information sources
  • Research user acceptance of the tools we provide
  • Capture and understanding user requirements based on feedback from using demonstrators
  • Publication and dissemination of lessons from our mechanisms for exposing research observation data

Stakeholder analysis

Stakeholder Interest / stake Importance
Researchers using images of Drosophila gene expression Ability to easily locate images and related information stored in a range of institutional repositories and other sources, with sufficient supporting information to properly interpret their content Allowing new lines of research based on existing published observations and related information from various sources; reducing future research costs by re-using hard-won observational data
Researchers publishing images of Drosophila gene expression Online publication of images Facilitating re-use of observations for the independent verification of conclusions drawn, and as a basis for additional lines of research, enhancing the value and reputation of the original work. This project aims to enable the use by researchers of published images in conjunction with institutional repositories and a variety of other information sources.
Institutional repository managers and operators Providing an effective and economic service that is responsive to the needs of research users; creating a framework for long-term preservation of research image data. Improving the cost/benefit in repository provision; enhancing institutional visibility by providing additional access to valued resources. The particular significance of this project is to enable the use by researchers of institutional repository sources as part of a wider information landscape.
Journal publishers Linking journal publicatioins to underlying data Increasing the value to researchers to consult and cite journal sources

With reference to stakeholders, it was observed during the Defining Image Access project that not everything happens in isolation. Value may be derived by working across narrow interests. More generally, the ability to create new facilities (e.g. mashups) may be of value to a wider audience - as-yet unidentified stakeholders developing new and emerging applications.

Risk analysis

Identified risk factors include the following:

Risk Probability Severity Description and mitigation
Staffing - recruitment and retention Med High The project is very much dependent on recruiting a competent project manager and developer who has enthusiasm for the ideas we aim to develop. We are exploring a number of alternative plans that we might adopt if our present round of recruitment does not yield a suitable addition to our team.

Once the project is under way, using our web facilities to capture and lock in knowledge gained as we conduct the project reduces the risk of losing all associated knowledge in the event of departure of a team member. The agile development approach will lock in knowledge and successes in the form of unit tests and software versioning, while pair-programming will promote sharing of expertise and understanding.

Organisational Low Low The core team is small and co-located, so we don't anticipate organizational problems there. We are dependent on input from our Drosophila research partners, and to gain this we are working to involve them early in the project and to be responsive to their interests and concerns. We already control most of the technical resources we need to conduct this project, other than those (e.g. network, power, etc.) whose failure would have implications far more serious than for just our project.
External suppliers Low Low Identifying and applying externally developed software systems is central to our development strategy. We have already identified a number of open source systems that appear to provide the functionality we need. We already have hands-on experience of using many of these. We are most exposed in the area of software systems for displaying and browsing complex, open-ended experimental data: so far we have worked with two externally developed "semantic browser" systems (JSpace and Exhibit), both of which provide significant capabilities but also have significant shortcomings. If these fail to be adequate to deliver all or part of our goals, and if we do not identify any other more capable software components, we may have to fall back on in-house development of a more primitive user interface based on conventional web technologies.
Legal Low High We don't anticipate any legally related risks. We will, for the most part, be working with information that our partners choose to make publicly available, we are using open source software for most of our project activities, and it is our goal that the results of our work will not be subject to any confidentiality agreements. There is no foreseeable risk to life, limb or well-being of any person in the work we are proposing to undertake.
Office move Med Low There is a foreseeable possibility that we will be required to relocate some or all of our office accommodation during the course of this project, which could cost us 2-4 weeks of lost productivity. We are already planning to move our servers to a shared space in our department. By comparison, relocating people and personal facilities should be relatively easy.
Equipment failure Low Low We are planning to use available equipment (mainly computers and network connections) for this project, though our budget has some allowance for reliability-enhancing upgrades or replacements. Failure of any of the key systems could expose us to delays or worse. The main web server is running on a virtual machine, which we have already proven can be transferred to another available host with minimal loss or disruption to information, though there would be some loss of productivity in such a transition. The server is automatically backed up daily (though we have not yet attempted a restoration from these backups). Facilities provided by the other main project server (mainly version management and ticketing software) could reasonably easily be re-installed on another available server, or to a rented virtual system provided by Oxford University Computing Services.
Agile programming Med Low While we believe an agile approach is particularly appropriate for use in an academic environment, we have limited experience of agile programming techniques. The project advisor (GK) has some experience of agile software development, but mostly on smaller-scale projects; other team members lack or are likely to lack such experience. In mitigation, we have budgeted and planned for all project developers to participate in a training course on agile programming techniques, and we will endeavour to form links with the local agile programming interest group. Our experiences in this area may in themselves create valuable lessons for the JISC and the academic software development community. Ultimately, we have the option to fall back to more conventional development scheduling.

In advance of project commencement, the greatest uncertainty introduced by the agile development approach is understanding how to ensure that we have some meaningful mechanism to assess progress when the nature of this approach makes it difficult to set measurable development targets for project tasks in the distant future. This risk is mitigated by having a sufficiently small project team that affords reasonable visibility of goals and actual progress to all members. (See also the detailed planning section of this plan.)

Technical - Underestimation of technical challenges Med Med The six months of requirements analysis undertaken during the Defining Image Access Project have prepared us to undertake this follow-on project to build a demonstrator data web, so we believe that the project as described is feasible. Adoption of the incremental development approach will create a usable system with limited capabilities relatively early in the project, which will then be progressively enhanced. Since we will rely substantially on proven third-party Web and Semantic Web tools such as Jena/Joseki to do the "heavy lifting", we do not anticipate having to write major software applications.

The greatest technical challenges are distributed query (addressed below) and semantic alignment of heterogeneous data sources. In both cases, our strategy will be to start with simple, easily analyzed cases and build up from there, focusing on specific user requirements that we encounter. This allows us to avoid solving a fully general alignment problem. There is a significant body of existing research, from communities including database, semantic data integration and ontology alignment, that we intend to draw upon for developing schema alignment mechanisms.

Technical - Scalability of RDF stores Med Low RDF databases are known to be less efficient than relational databases, which have been honed by decades of development. We do not anticipate this will be a problem in the demonstrator, since the volumes of stored RDF data will be relatively low. In production data webs, most storage of RDF data will be localized to the distributed data sources, so the problem of searching across a single massive integrated RDF triplestore will be avoided.
Technical - Reliance on external tools Med Med The principle of our software development is to reuse existing tools as much as possible, to reduce the development costs. Such tools include semantic faceted browsing tools (e.g. jSpace, Exhibit, etc), software frameworks (e.g. D2R, Joseki, etc), RDF databases, and etc. Potential problems with relying on these tools include the incompatibility between these tools, their performance and robustness when dealing with large amounts of data, and their software sustainability. We have been building close collaboration with tool providers mentioned above, who are keen to receive feedbacks from real users. We expect to maintain a responsive communication between these tool providers for any technical challenges throughout the duration of software development. Loose coupling in our own system design reduces dependency on any single tool.
Technical - Distributed query performance Med Med This is known to be a hard problem to solve in a general fashion. Our lightweight web approach will start with simple, easily analyzed queries for which scalable solutions are easily developed, and will build incrementally from this point. We will aim to exploit existing work from both semantic web and relational database communities in the development of our solutions.

Standards

It is our firm intention that image data webs will be based to the maximum extent possible on open standards, and in particular we aim to maximize our use of widely deployed web standards.

Some specific standards that we expect to figure in our analysis include:

Specification Version Notes
HTTP 1.1 http://www.ietf.org/rfc/rfc2616.txt
XML 1.1 http://www.w3.org/TR/REC-xml/, http://www.w3.org/TR/xml11/
RDF(S) 1.0 http://www.w3.org/2001/sw/RDFCore/ (links to specification set)
OWL 1.0 http://www.w3.org/2004/OWL/
SPARQL http://www.w3.org/TR/rdf-sparql-query/
ATOM RFC 4287 Atom Syndication Format - see http://www.intertwingly.net/wiki/pie/FrontPage
APP RFC 5023 Atom Publishing Protocol
OAI-PMH 2.0 http://www.openarchives.org/OAI/openarchivesprotocol.html
METS 1.6 The Metadata Encoding and Transmission Standard - http://www.loc.gov/standards/mets/, http://www.loc.gov/standards/mets/METS%20Documentation%20final%20070930%20msw.pdf
DC Dublin Core - http://dublincore.org/
SWAP Scholarly Works Application Profile of Dublin Core: http://www.ariadne.ac.uk/issue50/allinson-et-al/

Technical development

For software components that we develop, we intend to use an agile development approach based on Extreme Programming (XP), some basic tenets of which are (a) incremental development in small demonstrable steps, (b) use of automated unit tests to capture required functionality and avoid regression when code is changed.

Our strategy of using existing tools wherever possible strongly suggests a software environment that combined programs written in a variety of programing languages, which in turn suggest that components communicate by some form of inter-process communication, of which the most prominent form will be HTTP. In this way, we aim to allow any component to be easily replaced by a new component that performs the same functions, and maybe more, in improved fashion.

For software components that we develop ourselves, we anticipate using a lightweight "scripting" language like Python or Ruby, for which there exist highly developed web protocol communication and server libraries.

Intellectual property rights

We intend that this project should create information and software tools that will be freely available for the benefit of the UK academic community. Our purpose is to facilitate sharing of academic research data, a goal that is likely to be compromised if the tools to achieve this are not themselves open and freely sharable.

We intend that any software development will be open-sourced.

An as-yet unanswered question is whether we should prefer a "viral" open source licence like the GNU Public Licence, or a more flexible form of licence that permits closed-source derivative works. Fundamentally, we don't want licensing concerns to get in the way of our goals of improving sharing of research images. It may be that the form of licence may be determined by the choice of software used to build an image data web. In considering the possible licences, we will consult with JISC's OSS-Watch advisory service, with whom we have good contacts. The conclusions of such consideration will be included in our final report.

Project Resources

Project Staff

David Shotton (Principal Investigator).

Graham Klyne (Project Advisor).

Jun Zhao (Research Officer).

Project Manger, to be recruited

All members of the Image Bioinformatics Research Group, Department of Zoology, University of Oxford, and have email addresses of the form <firstname.lastname@zoo.ox.ac.uk>.

Postal address: Department of Zoology, University of Oxford, South Parks Road, Oxford OX1 3PS, United Kingdom.

Drosophila and Other Research Partners

We will be working closely with Drosophila gene function researchers in the Zoology department at Oxford University, notably Helen White-Cooper's group investigating gene expression in Drosophila testes, with whome we have a well-established relationship.

We will be working to expand our links with Drosophila gene researchers at Imperial College.

We will also strengthen and expand our informal links with other Semantic Web software developers, through contacts with HP Labs in Bristol, ILRT also in Bristol, contacts through Oxford's Semantic Web interest group (with participants from the University and local technology companies) and the growing Semantic Web research presence in Oxford University's computing laboratory.

Project management

This is a project undertaken by a small core team, so formalization of reporting and decisions is unlikely to be helpful. Where progress is not achieved by team consensus, definitive direction will come from the PI; all core team members are contractually reporting to the PI.

The framework for conduct and management of the project is intended to facilitate free and open communication between the core project team and the consultant partners using Web-based tools. To this end, we start with a dedicated system (http://imageweb.zoo.ox.ac.uk/) whose role is to capture and communicate information about the project itself and the topics of investigation. Initial facilities provided for this purpose include a semantic wiki, a mailing list system with web-accessible message archives, a Subversion repository and a TRAC ticketing and issue tracking system. These facilities may be expanded as needs arise.

The main project will be structured around a number of development, or release, cycles, typically of up to a month duration. The functionality to be addressed in each cycle will be decided at the commencement of each cycle, taking into account the progress in previous cycles, and user feedback about the developments delivered to date. The ticketing system will be used to record goals and achievements, and will provide the basis for recording progress toward intended goals.

Training needs consist primarily of all team developers attending the same agile development course, so that we all start with a shared understanding or expectation of how the development process can work for us.

For project control and recording purposes, the detailed project plan has a number milestones at defined dates. In keeping with the agile development approach, we don't attempt to fix in advance exactly what functionality is intended to be implemented at each milestone, as this is expected to be determined as the project progresses. Rather, we set some very broad schedule expectations, will assign targets to milestones as the project progresses, and use the issue tracking system record the extent to which those targets are satisfied.

Programme support

JISC Programme Manager: Balviar Notay (mailto:b.notay@jisc.ac.uk)

This project has been conceived from a position of exploiting common and widely deployed Web technologies to improve access to research image data. We are aware that there is much work by members of the JISC community that address access to image data, emphasizing database and library system technologies, with which we have become familiar through our Defining Image Access project work. We aim to use Web techniques to improve access to these more traditional repository systems, and we will look to devise approaches for exposing them to a wider web audience.

No areas of formal support from the JISC Programme Manager have been identified.

Members of the project team are also taking part in activities to create an Image Application Profile for Dublin Core (http://www.ukoln.ac.uk/repositories/digirep/index/Images_Application_Profile), and involvement in this activity will be considered to be an element of this projects' work.

Budget

Start date: 1-Oct-2007, Duration: 20 months

Project costs Oct 07 - Jul 08 Aug 08 - May 09 Total
Staff £67,150 £48,160 £115,310
Travel £1,490 £900 £2,390
Consumables £760 £940 £1,700
Hardware £7,000 £7,000
Training £3,600 £3,600
Directly Incurred £80,000 £50,000 £130,000
PI allocation £2,726 £2,832 £5,558
Estates £24,730 £15,834 £40,564
Indirect costs £59,787 £38,280 £98,067
TOTAL £167,243 £106,946 £274,189

Detailed project planning

To support the intended agile development approach, milestones have been separated from work package descriptions: typically, each milestone will represent a crystalization of activities in more than one work package. We need to maintain flexibility to evolve detailed functional requirements as the project progresses. The workpackage duration and task breakdown details are intended as an initial guide indicating where we expect the bulk of effort to be directed as the project progresses; detailed project scheduling and tracking will be through the assignment of issue tickets to each milestone.

Summary

A PDF file with this diagram can be found at http://imageweb.zoo.ox.ac.uk/drupal/files/FlyWeb-ProjectPlan.pdf.

Summary of FlyWeb project work packages, milestones and approximate schedule

Milestones

Project management and administration

Milestone Date Responsibility Description
PM1 2007-11-30 GK Project plan in wiki
PM2 2007-11-30 DS (pp GK) Initial project plan submitted to JISC
PM3 2007-11-30 GK Technical infrastructure (Wiki, TRAC, Subversion, etc) operational
PM4 2007-12-31 GK Initial development milestone details and targets recorded in TRAC
PM5 2008-07-31 (est.) DS Submission of interim progress report to JISC
PM6 2009-04-30 DS Submit draft of final report to JISC
PM7 2009-05-31 DS Submit completion report to the JISC
PM8 2009-05-31 DS Submit final report to the JISC

Software development

Milestone Date Responsibility Description
SM1 2007-12-31 GK Initial development phase targets in TRAC
SM2 2008-02-29  ?? Initial development phase implemented and tested; TRAC updated for next phase.
SM3 2008-03-31  ?? Development phase 2 implemented and tested; TRAC updated for next phase.
SM4 2008-04-30  ?? Development phase 3 implemented and tested; TRAC updated for next phase.
SM5 2008-05-31  ?? Development phase 4 implemented and tested; TRAC updated for next phase.
SM6 2008-06-30  ?? Development phase 5 implemented and tested; TRAC updated for next phase.
SM7 2008-07-31  ?? Development phase 6 implemented and tested; TRAC updated for next phase.
SM8 2008-08-31  ?? Development phase 7 implemented and tested; TRAC updated for next phase.
SM9 2008-09-30  ?? Development phase 8 implemented and tested; TRAC updated for next phase.
SM10 2008-10-31  ?? Development phase 9 implemented and tested; TRAC updated for next phase.
SM11 2008-11-30  ?? Development phase 10 implemented and tested; TRAC updated for next phase.
SM12 2008-12-31  ?? Development phase 11 implemented and tested; TRAC updated for next phase.
SM13 2009-01-31  ?? Development phase 12 implemented and tested; TRAC updated for final phase.
SM14 2009-02-28  ?? Software development finalized and stabilized: final functionality implemented and tested.
SM15 2009-03-31  ?? Software development installation kit and documentation finalized for general distribution. (Software kits will be released throughout the development, but this milestone represents finalization of the distribution package. This may involve transitioning to a public open source environment, such as Sourceforge, if not already done.

Dissemination

Milestone Date Responsibility Description
DM1 2008-05-31 JZ Submit image ingest and metadata query work from WP2 for publication (e.g. in DLib journal)
DM2 2008-07-31  ?? Publish or submit for publication work on the query rewriting from WP3
DM3 2008-08-31  ?? Publish or submit for publication work of aligning FlyTED and BDGP to satisfy real use cases from WP4 and WP5
DM4 2008-09-30  ?? Publish or submit for publication work of aligning FlyTED/BDGP and FlyBase from WP4 and WP5
DM5 2008-10-30  ?? Publish or submit for publication work of aligning FlyTED/BDGP/FlyBase, PubMed and PubMed Central from WP4 and WP5

Workpackages

Note that dates given below are intended to convey a sense of the developing project focus, and are not intended to be used as project control - for project control, refer to the milestones described above.


WP1: Project management and administration

Infrastructure for project activities, communication between project partners and visibility of progress with respect to intended goals.

Activity Start End Who Outputs
(a) Project plan 2007-10 2007-11 GK, JZ Prepare project plan
(b) Project management 2007-10 2009-05 GK, ?? Ongoing at 1 day/week.
(c) Project infrastructure 2007-10 2007-11 GK Intending to continue use of Defining Image Access wiki for internal project notes, and existing group web calendar, but will also set up a Subversion repository and TRAC issue tracking system.
(d) Related work survey 2007-10 2008-12 ALL Previous Defining Image Access work provides the basis of survey work that underpins this project, but we will need to maintain ongoing awareness of new standards (e.g. Image Profile work) and software systems that may facilitate our work. Ongoing at 1 day/week.
(e) Images app profile 2007-10 2008-02 DS, GK, JZ Participation and review of work in the JISC Images Application Profile working group (http://www.ukoln.ac.uk/repositories/digirep/index/Images_Application_Profile)
(f) Agile Development Training 2008-01 2008-02 GK, JZ, ?? All developers should attend the same course (TBD) so that we start with some common understanding of the agile development process and procedures.
(g) Prepare interim progress report 2008-06 2008-07 DS, ?? Preparation of progress report for the JISC.
(h) Evaluation 2009-01 2009-04  ?? Evaluate results - collect and organize user feedback recorded throughout the project
(i) Draft project report 2009-04 2009-05  ?? Prepare initial draft of final report. Circulate for review by project partners. Collect feedback and update draft report as appropriate.
(j) Draft completion report 2009-05 2009-05  ?? Draft completion report for the JISC

WP2: SPARQL endpoint for OAI-PMH

Building a SPARQL endpoint over an EPrints repository using Joseki/Jena toolkits.

Activity Start End Who Outputs
(a) Develop EPrints harvester 2007-10 2008-03 JZ A harvester that collects metadata from EPrints through the OAI-PMH protocol and upload it to the Jena RDF repository using Jena model loader, and provide SPARQL access using Joseki. The initial development will be batch-upload based and very focused on our specific metadata, but later iterations will be more generic, and use SPARQL update protocol or other mechanisms to perform incremental updates to the Joseki database.
(b) METS-based image ingest for EPrints 2008-03 2008-05 JZ An improved image ingest script that associates images with domain-specific metadata using METS schema. This is a necessary step prior to building a generic local harvester script.

WP3: Accessing other data sources

Provide at least two types of services to create SPARQL endpoints over other data sources: through metadata harvesting and translation, and by SPARQL query rewriting.

Activity Start End Who Outputs
(a) BDGP D2R server 2008-01 2008-02 JZ Set up and publish a locally hosted SPARQL endpoint for the Berkeley Drosophila Genome Project (BDGP) data resource using D2R server.
(b) Survey other data sources 2007-12 2008-05  ?? Produce a document that both summaries the access mechanism to data and metadata resources of BDGP, FlyBase, PubMed and PubMed Central
(c) Access other data sources 2008-03 2008-11  ?? Develop software framework of accessing the FlyBase, PubMed and PubMed Central resources using query rewriting and/or local harvesting, as appropriate.

WP4: Core data web services

Implement the schema alignment service, co-reference service and query handling service for each of the distributed data resource.

Activity Start End Who Outputs
(a) Coreference service 2008-01 2008-11  ?? Design and implement a service that aligns the data identities from different resources
(b) SPARQL query service 2008-01 2008-11  ?? Design and implement a service that performs SPARQL queries over resources on the data web
(c) FlyTED and BDGP alignment: initial implementation 2008-02 2008-04  ?? Design, implement and evaluate services and software framework to align FlyTED and BDGP
(d) FlyTED and BDGP alignment: refine implementation 2008-05 2008-06 Identify unsatisfied requirements and iterative refinements for aligning FlyTED and BDGP
(e) FlyBase alignment: implementation 2008-05 2008-08  ?? Design, implement, evaluate and refine services and software framework to align FlyTED, BDGP and FlyBase
(f) PubMed alignment: implementation 2008-08 2008-12  ?? Design, implement, evaluate and refine services and software framework to align FlyTED/BDGP/FlyBase, PubMed and PubMed Central

WP5: Core schema and mapping rules

Based on an understanding about our Drosophila-relevant data sources, create a core schema to be used as the basis for aligning individual schema from each data source, and define the schema mapping and identity co-reference rules to link assertions from these diverse data resources.

Activity Start End Who Outputs
(a) Initial core schema 2008-01 2008-03  ?? Create an initial core schema to describe images held in FlyTED and publications, leading to a first draft schema in RDF and OWL formats.
(b) FlyTED and BDGP alignment: requirements analysis 2008-01 2008-03  ?? Gather initial requirements and use cases for aligning resources from FlyTED and BDGP
(c) FlyTED/BDGP mapping definition 2008-01 2008-03  ?? Define the initial mapping from FlyTED and BDGP’s schema to FlyWeb Core.
(d) FlyBase alignment: requirements analysis 2008-04 2008-06  ?? Gather requirements and use cases for aligning resources from FlyTED, BDGP and FlyBase
(e) FlyBase schema mapping definition 2008-04 2008-08  ?? Define a mapping from FlyBase to FlyWeb Core, updating FlyWeb core schema as necessary, leading to a revised FlyWeb Core schema and mapping rules from FlyBase.
(f) PubMed alignment: requirements analysis 2008-07 2008-09  ?? Design, implement, evaluate and refine services and software framework to align FlyTED/BDGP/FlyBase, PubMed and PubMed Central
(g) PubMed schema mapping definition 2008-07 2008-11  ?? Define a mapping from PubMed and PubMed Central to FlyWeb Core, updating FlyWeb core schema as necessary, leading to a revised FlyWeb Core schema and mapping rules from PubMed and PubMed Central.

WP6: Search, browse and other services

Explore adopting existing searching and browsing tools from the Semantic Web community to our FlyWeb, and explore, if time permits, adopting existing tools to enrich annotations of images on the FlyWeb.

Activity Start End Who Outputs
(a) Browse FlyTED and BDGP 2008-03 2008-06  ?? Explore use of available faceted browser and data display tools for displaying information from FlyTED and BDGP
(b) Extend browsing to cover FlyBase 2008-06 2008-09  ?? Extend use of faceted browser and data display tools to displaying information additionally from FlyBase
(c) Extend browsing to cover PubMed 2008-09 2008-12  ?? Extend use of faceted browser and data display tools to displaying information additionally from PubMed and PubMed Central.
(d) Search facilities 2008-06 2009-03  ?? Explore using existing semantic search facilities, such as Sindice, SemanticBank, SWoogle, etc, to perform document searches over the data resources on the FlyWeb. This should complement the database-like queries provided by the SPARQL query service package.
(e) Link data to journal publications 2008-11 2009-03  ?? Explore using Connotea to annotate publications in order build a link between local research data and journal publications. Experiment with publishing a SPARQL endpoint of Connotea annotations and/or local EndPoint bibliography resources

WP7: Dissemination activities

Disseminate project activities on local wiki, calendar and maillist and and publish project findings on national/international conference proceedings and journals. These dissemination activities will not only share our findings with the peer communities, but also guarantee a regular communication between our local team members for the purpose of an agile project management practice.

Activity Start End Who Outputs
(a) Project ongoing activities 2007-10-01 2009-05-31 Team Publish project ongoing activities regularly in the project wiki and/or tracking system
(b) Project plans, designs, evaluation results, etc on the project wiki 2007-10-01 2009-05-31 Team Publish any project plans and milestone details and progress against milestones on the project wiki and/or tracking system.
(c) Project meetings 2007-10-01 2009-05-31  ?? Minute project meetings monthly on the project wiki. We'll aim to hold minuted meetings regularly, as circumstances dictate, and at least monthly.
(d) Technical results 2007-10-01 2009-05-31 Team Publish any technical designs, evaluation results, etc. on the project wiki
(e) Publications 2007-10-01 2009-05-31 Team Publish significant project findings and results as scientific papers, conference papers.
(f) Informal presentations 2007-10-01 2009-05-31 Team Disseminate project findings to local/national/international research communities, through informal presentations to partners, colleagues, user groups, etc.

Evaluation plan

The main output from this project is a demonstration software system and supporting documentation. We also aim to publish papers about key elements of our work.

Factor to Evaluate Questions to Address Method(s) Measure of Success
Software development Did we create a useful and functional software system Examine feature requests TRAC history Successful implementation of a significant number of user-requested features. (Significant here needs some interpretation.)
Software usability Can the software be installed and used easily by others. Try out software with a new user Can a new user, ab initio, successfully install and use the software using instructions provided?
Software stability Is the software stable and usable in the longer term? Can the components used be upgraded independently while maintaining functionality of the system as a whole? Survey interfaces between components. Consider real experience of upgrading component elements. The best indicator here is real experience of upgrading or replacing a system component: was this achieved with no or minimal changes to other components? Re the interfaces mostly based on simple HTTP requests? Can services be explored using just a browser, or CURL?
Use of existing software Did we manage to make extensive use of available software tools Summary of functionality from existing tools vs new implementation Have we provided significant functionality without developing major software components? Could the functionality of anything we implemented be provided by an existing component? If so, were there germane reasons we did not use that?
Adaptability to new application areas Have we produced a system that is adaptable to handle new applications beyond the original demonstration exemplar? Review of technical design, identify elements that are specific to exemplar application. Assess effort applied that is readily re-applied in relation to effort that is specific to exemplar.
Consistency with JISC strategies Do our proposals advance JISC strategies for repositories? Review the extent to which our demonstrator works with existing JISC systems; gather feedback from other developers in the JISC programmes about the ease of integrating our ideas with existing system, or evolving existing systems to take advantage of our work. Assess likely level of effort required to integrate existing JISC systems with other web information sources.
Peer-reviewed publication Have we gained peer acceptance for our approach? Acceptance of peer-reviewed conference or journal publications may indicate that our ideas are being perceived as useful developments. Note acceptance of conference papers, posters, and journal publications.


Quality plan

Agile development makes extensive use of automated tests, especially unit tests that verify the function of isolated portions of software, and automated or semi-automated integration tests that verify performance of the system as a whole. The intent is that each software release will always pass all current unit and integration tests - this gives an assurance of basic code quality.

We also need to work to ensure that unit tests capture the actual user requirements in a live deployment. This can be addressed in a number of ways, including:

  • record any problems that are noted in integration testing (which presumes all unit tests are passed), and try to devise unit tests that detect these problems.
  • capture stated user requirements as testable assertions, and in turn create integration and/or unit tests to test these assertions.
  • capture all user performance feedback and enhancement requests in a ticketing system, and keep track of the number of such tickets that have not been positively resolved.
  • develop unit tests to validate the target installation environment as well as the code under development.

Documentation quality is a factor in overall system quality: when the system does effectively solve a user requirement, it is also important that a user can easily learn how to do this. Using a wiki to capture information about how to solve particular problems or goals using the delivered software - the key here is making it as easy as possible to capture information. For users who are not comfortable using a wiki, we can also provide other routes (e.g. a mailing list) for users to easily submit descriptions that we can transfer to a wiki.

The intent is that software releases will be small and frequent, providing plenty of opportunity for constructive user feedback.

Dissemination plan

Project outcomes will be disseminated by the following means:

Timing Dissemination activity Audience Purpose Key message
Project lifetime User interactions Users Providing tangible benefits to users, who will hopefully tell others Demonstrable utility of our ideas and software
Project lifetime Peer-reviewed publication Technology developers and research scientists Communicating our technical ideas to other developers, and promoting use of web-based data publication and discovery tools to research scientists. Detail technical aspects of our use of Web systems, describe general mechanisms for open data publication on the web, report user feedback from use of our systems.
Project lifetime, and beyond Web site Users, software designers, anyone else interested Allowing others to understand out goals, approach and achievements. Project documentation and communication of experiences
Project lifetime, and beyond Public repository Users, software designers, anyone else interested Allowing others to replicate our results Published software, ontologies and data
End of project Final report JISC, developers, operators, users Guidance for use of our work, indications of when it may be applicable, and suggestions for future development work Summary of project activities, results and recommendations; references to supporting documentation, software and other resources.
Future Developed software Repository operators and users Delivery of working data webs Working software building upon the results of this projects' work


Exit and sustainability plans

Take-up and embedding of project outputs:

Project outputs Action for take-up & embedding Action for exit
(a) Final report Providing a final report and supporting materials will provide a basis for helping others to understand the nature of our work, its benefits and disadvantages, and how to apply it to different application areas. Publication of the final report and supporting materials
(b) Software tools and design patterns Release of new software tools and compositions with existing tools that can be used as a basis for providing data access and discovery functionality in a number of environments, and from diverse web information sources. Web distribution of our software, test cases, developer documentation and user documentation.
(c) Service usage models Publication of patterns or recipes for using out data web ideas in a range of different contexts, and providing indications or contraindications of when this approach may be applicable. Publication Service Usage Models (SUM) documentation [detail ... JISC repository?]
(d) Reviewed publications Communicating relevant results to a wider development or user communities Conference and/or journal publication of specific technical results, developments and experiences
(e) Standards feedback Incorporation of relevant experiences into standards development Feedback relevant experiences to standards development bodies

Project outputs that may have potential to live on and continue development after the project ends:

Project outputs Why sustainable Scenarios for taking forward Issues to address
(a) Web site containing resources and information pertaining to institutional repositories, repository access and metadata aggregation tools, and proposals and plans for implementing an image data web. Our wiki web site will be taken over from a previous JISC project (Defining Image Access), which has already been acknowledged as a useful resource concerning the topics surveyed, and refined and expanded to capture our expanding knowledge and experience. As such, it will provide a place for capturing information and usage notes about the software we develop and the tools we use. The tools chosen are all designed to support user contribution of content, with relatively little input required from a supervising "webmaster". The site itself is hosted on a VMWare virtual machine, allowing it to be moved relatively easily to a new hosting environment. Other projects continue to use the same web site, or, failing that, transfer of web site hosting to new ownership Organization of the content content could be improved, especially as its range grows. We are taking action to generally improve our research group's web presence, which will take account ease of access to a range of published information.
(b) Image data web implementation The software tools we develop will initially focus on our exemplar application of Drosophila gene expression images and related information, but the ideas should be applicable in other applications. We are already exploring possible applications in the areas of animal behaviour, and cilia development and associated abnormalities.

The ultimate goal of our work is to deploy systems that improve the capabilities and productivity of life science researchers, particularly where key the key scientific observations used are images, videos, or similar.

Development of generalized software and hostable services that are easily accessed and used by researchers to publish their image data and associated information, and place such publication in a wider context of web-published information.

In connection with this, we are also working on a BBSRC funded project to create a complementary system for a laboratory data management system to integrate semantic data management into laboratory research, both by facilitating access to semantic data on the web (such as would be provided by a data web), and by capturing research images and supporting metadata in a form suitable for publication to an image data web.

We have also adapted a conventional repository system (EPrints) for publication of Drosophila gene expression images and associated annotations, which is providing one of the key information sources for the current project.

This scenario demonstrates the wider context in which we aim to place out data webs work, covering all phases of the research information life cycle from initial capture and discovery through to publication.

(No additional issues currently noted)
Impact on related JISC project work Promoting use of ideas and mechanisms that are already finding use in broader communities. Scenarios for taking forward Issues to address

Additional material - input from other image-related projects

People and projects it would be good to contact or monitor in the course of the project include:

Personal tools
Oxford DMP online
MIIDI
Claros