FlyData project

From ImageWeb

Jump to: navigation, search

FlyData: Decision support and semantic organization of laboratory data in Drosophila gene expression experiments

The FlyData Project, funded by BBSRC (May 2007-Dec 2008) and being undertaken by undertaken by Graham Klyne (Project Manager), Jun Zhao and David Shotton (PI) of the Image Bioinformatics Research Group of the University of Oxford, has been conceived as a very focused development aimed at providing tools that enhance the work of a specific group of researchers, and thereby to better understand the requirements of post-genomic researchers in general when building semantic data management tools. This project does not aim to produce a generalized tool suitable for all kinds of lab based research, but through development of a specific tool to move toward formulating and articulating the principles upon which such a tool might be based.

We are building a laboratory information management and decision support system to handle experimental data from two funded BBSRC projects investigating gene expression in the fruit fly Drosophila melanogaster. This will capture and manage primary laboratory data up to and including the in situ gene expression images, and to integrate third-party information from FlyBase and elsewhere, providing access to a comprehensive record of our research. The FlyData project will complement our ImageWeb work by developing systems to facilitate the capture of images from laboratory experiments, and related contextual data, in a fashion that supports their use during the course of a research study, and also makes them accessible for publication via one or more data webs. The initial work will be in support of research into gene expression in Drosophila spermatogenesis undertaken in the Oxford laboratory of Dr Helen White-Cooper.

The FlyData Project's purpose is as follows:

  • To organize all the laboratory data arising from our Drosophila gene expression experiments in semantically meaningful ways.
  • To relate them to online information.
  • To provide different semantic views into this multidimensional information using a standard web browser, by means of a set of carefully designed graphical user interfaces.
  • To record our research decisions and their provenance.
  • To automate the subsequent publication of our results to the public Drosophila Testis Gene Expression Database (http://www.fly-ted.org).

The range of data that we need to manage includes:

  • Affymetrix array gene expression results and their analyses.
  • Annotations of selected genes with data from FlyBase (http://www.flybase.org) and the literature, and reasons for selecting genes for further study.
  • PCR primer design criteria and sequences.
  • Real-time PCR results and conclusions drawn.
  • In situ hybridization probe sequences and in situ images .
  • Expert interpretations of the in situ images made at the microscope.

We are using an agile programming approach, focusing on the most immediate needs of our research biologists, with a view to iterative enhancement to meet further requirements. Being responsive to researchers’ needs, this will crystalize requirements for such a system, and provide a significant step toward creating a more generic system for use in other systems biology projects that may have different detailed requirements and data structures. We shall use available lightweight open-source tools as far as possible, directing our activities to meeting biological researchers’ key requirements rather than constructing a complex software system.

We hope our work will seed the creation of an open source laboratory image and information management system and mutually-supporting community of users and developers.


Contents

Project goals

Reflecting revisions in light of developments since the original proposal:

  • To improve the efficiency of capturing primary laboratory data and metadata flowing from investigations of gene expression in Drosophila, prior to and including the acquisition of annotated images showing mRNA in situ expression patterns and the localization patterns of GFP-tagged proteins.
  • To combine locally gathered information with external sources, such as the Affymetrix database relating array location to gene identity, and FlyBase information about the identified genes.
  • To support decision-making upon which research progress depends (for example, in the choice of genes to select for in situ gene expression imaging) by providing the ability to view all current data in integrated ways.
  • To provide a more comprehensive research record, supporting review of key decisions, by recording supporting data and provenance (i.e. who made what data entry, annotation or decision, when and why).
  • To improve the sharing of research data within our research group (enabling better integration of individuals' results into a common "collective lab consciousness"), sharing with colleagues in other institutions, and facilitating eventual dissemination in published reports and databases. This is supported by collecting related information into a semantically organized environment, employing a Drosophila data ontology. In particular:
    • to facilitate dispatch of images showing GFP-tagged protein localization patterns and their supporting metadata to collaborators in Cambridge who have supplied us with flies bearing the GFP constructs,
    • to fulfil our goal of sharing our research results with the scientific community through the public-facing Drosophila Testis Gene Expression Database (Fly-TED), automating the export of selected images and their supporting metadata to the public-facing repository.
    • to capture and present metadata needed to link research data sets and other published materials (cf. FlyWeb proposal).
  • To lay groundwork for future projects that address data handling requirements of a wider post genomic research and systems biology community; for example, enhancements for use with Arabidopsis or Trypanosoma, which have different requirements and data structures.

Project focus

The project aims to develop and evaluate tools covering the following areas:

  • User requirements
  • Data modelling (schema and/or ontology)
  • Data acquisition and provenance (from multiple sources, including researchers' preferred tools)
  • Data presentation using Semantic Web standards
  • Responsive web-based user interface

Possible additional areas:

  • Data annotation, with provenance

Functional goals

  • Web-based system: primary access using a Web Browser
  • Assisting with data entry and/or ingest tasks
  • Easy access to related data (needs more detail)
  • Search over multiple facets of complex, loosely structured data
  • Viewing selected gene data in a single integrated interface
  • Automated or assisted publication of data
  • Support for LSIDs to identify all relevant entities (http://lsids.sourceforge.net/)
  • Supporting evolution of Fly-TED to support publication of more complete data

Data sources

Currently, an investigator must integrate information about a gene arising from:

  • DNA array assays
  • Real time PCR measurements
  • In situ hybridization images
  • Purported gene identity for each microarray probe and other data from the Affymetrix microarray database
  • Annotated genomics information from FlyBase
  • PubMed literature references from NCBI
  • Other on-line information

Components and capabilities

Interoperability will be supported by providing data mappings to RDF, and using an OWL ontology to describe the diverse information gathered by FlyData. Data can be exported to other systems using existing Semantic Web tools and technologies.

We will also provide tools to populate the public-facing Drosophila Testis Gene Expression Database (Fly-TED) from data recorded by FlyData.

The FlyData system will comprise the following components and capabilities:

  • An extensible ontology, the Drosophila Data Ontology (DRODO), that will describe the Drosophila experiments and the data they produce. This ontology will underpin our data structures.
  • Access to the system via standard web browsers and tools
  • User interfaces designed to assist biological researchers with input and exploration of data and annotations managed by FlyData.
  • Links between our laboratory data and relevant on-line information from FlyBase and other online data sources.
  • The ability to export data to RDF (via SPARQL?)

The system will be constructed as a lightweight framework of software components, loosely coupled and with low maintenance costs, for user interaction and data storage and retrieval, each dealing with data formally or informally related to DRODO.

FlyData stands in contrast to many other excellent web bioinformatics systems by being focussed on the organization of local laboratory results in relation to external data.

We will provide facilities to review the stored data. The details of such access will be based on researchers’ ongoing and evolving needs. Initial discussions suggest interrogating the data along a number of semantic dimensions, to identify and compare genes sharing common properties; e.g., to find all genes that:

  • are expressed in the same anatomical region of the testis, for example only in the primary spermatocytes;
  • have high, medium or low degrees of wild-type expression
  • are up-regulated or down-regulated in the strains relative to the wild type, for instance being at about twice the wild type level in the aly meiotic arrest mutant strain;
  • have discordant results from the independent Affymetrix arrays and rt-PCR measurements of their gene expression;
  • have unusual spatial expression patterns, such as comets and cups (Fig.1);
  • are spatially adjacent in the genome, for example on the short arm of Chromosome 4;
  • are duplicates or pseudogenes;
  • have homologues of known function in other model organisms;
  • have translation products that share particular protein domains, such as membrane-spanning alpha helices;
  • have translation products that are known to interact, for example as members of the same signalling pathway;

or certain combinations of these features.

Finally, an integrated view into all these data for each single gene will enable users to see and relate these data in a single convenient interface.

Further work

Questions

  • Initial phase requirements?
  • How to store raw data and images?
  • How to store metadata, and other ontologically structured information?

Project activity and planning

See https://milos2.zoo.ox.ac.uk/trac/FlyData/wiki/WikiStart

Project meetings

Technical notes

Software development approach

We will undertake lightweight software development, and will adopt an agile programming approach (http://en.wikipedia.org/wiki/Agile_programming) of doing the simple things first and building incrementally. The functionality will be defined by test cases agreed with the biological researchers. User requirements will be captured in the form of test cases, and automated regression testing will be used to confirm that subsequent developments of the system continue to satisfy all the identified requirements.

Web application framework

A key element of this work will be to choose appropriate tools to create (and test) user interfaces for importing, inputing, editing and presenting a variety of data elements. We intend to employ a lightweight full-stack application framework (e.g. TurboGears (http://www.turbogears.org/), based on the Python programming language (http://www.python.org/), or Ruby on Rails (http://www.rubyonrails.org/)), to build the FlyData system. Experience also suggests that, while tools like these are very helpful, they may not be enough to maintain flexibility when weaving applications that combine data access and query, server side processing and dynamic web pages (dynamic HTML, or AJAX). The Links project (see below) suggests a complementary approach drawing on ideas from functional programming; Links introduces a new programming language for web applications.

TurboGears is an integration of "best of breed" Python components that creates a Model–View–Controller (MVC) framework (http://java.sun.com/blueprints/patterns/MVC.html), a successful design pattern for interfacing users with underlying data structures that decouples data access, application logic, user interactions and data presentation:

  • The Controller, CherryPy (http://www.cherrypy.org/), is the central organizing component of TurboGears, and provides a simple lightweight web server to deal with user interactions;
  • The Model in TurboGears is an object-relational mapper that works with its own internal relational database. But, the model component interface is quite visible and it is very easy to use other data storage elements, or several in concert. We anticipate that simple files and web resources will form an important part of the FlyData model component;
  • View elements (Web pages) is provided by the Python templating language KID (http://kid.lesscode.org/language.html).

TurboGears also works with MochiKit (http://www.mochikit.com/), a lightweight but powerful AJAX client-side Javascript library. AJAX (Asynchronous Javescript And XML; http://en.wikipedia.org/wiki/AJAX) permits Javascript in browers to interact with a Web server in response to user input without refreshing the complete Web page, making it faster and smoother to display updated information, as users of Flickr, GMail or Google Maps will have experienced.

The Links project explores unified programming language design for use across model, controller and view elements of a web application, including code that runs in a browser. I would like to explore using similar ideas using a metaprogramming approach (cf. Paul Graham in "Beating the Averages", http://www.paulgraham.com/avg.html). I have done some private work on unit-testing of asynchronous javascript browser code, and used a monad-like construction to assemble a complete test case from a number of asynchronous elements, so I think there is promise here.

Continuation and sustainability

We seek the creation of an Open Source laboratory data management and decision support system, which can be freely distributed to the BBSRC community and beyond. Our vision is for a lightweight system that is easily adapted to local requirements, encouraging a mutually supporting community of developers and users. Data representation using Web standards and formal ontology descriptions provides a sound basis for creating tools to facilitate data publication and exchange between research groups.

Software notes

Running the software

  • Install Python (currently 2.5.2 - should be able to use 2.4 or later)
  • Add Python scripts directory to the system PATH environment variable (e.g. C:\Dev\Python25\Scripts)
  • Install Turbogears, using tgsetup.py
  • Install SQLAlchemy and the SQLite python interface:
    • easy_install SQLAlchemy
    • easy_install pysqlite
  • Download SQLite3 libraries and command line program. For windows, these are sqlitedll-3_x_x.zip and sqlite-3_x_x.zip.
    • http://www.sqlite.org/download.html
    • Extract .dll and .def file to Python DLL directory (e.g. C:\Dev\Python25\DLLs).
    • Optionally, extract command line utility to Python scripts directory (e.g. C:\Dev\Python25\Scripts)
  • Checkout the software from subversion...
mkdir /usr/local/FlyData
cd /usr/local/FlyData
svn co svn+ssh://graham@milos1.zoo.ox.ac.uk/var/svn/ImageWeb/FlyData/Trunk/
  • You may need to change the server port number, and open that port in firewall (reverse proxying via Apache, described below, is an alternative to opening a port through the firewall). On rodos.zoo.ox.ac.uk, port 8080 (the default for CherryPy) is used by FlyWeb/Tomcat, so I have changed the configuration to use port 8081.
    • Port configuration is in /usr/local/FlyData/Trunk/FlyDataServer/dev.cfg
  • You may need to change the FlyDataServer configuration to create new database tables:
    • Value LoadNewData in /usr/local/FlyData/Trunk/FlyDataServer/flydataserver/ConfigSettings.py controls the reloading of database tables.
    • Do this once only to create the tables, then reset the value to False, as it currently takes about 10 minutes to load new database tables.
  • To run the server, execute the file /usr/local/FlyData/Trunk/FlyDataServer/start-flydataserver.py, e.g.:
cd /usr/local/FlyData/Trunk/FlyDataServer
./start-flydataserver.py
    • Later, we will want to choose a way to start this automatically

Running under Apache with reverse proxying for password protected access

wget http://apache.webthing.com/mod_proxy_html/mod_proxy_html.tar.bz2
bunzip2 mod_proxy_html.tar.bz2 
xfv mod_proxy_html.tar
cd mod_proxy_html
ls
less proxy_html.conf 
apxs -c -I/usr/include/libxml2 -i mod_proxy_html.c
  • The final step above builds and installs the new module, assuming libxml2 is available at the location given (which it is for Scientific Linux 4.4)
  • Configure reverse-proxy support for the FlyData server, with something like this /etc/http/conf.d/Flydata.conf file:
# Reverse-proxy path /flydata to the flydata server on port 8081
LoadFile    /usr/lib/libxml2.so
LoadModule proxy_html_module modules/mod_proxy_html.so
#ProxyHTMLLogVerbose On
#LogLevel Info
#or:
#LogLevel Debug

SetOutputFilter proxy-html
ProxyRequests Off

<Proxy *>
Order deny,allow
Allow from all
</Proxy>

ProxyPass        /flydata/ http://rodos.zoo.ox.ac.uk:8081/
ProxyHTMLURLMap http://rodos.zoo.ox.ac.uk:8081 /flydata
<Location /flydata/>
   ProxyPassReverse /
   ProxyHTMLURLMap / /flydata/
</Location>

# End.
  • Create password file in /etc/httpd/conf.d, e.g.:
htpasswd -cb flydataaccess flydata affyatlas
  • Finally, edit /etc/http/conf.d/Flydata.conf so that the /flydata/ section looks something like this:
<Location /flydata/>
   ProxyPassReverse /
   ProxyHTMLURLMap / /flydata/
   AuthName "Access to flydata (affyatlas) server is restricted"
   AuthType Basic
   AuthUserFile /etc/httpd/conf.d/flydataaccess
   AuthGroupFile /dev/null
   require valid-user
</Location>

When FlyDataServer is running, access should now be possible at http://rodos.zoo.ox.ac.uk/flydata/, using a username/password combination saved in /etc/httpd/conf.d/flydataaccess by running htpasswd.

Public server deployment

There is a password-protected remotely accessible demonstration version of the FlyData server running at:

 http://rodos.zoo.ox.ac.uk/flydata/localatlas/select

It is not intended to be completely public, and the access details needed are:

 username: flydata
 password: (ask us)

This service uses the python server described above running behind an Apache HTTP server using reverse proxying. Configuration for the reverse proxy looks like this (see also /etc/httpd/conf.d/flydata.conf):

# Proxy requests for FlyData server on local port 8081
#

# Reverse-proxy path /flydata to the flydata server on port 8081

LoadFile    /usr/lib/libxml2.so
LoadModule proxy_html_module modules/mod_proxy_html.so

ProxyHTMLLogVerbose On
LogLevel Info
#or:
#LogLevel Debug

SetOutputFilter proxy-html

ProxyRequests Off

<Proxy *>
Order deny,allow
Allow from all
</Proxy>

ProxyPass       /flydata/ http://rodos.zoo.ox.ac.uk:8081/
ProxyHTMLURLMap http://rodos.zoo.ox.ac.uk:8081 /flydata

<Location /flydata/>
    ProxyPassReverse /
    ProxyHTMLURLMap / /flydata/
    AuthName "Access to flydata server is restricted"
    AuthType Basic
    AuthUserFile /etc/httpd/conf.d/flydataaccess
    AuthGroupFile /dev/null
    require valid-user
</Location>

# End.


FlyAtlas Ontology

This is currently work-in-progress. This table is automatically generated from a separately constructed ontology file.

Other notes and resources

Software to look at

Collaborations:

  • CombeChem (?)

Other resources:

Links



-- GrahamKlyne 16:00, 11 September 2007 (BST)

Personal tools
Oxford DMP online
MIIDI
Claros