FlyData project
From ImageWeb
FlyData: Decision support and semantic organization of laboratory data in Drosophila gene expression experiments
The FlyData Project, funded by BBSRC (May 2007-Dec 2008) and being undertaken by undertaken by Graham Klyne (Project Manager), Jun Zhao and David Shotton (PI) of the Image Bioinformatics Research Group of the University of Oxford, has been conceived as a very focused development aimed at providing tools that enhance the work of a specific group of researchers, and thereby to better understand the requirements of post-genomic researchers in general when building semantic data management tools. This project does not aim to produce a generalized tool suitable for all kinds of lab based research, but through development of a specific tool to move toward formulating and articulating the principles upon which such a tool might be based.
We are building a laboratory information management and decision support system to handle experimental data from two funded BBSRC projects investigating gene expression in the fruit fly Drosophila melanogaster. This will capture and manage primary laboratory data up to and including the in situ gene expression images, and to integrate third-party information from FlyBase and elsewhere, providing access to a comprehensive record of our research. The FlyData project will complement our ImageWeb work by developing systems to facilitate the capture of images from laboratory experiments, and related contextual data, in a fashion that supports their use during the course of a research study, and also makes them accessible for publication via one or more data webs. The initial work will be in support of research into gene expression in Drosophila spermatogenesis undertaken in the Oxford laboratory of Dr Helen White-Cooper.
The FlyData Project's purpose is as follows:
- To organize all the laboratory data arising from our Drosophila gene expression experiments in semantically meaningful ways.
- To relate them to online information.
- To provide different semantic views into this multidimensional information using a standard web browser, by means of a set of carefully designed graphical user interfaces.
- To record our research decisions and their provenance.
- To automate the subsequent publication of our results to the public Drosophila Testis Gene Expression Database (http://www.fly-ted.org).
The range of data that we need to manage includes:
- Affymetrix array gene expression results and their analyses.
- Annotations of selected genes with data from FlyBase (http://www.flybase.org) and the literature, and reasons for selecting genes for further study.
- PCR primer design criteria and sequences.
- Real-time PCR results and conclusions drawn.
- In situ hybridization probe sequences and in situ images .
- Expert interpretations of the in situ images made at the microscope.
We are using an agile programming approach, focusing on the most immediate needs of our research biologists, with a view to iterative enhancement to meet further requirements. Being responsive to researchers’ needs, this will crystalize requirements for such a system, and provide a significant step toward creating a more generic system for use in other systems biology projects that may have different detailed requirements and data structures. We shall use available lightweight open-source tools as far as possible, directing our activities to meeting biological researchers’ key requirements rather than constructing a complex software system.
We hope our work will seed the creation of an open source laboratory image and information management system and mutually-supporting community of users and developers.
Contents |
Project goals
Reflecting revisions in light of developments since the original proposal:
- To improve the efficiency of capturing primary laboratory data and metadata flowing from investigations of gene expression in Drosophila, prior to and including the acquisition of annotated images showing mRNA in situ expression patterns and the localization patterns of GFP-tagged proteins.
- To combine locally gathered information with external sources, such as the Affymetrix database relating array location to gene identity, and FlyBase information about the identified genes.
- To support decision-making upon which research progress depends (for example, in the choice of genes to select for in situ gene expression imaging) by providing the ability to view all current data in integrated ways.
- To provide a more comprehensive research record, supporting review of key decisions, by recording supporting data and provenance (i.e. who made what data entry, annotation or decision, when and why).
- To improve the sharing of research data within our research group (enabling better integration of individuals' results into a common "collective lab consciousness"), sharing with colleagues in other institutions, and facilitating eventual dissemination in published reports and databases. This is supported by collecting related information into a semantically organized environment, employing a Drosophila data ontology. In particular:
- to facilitate dispatch of images showing GFP-tagged protein localization patterns and their supporting metadata to collaborators in Cambridge who have supplied us with flies bearing the GFP constructs,
- to fulfil our goal of sharing our research results with the scientific community through the public-facing Drosophila Testis Gene Expression Database (Fly-TED), automating the export of selected images and their supporting metadata to the public-facing repository.
- to capture and present metadata needed to link research data sets and other published materials (cf. FlyWeb proposal).
- To lay groundwork for future projects that address data handling requirements of a wider post genomic research and systems biology community; for example, enhancements for use with Arabidopsis or Trypanosoma, which have different requirements and data structures.
Project focus
The project aims to develop and evaluate tools covering the following areas:
- User requirements
- Data modelling (schema and/or ontology)
- Data acquisition and provenance (from multiple sources, including researchers' preferred tools)
- Data presentation using Semantic Web standards
- Responsive web-based user interface
Possible additional areas:
- Data annotation, with provenance
Functional goals
- Web-based system: primary access using a Web Browser
- Assisting with data entry and/or ingest tasks
- Easy access to related data (needs more detail)
- Search over multiple facets of complex, loosely structured data
- Viewing selected gene data in a single integrated interface
- Automated or assisted publication of data
- Support for LSIDs to identify all relevant entities (http://lsids.sourceforge.net/)
- Supporting evolution of Fly-TED to support publication of more complete data
Data sources
Currently, an investigator must integrate information about a gene arising from:
- DNA array assays
- Real time PCR measurements
- In situ hybridization images
- Purported gene identity for each microarray probe and other data from the Affymetrix microarray database
- Annotated genomics information from FlyBase
- PubMed literature references from NCBI
- Other on-line information
Components and capabilities
Interoperability will be supported by providing data mappings to RDF, and using an OWL ontology to describe the diverse information gathered by FlyData. Data can be exported to other systems using existing Semantic Web tools and technologies.
We will also provide tools to populate the public-facing Drosophila Testis Gene Expression Database (Fly-TED) from data recorded by FlyData.
The FlyData system will comprise the following components and capabilities:
- An extensible ontology, the Drosophila Data Ontology (DRODO), that will describe the Drosophila experiments and the data they produce. This ontology will underpin our data structures.
- Access to the system via standard web browsers and tools
- User interfaces designed to assist biological researchers with input and exploration of data and annotations managed by FlyData.
- Links between our laboratory data and relevant on-line information from FlyBase and other online data sources.
- The ability to export data to RDF (via SPARQL?)
The system will be constructed as a lightweight framework of software components, loosely coupled and with low maintenance costs, for user interaction and data storage and retrieval, each dealing with data formally or informally related to DRODO.
FlyData stands in contrast to many other excellent web bioinformatics systems by being focussed on the organization of local laboratory results in relation to external data.
We will provide facilities to review the stored data. The details of such access will be based on researchers’ ongoing and evolving needs. Initial discussions suggest interrogating the data along a number of semantic dimensions, to identify and compare genes sharing common properties; e.g., to find all genes that:
- are expressed in the same anatomical region of the testis, for example only in the primary spermatocytes;
- have high, medium or low degrees of wild-type expression
- are up-regulated or down-regulated in the strains relative to the wild type, for instance being at about twice the wild type level in the aly meiotic arrest mutant strain;
- have discordant results from the independent Affymetrix arrays and rt-PCR measurements of their gene expression;
- have unusual spatial expression patterns, such as comets and cups (Fig.1);
- are spatially adjacent in the genome, for example on the short arm of Chromosome 4;
- are duplicates or pseudogenes;
- have homologues of known function in other model organisms;
- have translation products that share particular protein domains, such as membrane-spanning alpha helices;
- have translation products that are known to interact, for example as members of the same signalling pathway;
or certain combinations of these features.
Finally, an integrated view into all these data for each single gene will enable users to see and relate these data in a single convenient interface.
Further work
- FlyData/DataManagementRequirements - developing requirements a generic data management platform
- FlyData/DataManagementUserStories - user stories relating to lab data management
- FlyData/DataManagementVocabulary - work-in-progress vocabulary for data management platform, based upon SiOC.
Questions
- Initial phase requirements?
- How to store raw data and images?
- How to store metadata, and other ontologically structured information?
Project activity and planning
See https://milos2.zoo.ox.ac.uk/trac/FlyData/wiki/WikiStart
- Phase 1
- Starting September 2007, aim for 2-3 month (max) timeframe
- FlyData/Activity/Phase1 - display local Affymetrix gene expression data alongside corresponding data from flyatlas.org (from meeting Meetings/20070817/FlyData-review).
Project meetings
- Meetings/20070817/FlyData-review - initial meeting with Helen White-Cooper
- Summary, initial goal will be to display Helen's Affymetrix data together with with corresponding data from flyatlas.org.
- Meetings/20080115/FlyData-review - review meeting with Helen White-Cooper, with GK, DS and JZ.
Technical notes
Software development approach
We will undertake lightweight software development, and will adopt an agile programming approach (http://en.wikipedia.org/wiki/Agile_programming) of doing the simple things first and building incrementally. The functionality will be defined by test cases agreed with the biological researchers. User requirements will be captured in the form of test cases, and automated regression testing will be used to confirm that subsequent developments of the system continue to satisfy all the identified requirements.
Web application framework
A key element of this work will be to choose appropriate tools to create (and test) user interfaces for importing, inputing, editing and presenting a variety of data elements. We intend to employ a lightweight full-stack application framework (e.g. TurboGears (http://www.turbogears.org/), based on the Python programming language (http://www.python.org/), or Ruby on Rails (http://www.rubyonrails.org/)), to build the FlyData system. Experience also suggests that, while tools like these are very helpful, they may not be enough to maintain flexibility when weaving applications that combine data access and query, server side processing and dynamic web pages (dynamic HTML, or AJAX). The Links project (see below) suggests a complementary approach drawing on ideas from functional programming; Links introduces a new programming language for web applications.
TurboGears is an integration of "best of breed" Python components that creates a Model–View–Controller (MVC) framework (http://java.sun.com/blueprints/patterns/MVC.html), a successful design pattern for interfacing users with underlying data structures that decouples data access, application logic, user interactions and data presentation:
- The Controller, CherryPy (http://www.cherrypy.org/), is the central organizing component of TurboGears, and provides a simple lightweight web server to deal with user interactions;
- The Model in TurboGears is an object-relational mapper that works with its own internal relational database. But, the model component interface is quite visible and it is very easy to use other data storage elements, or several in concert. We anticipate that simple files and web resources will form an important part of the FlyData model component;
- View elements (Web pages) is provided by the Python templating language KID (http://kid.lesscode.org/language.html).
TurboGears also works with MochiKit (http://www.mochikit.com/), a lightweight but powerful AJAX client-side Javascript library. AJAX (Asynchronous Javescript And XML; http://en.wikipedia.org/wiki/AJAX) permits Javascript in browers to interact with a Web server in response to user input without refreshing the complete Web page, making it faster and smoother to display updated information, as users of Flickr, GMail or Google Maps will have experienced.
The Links project explores unified programming language design for use across model, controller and view elements of a web application, including code that runs in a browser. I would like to explore using similar ideas using a metaprogramming approach (cf. Paul Graham in "Beating the Averages", http://www.paulgraham.com/avg.html). I have done some private work on unit-testing of asynchronous javascript browser code, and used a monad-like construction to assemble a complete test case from a number of asynchronous elements, so I think there is promise here.
- MSpace - DefiningImageAccess/Tool/mSpace, http://www.mspace.fm/
- JSpace - DefiningImageAccess/Tool/JSpace, http://clarkparsia.com/projects/code/jspace/
- SWED faceted browser - DefiningImageAccess/Tool/Semantic_Portal
- Links - http://groups.inf.ed.ac.uk/links/
- Gloze - DefiningImageAccess/Tool/Gloze
- Monad comprehensions for queries - http://www.inf.uni-konstanz.de/dbis/publications/download/monad-comprehensions.pdf
- Can we use monad comprehensions for SPARQL queries?
- Scala, a functional-inspired language that runs in the JVM - http://www.scala-lang.org/
- Haskell, a pure functional programming language - http://www.haskell.org/
- Continuations for web interactions:
- http://www.cs.chalmers.se/~rjmh/Papers/arrows.pdf
- http://portal.acm.org/citation.cfm?doid=772970.772977
- http://www.ccs.neu.edu/scheme/pubs/interactive06-sk-mf.pdf
- More Scheme papers - continuations seem to be a recurring theme: http://www.ccs.neu.edu/scheme/pubs/
- http://pyds.muensterland.org/wiki/continuationbasedserver.html
- http://www.seaside.st/
- http://rifers.org/
- Is there a web continuation framework for Python, Ruby, etc.?
- http://www.cs.cmu.edu/~rwh/courses/mobile/Join/Fournet-join-tutorial.pdf - Join calculus
- http://research.microsoft.com/~nick/polyphony/redmondPolyphony.ppt - slides describing Polyphonic C#
Continuation and sustainability
We seek the creation of an Open Source laboratory data management and decision support system, which can be freely distributed to the BBSRC community and beyond. Our vision is for a lightweight system that is easily adapted to local requirements, encouraging a mutually supporting community of developers and users. Data representation using Web standards and formal ontology descriptions provides a sound basis for creating tools to facilitate data publication and exchange between research groups.
Software notes
Running the software
- Install Python (currently 2.5.2 - should be able to use 2.4 or later)
- Add Python scripts directory to the system PATH environment variable (e.g. C:\Dev\Python25\Scripts)
- Install Turbogears, using tgsetup.py
- http://docs.turbogears.org/1.0/Install
- This also installs the easy_install facility...
- Install SQLAlchemy and the SQLite python interface:
-
easy_install SQLAlchemy -
easy_install pysqlite
-
- Download SQLite3 libraries and command line program. For windows, these are sqlitedll-3_x_x.zip and sqlite-3_x_x.zip.
- http://www.sqlite.org/download.html
- Extract .dll and .def file to Python DLL directory (e.g. C:\Dev\Python25\DLLs).
- Optionally, extract command line utility to Python scripts directory (e.g. C:\Dev\Python25\Scripts)
- Checkout the software from subversion...
-
mkdir /usr/local/FlyData -
cd /usr/local/FlyData -
svn co svn+ssh://graham@milos1.zoo.ox.ac.uk/var/svn/ImageWeb/FlyData/Trunk/
- You may need to change the server port number, and open that port in firewall (reverse proxying via Apache, described below, is an alternative to opening a port through the firewall). On rodos.zoo.ox.ac.uk, port 8080 (the default for CherryPy) is used by FlyWeb/Tomcat, so I have changed the configuration to use port 8081.
- Port configuration is in /usr/local/FlyData/Trunk/FlyDataServer/dev.cfg
- You may need to change the FlyDataServer configuration to create new database tables:
- Value LoadNewData in /usr/local/FlyData/Trunk/FlyDataServer/flydataserver/ConfigSettings.py controls the reloading of database tables.
- Do this once only to create the tables, then reset the value to False, as it currently takes about 10 minutes to load new database tables.
- To run the server, execute the file /usr/local/FlyData/Trunk/FlyDataServer/start-flydataserver.py, e.g.:
cd /usr/local/FlyData/Trunk/FlyDataServer ./start-flydataserver.py
- Later, we will want to choose a way to start this automatically
Running under Apache with reverse proxying for password protected access
- Download and unpack mod_proxy_html from http://apache.webthing.com/, and build with apxs, starting in a working/download directory:
wget http://apache.webthing.com/mod_proxy_html/mod_proxy_html.tar.bz2 bunzip2 mod_proxy_html.tar.bz2 xfv mod_proxy_html.tar cd mod_proxy_html ls less proxy_html.conf apxs -c -I/usr/include/libxml2 -i mod_proxy_html.c
- The final step above builds and installs the new module, assuming libxml2 is available at the location given (which it is for Scientific Linux 4.4)
- Configure reverse-proxy support for the FlyData server, with something like this
/etc/http/conf.d/Flydata.conffile:
# Reverse-proxy path /flydata to the flydata server on port 8081 LoadFile /usr/lib/libxml2.so LoadModule proxy_html_module modules/mod_proxy_html.so #ProxyHTMLLogVerbose On #LogLevel Info #or: #LogLevel Debug SetOutputFilter proxy-html ProxyRequests Off <Proxy *> Order deny,allow Allow from all </Proxy> ProxyPass /flydata/ http://rodos.zoo.ox.ac.uk:8081/ ProxyHTMLURLMap http://rodos.zoo.ox.ac.uk:8081 /flydata <Location /flydata/> ProxyPassReverse / ProxyHTMLURLMap / /flydata/ </Location> # End.
- Create password file in
/etc/httpd/conf.d, e.g.:
htpasswd -cb flydataaccess flydata affyatlas
- Finally, edit
/etc/http/conf.d/Flydata.confso that the/flydata/section looks something like this:
<Location /flydata/> ProxyPassReverse / ProxyHTMLURLMap / /flydata/ AuthName "Access to flydata (affyatlas) server is restricted" AuthType Basic AuthUserFile /etc/httpd/conf.d/flydataaccess AuthGroupFile /dev/null require valid-user </Location>
When FlyDataServer is running, access should now be possible at http://rodos.zoo.ox.ac.uk/flydata/, using a username/password combination saved in /etc/httpd/conf.d/flydataaccess by running htpasswd.
Public server deployment
There is a password-protected remotely accessible demonstration version of the FlyData server running at:
http://rodos.zoo.ox.ac.uk/flydata/localatlas/select
It is not intended to be completely public, and the access details needed are:
username: flydata password: (ask us)
This service uses the python server described above running behind an Apache HTTP server using reverse proxying. Configuration for the reverse proxy looks like this (see also /etc/httpd/conf.d/flydata.conf):
# Proxy requests for FlyData server on local port 8081
#
# Reverse-proxy path /flydata to the flydata server on port 8081
LoadFile /usr/lib/libxml2.so
LoadModule proxy_html_module modules/mod_proxy_html.so
ProxyHTMLLogVerbose On
LogLevel Info
#or:
#LogLevel Debug
SetOutputFilter proxy-html
ProxyRequests Off
<Proxy *>
Order deny,allow
Allow from all
</Proxy>
ProxyPass /flydata/ http://rodos.zoo.ox.ac.uk:8081/
ProxyHTMLURLMap http://rodos.zoo.ox.ac.uk:8081 /flydata
<Location /flydata/>
ProxyPassReverse /
ProxyHTMLURLMap / /flydata/
AuthName "Access to flydata server is restricted"
AuthType Basic
AuthUserFile /etc/httpd/conf.d/flydataaccess
AuthGroupFile /dev/null
require valid-user
</Location>
# End.
FlyAtlas Ontology
This is currently work-in-progress. This table is automatically generated from a separately constructed ontology file.
- FlyAtlas/Schema - summary of terms
Other notes and resources
- Overlap with SABRE proposal:
- Possible use of shared drive (WebDAV?)
- Draw upon ideas from NERC meeting - Meetings/20070220/NEBC-Workshop
Software to look at
- http://www.biomart.org/
- http://www.intermine.org/
- http://www.cs.kent.ac.uk/projects/vital/overview/index.html
- http://www.openoffice.org/
- http://www.mygrid.org.uk/ - Taverna/myGrid
- http://www.hpl.hp.com/semweb/tools.htm - Jena family tools:
- http://jena.sourceforge.net/ - Jena
- http://www.joseki.org/ - Joseki
Collaborations:
- CombeChem (?)
Other resources:
- http://www.flybase.org/static_pages/allied-data/external_resources5.html - look here under atlases and images you will find links to lots of drosophila image databases.
- http://www.fruitfly.org/cgi-bin/ex/insitu.pl - Berkley Drosophila Gene expression Database (BDGB)
- Martlet: A Scientific Work-Flow Language for Abstracted Parallisation, Daniel Goodman, http://www.softeng.ox.ac.uk/daniel.goodman/martlet.pdf.
- Functional Genomics Experiment (FuGE): http://fuge.sourceforge.net/. A data model to be adopted.
- Pedro: http://pedrodownload.man.ac.uk/ and Pierre: http://pierredownload.man.ac.uk/. Tools.
- Ontology for Biomedical Investigations: http://obi.sourceforge.net/. A general ontology to follow.
- Minimum Information Specification For In Situ Hybridization and Immunohistochemistry Experiments (MISFISHIE): http://scgap.systemsbiology.net/standards/misfishie/. Highly relevant to Helen's work.
- Proteomics Standards Initiative: http://psidev.sourceforge.net/ and Minimum Information About a Proteomics Experiment (MIAPE): http://psidev.sourceforge.net/miape/. More general info.
- Model-driven user interfaces for bioinformatics data resources: regenerating the wheel as an alternative to reinventing it, Kevin Garwood, Christopher Garwood, Cornelia Hedeler, Tony Griffiths, Neil Swainston, Stephen G Oliver and Norman W Paton; BMC Bioinformatics 2006, 7:532 doi:10.1186/1471-2105-7-532; the electronic version of this article is complete and online at: http://www.biomedcentral.com/1471-2105/7/532. Relates to FuGE, Pedro and Pierre.
Links
- Meetings/20070220/NEBC-Workshop - NEBC/EBI workshop
- http://darwin.nerc-oxford.ac.uk/pgp-wiki/index.php/Main_Page
- http://www.ebi.ac.uk/net-project/
- http://mibbi.sourceforge.net/ - MIBBI web site
- http://www.cisban.ac.uk/cisbanDPI.html
- http://fuge.sourceforge.net/
- http://www.andromda.org/
- http://www.ncbi.nlm.nih.gov/sites/entrez?cmd=Retrieve&db=PubMed&dopt=AbstractPlus&list_uids=17087822
- http://nebc.nox.ac.uk/public_catalogue.php - Example database/portal?
- http://www.megx.net/ - Example database/portal?
- http://www.biomart.org/
- Functional reactive programming (noted here because it seems that a CPS approach to creating web applications may be appropriate for maintaining agility when developing FlyData web interfaces):
- Some links that should be followed-up more closely
- http://www.red-bean.com/ospowiki/LondonOpenSourceJam05Talks - some Open Source Jam talks about the web
- http://duncan-cragg.org/blog/post/google-micro-conference/
- http://the-u-web.org/ - I think some the ideas here are very close to elements of our thinking
- http://code.google.com/p/webdriver/ - possible testing tools?
-- GrahamKlyne 16:00, 11 September 2007 (BST)

