FlyWeb/DataAccessSurvery
From ImageWeb
A Survey of Data Access to resources in the Fly Web
Contents |
Access to BDGP
- The database of the Berkeley Drosophila Genome Project contains gene expression images of Drosophila embryos. The gene expression patterns are annotated using controlled vocabulary. Images, microarray and annotation data are stored in a modified version of Gene Ontology database and the entire dataset is available on the web in browsable and searchable form [1].
Data Access
- The mysql dump of the gene expression images and their annotations can be downloaded from ftp://ftp.fruitfly.org/pub/exgomysqldump/archive/current/exgopub-20070309.dump.gz [cited December 2007].
- The SPARQL endpoint to a subset of this dataset was available at http://spade.lbl.gov:2021/sparql and http://spade.lbl.gov:2020/sparql. But these links are not active at the time of testing [15.14GMT 03/12/2007]
- In this release, 3418 genes (out of total 6138 genes) were documented with 74612 digital photographs [1]. But according to the database records, there are 89,981 images for 7,703 genes. The whole database is ~328.5MB
Data Identifier
Each gene is identified by three types of identifiers:
- A FlyBase identifier, e.g. FBgn0038166, which links to FlyBase gene report
- An EST identifier
- etc
Schema
- Each image is assigned to a specific developmental stage range and typically multiple images are captured for each stage range [1].
- Images of each stage are described by one or more controlled term(s), created by BDGP team collaboratively with Volker Hartnestein and Michael Ashburner [2]. This controlled vocabulary has little overlapping with the OBO FlyBase Anatomy Ontology used to describe our Drosophila testis gene expression images on FlyTED.
- Each gene is associated with some GO function terms
- An image might be associated with some free text descriptions, by the property of Note or Comment. Note property contains textual annotation (maternal, no staining, ubiquitous) generally only when no images are available, and Comment property has free text comments of the annotator.
- More for the schema alignment package
Experiment
- Downloading the mysql dump from ftp://ftp.fruitfly.org/pub/exgomysqldump/archive/current/exgopub-20070309.dump.gz to andros.
- install phpmyadmin to explore the dataset
- possible to install D2R server over this dataset in order to expose a SPARQL endpoint over it todo.
Access to FlyBase
FlyBase provides the source of genetic and genomic information about fruit flies.
Data Access
Data can be downloaded by Release menu of FlyBas web site, e.g. ftp://ftp.flybase.net/releases/. In the FlyWeb proposal, we mentioned: We already have some experience of automated extraction of selected information from FlyBase in XML format, and we propose to use this method to extract relevant data for the FlyWeb data web, parsing it into RDF on the fly using XSLT. I am not sure what this means and whether somebody in the group has done experiments with it.
Interestingly, there are links out to PubMed, UniProtKB/Swiss-Prot and UniProtKB/TrEMBL in the gene report of FlyBase.
Information about FlyBase identifiers can be found at: http://flybase.org/static_pages/docs/refman/refman-F.html
Experiment
- Download to Andros server the XML format gene annotation report of the current release from ftp://ftp.flybase.net/releases/current/reporting-xml/FBgn.xml.gz [16:06GMT 03/12/2007]
- The XML report is around 1,734 MB
- Is there any other to access the XML format gene report on FlyBase?
- todo: discuss how to manipulate and query this big XML dump
Accessing XML format FlyBase Records
FlyBase used to provide a retrieval facility which allows users to retrieve FlyBase records in different formats, including XML, CSV formats, etc. A quick test by Graham and I in January 2008 has shown this facility is no longer exposed in the current live FlyBase site.
The tutorial given in (http://chervil.bio.indiana.edu:7092/docs/lk/refman/refman-D.html#D13) works in FlyBase r4.3. which is hosted at http://chervil.bio.indiana.edu:7092. To retrieval a CSV format FlyBase record, one needs to do the following:
- Find the query bookmark manually: http://chervil.bio.indiana.edu:7092/.bin/asksrs.html?%5BFBgn-org%3ADmel%5D%26%5BFBgn-cla%3Apseudogene%5D
- Then edit the URL to perform a batch download of data:
- for Tabbed format, add '-gvtext/tsv&' after '?'
- for Comma format, add '-gvtext/csv&' after '?'
- for Acode format, add '-gvtext/acode&' after '?'
- for XML format, add '-gvtext/xml&' after '?'
- for plain text format, add '-gvtext/plain&' after '?'
- for HTML format, add '-gvtext/html&' after '?'
- for all results in one batch, add '-m99999&' after '?' (default is 20 matches/page)
- To have CSV format record output, we should have:
- To have XML format record output, we should have:
http://chervil.bio.indiana.edu:7092/.bin/asksrs.html?-gvtext/xml&%5BFBgn-org%3ADmel%5D%26%5BFBgn-cla%3Apseudogene%5D But this does not work in FlyBase r4.3 any more.
Access to PubMed and UK PubMed Central
PubMed is the universal repository for bibliographic information and abstracts of biomedical literature, and is the first port of call for any scientist seeking to discover what has been published in her field. Currently it references 61,041 papers on Drosophila and 17,703 papers on gene expression in Drosophila.
In addition, PubMed Central is a free digital archive of full-text biomedical and life sciences journal literature, currently listing 37,491 papers on Drosophila and 18,722 papers ongene expression in Drosophila (why this figure is larger than the equivalent figure in PubMed is unexplained).
Data Access
Programmatic access to both PubMed and PubMed Central is enabled by eUtils (Entrez Utilities; http://eutils.ncbi.nlm.nih.gov/entrez/query/static/eutils_help.html). An eUtils query is basically a HTTP call to the Entrez databases using a formatted URL (although a SOAP version is also available), such a query returning results in XML format. Additionally, PubMed Central permits harvesting of full-text articles using OAI-PMH which returns the records in XML format, or alternatively as an FTP service providing .nxml XML or PDF files.
Two things to note about eUtils:
- It can only access information in Entrez. Not everything in NCBI is in Entrez. But the PubMed database of our interest is included in Entrez.
- It uses UIDs for both query input and output. UIDs can be GI numbers for Nucleotide and Protein, PMIDs for PubMed, or MMDB-IDs for Structure. An application needs to understand these UIDs in order to reach desired data [3].
Understand Entrez[4]: 1. EGQuery, ESearch, and ESummary:
- ESearch returns a list of UIDs that match a text query in a given Entrez database.
- ESummary returns DocSums that match a list of input UIDs.
- EGQuery is a global version of ESearch that searches all Entrez databases simultaneously.
2. EInfo, EFetch*, and ELink
- EInfo provides detailed information about a given database, including lists of the indexing fields in the database and the available links to other Entrez databases.
- EFetch generates formatted output for a list of input UIDs.
- ELink generates a list of UIDs in a specified Entrez database that are linked to a set of input UIDs.
EFetch is currently supported only in the following databases: PubMed, PubMed Central, Journals, Nucleotide, Protein, Genome, Gene, SNP, PopSet, and Taxonomy.
Guidelines for Constructing URLs
- Use lowercase characters for all parameters except &WebEnv
- Avoid placing spaces in the URLs. If a space is required, use a plus sign (+) instead of a space.
- Special characters should be represented by their URL encodings (%23 for #)
Experiment
- Look for citations about Drosophila
http://www.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=pubmed&term=drosophila This returns an ID list, which can be used by EFetch command.
- The perl script to search for reports/abstracts about Drosophila can be found on Andros server
4DXpress
http://ani.embl.de/4DXpress/ (This data link is forward by D.S.)
"This database provides a platform to query and compare gene expression data during the development of the major model animals (zebrafish, drosophila, medaka, mouse)".
To access information from 4DXpress programmatically, we can use the following ids:
- internal ids: http://ani.embl.de/4DXpress/reg/all/search/bquery.do?id=CSEPD:GE0047765
- ensembl ids: http://ani.embl.de/4DXpress/reg/all/search/bquery.do?id=ENSMUSG00000036026
- Flybase ids: http://ani.embl.de/4DXpress/reg/all/search/bquery.do?id=FBgn0003900
- mgi ids: http://ani.embl.de/4DXpress/reg/all/search/bquery.do?id=MGI:96204
- gene symbols: http://ani.embl.de/4DXpress/reg/all/search/bquery.do?id=Mef2
The same link can be used to link not just to genes but also expression patterns, developmental stages, anatomy terms.
- internal expression pattern: http://ani.embl.de/4DXpress/reg/all/search/bquery.do?id=CSEPD:XP0023526
- internal developmental stage: http://ani.embl.de/4DXpress/reg/all/search/bquery.do?id=CSEPD:ST0000006
This site would help us with accessing gene expression data of different species programmatically. They manage gene name disambiguation using the ENSEMBL Compara database.
Reference
- Patterns of gene expression in Drosophila embryogenesis. http://www.fruitfly.org/cgi-bin/ex/insitu.pl March 2007 [cited December 2007]

