Flyweb/Gene Name Synonyms

From ImageWeb

Jump to: navigation, search

Contents

Finding Gene Name Synonyms

This page contains notes of investigations for determining the various synonyms used to iudentify a given Drosophila gene in data sources. This probably represents a typical coreference problem.

Mining FlyBase

Public access is available to FB_CHADO, the master database for FlyBase. Details, which should be usable to harvest the data which you used to get through URLs from the old FlyBase, are posted here:

Generic Chado specs and related info can be found here:

Phenotype specific generic specs can be found here:

Some mapping tables detailing the quirks of implementation of the chado schema at FlyBase. To access these you will need to sign up to my trac wiki (this will probably be useful for other purposes), which you can do here:

Mapping tables are here:

If you need any help with SQL queries regarding phenotypic data, just let me know. For details about mining expression data, you should ask Andy Schroeder. If you have general questions about options for mining FlyBase, you should ask Josh Goodman. (Email addresses for the above are in an email from David Shotton, 23 April 2008 15:01)

Chado

FlyBase uses a database schema called Chado, which is part of GMOD. GMOD is the Generic Model Organism Database project, a collaboration of several model organism database groups, including FlyBase, to develop a set of open-source software for managing model organism data. You can use it to create a small laboratory database of genome annotations, or a large web-accessible community database. GMOD tools are in use at many large and small community databases.

GMOD is a collaboration of several model organism database groups, including FlyBase, to develop a set of open-source software for managing model organism data.

From http://www.gmod.org/wiki/index.php/Chado_-_Getting_Started:

Chado is a relational database schema that underlies many GMOD installations. It is capable of representing many of the general classes of data frequently encountered in modern biology such as sequence, sequence comparisons, phenotypes, genotypes, ontologies, publications, and phylogeny. It has been designed to handle complex representations of biological knowledge and should be considered one of the most sophisticated relational schemas currently available in molecular biology. The price of this capability is that the new user must spend some time becoming familiar with its fundamentals.

Chado is, in some respects, very RDF-like in its approach. Consider this:

This use of a controlled vocabulary instead of explicit table columns allows new properties to be added easily and without any disruption to the schema or any software that uses the schema.
-- http://dx.doi.org/10.1093%2Fbioinformatics%2Fbtm189

See also:

Gene synonyms

From: http://www.gmod.org/wiki/index.php/Chado_Sequence_Module#Feature_Synonyms:

Feature Synonyms

In addition to having a name or symbol, it is common for features such as genes to have multiple synonyms or aliases. These synonyms may exist due to different publications referring to the same gene with different symbols, or because one gene was once believed to be two or more separate genes. A common curation operation on genes is splitting and merging, which results in the creation of synonyms.

This is modelled in Chado with a synonym table and a feature_synonym linking table; thus multiple features can potentially share the same, and a single feature can be have multiple synonyms. Use of a synonym in the literature is indicated with a pub_id foreign key referencing the pub table (see the publications module), indicating historical provenance for the use of a synonym.

Feature synonyms are found by joining to feature_synonym and synonym. For example, here is a query to find gene by name or synonym:

SELECT feature_id FROM feature
WHERE name = 'name of interest'
union SELECT feature_id
FROM feature_synonym fs, synonym s
WHERE fs.synonym_id = s.synonym_id
AND s.name = 'name of interest'
AND fs.is_current;

Also, from the Bioinformatics paper (http://dx.doi.org/10.1093%2Fbioinformatics%2Fbtm189):

4.2.5 Feature synonyms Features can have multiple names and synonyms. This is modeled in Chado with the synonym table, which links to features via the feature_synonym linking table. All feature_synonym links have an is_current Boolean attribute, which distinguishes between names in current usage and alternate or obsolete names. Depending on the experimental history of a feature, multiple features can potentially share a common synonym, and a single feature can have multiple synonyms. The provenance of a particular synonym is indicated using the pub_id foreign key, which references the pub table.

Accessing FlyBase Chado database

FlyBase is pleased to announce direct read only access to our Chado database.

hostname: flybase.org
port: 5432
username: flybase
password: (no password)
database name: flybase

e.g. Using the PostgreSQL client

psql -h flybase.org -U flybase flybase

The release available via this method will match the current version displayed on our web site. Prior to each release there will be a planned database service outage of ~5 hours while the new data is loaded. These will occur approximately once a month. Please check back here for updates on when outages will take place.

Also, if you would like a local copy of our database you can always download our PostgreSQL dumps from: ftp://ftp.flybase.net/releases/current/psql

The following notes show SQL queries that we used in our initial exploration of the FlyBase Chado database:

// Find feature, returning internal feature_id (primary key of feature table) using any term explicitly designated a synonym:
SELECT feature_id FROM feature WHERE name = 'comr' union 
SELECT feature_id FROM feature_synonym fs, synonym s WHERE fs.synonym_id = s.synonym_id AND s.name = 'comr' AND fs.is_current;

// For 'aly', the above query returns 3101873:  show all details of this feature:
SELECT * FROM feature where feature_id = 3101873;

// Try similar using one of the synonyms.  This query returns nothing, as the corresponding is_current value is flase.
SELECT feature_id FROM feature WHERE name = 'CG2075' union 
SELECT feature_id FROM feature_synonym fs, synonym s WHERE fs.synonym_id = s.synonym_id AND s.name = 'CG2075' AND fs.is_current;

SELECT * FROM feature where feature_id = 3094203;

SELECT * FROM feature_synonym WHERE feature_id = 3101873;

SELECT * FROM synonym WHERE synonym_id IN (310376,1310376,310377,1181155,1310378,1310377,1310379);

SELECT * FROM feature_synonym fs, synonym s where fs.feature_id = 3101873 and fs.synonym_id = s.synonym_id;

SELECT DISTINCT * FROM feature_synonym fs, synonym s where fs.feature_id = 3101873 and fs.synonym_id = s.synonym_id;

SELECT DISTINCT name, fs.feature_id, fs.synonym_id, fs.is_current FROM feature_synonym fs, synonym s where fs.feature_id = 3101873 and fs.synonym_id = s.synonym_id;

// Find feature using any synonym - these queries all work
SELECT feature_id FROM feature WHERE name = 'aly' union 
SELECT feature_id FROM feature_synonym fs, synonym s WHERE fs.synonym_id = s.synonym_id AND s.name = 'aly' AND fs.is_current;

SELECT feature_id FROM feature WHERE name = '143450_at' union
SELECT feature_id FROM feature_synonym fs, synonym s WHERE fs.synonym_id = s.synonym_id AND s.name = '143450_at';

SELECT feature_id FROM feature WHERE name = 'CG2075' union
SELECT feature_id FROM feature_synonym fs, synonym s WHERE fs.synonym_id = s.synonym_id AND s.name = 'CG2075';

SELECT feature_id FROM feature WHERE name = 'always early' union
SELECT feature_id FROM feature_synonym fs, synonym s WHERE fs.synonym_id = s.synonym_id AND s.name = 'always early';

Links to follow up

Programmatic access to FlyBase:

  • http://flybase.bio.indiana.edu/forums/viewtopic.php?f=10&t=35 - discussion thread on the FlyBase Community Forum. Basically, the developer Josh Goodman says the old functionality will probably be restored soon, and is currently available on their Chado Database (http://flybase.org/forums/viewtopic.php?f=4&t=26), and is asking whether people want a REST interface instead, or a BioMart interface. I think we should jump into this discussion string and ask about the possibility of a SPARQL Endpoint over FlyBase. (taken from an email by David Shotton)

Gene browse interface:

User Requirement

Requirement (from Alistair) is an interface which can respond to query of the form:

 SELECT ?feature
        ?symbol
        ?annotationno
        ?flybaseid
        ?synonym (s?)
 WHERE{
  ?feature(gene) label ?label
 }

"label" means any of feature,symbol,annotationno,id,synonym.

ie: A. Query for gene via any of its labels (union many queries?) B. return all its "labels"


Fields of interest in flybase rdb

name on HTML page http://flybase.org/reports/FBgn0004372.html / http://flybase.org/reports/FBgn0000180.html example database field notes SQL Results of SQL
Symbol Dmel\aly Feature.name select * from feature where feature.uniquename = 'FBgn0000180'; / select * from feature where feature.uniquename = 'FBgn0004372';
 select feature.feature_id from feature 
 where feature.name = 'aly';
3101873
Name always early Synonym.name where Synonym.type_id=60037(fullname) Note that this is a subset of the Symbol Synonym data

select synonym.*, cvterm.* from synonym join feature_synonym on synonym.synonym_id=feature_synonym.synonym_id join feature on feature.feature_id=feature_synonym.feature_id join cvterm on synonym.type_id = cvterm.cvterm_id where feature.uniquename = 'FBgn0004372' and cvterm.name='fullname';

 select feature.feature_id from synonym 
  join feature_synonym on synonym.synonym_id = feature_synonym.synonym_id 
  join feature on feature.feature_id=feature_synonym.feature_id 
  join cvterm on synonym.type_id = cvterm.cvterm_id
  where cvterm.name='fullname' 
  and synonym.name = 'always early'; 
301873,301873,301873 (why duplicates?)
Annotation symbol CG2075 Dbxref.accession select dbxref.* from feature join feature_dbxref on feature.feature_id = feature_dbxref.feature_id join dbxref on dbxref.dbxref_id=feature_dbxref.dbxref_id join db on db.db_id=dbxref.db_id where db.name = 'FlyBase Annotation IDs' and feature.uniquename = 'FBgn0004372';
select feature.feature_id from feature 
  join feature_dbxref on feature.feature_id=feature_dbxref.feature_id 
  join dbxref on dbxref.dbxref_id=feature_dbxref.dbxref_id 
  join db on db.db_id=dbxref.db_id 
  where db.name = 'FlyBase Annotation IDs' 
  and dbxref.accession = 'CG2075';
3101873
FlyBase ID FBgn004372 Feature.Uniquename select * from feature where feature.uniquename = 'FBgn0004372';
 select feature.feature_id from feature 
 where feature.uniquename = 'FBgn0004372';
3101873
Symbol Synonym 143450_at, aly, CG2075, ms(3)s, ms(3)ry2 Synonym.name select synonym.* from synonym join feature_synonym on synonym.synonym_id = feature_synonym.synonym_id join feature on feature.feature_id=feature_synonym.feature_id where feature.uniquename = 'FBgn0004372';
select feature.feature_id from synonym 
  join feature_synonym on synonym.synonym_id = feature_synonym.synonym_id 
  join feature on feature.feature_id=feature_synonym.feature_id 
  where synonym.name = 'aly'; 
3101873,3101873,3101873,3101873,3101873,3101873,3101873,3101873,3101873,23209212,23209018,3101873 (why so many?)

Query design

A. Query for a gene id via any of its labels

 select feature.feature_id from feature 
  where feature.name = 'CG2075'
union
 select feature.feature_id from feature 
  join feature_dbxref on feature.feature_id=feature_dbxref.feature_id 
  join dbxref on dbxref.dbxref_id=feature_dbxref.dbxref_id 
  join db on db.db_id=dbxref.db_id 
  where db.name = 'FlyBase Annotation IDs' 
  and dbxref.accession = 'CG2075'
union
 select feature.feature_id from feature 
  where feature.uniquename = 'CG2075'
union
 select feature.feature_id from synonym 
  join feature_synonym on synonym.synonym_id = feature_synonym.synonym_id 
  join feature on feature.feature_id=feature_synonym.feature_id 
  where synonym.name = 'CG2075'; 

Result=3101873 NB can replace CG2075 (in all four places) with 'always early', 'aly' or 'FBgn0004372'

B. Select all "labels" of a gene, given its id

 select 'Symbol', feature.name as t from feature 
  where feature.feature_id=3101873
union 
 select 'Name', synonym.name as t from synonym 
  join feature_synonym on synonym.synonym_id = feature_synonym.synonym_id 
  join feature on feature.feature_id=feature_synonym.feature_id
  join cvterm on synonym.type_id = cvterm.cvterm_id 
  where cvterm.name='fullname' 
  and feature.feature_id=3101873
union
 select 'Annotation symbol', dbxref.accession as t from feature 
  join feature_dbxref on feature.feature_id=feature_dbxref.feature_id 
  join dbxref on dbxref.dbxref_id=feature_dbxref.dbxref_id 
  join db on db.db_id=dbxref.db_id 
  where db.name = 'FlyBase Annotation IDs' 
  and feature.feature_id=3101873
union
 select 'Flybase ID', feature.uniquename as t from feature 
  where feature.feature_id=3101873
union
 select 'synonym', synonym.name as t from synonym 
  join feature_synonym on synonym.synonym_id = feature_synonym.synonym_id 
  join feature on feature.feature_id=feature_synonym.feature_id 
  where feature.feature_id=3101873;

Result=

"Annotation symbol";"CG2075"
"Flybase ID";"FBgn0004372"
"Name";"always early"
"Symbol";"aly"
"synonym";"143450_at"
"synonym";"CG2075"
"synonym";"always early"
"synonym";"aly"
"synonym";"ms(3)2"
"synonym";"ms(3)ry2"


.. Can we add an EA E-R diagram covering the elements used here? (#g)

.. how to put them together!

.. maybe, need to create a single big join of all the relevant tables, pick out rows based on any of the possible match points, and finally select all the required results? (#g)

Personal tools
Oxford DMP online
MIIDI
Claros