Papers on the publication, management and integration of Drosophila genomic and associated data
FlyBase: integration and improvements in query tools (Wilson et al. 2008)
Nucleic Acids Research
Paper on recent developments in FlyBase user functionality, including QueryBuilder, Interactions Browser and enhanced GBrowse for multi-species comparison.
Abstract: FlyBase (http://flybase.org) is the primary resource for molecular and genetic information on the Drosophilidae. The database serves researchers of diverse backgrounds and interests, and offers several different query tools to provide efficient access to the data available and facilitate the discovery of significant relationships within the database. Recently, FlyBase has developed Interactions Browser and enhanced GBrowse, which are graphical query tools, and made improvements to the search tools QuickSearch and QueryBuilder. Furthermore, these search tools have been integrated with Batch Download and new analysis tools through a more flexible search results list, providing powerful ways of exploring the data in FlyBase.
REDfly 2.0: an integrated database of cis-regulatory modules and transcription factor binding sites in Drosophila (Halfon et al. 2008)
Nucleic Acids Research
Paper describing integration of two databases on Drosophila gene regulation. The integrated system includes expression pattern annotations for regulatory elements using the Drosophila anatomy ontology.
Abstract: The identification and study of the cis-regulatory elements that control gene expression are important areas of biological research, but few resources exist to facilitate large-scale bioinformatics studies of cis-regulation in metazoan species. Drosophila melanogaster, with its well-annotated genome, exceptional resources for comparative genomics and long history of experimental studies of transcriptional regulation, represents the ideal system for regulatory bioinformatics. We have merged two existing Drosophila resources, the REDfly database of cis-regulatory modules and the FlyReg database of transcription factor binding sites (TFBSs), into a single integrated database containing extensive annotation of empirically validated cis-regulatory modules and their constituent binding sites. With the enhanced functionality made possible through this integration of TFBS data into REDfly, together with additional improvements to the REDfly infrastructure, we have constructed a one-stop portal for Drosophila cis-regulatory data that will serve as a powerful resource for both computational and experimental studies of transcriptional regulation. REDfly is freely accessible at http://redfly.ccr.buffalo.edu.
Automated annotation of Drosophila gene expression patterns using a controlled vocabulary (Ji et al. 2008)
Paper evaluating approaches to automatic annotation of images of in situ mRNA hybridisation in embryos (from BDGP) with expression pattern terms from a controlled vocabulary; building on prior work on FlyExpress.
'Abstract: MOTIVATION: Regulation of gene expression in space and time directs its localization to a specific subset of cells during development. Systematic determination of the spatiotemporal dynamics of gene expression plays an important role in understanding the regulatory networks driving development. An atlas for the gene expression patterns of fruit fly Drosophila melanogaster has been created by whole-mount in situ hybridization, and it documents the dynamic changes of gene expression pattern during Drosophila embryogenesis. The spatial and temporal patterns of gene expression are integrated by anatomical terms from a controlled vocabulary linking together intermediate tissues developed from one another. Currently, the terms are assigned to patterns manually. However, the number of patterns generated by high-throughput in situ hybridization is rapidly increasing. It is, therefore, tempting to approach this problem by employing computational methods. RESULTS: In this article, we present a novel computational framework for annotating gene expression patterns using a controlled vocabulary. In the currently available high-throughput data, annotation terms are assigned to groups of patterns rather than to individual images. We propose to extract invariant features from images, and construct pyramid match kernels to measure the similarity between sets of patterns. To exploit the complementary information conveyed by different features and incorporate the correlation among patterns sharing common structures, we propose efficient convex formulations to integrate the kernels derived from various features. The proposed framework is evaluated by comparing its annotation with that of human curators, and promising performance in terms of F1 score has been reported.
Global analysis of patterns of gene expression during Drosophila embryogenesis (Tomancak et al. 2007)
Paper on analysis of BDGP mRNA in situ and time course microarray data for Drosophila embryogenesis.
Abstract: BACKGROUND: Cell and tissue specific gene expression is a defining feature of embryonic development in multi-cellular organisms. However, the range of gene expression patterns, the extent of the correlation of expression with function, and the classes of genes whose spatial expression are tightly regulated have been unclear due to the lack of an unbiased, genome-wide survey of gene expression patterns. RESULTS: We determined and documented embryonic expression patterns for 6,003 (44%) of the 13,659 protein-coding genes identified in the Drosophila melanogaster genome with over 70,000 images and controlled vocabulary annotations. Individual expression patterns are extraordinarily diverse, but by supplementing qualitative in situ hybridization data with quantitative microarray time-course data using a hybrid clustering strategy, we identify groups of genes with similar expression. Of 4,496 genes with detectable expression in the embryo, 2,549 (57%) fall into 10 clusters representing broad expression patterns. The remaining 1,947 (43%) genes fall into 29 clusters representing restricted expression, 20% patterned as early as blastoderm, with the majority restricted to differentiated cell types, such as epithelia, nervous system, or muscle. We investigate the relationship between expression clusters and known molecular and cellular-physiological functions. CONCLUSION: Nearly 60% of the genes with detectable expression exhibit broad patterns reflecting quantitative rather than qualitative differences between tissues. The other 40% show tissue-restricted expression; the expression patterns of over 1,500 of these genes are documented here for the first time. Within each of these categories, we identified clusters of genes associated with particular cellular and developmental functions.
FlyMine: an integrated database for Drosophila and Anopheles genomics (Lyne et al. 2007)
Paper describing FlyMine functionality and overview of system architecture.
Abstract: FlyMine is a data warehouse that addresses one of the important challenges of modern biology: how to integrate and make use of the diversity and volume of current biological data. Its main focus is genomic and proteomics data for Drosophila and other insects. It provides web access to integrated data at a number of different levels, from simple browsing to construction of complex queries, which can be executed on either single items or lists.
A Chado case study: an ontology-based modular schema for representing genome-associated biological information (Mungall et al. 2007)
Paper describing the Chado schema, core modules, functions and design rationale.
Abstract: Chado is a relational database schema now being used to manage biological knowledge for a wide variety of organisms, from human to pathogens, especially the classes of information that directly or indirectly can be associated with genome sequences or the primary RNA and protein products encoded by a genome. Biological databases that conform to this schema can interoperate with one another, and with application software from the Generic Model Organism Database (GMOD) toolkit. Chado is distinctive because its design is driven by ontologies. The use of ontologies (or controlled vocabularies) is ubiquitous across the schema, as they are used as a means of typing entities. The Chado schema is partitioned into integrated subschemas (modules), each encapsulating a different biological domain, and each described using representations in appropriate ontologies. To illustrate this methodology, we describe here the Chado modules used for describing genomic sequences.
Using FlyAtlas to identify better Drosophila melanogaster models of human disease (Chintipalli et al. 2007)
Paper describing results and high-level analysis of FlyAtlas tissue-specific microarray data for 18880 probe sets (covers ~13,500 genes).
Abstract: FlyAtlas, a new online resource, provides the most comprehensive view yet of expression in multiple tissues of Drosophila melanogaster. Meta-analysis of the data shows that a significant fraction of the genome is expressed with great tissue specificity in the adult, demonstrating the need for the functional genomic community to embrace a wide range of functional phenotypes. Well-known developmental genes are often reused in surprising tissues in the adult, suggesting new functions. The homologs of many human genetic disease loci show selective expression in the Drosophila tissues analogous to the affected human tissues, providing a useful filter for potential candidate genes. Additionally, the contributions of each tissue to the whole-fly array signal can be calculated, demonstrating the limitations of whole-organism approaches to functional genomics and allowing modeling of a simple tissue fractionation procedure that should improve detection of weak or tissue-specific signals.
Global Analysis of mRNA Localization Reveals a Prominent Role in Organizing Cellular Architecture and Function (Lécuyer et al. 2007)
Paper describing high-resolution FISH study of genes during Drosophila embryogenesis; analysis and interpretation of patterns of sub-cellular localisation; describes data underlying FlyFISH.
Abstract: Although subcellular mRNA trafficking has been demonstrated as a mechanism to control protein distribution, it is generally believed that most protein localization occurs subsequent to translation. To address this point, we developed and employed a high-resolution fluorescent in situ hybridization procedure to comprehensively evaluate mRNA localization dynamics during early Drosophila embryogenesis. Surprisingly, of the 3370 genes analyzed, 71% of those expressed encode subcellularly localized mRNAs. Dozens of new and striking localization patterns were observed, implying an equivalent variety of localization mechanisms. Tight correlations between mRNA distribution and subsequent protein localization and function, indicate major roles for mRNA localization in nucleating localized cellular machineries. A searchable web resource documenting mRNA expression and localization dynamics has been established and will serve as an invaluable tool for dissecting localization mechanisms and for predicting gene functions and interactions.
Nucleic Acids Research
Paper describing the DroSpeGe online database providing access to sequence data for 12 Drosophila genomes; based on GMOD components including Chado and BioMart.
Abstract: The Drosophila species comparative genome database DroSpeGe (http://insects.eugenes.org/DroSpeGe/) provides genome researchers with rapid, usable access to 12 new and old Drosophila genomes, since its inception in 2004. Scientists can use, with minimal computing expertise, the wealth of new genome information for developing new insights into insect evolution. New genome assemblies provided by several sequencing centers have been annotated with known model organism gene homologies and gene predictions to provided basic comparative data. TeraGrid supplies the shared cyberinfrastructure for the primary computations. This genome database includes homologies to Drosophila melanogaster and eight other eukaryote model genomes, and gene predictions from several groups. BLAST searches of the newest assemblies are integrated with genome maps. GBrowse maps provide detailed views of cross-species aligned genomes. BioMart provides for data mining of annotations and sequences. Common chromosome maps identify major synteny among species. Potential gain and loss of genes is suggested by Gene Ontology groupings for genes of the new species. Summaries of essential genome statistics include sizes, genes found and predicted, homology among genomes, phylogenetic trees of species and comparisons of several gene predictions for sensitivity and specificity in finding new and known genes.
FlyBase: anatomical data, images and queries (Grumbling et al. 2006)
Nucleic Acids Research
Short paper describing phenotypic and expression pattern data in FlyBase, annotated with anatomy and developmental ontologies.
Abstract: FlyBase (http://flybase.org/) is a database of genetic and genomic data on the model organism Drosophila melanogaster and the entire insect family Drosophilidae. The FlyBase Consortium curates, annotates, integrates and maintains a wide variety of data within this domain. Access to the data is provided through graphical and textual user interfaces tailored to particular types of data. FlyBase data types include maps at the cytological, genetic and sequence levels, genes and alleles including their products, functions, expression patterns, mutant phenotypes and genetic interactions as well as aberrant chromosomes, annotated genomes, genetic stock collections, transposons, transgene constructs and insertions, anatomy and images, bibliographic data, and community contact information.
FlyExpress: an image-matching web-tool for finding genes with overlapping patterns of expression in Drosophila embryos (Van Emden et al. 2006)
Citation for the FlyExpress web tool, which supports searching BDGP in situ images by visual expression pattern.
FLIGHT: database and tools for the integration and cross-correlation of large-scale RNAi phenotypic datasets (Sims et al. 2006)
Nucleic Acids Research
Paper describing the FLIGHT database, a repository for results of RNAi screens; also integrates microarray data (from where?), gene identifiers from FlyBase, WormBase, SGD, MGD, Genew, protein data from UniProt, homology data from HomoloGene and InParanoid, and protein and gene interaction data from several sources including FlyBase. Also described are phenotype annotation and clustering tools.
Abstract: FLIGHT (www.flight.licr.org) is a new database designed to help researchers browse and cross-correlate data from large-scale RNAi studies. To date, the majority of these functional genomic screens have been carried out using Drosophila cell lines. These RNAi screens follow 100 years of classical Drosophila genetics, but have already revealed their potential by ascribing an impressive number of functions to known and novel genes. This has in turn given rise to a pressing need for tools to simplify the analysis of the large amount of phenotypic information generated. FLIGHT aims to do this by providing users with a gene-centric view of screen results and by making it possible to cluster phenotypic data to identify genes with related functions. Additionally, FLIGHT provides microarray expression data for many of the Drosophila cell lines commonly used in RNAi screens. This, together with information about cell lines, protocols and dsRNA primer sequences, is intended to help researchers design their own cell-based screens. Finally, although the current focus of FLIGHT is Drosophila, the database has been designed to facilitate the comparison of functional data across species and to help researchers working with other systems navigate their way through the fly genome.
A database and tool, IM Browser, for exploring and integrating emerging gene and protein interaction data for Drosophila (Pacifico et al. 2006)
Paper describing incorporation of gene and protein interaction data from 6 sources into a single database, and development of a network visualisation tool.
Abstract: BACKGROUND: Biological processes are mediated by networks of interacting genes and proteins. Efforts to map and understand these networks are resulting in the proliferation of interaction data derived from both experimental and computational techniques for a number of organisms. The volume of this data combined with the variety of specific forms it can take has created a need for comprehensive databases that include all of the available data sets, and for exploration tools to facilitate data integration and analysis. One powerful paradigm for the navigation and analysis of interaction data is an interaction graph or map that represents proteins or genes as nodes linked by interactions. Several programs have been developed for graphical representation and analysis of interaction data, yet there remains a need for alternative programs that can provide casual users with rapid easy access to many existing and emerging data sets. DESCRIPTION: Here we describe a comprehensive database of Drosophila gene and protein interactions collected from a variety of sources, including low and high throughput screens, genetic interactions, and computational predictions. We also present a program for exploring multiple interaction data sets and for combining data from different sources. The program, referred to as the Interaction Map (IM) Browser, is a web-based application for searching and visualizing interaction data stored in a relational database system. Use of the application requires no downloads and minimal user configuration or training, thereby enabling rapid initial access to interaction data. IM Browser was designed to readily accommodate and integrate new types of interaction data as it becomes available. Moreover, all information associated with interaction measurements or predictions and the genes or proteins involved are accessible to the user. This allows combined searches and analyses based on either common or technique-specific attributes. The data can be visualized as an editable graph and all or part of the data can be downloaded for further analysis with other tools for specific applications. The database is available at http://proteome.wayne.edu/PIMdb.html CONCLUSION: The Drosophila Interactions Database described here places a variety of disparate data into one easily accessible location. The database has a simple structure that maintains all relevant information about how each interaction was determined. The IM Browser provides easy, complete access to this database and could readily be used to publish other sets of interaction data. By providing access to all of the available information from a variety of data types, the program will also facilitate advanced computational analyses.
FlyBase: genes and gene models (Drysdale et al. 2005)
Nucleic Acids Research
Short paper describing state of FlyBase in 2004, including classes of data, and different gene reports available.
Abstract: FlyBase (http://flybase.org) is the primary repository of genetic and molecular data of the insect family Drosophilidae. For the most extensively studied species, Drosophila melanogaster, a wide range of data are presented in integrated formats. Data types include mutant phenotypes, molecular characterization of mutant alleles and aberrations, cytological maps, wild-type expression patterns, anatomical images, transgenic constructs and insertions, sequence-level gene models and molecular classification of gene product functions. There is a growing body of data for other Drosophila species; this is expected to increase dramatically over the next year, with the completion of draft-quality genomic sequences of an additional 11 Drosphila species.
Recent advances in Drosophila genomics (Davis and White 2004)
Meeting report on the 45th annual Drosophila Research Conference. Discusses progress in genome annotation, systematic gene disruption; also highlights challenges of working across databases, and mentions progress on databases and data integration, highlighting FlyMine and FlyExpress.
Systematic determination of patterns of gene expression during Drosophila embryogenesis (Tomancak et al. 2002)
Paper describing initial study of expression of 2,179 genes in Drosophila embryogenesis, using in situ mRNA hybridisation complemented by time-course DNA microarrays; and development of online database for publication of images and associated data.
Abstract: (Background) Cell-fate specification and tissue differentiation during development are largely achieved by the regulation of gene transcription. (Results) As a first step to creating a comprehensive atlas of gene-expression patterns during Drosophila embryogenesis, we examined 2,179 genes by in situ hybridization to fixed Drosophila embryos. Of the genes assayed, 63.7% displayed dynamic expression patterns that were documented with 25,690 digital photomicrographs of individual embryos. The photomicrographs were annotated using controlled vocabularies for anatomical structures that are organized into a developmental hierarchy. We also generated a detailed time course of gene expression during embryogenesis using microarrays to provide an independent corroboration of the in situ hybridization results. All image, annotation and microarray data are stored in publicly available database. We found that the RNA transcripts of about 1% of genes show clear subcellular localization. Nearly all the annotated expression patterns are distinct. We present an approach for organizing the data by hierarchical clustering of annotation terms that allows us to group tissues that express similar sets of genes as well as genes displaying similar expression patterns. (Conclusions) Analyzing gene-expression patterns by in situ hybridization to whole-mount embryos provides an extremely rich dataset that can be used to identify genes involved in developmental processes that have been missed by traditional genetic analysis. Systematic analysis of rigorously annotated patterns of gene expression will complement and extend the types of analyses carried out using expression microarrays.
Gene Expression During the Life Cycle of Drosophila Melanogaster (Arbeitman et al. 2002)
Paper presenting data on levels and profiles of expression of 4028 genes across 66 sequential time periods beginning at fertilisation and spanning embryonic, larval, pupal periods and first 30 days of adulthood.
Abstract: Molecular genetic studies of Drosophila melanogaster have led to profound advances in understanding the regulation of development. Here we report gene expression patterns for nearly one-third of all Drosophila genes during a complete time course of development. Mutations that eliminate eye or germline tissue were used to further analyze tissue-specific gene expression programs. These studies define major characteristics of the transcriptional programs that underlie the life cycle, compare development in males and females, and show that large-scale gene expression data collected from whole animals can be used to identify genes expressed in particular tissues and organs or genes involved in specific biological and biochemical processes.
Standardizing data (NCB 2008)
Nature Cell Biology
Editorial describing need for greater funding and effort in data standardisation and interoperability.
Excerpt: Biological research is benefiting from an explosion of data. There is an urgent need to invest in bioinformatic infrastructure and education to interpret this data and guarantee its archiving. High-throughput research has helped fuel scientific progress at an unprecedented pace and left vast amounts of digital data in its wake. Even traditional hypothesis-driven research is now published at a rate that prohibits individuals from retaining the necessary overview. Bibliographic databases, such as PubMed, are key tools to navigate the information, but do not provide access to the primary data. The value of data is only as good as its annotation and accessibility: it must be properly curated and stored in machine-readable form in public databases. Indeed, the utility of data in the high-throughput age will depend on the establishment and long-term funding of an interlinked database infrastructure. It will equally depend on researchers contributing to and using these tools, as well as developing and adopting community-wide data standards. Several recently launched projects that aim to improve the value of data in the digital era are discussed below.
State of the nation in data integration for bioinformatics (Goble and Stevens 2008)
Paper providing overview of current issues and trends; discusses approaches including SOA, link integration, data warehousing, view integration, model-driven SOA, integration applications, workflows, mashups, smashups; emphasises identity management as biggest challenge.
Abstract: Data integration is a perennial issue in bioinformatics, with many systems being developed and many technologies offered as a panacea for its resolution. The fact that it is still a problem indicates a persistence of underlying issues. Progress has been made, but we should ask "what lessons have been learnt?", and "what still needs to be done?" Semantic Web and Web 2.0 technologies are the latest to find traction within bioinformatics data integration. Now we can ask whether the Semantic Web, mashups, or their combination, have the potential to help. This paper is based on the opening invited talk by Carole Goble given at the Health Care and Life Sciences Data Integration for the Semantic Web Workshop collocated with WWW2007. The paper expands on that talk. We attempt to place some perspective on past efforts, highlight the reasons for success and failure, and indicate some pointers to the future.
Towards a cyberinfrastructure for the biological sciences: progress, visions and challenges (Stein 2008)
Nature Reviews Genetics
Review stating requirements for "cyberinfrastructure", comprising data, computational, communication and human infrastructure; discussion of state of the art in current foci, including centralised online databases (e.g. GenBank), community annotation hubs (e.g. GeneRif), bioinformatics toolkits (e.g. GMOD), web service links (e.g. BioMoby), Semantic Web (e.g. SSWAP); also discussion of major projects including BIRN, caBIG and iPlant.
Abstract: Biology is an information-driven science. Large-scale data sets from genomics, physiology, population genetics and imaging are driving research at a dizzying rate. Simultaneously, interdisciplinary collaborations among experimental biologists, theorists, statisticians and computer scientists have become the key to making effective use of these data sets. However, too many biologists have trouble accessing and using these electronic data sets and tools effectively. A 'cyberinfrastructure' is a combination of databases, network protocols and computational services that brings people, information and computational tools together to perform science in this information-driven world. This article reviews the components of a biological cyberinfrastructure, discusses current and pending implementations, and notes the many challenges that lie ahead.
Semantic mashup of biomedical data (Cheung et al. 2008)
Journal of Biomedical Informatics
Guest editorial providing an introduction to the idea of "semantic mashups" and an overview of the articles in a special issue of the Journal of Biomedical Informatics on the applications of Web 2 and Semantic Web ideas to biological data integration.
Integrating biological databases (Stein 2003)
Review of approaches to data integration; defines approaches including link integration, view integration, data warehousing; also discusses Web services, ontologies and globally unique identifiers, and proposes a novel "knuckles and nodes" approach.
Abstract: Recent years have seen an explosion in the amount of available biological data. More and more genomes are being sequenced and annotated, and protein and gene interaction data are accumulating. Biological databases have been invaluable for managing these data and for making them accessible. Depending on the data that they contain, the databases fulfil different functions. But, although they are architecturally similar, so far their integration has proved problematic.
Technologies for integrating biological data (Wong 2002)
Paper stating requirements for general data integration platform and reviewing some existing platforms including Ensembl, GenoMax, SRS, DiscoveryLink, OPM, Kleisli and XML.
Abstract: The process of building a new database relevant to some field of study in biomedicine involves transforming, integrating and cleansing multiple data sources, as well as adding new material and annotations. This paper reviews some of the requirements of a general solution to this data integration problem. Several representative technologies and approaches to data integration in biomedicine are surveyed. Then some interesting features that separate the more general dataintegration technologies from the more specialised ones are highlighted.
Creating a bioinformatics nation (Stein 2002)
Short commentary article discussing the pains of screen scraping, the virtues of the Bio* projects, and the promise of Web services; also proposes a pragmatic code of conduct for biological data providers.
Multi-Organism or Non-Drosophila Bioinformatics Databases, Data Warehouse/Mining/Integration/...
NCBI GEO: archive for high-throughput functional genomic data (Barrett et al. 2009)
Nucleic Acids Research
Paper on GEO, almost exactly the same as Barrett et al. (2007) but emphasising new range of data types being submitted, beyond gene expression level data, including aCGH, ChIP chip and high-throughput second-generation sequence data.
Abstract: The Gene Expression Omnibus (GEO) at the National Center for Biotechnology Information (NCBI) is the largest public repository for high-throughput gene expression data. Additionally, GEO hosts other categories of high-throughput functional genomic data, including those that examine genome copy number variations, chromatin structure, methylation status and transcription factor binding. These data are generated by the research community using high-throughput technologies like microarrays and, more recently, next-generation sequencing. The database has a flexible infrastructure that can capture fully annotated raw and processed data, enabling compliance with major community-derived scientific reporting standards such as ‘Minimum Information About a Microarray Experiment’ (MIAME). In addition to serving as a centralized data storage hub, GEO offers many tools and features that allow users to effectively explore, analyze and download expression data from both gene-centric and experiment-centric perspectives. This article summarizes the GEO repository structure, content and operating procedures, as well as recently introduced data mining features. GEO is freely accessible at http://www.ncbi.nlm.nih.gov/geo/.
Database resources of the National Centre for Biotechnology Information (Sayers et al. 2009)
Nucleic Acids Research
Paper giving an overview of current databases and services provided by NCBI, including Entrez, BLAST, PubMed, Entrez Gene, UniGene, HomoloGene, dbGaP, dbSNP, GEO and GENSAT. Very similar to Wheeler et al. 2007.
Abstract: In addition to maintaining the GenBank® nucleic acid sequence database, the National Center for Biotechnology Information (NCBI) provides analysis and retrieval resources for the data in GenBank and other biological data made available through the NCBI web site. NCBI resources include Entrez, the Entrez Programming Utilities, MyNCBI, PubMed, PubMed Central, Entrez Gene, the NCBI Taxonomy Browser, BLAST, BLAST Link (BLink), Electronic PCR, OrfFinder, Spidey, Splign, RefSeq, UniGene, HomoloGene, ProtEST, dbMHC, dbSNP, Cancer Chromosomes, Entrez Genomes and related tools, the Map Viewer, Model Maker, Evidence Viewer, Clusters of Orthologous Groups (COGs), Retroviral Genotyping Tools, HIV-1/Human Protein Interaction Database, Gene Expression Omnibus (GEO), Entrez Probe, GENSAT, Online Mendelian Inheritance in Man (OMIM), Online Mendelian Inheritance in Animals (OMIA), the Molecular Modeling Database (MMDB), the Conserved Domain Database (CDD), the Conserved Domain Architecture Retrieval Tool (CDART) and the PubChem suite of small molecule databases. Augmenting many of the web applications is custom implementation of the BLAST program optimized to search specialized data sets. All of the resources can be accessed through the NCBI home page at www.ncbi.nlm.nih.gov.
BioMart - biological queries made easy (Smedley et al. 2009)
Paper describing functionality of BioMart for queries across genomic databases; descibes us of the MartView Web interface, the MartServices REST/XML protocol, and other integrations.
Abstract: (Background) Biologists need to perform complex queries, often across a variety of databases. Typically, each data resource provides an advanced query interface, each of which must be learnt by the biologist before they can begin to query them. Frequently, more than one data source is required and for high-throughput analysis, cutting and pasting results between websites is certainly very time consuming. Therefore, many groups rely on local bioinformatics support to process queries by accessing the resource's programmatic interfaces if they exist. This is not an efficient solution in terms of cost and time. Instead, it would be better if the biologist only had to learn one generic interface. BioMart provides such a solution. (Results) BioMart enables scientists to perform advanced querying of biological data sources through a single web interface. The power of the system comes from integrated querying of data sources regardless of their geographical locations. Once these queries have been defined, they may be automated with its "scripting at the click of a button" functionality. BioMart's capabilities are extended by integration with several widely used software packages such as BioConductor, DAS, Galaxy, Cytoscape, Taverna. In this paper, we describe all aspects of BioMart from a user's perspective and demonstrate how it can be used to solve real biological use cases such as SNP selection for candidate gene screening or annotation of microarray results. (Conclusions) BioMart is an easy to use, generic and scalable system and therefore, has become an integral part of large data resources including Ensembl, UniProt, HapMap, Wormbase, Gramene, Dictybase, PRIDE, MSD and Reactome. BioMart is freely accessible to use at www.biomart.org.
Bioinformatics (data and text mining)
Short note on work to make GEO data queryable from Bioconductor via an SQLite dump of the GEO database, and a Web user-interface to the same functionality.
Abstract: The NCBI Gene Expression Omnibus (GEO) represents the largest public repository of microarray data. However, finding data in GEO can be challenging. We have developed GEOmetadb in an attempt to make querying the GEO metadata both easier and more powerful. All GEO metadata records as well as the relationships between them are parsed and stored in a local MySQL database. A powerful, flexible web search interface with several convenient utilities provides query capabilities not available via NCBI tools. In addition, a Bioconductor package, GEOmetadb that utilizes a SQLite export of the entire GEOmetadb database is also available, rendering the entire GEO database accessible with full power of SQL-based queries from within R. Availability: The web interface and SQLite databases available at http://gbnci.abcc.ncifcrf.gov/geo/. The Bioconductor package is available via the Bioconductor project. The corresponding MATLAB implementation is also available at the same website.
NCBI GEO: mining tens of millions of expression profiles--database and tools update (Barrett et al. 2007)
Nucleic Acids Research
Paper giving overview of the NCBI GEO database, and update on new features. Tools for retrieving and visualising data include a profile chart for expression of each gene within a dataset, and a precomputed interactive hierarchical cluster heat map for each dataset record.
Abstract: The Gene Expression Omnibus (GEO) repository at the National Center for Biotechnology Information (NCBI) archives and freely disseminates microarray and other forms of high-throughput data generated by the scientific community. The database has a minimum information about a microarray experiment (MIAME)-compliant infrastructure that captures fully annotated raw and processed data. Several data deposit options and formats are supported, including web forms, spreadsheets, XML and Simple Omnibus Format in Text (SOFT). In addition to data storage, a collection of user-friendly web-based interfaces and applications are available to help users effectively explore, visualize and download the thousands of experiments and tens of millions of gene expression patterns stored in GEO. This paper provides a summary of the GEO database structure and user facilities, and describes recent enhancements to database design, performance, submission format options, data query and retrieval utilities. GEO is accessible at http://www.ncbi.nlm.nih.gov/geo/
Database resources of the National Center for Biotechnology Information (Wheeler et al. 2007)
Nucleic Acids Research
Paper giving overview of all NCBI services as of 2007.
Abstract: In addition to maintaining the GenBank® nucleic acid sequence database, the National Center for Biotechnology Information (NCBI) provides analysis and retrieval resources for the data in GenBank and other biological data made available through NCBI's Web site. NCBI resources include Entrez, the Entrez Programming Utilities, My NCBI, PubMed, PubMed Central, Entrez Gene, the NCBI Taxonomy Browser, BLAST, BLAST Link(BLink), Electronic PCR, OrfFinder, Spidey, Splign, RefSeq, UniGene, HomoloGene, ProtEST, dbMHC, dbSNP, Cancer Chromosomes, Entrez Genome, Genome Project and related tools, the Trace and Assembly Archives, the Map Viewer, Model Maker, Evidence Viewer, Clusters of Orthologous Groups (COGs), Viral Genotyping Tools, Influenza Viral Resources, HIV-1/Human Protein Interaction Database, Gene Expression Omnibus (GEO), Entrez Probe, GENSAT, Online Mendelian Inheritance in Man (OMIM), Online Mendelian Inheritance in Animals (OMIA), the Molecular Modeling Database (MMDB), the Conserved Domain Database (CDD), the Conserved Domain Architecture Retrieval Tool (CDART) and the PubChem suite of small molecule databases. Augmenting many of the Web applications are custom implementations of the BLAST program optimized to search specialized data sets. These resources can be accessed through the NCBI home page at www.ncbi.nlm.nih.gov.
Mining Microarray Data at NCBI's Gene Expression Omnibus (GEO) (Barrett and Edgar 2006)
Methods Molecular Biology
Paper describing GEO, similar to Barrett et al. 2007 but goes into more detail on search, visualisation and analysis tools, e.g. profile neighbours, query mean group A vs B, cluster heat maps.
Abstract: The Gene Expression Omnibus (GEO) at the National Center for Biotechnology Information (NCBI) has emerged as the leading fully public repository for gene expression data. This chapter describes how to use Web-based interfaces, applications, and graphics to effectively explore, visualize, and interpret the hundreds of microarray studies and millions of gene expression patterns stored in GEO. Data can be examined from both experiment-centric and gene-centric perspectives using user-friendly tools that do not require specialized expertise in microarray analysis or time-consuming download of massive data sets. The GEO database is publicly accessible through the World Wide Web at http://www.ncbi.nlm.nih.gov/geo.
NCBI GEO standards and services for microarray data (Edgar and Barrett 2006)
Letter to the editor, describing use and implementation of MIAME standard in the NCBI GEO database.
EnsMart: a generic system for fast and flexible access to biological data (Kasprzyk et al. 2004)
Paper describing the EnsMart data warehouse system, and its application to the integration of data from Ensembl databases, OMIM, dbSNP, FlyBase and others. Data is first imported into a staging area in its native schema, then transformed to a common schema via Perl scripts. The MartView Web interface provides a REST API for URL-based queries.
Abstract: The EnsMart system (www.ensembl.org/EnsMart) provides a generic data warehousing solution for fast and flexible querying of large biological data sets and integration with third-party data and tools. The system consists of a query-optimized database and interactive, user-friendly interfaces. EnsMart has been applied to Ensembl, where it extends its genomic browser capabilities, facilitating rapid retrieval of customized data sets. A wide variety of complex queries, on various types of annotations, for numerous species are supported. These can be applied to many research problems, ranging from SNP selection for candidate gene screening, through cross-species evolutionary comparisons, to microarray annotation. Users can group and refine biological data according to many criteria, including cross-species analyses, disease links, sequence variations, and expression patterns. Both tabulated list data and biological sequence output can be generated dynamically, in HTML, text, Microsoft Excel, and compressed formats. A wide range of sequence types, such as cDNA, peptides, coding regions, UTRs, and exons, with additional upstream and downstream regions, can be retrieved. The EnsMart database can be accessed via a public Web site, or through a Java application suite. Both implementations and the database are freely available for local installation, and can be extended or adapted to 'non-Ensembl' data sets.
Gene Expression Omnibus: NCBI gene expression and hybridization array data repository (Edgar et al. 2002)
Nucleic Acids Research
Paper on original design of GEO service; provides good description of underlying data model, and three main entity classes: platform, sample and series.
Abstract: The Gene Expression Omnibus (GEO) project was initiated in response to the growing demand for a public repository for high-throughput gene expression data. GEO provides a flexible and open design that facilitates submission, storage and retrieval of heterogeneous data sets from high-throughput gene expression and genomic hybridization experiments. GEO is not intended to replace in house gene expression databases that benefit from coherent data sets, and which are constructed to facilitate a particular analytic method, but rather complement these by acting as a tertiary, central data distribution hub. The three central data entities of GEO are platforms, samples and series, and were designed with gene expression and genomic hybridization experiments in mind. A platform is, essentially, a list of probes that define what set of molecules may be detected. A sample describes the set of molecules that are being probed and references a single platform used to generate its molecular abundance data. A series organizes samples into the meaningful data sets which make up an experiment. The GEO repository is publicly accessible through the World Wide Web at http://www.ncbi.nlm.nih.gov/geo.
Web Services, Work Flows, ...
Moby and Moby 2: Creatures of the Deep (Web) (Vendervalk et al. 2009)
Briefings in Bioinformatics
Briefing paper describing problems with current approaches to data integration based on Web services, and presenting Moby 2 approach using SPARQL to virtualise access to a range of computing resources.
Abstract: Facile and meaningful integration of data from disparate resources is the ‘holy grail’ of bioinformatics. Some resources have begun to address this problem by providing their data using Semantic Web standards, specifically the Resource Description Framework (RDF) and the Web Ontology Language (OWL). Unfortunately, adoption of Semantic Web standards has been slow overall, and even in cases where the standards are being utilized, interconnectivity between resources is rare. In response, we have seen the emergence of centralized ‘semantic warehouses’ that collect public data from third parties, integrate it, translate it into OWL/RDF and provide it to the community as a unified and queryable resource. One limitation of the warehouse approach is that queries are confined to the resources that have been selected for inclusion. A related problem, perhaps of greater concern, is that the majority of bioinformatics data exists in the ‘Deep Web’—that is, the data does not exist until an application or analytical tool is invoked, and therefore does not have a predictable Web address. The inability to utilize Uniform Resource Identifiers (URIs) to address this data is a barrier to its accessibility via URI-centric Semantic Web technologies. Here we examine ‘The State of the Union’ for the adoption of Semantic Web standards in the health care and life sciences domain by key bioinformatics resources, explore the nature and connectivity of several community-driven semantic warehousing projects, and report on our own progress with the CardioSHARE/Moby-2 project, which aims to make the resources of the Deep Web transparently accessible through SPARQL queries.
Interoperability with Moby 1.0--it's better than sharing your toothbrush! (BioMoby Consortium 2008)
Paper describing first major release of a platform for semi-automatic Web service composition.
Abstract: The BioMoby project was initiated in 2001 from within the model organism database community. It aimed to standardize methodologies to facilitate information exchange and access to analytical resources, using a consensus driven approach. Six years later, the BioMoby development community is pleased to announce the release of the 1.0 version of the interoperability framework, registry Application Programming Interface and supporting Perl and Java code-bases. Together, these provide interoperable access to over 1400 bioinformatics resources worldwide through the BioMoby platform, and this number continues to grow. Here we highlight and discuss the features of BioMoby that make it distinct from other Semantic Web Service and interoperability initiatives, and that have been instrumental to its deployment and use by a wide community of bioinformatics service providers. The standard, client software, and supporting code libraries are all freely available at http://www.biomoby.org/.
Performing statistical analyses on quantitative data in Taverna workflows: an example using R and maxdBrowse to identify differentially-expressed genes from microarray data (Li et al. 2008)
Paper describing use of Taverna to construct a workflow combining Web services with R scripts to fetch and analyse microarray data on yeast.
Abstract: BACKGROUND: There has been a dramatic increase in the amount of quantitative data derived from the measurement of changes at different levels of biological complexity during the post-genomic era. However, there are a number of issues associated with the use of computational tools employed for the analysis of such data. For example, computational tools such as R and MATLAB require prior knowledge of their programming languages in order to implement statistical analyses on data. Combining two or more tools in an analysis may also be problematic since data may have to be manually copied and pasted between separate user interfaces for each tool. Furthermore, this transfer of data may require a reconciliation step in order for there to be interoperability between computational tools. RESULTS: Developments in the Taverna workflow system have enabled pipelines to be constructed and enacted for generic and ad hoc analyses of quantitative data. Here, we present an example of such a workflow involving the statistical identification of differentially-expressed genes from microarray data followed by the annotation of their relationships to cellular processes. This workflow makes use of customised maxdBrowse web services, a system that allows Taverna to query and retrieve gene expression data from the maxdLoad2 microarray database. These data are then analysed by R to identify differentially-expressed genes using the Taverna RShell processor which has been developed for invoking this tool when it has been deployed as a service using the RServe library. In addition, the workflow uses Beanshell scripts to reconcile mismatches of data between services as well as to implement a form of user interaction for selecting subsets of microarray data for analysis as part of the workflow execution. A new plugin system in the Taverna software architecture is demonstrated by the use of renderers for displaying PDF files and CSV formatted data within the Taverna workbench. CONCLUSION: Taverna can be used by data analysis experts as a generic tool for composing ad hoc analyses of quantitative data by combining the use of scripts written in the R programming language with tools exposed as services in workflows. When these workflows are shared with colleagues and the wider scientific community, they provide an approach for other scientists wanting to use tools such as R without having to learn the corresponding programming language to analyse their own data.
myGrid and UTOPIA: An Integrated Approach to Enacting and Visualising in Silico Experiments in the Life Sciences (Pettifer et al. 2007)
Paper describing combined use of Taverna work flow design and enactment environment and UTOPIA visualisation system to determine candidate genes for further study of Graves Disease, via microarray data; emphasises role of myGrid ontology in interoperability between the two systems.
Abstract: In silico experiments have hitherto required ad hoc collections of scripts and programs to process and visualise biological data, consuming substantial amounts of time and effort to build, and leading to tools that are difficult to use, are architecturally fragile and scale poorly. With examples of the systemgapgaps applied to real biological problems, we describe two complimentary software frameworks that address this problem in a principled manner; Grid Taverna, a workflow design and enactment environment enabling coherent experiments to be built, and UTOPIA, a flexible visualisation system to aid in examining experimental results.
Taverna: a tool for building and running workflows of services (Hull et al. 2006)
Nucleic Acids Research
Short paper describing the Taverna; overview of limitations of Web services; example work flow for DNA sequence analysis using GenBank, RepeatMasker, GenScan, BLASTp.
Abstract: Taverna is an application that eases the use and integration of the growing number of molecular biology tools and databases available on the web, especially web services. It allows bioinformaticians to construct workflows or pipelines of services to perform a range of different analyses, such as sequence analysis and genome annotation. These high-level workflows can integrate many different resources into a single analysis. Taverna is available freely under the terms of the GNU Lesser General Public License (LGPL) from http://taverna.sourceforge.net/.
UTOPIA - user-friendly tools for operating informatics applications (Pettifer et al. 2004)
Comparative and Functional Genomics
Paper on UTOPIA workbench, server and development kit; use case for sequence-based informatics.
Abstract: Bioinformaticians routinely analyse vast amounts of information held both in large remote databases and in flat data files hosted on local machines. The contemporary toolkit available for this purpose consists of an ad hoc collection of data manipulation tools, scripting languages and visualization systems; these must often be combined in complex and bespoke ways, the result frequently being an unwieldy artefact capable of one specific task, which cannot easily be exploited or extended by other practitioners. Owing to the sizes of current databases and the scale of the analyses necessary, routine bioinformatics tasks are often automated, but many still require the unique experience and intuition of human researchers: this requires tools that support real-time interaction with complex datasets. Many existing tools have poor user interfaces and limited real-time performance when applied to realistically large datasets; much of the user's cognitive capacity is therefore focused on controlling the tool rather than on performing the research. The UTOPIA project is addressing some of these issues by building reusable software components that can be combined to make useful applications in the field of bioinformatics. Expertise in the fields of human computer interaction, high-performance rendering, and distributed systems is being guided by bioinformaticians and end-user biologists to create a toolkit that is both architecturally sound from a computing point of view, and directly addresses end-user and application-developer requirements.
Semantic Web, RDF, OWL, SPARQL, HCLSIG ...
A prototype knowledge base for the life sciences (Marshall et al. 2008)
W3C Interest Group Note
Note describing use of RDF, OWL and SPARQL to construct a prototype knowledge base for the HCLSIG, integrating data relevant to Alzheimer's Disease, including the Allen Brain Atlas (image database of gene expression in the mouse brain), NCBI gene info and GO annotations, HomoloGene and PubMed.
Abstract: The prototype we describe is a biomedical knowledge base, constructed for a demonstration at Banff WWW2007 , that integrates 15 distinct data sources using currently available Semantic Web technologies such as the W3C standard Web Ontology Language [OWL] and Resource Description Framework [RDF]. This report outlines which resources were integrated, how the knowledge base was constructed using free and open source triple store technology, how it can be queried using the W3C Recommended RDF query language SPARQL [SPARQL], and what resources and inferences are involved in answering complex queries. While the utility of the knowledge base is illustrated by identifying a set of genes involved in Alzheimer's Disease, the approach described here can be applied to any use case that integrates data from multiple domains.
Bio2RDF:Towards a mashup to build bioinformatics knowledge systems (Belleau et al. 2008)
Journal of Biomedical Informatics
Paper describing the conversion of various data sources to RDF (including MGI, HGNC, Kegg, Entrez Gene, OMIM, PDB, ChEBI, Reactome, Prosite Pubmed, GenBank, PubChem), the publication of these data in the Web as linked data, and the use of a data crawler to discover genes potentially relevant to Parkinson's disease.
Abstract: Presently, there are numerous bioinformatics databases available on different websites. Although RDF was proposed as a standard format for the web, these databases are still available in various formats. With the increasing popularity of the semantic web technologies and the ever growing number of databases in bioinformatics, there is a pressing need to develop mashup systems to help the process of bioinformatics knowledge integration. Bio2RDF is such a system, built from rdfizer programs written in JSP, the Sesame open source triplestore technology and an OWL ontology. With Bio2RDF, documents from public bioinformatics databases such as Kegg, PDB, MGI, HGNC and several of NCBI's databases can now be made available in RDF format through a unique URL in the form of http://bio2rdf.org/namespace:id. The Bio2RDF project has successfully applied the semantic web technology to publicly available databases by creating a knowledge space of RDF documents linked together with normalized URIs and sharing a common ontology. Bio2RDF is based on a three-step approach to build mashups of bioinformatics data. The present article details this new approach and illustrates the building of a mashup used to explore the implication of four transcription factor genes in Parkinson's disease. The Bio2RDF repository can be queried at http://bio2rdf.org.
An ontology-driven semantic mashup of gene and biological pathway information: application to the domain of nicotine dependence (Sahoo et al. 2008)
Journal of Biomedical Informatics
Paper on construction of a small knowledge base of pathway-related information for a set of genes implicated in nicotine dependence, integrating data via RDF from Entrez Gene, HomoloGene, Kegg, Reactome and BioCyc; emphasises value of reasoning.
Abstract: OBJECTIVES: This paper illustrates how Semantic Web technologies (especially RDF, OWL, and SPARQL) can support information integration and make it easy to create semantic mashups (semantically integrated resources). In the context of understanding the genetic basis of nicotine dependence, we integrate gene and pathway information and show how three complex biological queries can be answered by the integrated knowledge base. METHODS: We use an ontology-driven approach to integrate two gene resources (Entrez Gene and HomoloGene) and three pathway resources (KEGG, Reactome and BioCyc), for five organisms, including humans. We created the Entrez Knowledge Model (EKoM), an information model in OWL for the gene resources, and integrated it with the extant BioPAX ontology designed for pathway resources. The integrated schema is populated with data from the pathway resources, publicly available in BioPAX-compatible format, and gene resources for which a population procedure was created. The SPARQL query language is used to formulate queries over the integrated knowledge base to answer the three biological queries. RESULTS: Simple SPARQL queries could easily identify hub genes, i.e., those genes whose gene products participate in many pathways or interact with many other gene products. The identification of the genes expressed in the brain turned out to be more difficult, due to the lack of a common identification scheme for proteins. CONCLUSION: Semantic Web technologies provide a valid framework for information integration in the life sciences. Ontology-driven integration represents a flexible, sustainable and extensible solution to the integration of large volumes of information. Additional resources, which enable the creation of mappings between information sources, are required to compensate for heterogeneity across namespaces. RESOURCE PAGE: http://knoesis.wright.edu/research/lifesci/integration/structured_data/JBI-2008/
Identifying disease-causal genes using Semantic Web-based representation of integrated genomic and phenomic knowledge (Gudivada et al. 2008)
Journal of Biomedical Informatics
Paper describing use of graph-theoretic algorithm similar to pagerank to prioritise candidate genes for various human diseases; RDF was used to create integrated graph of genomic and phenotypic data from GO, KEGG, BioCarta, BioCyc, Reactome, MGI, OMIM and Entrez Gene.
Abstract: Most common chronic diseases are caused by the interactions of multiple factors including the influences and responses of susceptibility and modifier genes that are themselves subject to etiologic events, interactions, and environmental factors. These entities, interactions, mechanisms, and phenotypic consequences can be richly represented using graph networks with semantically definable nodes and edges. To use this form of knowledge representation for inferring causal relationships, it is critical to leverage pertinent prior knowledge so as to facilitate ranking and probabilistic treatment of candidate etiologic factors. For example, genomic studies using linkage analyses detect quantitative trait loci that encompass a large number of disease candidate genes. Similarly, transcriptomic studies using differential gene expression profiling generate hundreds of potential disease candidate genes that themselves may not include genetically variant genes that are responsible for the expression pattern signature. Hypothesizing that the majority of disease-causal genes are linked to biochemical properties that are shared by other genes known to play functionally important roles and whose mutations produce clinical features similar to the disease under study, we reasoned that an integrative genomics–phenomics approach could expedite disease candidate gene identification and prioritization. To approach the problem of inferring likely causality roles, we generated Semantic Web methods-based network data structures and performed centrality analyses to rank genes according to model-driven semantic relationships. Our results indicate that Semantic Web approaches enable systematic leveraging of implicit relations hitherto embedded among large knowledge bases and can greatly facilitate identification of centrality elements that can lead to specific hypotheses and new insights.
The Semantic Web in Action (Feigenbaum et al. 2007)
Popular article describing current applications of Semantic Web standards in various sectors, with case studies in drug discovery (prioritising genes involved in cardiovascular disease) and health care (monitoring public health).
Advancing translational research with the Semantic Web (Ruttenberg et al. 2007)
Paper promoting the use of semantic web standards (RDF, OWL, SKOS, SPARQL) in translational biomedical research, and describing use cases and activities within the W3C Semantic Web HCLSIG, with a particular focus on neurodegenerative disorders e.g. Alzheimer's Disease.
Abstract: (Background) A fundamental goal of the U.S. National Institute of Health (NIH) "Roadmap" is to strengthen Translational Research, defined as the movement of discoveries in basic research to application at the clinical level. A significant barrier to translational research is the lack of uniformly structured data across related biomedical domains. The Semantic Web is an extension of the current Web that enables navigation and meaningful use of digital resources by automatic processes. It is based on common formats that support aggregation and integration of data drawn from diverse sources. A variety of technologies have been built on this foundation that, together, support identifying, representing, and reasoning across a wide range of biomedical data. The Semantic Web Health Care and Life Sciences Interest Group (HCLSIG), set up within the framework of the World Wide Web Consortium, was launched to explore the application of these technologies in a variety of areas. Subgroups focus on making biomedical data available in RDF, working with biomedical ontologies, prototyping clinical decision support systems, working on drug safety and efficacy communication, and supporting disease researchers navigating and annotating the large amount of potentially relevant literature. (Results) We present a scenario that shows the value of the information environment the Semantic Web can support for aiding neuroscience researchers. We then report on several projects by members of the HCLSIG, in the process illustrating the range of Semantic Web technologies that have applications in areas of biomedicine. (Conclusion) Semantic Web technologies present both promise and challenges. Current tools and standards are already adequate to implement components of the bench-to-bedside vision. On the other hand, these technologies are young. Gaps in standards and implementations still exist and adoption is limited by typical problems with early technology, such as the need for a critical mass of practitioners and installed base, and growing pains as the technology is scaled up. Still, the potential of interoperable knowledge sources for biomedicine, at the scale of the World Wide Web, merits continued work.
Web 2.0, Mashups ...
HCLS 2.0/3.0: Health care and life sciences data mashup using Web 2.0/3.0 (Cheung et al. 2008)
Paper describing prototype use of Web 2 mashup tools like Dapper, Yahoo! Pipes and GeoCommons to integrate biological data.
Abstract: We describe the potential of current Web 2.0 technologies to achieve data mashup in the health care and life sciences (HCLS) domains, and compare that potential to the nascent trend of performing semantic mashup. After providing an overview of Web 2.0, we demonstrate two scenarios of data mashup, facilitated by the following Web 2.0 tools and sites: Yahoo! Pipes, Dapper, Google Maps and GeoCommons. In the first scenario, we exploited Dapper and Yahoo! Pipes to implement a challenging data integration task in the context of DNA microarray research. In the second scenario, we exploited Yahoo! Pipes, Google Maps, and GeoCommons to create a geographic information system (GIS) interface that allows visualization and integration of diverse categories of public health data, including cancer incidence and pollution prevalence data. Based on these two scenarios, we discuss the strengths and weaknesses of these Web 2.0 mashup technologies. We then describe Semantic Web, the mainstream Web 3.0 technology that enables more powerful data integration over the Web. We discuss the areas of intersection of Web 2.0 and Semantic Web, and describe the potential benefits that can be brought to HCLS research by combining these two sets of technologies.
Post-meiotic transcription in Drosophila testes (Barreau et al. 2008)
Paper reporting discovery that several genes (including schuy) are expressed in "comet" and "cups" patterns in spermatids.
Abstract: Post-meiotic transcription was accepted to be essentially absent from Drosophila spermatogenesis. We identify 24 Drosophila genes whose mRNAs are most abundant in elongating spermatids. By single-cyst quantitative RT-PCR, we demonstrate post-meiotic transcription of these genes. We conclude that transcription stops in Drosophila late primary spermatocytes, then is reactivated by two pathways for a few loci just before histone-to-transition protein-to-protamine chromatin remodelling in spermiogenesis. These mRNAs localise to a small region at the distal elongating end of the spermatid bundles, thus they represent a new class of sub-cellularly localised mRNAs. Mutants for a post-meiotically transcribed gene (scotti), are male sterile, and show spermatid individualisation defects, indicating a function in late spermiogenesis.
Emerging technologies for gene manipulation in Drosophila melanogaster (Venken and Bellen 2005)
Abstract: The popularity of Drosophila melanogaster as a model for understanding eukaryotic biology over the past 100 years has been accompanied by the development of numerous tools for manipulating the fruitfly genome. Here we review some recent technologies that will allow Drosophila melanogaster to be manipulated more easily than any other multicellular organism. These developments include the ability to create molecularly designed deletions, improved genetic mapping technologies, strategies for creating targeted mutations, new transgenic approaches and the means to clone and modify large fragments of DNA.
- Drosophila as a model system to study human neurodegenerative diseases -- lecture (~40 mins) by Frank Hirth giving introduction to basic aspects of Drosophila biology and genetics.