COMPUTATIONAL GENOMICS
THE "HOW" OF SCIENCE
DNA DATABASES
WHAT IS A DATABASE?
Where do you store all of this annotated information once the computer has analyzed it? This is where databases come in. Maybe one of the essential components of bioinformatics and computational biology, databases provide an easy way to communicate the results of experiments to the research community. Without databases, information would not be easily available. However, with huge online resources such as the National Center For Biotechnology Information (NCBI) locating the genetic information to understand a disorder such as diabetes is easy and efficient. There are many types of databases, each which contain a certain type of biological information, such as DNA databases and RNA databases. There are also databases that contain genomic information about a specific animal such as the Mouse Genome Informatics for many types of mice and WormBase for specific types of worm genomes. With the emergence of private corporations that provided bioinformatics services like 23andMe, private databases that correspond to specific products have been developed.
NATIONAL CENTER FOR BIOTECHNOLOGY (NCBI)
The NCBI database branches from the United States National Library of Medicine provided by the National Institutes of Health. The NCBI is centered in Bethesda Maryland after being founded in 1988. David Lipman directed work on NCBI after his work on the BLAST sequence alignment program [1].
The NCBI is an umbrella under which a series of databases are contained that pertain to biotechnology and biomedicine. Major databases like GenBank and PubMed provide relevant information for researchers across the world. Every database is easily located and accessed using a universal search tool.
One of the major databases found under NCBI is GenBank. GenBank contains the sequenced DNA information produced from more than 100,00 different organisms in laboratories across the globe. The database was started in 1982 by Walter Goad. The amount of resources contained in GenBank under NCBI has grown exponentially in the 36 years since its conception. The rate of information found in Genbank doubles exponentially every 18 months.
Other projects have been contained in NCBI besides GenBank including the Molecular Modeling Database, dbSNP (a collection of single-nucleotides), a map of the human genome, a taxonomy browser, and projects that coordinate with National Cancer Institute.
The NCBI contains cutting-edge software tools including BLAST for sequence similarity analysis. BLAST has become so efficient that it can perform sequence comparisons with GenBank in under 15 seconds [1].
This image shows the logo for the NCBI [1].
This image shows the exponential growth of NCBI in terms of number of base pairs of DNA contained inside its databases [2].
EUROPEAN BIOINFORMATICS
The EMBL-EBI was established in 1980 in Heidelberg Germany as the first nucleotide sequence database. The goal of the European Molecular Biology Laboratory European Bioinformatics Institute is to develop a central database enclosed in a computer that contained DNA sequences that could act as a supplement for sequence-based research published in journals. The process of obtaining sequence data soon became extremely necessary with the growing popularity of genome projects. As a result, the European Bioinformatics Institute database grew in scale, with a large contingency of electronic data submissions that needed to be organized and contained in the growing database [2].
To this end, the EMBL established the European Bioinformatics Institute at the Wellcome Trust Genome Campus in Hinxton Cambridgeshire so that it could be close to sequencing laboratories such as the Wellcome Trust Sanger Institute. In their new home, the EMBL-EBI established two databases (the EMBL-Bank for nucleotide sequences and UniProt for protein sequences).
EMBL-EBL YOUTUBE CHANNEL
EMBL-EBL YOUTUBE CHANNEL
BioBeat15: Professor Dame Janet Thornton, Director Emeritus, EMBL-EBI
TEDxPrague 2013 - Nick Goldman (interview)
Ewan Birney What's the next big thing in biology
The database provides extensive resources including programs such as
-
ArrayExpress- gene expression experiments archived for general use
-
BioModels Database- a collection of life science principles modeled using computational programs
-
BioStudies- an archive of datasets pertaining to a broad variety of biomolecules
-
Chemical Entities of Biological Interest- a collection of information about molecules
-
European Nucleotide Archive- a general collection of sequencing information about nucleotides
-
Ensembl project- results of a genome project that established and stored genomic information about a wide variety of vertebrates and eukaryotic species
-
Europe PubMed Central- a selection of biomedical literature offered for free
-
Expression Atlas- a summary of specific information about the conditions under which genes are expressed
-
Gene Ontology- an understanding of gene functions and processes available for consumption by the general public
-
InterPro- database detailing protein function
-
Proteomics Identifications Database- a collection of information gathered through mass spectrometry about proteomics
-
UniProt- a collection detailing protein sequence and function established through collaboration with the Swiss Institute of Bioinformatics and Protein Information Resource
Through all of these programs that are offered through the EMBI-EBI, this database has become an extensive resource for a variety of biotechnology and biological studies and research. This database has become a staple in the bioinformatics community through the expanse of information that is available throughout their libraries [2].
The Disease Gene Database or DisGeNET includes information gathered from various other sources (some mentioned above) in order to provide a comprehensive and specific archive of diseased genes. The principle behind the database is to mine information about human gene diseases from various other resources so that the information is easily accessible in one succinct place. The integration of this information into a comprehensive gene-disease map uses the DisGeNET association-type ontology method [3].
The original information is derived pre-curated data from sources including
-
UNIPROT-proteins associated with diseases are specifically obtained from this database and are classified as "Genetic Variation" proteins in the DisGeNET association-type ontology classification system.
-
Comparitve Toxicogenomics Database- focuses on relationships between genes and diseases with an understanding of the effects of chemicals found in the environment on human genetic health. This information is classified as "Biomarker" or "Therapeutic" classes on the system
-
Orphanet- contains literature on rare diseases and orphan drugs (undeveloped drugs due to the rareness of the disorder it aims to treat)
-
GWAS Catalog- contains a selection of over 100,000 SNPs from genome-wide association studies performed in order to better understand the cause and treatment of genetic disorders
DISEASE GENE DATABASE
It also includes data gathered from animal models including
-
Mouse Genome Database- provides detailed genomic information about the laboratory mouse. The disease information provided in the database can provide information about associations between human and mouse diseases, which can be immensely helpful in understanding causes and effects of diseases.
-
Rat Genome Database- provides the results of a collaborative project between research institutions to curate annotated data about rat genes and genome. Scientists are then able to understand the association between rat genomic disorders and genetic causes of disease
Information found in the database includes literature about genetic diseases including
-
Genetic Association Database- a collection of studies on the human genetic association for common complex diseases. It includes information collected from published papers in journals about GWAS.
-
Literature-Derived Human Gene-Disease Network- uses an advanced intelligence algorithm to collect gene-disease relations from the GeneRIF (Gene Reference Into Function) database [3].
This image shows the logo for the Disease Gene Database [3].
This image shows the logo for the Disease Gene Database [4].
COMBINED DNA INDEX SYSTEM
The United States FBI uses the Combined DNA Index System to organize and contain information about missing persons, convicted offenders, and forensic samples from crime scenes. The CODIS contains information on the local, state, and national levels, allowing laboratories across the country to share, compare and collaborate on DNA information. Each state has different laws for how the information is collected, uploaded, and analyzed within each distinct database. It is. however, a federal law that the CODIS database does not contain personal information such as the name associated with the genetic profile [4].
The database is organized to contain different indexes that organize and DNA profiles for easy storage. For criminal investigations, DNA information can fall into three categories: forensic index (DNA information gathered from the scene of the crime), arrestee index, and offender index. Other indexes exist to assist in the location of unidentified human remains, missing persons, and relatives of missing persons. Other categories also exist for samples obtained for other legal instances, such as staff indexes or multi-allelic offenders.
Identifications made using CODIS rely on short repeating sequencing that are scattered throughout the human genome called STRs.The DNA analysis is performed at a certain section of the DNA strand (known as a locus) and the number of repeats forms the unique DNA profile of the criminal. Mitochondrial DNA profiles can also be used for missing persons index since it can be obtained from a living maternal relative.
Once a match is established between the target DNA and the DNA records using CODIS software, the laboratories verify the match and coordinate a correspondence between the agency performing the match and the original agency running the test. If the match corresponds with a forensic investigation, the DNA match may be used as evidence for the suspect. The law enforcement agency may also use the verified match as a documentation to authorize the collection of another sample from the offender. The laboratory that is overseeing the casework could then perform DNA analysis on a larger scale so that the analysis could be presented as evidence in court.
As of February of 2017, the national database holds over 12 million offender profiles, 2.5 million arrestees, and 750 thousand profiles for forensic studies. The CODIS has helped produce evidence in over 350 thousand investigations. This high efficiency of the database is one of the reasons that it is the largest biological database in the world.
This image shows the logo for the Combined DNA Index System [5].
This picture shows the chromosomes that are used for DNA profiling for the FBI [6].
This picture shows a biological sample being tested for a DNA profile [7].
This diagram shows the different layers of the CODIS system [8].
MOUSE GENOME DATABASE
The Mouse Genome Database is a valuable resource that produces genetic, genomic and biological data from research using the mouse as a model for understanding how the human body works and responds to disease. The data obtained for the database includes a gene catalog specifically for mouse, protein and nucleotide sequences, gene function information, an archive of mutant alleles, and associations between mutant genotypes and phenotypes both in mice and as they relate to humans [5].
The database also provides a genetic map of the mouse genome, a genome browser that allows the mouse genome to be easily viewed and navigated, and SNPs and relating data. The data provided is integrated with other aspects of the Mouse Genome Informatics database resource including the Mouse Tumor Biology Database, International Mouse Strain Resource, and Gene Expression Database.
Biologists who work for the Mouse Genome Informatics update the database on a weekly basis with information gathered from the recent scientific publications. This data includes sequences from the GenBank, models of genes extracted from the NCBI and Ensembl databases, and information from the International Knockout Mouse Consortium. This information is combined with the following features already provided in the database:
-
Gene and DNA markers including descriptions and annotations
-
Mouse genetic relationships between the genome
-
Mouse phenotypes
-
SNPs as they relate to the mouse genome
-
Vertebrate data on homology
-
Sequencing Data for nucleotides in the mouse genome
-
Maps for both physical and genetic information
-
References that supplement all the information in the database [5].
This image shows that mice are studied for the Mouse Genome Database [9].
Many different mice species were studied and their genomes were compared in the Mouse Genome Species [10].
This is the logo for the Mouse Genome Informatics [11].