top of page

GENE ANNOTATION

WHAT IS GENE ANNOTATION?

Once scientists obtain data (in this case after scientists obtain the sequences of the genome) they need a way to understand the complex sequence of nucleotides that they have uncovered. This is where genomic annotation comes in. Genome annotation is the process of identifying and explaining the specific processes of genes. An annotation, kind of like the comment that your teacher writes on the side of your essay, is the way that scientists explain the data that they have sequenced in terms of its function. [1].

Just like an editor uses a red pen to leave notes that help her understand the text, scientists annotate genome sequences so that they can understand what the specific sequences mean [2]

 

 

HOW DOES GENOME ANNOTATION WORK?

This image diagrams a chromosome and shows an example of both coding and non-coding regions that may need to be marked during annotation [3]

This diagram shows the many pieces of information that may be included in the annotation of a Heliconius erato genome [4].

There are three main objectives in the process of annotating a gene.

  1. identifying non-coding or introns in the DNA sequence

  2. predicting the specific functions of each part of the DNA code

  3. understanding the biological processes that correlate with the specific DNA sequences.

 

More and more, technology is attempting to make the gene annotation process automated so that less human knowledge is needed to annotate genes by hand. Also, automating the process would save time and energy [4].

 

One of the major methods for automated gene annotation is using technology that searches for gene homology or similarities to search for homologous genes in the sequences. The results from a search using this technology can be used to mark the similarities between the genes in an annotation [1]

 

Some annotations also include context information for the genome, ratings that discuss the degree of similarities between genes, and data from experiments that provide more specificity between similar genes.

 

Many gene databases find a middle ground between completely relying on automatic annotation software and annotating completely by hand. For example, Ensembl is a database that relies both on hand-curated data sources as well as different automatic software tools [2].

WHAT DOES ANNOTATED CODE LOOK LIKE???

Unannotated Code

Annotated Code

In this picture, you can see where the scientists have marked

specific genes and their functions for future reference. Now that they have the information, annotations are needed so that the scientists can interact with the data and figure out what it means [6]

This picture shows what the DNA data looks like straight after being sequenced. No annotations have been made, so scientists have no way of easily knowing which parts of the DNA sequence performs which function [5].

WHERE IS THE INFORMATION STORED?

The annotated genetic code is now stored in a database where it can be easily accessible to researchers and scientists. These databases, such as Mouse Genome InformaticsFlyBase, WormBase, and ENSEMBL as well as many other genomic databases [3].

 

Most of these databases provide this annotated information so that researchers and scientists can easily understand the function, structure, and correlations between genes and can efficiently move on to larger purposes in their experimentation.

This image depicts the finished annotation of a single gene in a database [9]

This picture shows a compilation of some of the known genome databases that store annotated code. [7]

This image shows what the search engine on the Fly Base gene database looks like [8]

WHAT TECHNOLOGY IS USED IN GENE ANNOTATION?

If annotation was done completely by hand, it could take forever for a person to annotate each gene for its function and similarities to other genes. For this reason, technology is essential to the annotation process. Several programs that attempt to make this time-consuming process automated.

GENE QUIZ

This graphic shows the workflow of gene annotating technology  [10].

Gene Quiz is the first of many programs that attempts to make the gene annotation process automated, thus saving crucial time. Gene Quiz is the only automated program that is currently available to the public.

Gene Quiz works by comparing sequences of bases with several similar databases including 

PDB, SWISS-PROT, PIR, PROSITE, and TrEMBL. Using an algorithm, the program analyzes similarities in the data in order to predict functions and clusters proteins accordingly. The data is presented in a table that provides predictions of protein function, gene names, and links to databases that contain more information. 

Although the creators of the program claim that there should be a 95% accuracy rate, some users have reported that 8 of 21 new functional predictions proved to be relevant and accurate. For this reason, technology like Gene Quiz still has a long way to go before gene annotation is completely automated [2].

While Gene Quiz is the only program that currently is available to the public, similar programs exist that serve similar purposes and produce information that can be useful in genome annotation.

The above image is the logo for the Pedant annotation tool. It shows the five tasks that the technology can automatically perform [11]

This image shows what manual annotation looks like using one of the above technologies [12]

PEDANT AND ERGO

  • Pedant automatically analyzes genome sequence for protein function. The prediction of cellular roles and functions is made by cross-referencing the information provided with databases in Pfam, BLOCKS, and PROSITE databases. Protein functions are manually assigned categories based on the Functional Catalogue (FunCat)

 

  • Ergo is a genome annotation platform where the annotation is completed by combining next-generation sequencing data and genome assembly to draft a sequence of genes. The ERGO platform is highly accurate because it structures the entire genome instead of only looking for sequence similarities to predict protein functions [2].

bottom of page