FlyBase:Annotation Evidence Score

From FlyBase Wiki
Jump to: navigation, search

What does the annotation evidence score mean?

The current implementation of the evidence scoring system is based on assessment of three different classes of evidence used to inform transcript annotations. These are

  1. gene prediction algorithms,
  2. aligned nucleotide sequences, and
  3. overlapping regions of protein similarity.

Note that, in the future, we plan to refine this scoring metric to include support based on RNAseq data and potentially other classes of supporting evidence.

Each transcript gets a score that is based on the sum of the following categories:

1 point if one or more aligned EST sequences are fully consistent with the annotated transcript.
2 points if an annotated exon intersects a region of aligned protein similarity (note that similarity to self is excluded)
4 points if there is any gene prediction that is fully consistent with the annotated transcript
8 points if one or more aligned cDNAs are fully consistent with the annotated transcript.

The points assigned for each type of evidence allow one to easily and unambiguously determine what types of evidence exist that support a particular transcript annotation as each possible combination of supporting types receives a unique score.

For example, to identify all transcripts with cDNA support one would look for all transcripts with a score greater than or equal to 8. If instead you wanted to identify transcripts with no aligned nucleotide support you would search for transcripts with scores of 0,2,4 or 6. And to identify those transcripts with both supporting ESTs and gene prediction support but without a full length cDNA or protein similarity you would seach for transcripts with a score equal to 5.

The Annotation Evidence Rank is simply a text tag binning scores into three groups. Highly Supported indicates a score of 9 or higher, Moderately Supported indicates a score between 5 and 8, inclusive and Weakly Supported indicates a score of 4 or less.

Support means different things for different classes of evidence.

For gene prediction support the ends of the predicted gene model must either match or be within the annotated CDS of a transcript and the internal predicted exon/intron junctions must match the annotated junctions along the entire length of the prediction.

The rules are the same for EST and cDNA alignments except that the assessment is based on the entire annotated transcript and not just the coding region.

For protein similarity a positive score is simply based on a region of aligned protein sequence overlapping any annotated CDS exon of an annotated transcript on the same strand. This simplistic assessment likely produces a fair number of false positives and we hope to refine this aspect of assessment to provide more meaningful confidence values.