Genome Annotation

Overview

Teaching: 10 min
Exercises: 0 min

Questions

How to find the genes present in a genome assembly

Objectives

Genome annotation

Whole genome annotation is the process of identifying features of interest in a set of genomic DNA sequences, and labelling them with useful information.

Prokka is a software tool to annotate bacterial, archaeal and viral genomes quickly and produce standards-compliant output files.

Prokka: rapid prokaryotic genome annotation. https://github.com/tseemann/prokka

Check the prokka manual before we apply prokka on our assembly.

$ prokka -h

Usage:
  prokka [options] <contigs.fasta>
General:
  --help            This help
  --version         Print version and exit
  --citation        Print citation for referencing Prokka
  --quiet           No screen output (default OFF)
  --debug           Debug mode: keep all temporary files (default OFF)
Setup:
  --listdb          List all configured databases
  --setupdb         Index all installed databases
  --cleandb         Remove all database indices
  --depends         List all software dependencies
Outputs:
  --outdir [X]      Output folder [auto] (default '')
  --force           Force overwriting existing output folder (default OFF)
  --prefix [X]      Filename output prefix [auto] (default '')
  --addgenes        Add 'gene' features for each 'CDS' feature (default OFF)
  --locustag [X]    Locus tag prefix (default 'PROKKA')
  --increment [N]   Locus tag counter increment (default '1')
  --gffver [N]      GFF version (default '3')
  --compliant       Force Genbank/ENA/DDJB compliance: --genes --mincontiglen 200 --centre XXX (default OFF)
  --centre [X]      Sequencing centre ID. (default '')
Organism details:
  --genus [X]       Genus name (default 'Genus')
  --species [X]     Species name (default 'species')
  --strain [X]      Strain name (default 'strain')
  --plasmid [X]     Plasmid name or identifier (default '')
Annotations:
  --kingdom [X]     Annotation mode: Archaea|Bacteria|Mitochondria|Viruses (default 'Bacteria')
  --gcode [N]       Genetic code / Translation table (set if --kingdom is set) (default '0')
  --prodigaltf [X]  Prodigal training file (default '')
  --gram [X]        Gram: -/neg +/pos (default '')
  --usegenus        Use genus-specific BLAST databases (needs --genus) (default OFF)
  --proteins [X]    Fasta file of trusted proteins to first annotate from (default '')
  --hmms [X]        Trusted HMM to first annotate from (default '')
  --metagenome      Improve gene predictions for highly fragmented genomes (default OFF)
  --rawproduct      Do not clean up /product annotation (default OFF)
Computation:
  --fast            Fast mode - skip CDS /product searching (default OFF)
  --cpus [N]        Number of CPUs to use [0=all] (default '8')
  --mincontiglen [N] Minimum contig size [NCBI needs 200] (default '1')
  --evalue [n.n]    Similarity e-value cut-off (default '1e-06')
  --rfam            Enable searching for ncRNAs with Infernal+Rfam (SLOW!) (default '0')
  --norrna          Don't run rRNA search (default OFF)
  --notrna          Don't run tRNA search (default OFF)
  --rnammer         Prefer RNAmmer over Barrnap for rRNA prediction (default OFF)

Now we can run prokka with ~/asm_workshop/results/ecoli_ont/ont_pilon_polished.fasta as input and –outdir ~/asm_workshop/results/prokka as output directory and set the options as much as we know from the organism to get a complete as posible annotation.

$ prokka --outdir ~/asm_workshop/results/prokka \
         --prefix Ecoli_K12 \
         --addgenes \
         --genus Escherichia  \
         --species coli \
         --strain K12 \
         --kingdom Bacteria \
         --usegenus \
         ~/asm_workshop/results/miniasm_ont/ont_pilon_polished.fasta

Prokka outputs a lot of files in ~/asm_workshop/results/prokka. Below we find a description of each file.

Output Files

Extension	Description
.gff	This is the master annotation in GFF3 format, containing both sequences and annotations. It can be viewed directly in Artemis or IGV.
.gbk	This is a standard Genbank file derived from the master .gff. If the input to prokka was a multi-FASTA, then this will be a multi-Genbank, with one record for each sequence.
.fna	Nucleotide FASTA file of the input contig sequences.
.faa	Protein FASTA file of the translated CDS sequences.
.ffn	Nucleotide FASTA file of all the prediction transcripts (CDS, rRNA, tRNA, tmRNA, misc_RNA)
.sqn	An ASN1 format “Sequin” file for submission to Genbank. It needs to be edited to set the correct taxonomy, authors, related publication etc.
.fsa	Nucleotide FASTA file of the input contig sequences, used by “tbl2asn” to create the .sqn file. It is mostly the same as the .fna file, but with extra Sequin tags in the sequence description lines.
.tbl	Feature Table file, used by “tbl2asn” to create the .sqn file.
.err	Unacceptable annotations - the NCBI discrepancy report.
.log	Contains all the output that Prokka produced during its run. This is a record of what settings you used, even if the –quiet option was enabled.
.txt	Statistics relating to the annotated features found.
.tsv	Tab-separated file of all features: locus_tag,ftype,len_bp,gene,EC_number,COG,product

First check the .txt file which gives information on the found features.

$ less ~/asm_workshop/results/prokka/Ecoli_K12.txt

Now we can compare the number of found genes with our annotated reference ~/asm_workshop/reference/Ecoli_K12_reference.gff

We will use AWK to select a feature and ‘wc -l’ to count them.

First we wll count the number of annotated genes in the reference. In the GFF file the third column stores the feature type. We are searching for the feature ‘gene’.

$ cat ~/asm_workshop/reference/Ecoli_K12_reference.gff | awk '$3=="gene"' | wc -l

Compare the number of found genes in the reference with those found in our annotated assembly. Can you explain the difference?

Now we can do the same for the rRNA’s and tRNA’s.

$ cat ~/asm_workshop/reference/Ecoli_K12_reference.gff | awk '$3=="rRNA"' | wc -l

$ cat ~/asm_workshop/reference/Ecoli_K12_reference.gff | awk '$3=="tRNA"' | wc -l

Key Points

previous episode

Genome Assembly

next episode