de novo Short Read Paired-End Assembly

Overview

Teaching: 10 min
Exercises: 0 min

Questions

How to do a de novo short read paired-end genome assembly?

Objectives

Explain what is a contig

Be able to calculate genome coverage

Explain genome statistics

De Novo assembly (Paired End libary)

*De Novo* assembly is the process of merging short sequencing reads into contiguous sequences (contigs).

Now that we checked and trimmed the Paired End library we are ready to assemble it.

Go to the asm_workshop folder

$ cd ~/asm_workshop

SPAdes Genome Assembler

We will use the ‘SPAdes Genome Assembler’. SPAdes

SPAdes takes as input paired-end reads, mate-pairs and single (unpaired) reads in FASTA and FASTQ format.

In a first step SPAdes will do a read error corrction and use these in the iterative short-read genome assembly.

To run SPAdes from the command line, type:

$ spades.py -h

Assemble the trimmed 600bp Paired End library with SPAdes and use as output folder ecoli_pe

Use -1 for the forward reads and -2 for the reverse reads.

$ spades.py -1 ~/asm_workshop/data/trimmed_fastq/PE_600bp_1.trim.fastq.gz \
            -2 ~/asm_workshop/data/trimmed_fastq/PE_600bp_2.trim.fastq.gz \
            -o ~/asm_workshop/results/spades_pe

SPAdes created a new directory called spades_pe. Give a listing of this directory.

$ ls -l results/spades_pe

A log file (spades.log) is created by spades, outputing all the steps and results. Inspect the result:

$ less results/spades_pe/spades.log

At the end of the log file (press shift-G) it shows the files that has been created. The contigs.fasta file contains the assembly. Inspect this file by using less. (Use Q to exit less)

$ less results/spades_pe/contigs.fasta

The first contig is called NODE_1 and has a certain length and coverage. We can count the number of contigs in this file by doing a search on a string that is in common in all contigs (NODE) and use the pipe command to pass the resuls to wc -l to count every line in the output

$ grep "NODE" results/spades_pe/contigs.fasta | wc -l

Evaluate with QUAST

For evaluating the SPAdes assembly we can use the Genome assembly evaluation tool, QUAST https://github.com/ablab/quast.

We are going to compare the SPAdes generated contigs and scaffolds files in relation to a publicly available complete genome sequence of Escherichia coli K-12 which we find in the reference directory.

$ quast.py results/spades_pe/contigs.fasta \
            results/spades_pe/scaffolds.fasta \
            -R reference/Ecoli_K12_reference.fasta \
            -o results/quast_pe

A summary table has been generated: `results/quast_pe/report.txt’.

$ less results/quast_pe/report.txt

Compare the generated statistics for the two input files:

Number of large contigs/scaffolds (i.e., longer than 500 bp) and total length of them.
Length of the largest contig.
N50 and L50
Genome fraction %
Number of N's per 100 kbp

Assembly alignment and visualization

We will use nucmer from the MUMmer package to align the contigs to the reference. http://mummer.sourceforge.net/

Create a new folder called mummer in ~/asm_workshop/results/

$ mkdir ~/asm_workshop/results/mummer

Move to this folder

$ cd results/mummer

USAGE: nucmer [options] < reference > < Query >

Align the assembly (contigs.fasta) to the reference: (~/asm_workshop/reference/Ecoli_K12_reference.fasta)

$ nucmer --prefix spades_pe \
        ~/asm_workshop/reference/Ecoli_K12_reference.fasta \
        ~/asm_workshop/results/spades_pe/contigs.fasta

nucmer has aligned all contigs to the reference.

Run show-coords on the spades_pe.delta file:

$ show-coords spades_pe.delta

This program parses the delta alignment output of nucmer This gives us the coordinates and displays the coordinates, and other useful information about the alignments.

Running ‘show-coords -h’ will provide the usage and explaines the output format:

$ show-tiling -h

USAGE: show-coords  [options]  <deltafile>

-b          Merges overlapping alignments regardless of match dir
            or frame and does not display any idenitity information.
-B          Switch output to btab format
-c          Include percent coverage information in the output
-d          Display the alignment direction in the additional
            FRM columns (default for promer)
-g          Deprecated option. Please use 'delta-filter' instead
-h          Display help information
-H          Do not print the output header
-I float    Set minimum percent identity to display
-k          Knockout (do not display) alignments that overlap
            another alignment in a different frame by more than 50%
            of their length, AND have a smaller percent similarity
            or are less than 75% of the size of the other alignment
            (promer only)
-l          Include the sequence length information in the output
-L long     Set minimum alignment length to display
-o          Annotate maximal alignments between two sequences, i.e.
            overlaps between reference and query sequences
-q          Sort output lines by query IDs and coordinates
-r          Sort output lines by reference IDs and coordinates
-T          Switch output to tab-delimited format

  Input is the .delta output of either the "nucmer" or the
"promer" program passed on the command line.
  Output is to stdout, and consists of a list of coordinates,
percent identity, and other useful information regarding the
alignment data contained in the .delta file used as input.
  NOTE: No sorting is done by default, therefore the alignments
will be ordered as found in the <deltafile> input.

We will use mummerplot to plot the alignments:

$ mummerplot --png --layout --filter -p spades_pe \
        spades_pe.delta \
        -R ~/asm_workshop/reference/Ecoli_K12_reference.fasta \
        -Q ~/asm_workshop/results/spades_pe/contigs.fasta

A plot file ‘spades_pe.png’ has been created. Download the file to your local computer and inspect the file.

In a new tab (local computer) in your terminal do:

$ mkdir ~/Desktop/mummer/
$ scp YOUR-NETID@vm0X-bt-edu.tnw.tudelft.nl:~/asm_workshop/results/mummer/spades_pe.png ~/Desktop/mummer/

What does this tell us about the assembly?

Visualise the assembly graphs with Bandage (Optional)

Bandage is a program for visualising de novo assembly graphs. By displaying connections which are not present in the contigs file.

Download Bandage from here: http://rrwick.github.io/Bandage/ No installtion is necessary - just unzip and run.

When first opening Bandage on a Mac, you may receive a warning stating that Bandage ‘can’t be opened because it is from an unidentified developer.’ Right click on the file and select ‘Open’ to override this warning.

download the assembly graph files to you local computer. In a new tab (local computer) in your terminal do:

$ mkdir ~/Desktop/bandage/
$ scp YOUR-NETID@vm0X-bt-edu.tnw.tudelft.nl:~/asm_workshop/results/spades_pe/assembly_graph.fastg \
        ~/Desktop/bandage/spades_pe_graph.fastg

start Bandage and load the file spades_pe_graph.fastg.

Click on “Draw graph” and save as image (current view)

Key Points

previous episode

Genome Assembly

next episode