This lesson is still being designed and assembled (Pre-Alpha version)

de novo Short Read Paired-End Assembly

Overview

Teaching: 10 min
Exercises: 0 min
Questions
  • How to do a de novo short read paired-end genome assembly?

Objectives
  • Explain what is a contig

  • Be able to calculate genome coverage

  • Explain genome statistics

De Novo assembly (Paired End libary)

*De Novo* assembly is the process of merging short sequencing reads into contiguous sequences (contigs).

Now that we checked and trimmed the Paired End library we are ready to assemble it.

Go to the asm_workshop folder

$ cd ~/asm_workshop

SPAdes Genome Assembler

We will use the ‘SPAdes Genome Assembler’. SPAdes

SPAdes takes as input paired-end reads, mate-pairs and single (unpaired) reads in FASTA and FASTQ format.

In a first step SPAdes will do a read error corrction and use these in the iterative short-read genome assembly.

To run SPAdes from the command line, type:

$ spades.py -h

Assemble the trimmed 600bp Paired End library with SPAdes and use as output folder ecoli_pe

Use -1 for the forward reads and -2 for the reverse reads.

$ spades.py -1 ~/asm_workshop/data/trimmed_fastq/PE_600bp_50x_1.trim.fastq.gz \
            -2 ~/asm_workshop/data/trimmed_fastq/PE_600bp_50x_2.trim.fastq.gz \
            -o ~/asm_workshop/results/ecoli_pe

SPAdes created a new directory called ecoli_pe. Give a listing of this directory.

$ ls -l results/ecoli_pe

A log file (spades.log) is created by spades, outputing all the steps and results. Inspect the result:

$ less results/ecoli_pe/spades.log

At the end of the log file (press shift-G) it shows the files that has been created. The contigs.fasta file contains the assembly. Inspect this file by using less. (Use Q to exit less)

$ less results/ecoli_pe/contigs.fasta

The first contig is called NODE_1 and has a certain length and coverage. We can count the number of contigs in this file by doing a search on a string that is in common in all contigs (NODE) and use the pipe command to pass the resuls to wc -l to count every line in the output

$ grep "NODE" results/ecoli_pe/contigs.fasta | wc -l

Now run the python script assemblyStats.py on the contigs file to get some statistics.

$ assemblyStats.py results/ecoli_pe/contigs.fasta

The statistics generated by assemblyStats.py gives information on the total number of assembled sequences, in basepair, across all contigs the total sequence.

The N50 defines the assembly quality in terms of sequence continuity and represents in basepair as the sequence length of the shortest contig at 50% of the total assembly length. While n is the number of contigs. Also refered to as L50.

Filter assembly

From the assembly statistics we find that there are 200 contigs. A lot of them smaller (56 bp) than the read length (100 bp). We therefor arbitrarily going to filter the assembly, keeping contigs > 500 bp

$ filterFasta_500bp.py -i results/ecoli_pe/contigs.fasta -o results/ecoli_pe/contigs_500bp.fasta

Inspect the filtered assembly and compare it with the unfiltered.

$ assemblyStats.py results/ecoli_pe/contigs_500bp.fasta

Have we assembled the complete genome of E. coli K12 substr. MG1655?

Compare the sum of all contigs with the genome size on NCBI: https://www.ncbi.nlm.nih.gov/genome/167?project_id=57779

How many contigs represent 50% of the assembly? What is the N50?

Open the filtered contigs file and select randomly sequence from the contigs and BLAST it on NCBI. https://blast.ncbi.nlm.nih.gov/Blast.cgi

Assembly with different kmers

Exercise

Now make different assemblies with different kmer lengths to see the effect of the kmers. One with a small kmer (like 21) and the other with a higher kmer (like 87). change the output folder name to create new output folders.

Solution

Use the -k option to set the kmer length. We will use here 21 as kmer length. And changed the output folder name to ecoli_pe_k21

$ spades.py -1 ~/asm_workshop/data/trimmed_fastq/PE_600bp_50x_1.trim.fastq.gz \
            -2 ~/asm_workshop/data/trimmed_fastq/PE_600bp_50x_2.trim.fastq.gz \
            -o ~/asm_workshop/results/ecoli_pe_k21 \
            -k 21

And for the second assembly we will use a kmer of 87.

$ spades.py -1 ~/asm_workshop/data/trimmed_fastq/PE_600bp_50x_1.trim.fastq.gz \
            -2 ~/asm_workshop/data/trimmed_fastq/PE_600bp_50x_2.trim.fastq.gz \
            -o ~/asm_workshop/results/ecoli_pe_k87 \
            -k 87

Evaluate with QUAST

To inspect the results we can use QUAST http://quast.sourceforge.net/quast to evaluate the assemblies

$ quast.py results/ecoli_pe_k21/contigs.fasta \
            results/ecoli_pe/contigs.fasta \
            results/ecoli_pe_k87/contigs.fasta \
            -o results/quast_pe

Now download the file results/quast_pe/report.html to your local computer and open it with a browser and compare the assemblies.

In a new tab (local computer) in your terminal do:

$ mkdir ~/Desktop/quast/
$ scp YOUR-NETID@student-linux.tudelft.nl:~/asm_workshop/results/quast_pe/report.html ~/Desktop/quast/

Visualise the assembly graphs with Bandage (Optional)

Bandage is a program for visualising de novo assembly graphs. By displaying connections which are not present in the contigs file.

Download Bandage from here: http://rrwick.github.io/Bandage/ No installtion is necessary - just unzip and run.

When first opening Bandage on a Mac, you may receive a warning stating that Bandage ‘can’t be opened because it is from an unidentified developer.’ Right click on the file and select ‘Open’ to override this warning.

download the assembly graph files to you local computer. In a new tab (local computer) in your terminal do:

$ mkdir ~/Desktop/bandage/
$ scp YOUR-NETID@student-linux.tudelft.nl:~/asm_workshop/results/ecoli_pe/assembly_graph.fastg \
        ~/Desktop/bandage/assembly_graph.fastg
$ scp YOUR-NETID@student-linux.tudelft.nl:~/asm_workshop/results/ecoli_pe_k21/assembly_graph.fastg \
        ~/Desktop/bandage/assembly_graph_k21.fastg
$ scp YOUR-NETID@student-linux.tudelft.nl:~/asm_workshop/results/ecoli_pe_k87/assembly_graph.fastg \
        ~/Desktop/bandage/assembly_graph_k87.fastg

start Bandage and load the file assembly_graph.fastg for all three the assemblies one by one

Click on “Draw graph” and save as image (current view)

Do the same for the k21 and k87 assemblies and compare the three assembly graphs

Assembly alignment

The assembly statistics gives us an idea on the assembly size and so on, but not on the correctness. Can we trust the contigs?

We will use nucmer from the MUMmer package to align the contigs to the reference. http://mummer.sourceforge.net/

We are lucky that there is a reference available.

Create a new folder called mummer in ~/asm_workshop/results/

$ mkdir ~/asm_workshop/results/mummer

Move to this folder

$ cd results/mummer

USAGE: nucmer [options] < reference > < Query >

Align the filtered assembly to the reference: (~/asm_workshop/reference/Ecoli_K12_reference.fasta)

$ nucmer --prefix ecoli_pe \
        ~/asm_workshop/reference/Ecoli_K12_reference.fasta \
        ~/asm_workshop/results/ecoli_pe/contigs_500bp.fasta

nucmer has aligned all contigs to the reference.

Run show-tiling on the ecoli_pe.delta file:

$ show-tiling ecoli_pe.delta

This gives us the coordinates of the “best” aligned location of the contigs

We will use mummerplot to plot the alignments:

$ mummerplot --png --layout --filter -p ecoli_pe \
        ecoli_pe.delta \
        -R ~/asm_workshop/reference/Ecoli_K12_reference.fasta \
        -Q ~/asm_workshop/results/ecoli_pe/contigs_500bp.fasta

A plot file ‘ecoli_pe.png’ has been created. Download the file to your local computer and inspect the file.

In a new tab (local computer) in your terminal do:

$ mkdir ~/Desktop/mummer/
$ scp YOUR-NETID@student-linux.tudelft.nl:~/asm_workshop/results/mummer/ecoli_pe.png ~/Desktop/mummer/

What does this tell us about the assembly?

Key Points