de novo Short Read Paired-End Assembly
Overview
Teaching: 10 min
Exercises: 0 minQuestions
How to do a de novo short read paired-end genome assembly?
Objectives
Explain what is a contig
Be able to calculate genome coverage
Explain genome statistics
De Novo assembly (Paired End libary)
*De Novo*
assembly is the process of merging short sequencing reads into contiguous sequences (contigs).
Now that we checked and trimmed the Paired End library we are ready to assemble it.
Go to the asm_workshop folder
$ cd ~/asm_workshop
SPAdes Genome Assembler
We will use the ‘SPAdes Genome Assembler’. SPAdes
SPAdes takes as input paired-end reads, mate-pairs and single (unpaired) reads in FASTA and FASTQ format.
In a first step SPAdes will do a read error corrction and use these in the iterative short-read genome assembly.
To run SPAdes from the command line, type:
$ spades.py -h
Assemble the trimmed 600bp
Paired End library with SPAdes and use as output folder ecoli_pe
Use -1
for the forward reads and -2
for the reverse reads.
$ spades.py -1 ~/asm_workshop/data/trimmed_fastq/PE_600bp_50x_1.trim.fastq.gz \
-2 ~/asm_workshop/data/trimmed_fastq/PE_600bp_50x_2.trim.fastq.gz \
-o ~/asm_workshop/results/ecoli_pe
SPAdes created a new directory called ecoli_pe
. Give a listing of this directory.
$ ls -l results/ecoli_pe
A log file (spades.log) is created by spades, outputing all the steps and results. Inspect the result:
$ less results/ecoli_pe/spades.log
At the end of the log file (press shift-G) it shows the files that has been created. The contigs.fasta file contains the assembly. Inspect this file by using less. (Use Q to exit less)
$ less results/ecoli_pe/contigs.fasta
The first contig is called NODE_1
and has a certain length and coverage. We can count the number of contigs in this file by doing a search on a string that is in common in all contigs (NODE) and use the pipe command to pass the resuls to wc -l
to count every line in the output
$ grep "NODE" results/ecoli_pe/contigs.fasta | wc -l
Now run the python script assemblyStats.py
on the contigs file to get some statistics.
$ assemblyStats.py results/ecoli_pe/contigs.fasta
The statistics generated by assemblyStats.py
gives information on the total number of assembled sequences, in basepair, across all contigs the total sequence
.
The N50
defines the assembly quality in terms of sequence continuity and represents in basepair as the sequence length of the shortest contig at 50%
of the total assembly length. While n
is the number of contigs. Also refered to as L50.
Filter assembly
From the assembly statistics we find that there are 200 contigs. A lot of them smaller (56 bp) than the read length (100 bp). We therefor arbitrarily going to filter the assembly, keeping contigs > 500 bp
$ filterFasta_500bp.py -i results/ecoli_pe/contigs.fasta -o results/ecoli_pe/contigs_500bp.fasta
Inspect the filtered assembly and compare it with the unfiltered.
$ assemblyStats.py results/ecoli_pe/contigs_500bp.fasta
Have we assembled the complete genome of E. coli K12 substr. MG1655?
Compare the sum of all contigs with the genome size on NCBI: https://www.ncbi.nlm.nih.gov/genome/167?project_id=57779
How many contigs represent 50% of the assembly? What is the N50?
Open the filtered contigs file and select randomly sequence from the contigs and BLAST it on NCBI. https://blast.ncbi.nlm.nih.gov/Blast.cgi
Assembly with different kmers
Exercise
Now make different assemblies with different kmer lengths to see the effect of the kmers. One with a small kmer (like 21) and the other with a higher kmer (like 87). change the output folder name to create new output folders.
Solution
Use the
-k
option to set the kmer length. We will use here 21 as kmer length. And changed the output folder name to ecoli_pe_k21$ spades.py -1 ~/asm_workshop/data/trimmed_fastq/PE_600bp_50x_1.trim.fastq.gz \ -2 ~/asm_workshop/data/trimmed_fastq/PE_600bp_50x_2.trim.fastq.gz \ -o ~/asm_workshop/results/ecoli_pe_k21 \ -k 21
And for the second assembly we will use a kmer of 87.
$ spades.py -1 ~/asm_workshop/data/trimmed_fastq/PE_600bp_50x_1.trim.fastq.gz \ -2 ~/asm_workshop/data/trimmed_fastq/PE_600bp_50x_2.trim.fastq.gz \ -o ~/asm_workshop/results/ecoli_pe_k87 \ -k 87
Evaluate with QUAST
To inspect the results we can use QUAST http://quast.sourceforge.net/quast to evaluate the assemblies
$ quast.py results/ecoli_pe_k21/contigs.fasta \
results/ecoli_pe/contigs.fasta \
results/ecoli_pe_k87/contigs.fasta \
-o results/quast_pe
Now download the file results/quast_pe/report.html
to your local computer and open it with a browser and compare the assemblies.
In a new tab (local computer) in your terminal do:
$ mkdir ~/Desktop/quast/
$ scp YOUR-NETID@student-linux.tudelft.nl:~/asm_workshop/results/quast_pe/report.html ~/Desktop/quast/
Visualise the assembly graphs with Bandage (Optional)
Bandage is a program for visualising de novo assembly graphs. By displaying connections which are not present in the contigs file.
Download Bandage from here: http://rrwick.github.io/Bandage/ No installtion is necessary - just unzip and run.
When first opening Bandage on a Mac, you may receive a warning stating that Bandage ‘can’t be opened because it is from an unidentified developer.’ Right click on the file and select ‘Open’ to override this warning.
download the assembly graph files to you local computer. In a new tab (local computer) in your terminal do:
$ mkdir ~/Desktop/bandage/
$ scp YOUR-NETID@student-linux.tudelft.nl:~/asm_workshop/results/ecoli_pe/assembly_graph.fastg \
~/Desktop/bandage/assembly_graph.fastg
$ scp YOUR-NETID@student-linux.tudelft.nl:~/asm_workshop/results/ecoli_pe_k21/assembly_graph.fastg \
~/Desktop/bandage/assembly_graph_k21.fastg
$ scp YOUR-NETID@student-linux.tudelft.nl:~/asm_workshop/results/ecoli_pe_k87/assembly_graph.fastg \
~/Desktop/bandage/assembly_graph_k87.fastg
start Bandage and load the file assembly_graph.fastg for all three the assemblies one by one
Click on “Draw graph” and save as image (current view)
Do the same for the k21 and k87 assemblies and compare the three assembly graphs
Assembly alignment
The assembly statistics gives us an idea on the assembly size and so on, but not on the correctness. Can we trust the contigs?
We will use nucmer from the MUMmer package to align the contigs to the reference. http://mummer.sourceforge.net/
We are lucky that there is a reference available.
Create a new folder called mummer in ~/asm_workshop/results/
$ mkdir ~/asm_workshop/results/mummer
Move to this folder
$ cd results/mummer
USAGE: nucmer [options] < reference > < Query >
Align the filtered assembly to the reference: (~/asm_workshop/reference/Ecoli_K12_reference.fasta)
$ nucmer --prefix ecoli_pe \
~/asm_workshop/reference/Ecoli_K12_reference.fasta \
~/asm_workshop/results/ecoli_pe/contigs_500bp.fasta
nucmer has aligned all contigs to the reference.
Run show-tiling on the ecoli_pe.delta file:
$ show-tiling ecoli_pe.delta
This gives us the coordinates of the “best” aligned location of the contigs
We will use mummerplot to plot the alignments:
$ mummerplot --png --layout --filter -p ecoli_pe \
ecoli_pe.delta \
-R ~/asm_workshop/reference/Ecoli_K12_reference.fasta \
-Q ~/asm_workshop/results/ecoli_pe/contigs_500bp.fasta
A plot file ‘ecoli_pe.png’ has been created. Download the file to your local computer and inspect the file.
In a new tab (local computer) in your terminal do:
$ mkdir ~/Desktop/mummer/
$ scp YOUR-NETID@student-linux.tudelft.nl:~/asm_workshop/results/mummer/ecoli_pe.png ~/Desktop/mummer/
What does this tell us about the assembly?
Key Points