de novo Long Read Assembly
Overview
Teaching: 10 min
Exercises: 0 minQuestions
How to do a de novo long-read genome assembly?
Objectives
Long read vs short read characteristics
Long read error rate vs short read error rate
De Novo assembly (Nanopore)
Recent new single molecule technologies like PacBio and Oxford Nanopore are promising. Here we will assemble data from NCBI Bioproject https://www.ncbi.nlm.nih.gov/bioproject/PRJDB9013, which was used for Benchmarks of de novo assemblers for bacterial genomes. The data we will use is E. coli str. K12 substr. MG1655 with identifier DRR198814 (https://www.ncbi.nlm.nih.gov/sra/DRX189231[accn]) which was sequenced on a MinION from Oxford Nanopore. https://nanoporetech.com/products/minion
Quality Control
Like we have done for the short reads we are going to do quality control on the raw long read data.
FastQC
Exercise
Assess the quality of the long reads from Oxford Nanopre technology (ONT) located in:
~/asm_workshop/data/ont/DRR198814.fastq.gz. These are single reads with different read lengths. We will use a subset of the data.Use as output folder
~/asm_workshop/results/fastqc_ontCompare the error rate with those from the Illumina error rates.
Compare the read length distribution in respect to the Illumina libraries.(Hint: Use
fastqcandscpto download the createdhtmlfiles.)Solution
Create an output folder for the result files.
$ mkdir -p ~/asm_workshop/results/fastqc_ontRun fastqc on the ONT library. Use fastqc with option -t 2 (–threads 2; To speed up the process and to prevend OutOfMemoryError).
$ fastqc -t 2 ~/asm_workshop/data/ont/DRR198814.fastq.gz -o ~/asm_workshop/results/fastqc_ontIn a new tab (local computer) in your terminal do:
$ mkdir ~/Desktop/fastqc_html/ $ scp YOUR-NETID@vm0X-bt-edu.tnw.tudelft.nl:~/asm_workshop/results/fastqc_ont/DRR198814_fastqc.html \ ~/Desktop/fastqc_html/Then take a look at the html files in your browser.
Miniasm assembler
Miniasm is a very fast OLC-based de novo assembler for noisy long reads like PacBio and Oxford Nanopore data. https://github.com/lh3/miniasm
Miniasm needs an all-vs-all read self-mapping (typically by minimap2) as input.
Before we can start we first we will move to the asm_workshop folder
$ cd ~/asm_workshop
And create an output folder for the Nanopore assembly results.
$ mkdir ~/asm_workshop/results/miniasm_ont
Step 1: minimap2
The first step is to run minimap for the all-vs-all mapping.
Run minimap2 without options for command usage.
$ minimap2
For all-vs-all mapping we use the raw ont reads as target and as query. As preset setting we have to use -x ava-ont for ONT read overlap and redirect the output to a PAF file (Pairwise mApping Format), which is a text format describing the approximate mapping positions between two sets of sequences.
$ minimap2 -x ava-ont \
~/asm_workshop/data/ont/DRR198814.fastq.gz \
~/asm_workshop/data/ont/DRR198814.fastq.gz \
> ~/asm_workshop/results/miniasm_ont/ont_overlaps.paf
Step 2: miniasm
The all-vs-all alignments (ont_overlaps.paf) will be used as input for miniasm that will build a assembly graph in GFA format.
Besides the all-vs-all PAF file we also have to provide the raw ONT reads.
$ miniasm -f ~/asm_workshop/data/ont/DRR198814.fastq.gz \
~/asm_workshop/results/miniasm_ont/ont_overlaps.paf \
> ~/asm_workshop/results/miniasm_ont/ont_assembly.gfa
Step 3: Consensus sequence
Miniasm stores the assembly graph in GFA format. This is a tab-delimited text format for describing a set of sequences and their overlap. The first field of each line identifies the type of the line:
- Header lines start with H
- Seqment lines start with S
- Link lines with L
- Jump lines with J
- Path line with P
- Walk lines with W
The assembly graph stores the assembled sequences that we are interested in on the Segment line:
S Seqgment line format:
| Column 1 | Column 2 | Column 3 |
|---|---|---|
| S | identifier | Sequence |
Next step is extracting the identifier and the consensus sequence from the assembly graph and convert it to multi-fasta format:
| >identifier1 |
| sequence1 |
| >identifier2 |
| sequence2 |
We will apply the linux tool awk, which does pattern scanning and processing in text files, for extracting the identifier and sequence from the graph and print it to fasta format.
The identifier of the sequence on a segment line is stored in the second column and can be captured by AWK with the variable $2. The sequence beloning to that identifier can be found on the third column and can be captured by AWK with the variable $3.
$ awk '/^S/{print ">"$2"\n"$3}' \
~/asm_workshop/results/miniasm_ont/ont_assembly.gfa \
> ~/asm_workshop/results/miniasm_ont/ont_assembly.fasta
| /^S/ | capture all lines that start with ‘S’ |
| ”>” | print the char ‘>’ |
| $2 | print the content of variable $2, which is the identifier |
| “\n” | start a new line |
| $3 | print the content of variable $3, which is the sequence |
Assembly Error rate
Since we are using noisy reads to do a De Novo assembly we would like to know how this will workout for the assembly.
We could align the assembly to the reference sequence we have to inspect the sequence similarity.
For this we will use the tool dnadiff from the MUMmer package. https://github.com/garviz/MUMmer
on how to use dnadiff:
$ dnadiff -h
USAGE: dnadiff [options]
<reference> Set the input reference multi-FASTA filename
<query> Set the input query multi-FASTA filename
We will use as a reference: ~/asm_workshop/reference/Ecoli_K12_reference.fasta
And the ONT assembly as query: ~/asm_workshop/results/miniasm_ont/ont/assmbly.fasta
Move to the earlier created Mummer folder:
$ cd ~/asm_workshop/results/mummer
Run dnadiff with the -p option to control the output file name.
$ dnadiff -p miniasm_ont \
~/asm_workshop/reference/Ecoli_K12_reference.fasta \
~/asm_workshop/results/miniasm_ont/ont_assembly.fasta
Now open the report file that was generated by dnadiff and inspect the average sequence identity (AvgIdentity) under the 1-to-1 alignments.
$ less ~/asm_workshop/results/mummer/miniasm_ont.report
Exercise
Noisy long read assembly results in assemblies with a high error rate. into the sequence similarity of Noisy long read assembly but not for Illumina based assemblies.
Apply
dnadiffon thepaired-endSPAdes assembly we did:~/asm_workshop/results/spades_pe/contigs.fasta.Use the same
mummeroutput folder and call the output file ecoli_pe:~/asm_workshop/results/mummer/spades_peCompare the average sequence identity from the
spades_peassembly with theminiasm_ontsequence identity.Solution
run dnadiff with:
dnadiff -p spades_pe \ ~/asm_workshop/reference/Ecoli_K12_reference.fasta \ ~/asm_workshop/results/spades_pe/contigs.fastaOpen the report
spades_pe.reportand compare the average sequence identity(AvgIdentity)under the1-to-1 alignmentswith the from the ont assembly report:spades_ont.report$ less ~/asm_workshop/results/mummer/spades_pe.report
Error correcting (polishing) long read based assemblies
Contigs genereated from erroneous long read based De Novo assemblies needs to be error corrected to improve the consensus sequence. Here we will use Minipolish, which will use Racon for polishing, to error correct in three steps the miniasm assembly.
Step1: initial Racon polish with constituent reads
Miniasm’s assembled contigs are made up of pieces of long reads and therefore has a high error rate – probably around 90% or so, depending on the input reads.
The miniasm GFA file indicates specifically which reads contributed to each contig.
Therefore, the first thing Minipolish does is to run Racon on each contig independently, only using the reads which were used to create that contig. This step typically quite fast because it does not involve high read depths, and it can bring the percent identity up to the high 90s.
Step 2: full Racon polish rounds
Now that the assembly is in better shape, Minipolish does full Racon-polishing rounds – aligning the full read set to the whole assembly and getting a Racon consensus.
Step 3: contig read depth
Minipolish finishes by doing one more read-to-assembly alignment, this time not to polish but to calculate read depths. These depths are added to the GFA line for each contig (e.g. dp:f:77.179) and they will be recognised if the graph is loaded in Bandage.
Run Minipolish
To be able to run Minipolish we need to provide it with the raw reads and the assembly graph from the miniasm output and redirect this to a new now pollished assembly graph:
$ minipolish ~/asm_workshop/data/ont/DRR198814.fastq.gz \
~/asm_workshop/results/miniasm_ont/ont_assembly.gfa \
> ~/asm_workshop/results/miniasm_ont/ont_polished.gfa
Polished consensus sequence
Now we have to extract the consence sequence from the polished assembly graph and for that we have to use awk, which is a linux tool that does pattern scanning and processing in text files.
The assembly graph contains the sequence that we need but we have to convert it to fasta format. AWK is searching for lines that start with S (/^S/) and starts printing the sequence identifiers $2 with the fasta header symbol > in front of it. On the next line \n the actual sequence will be printed $3.
$ awk '/^S/{print ">"$2"\n"$3}' \
~/asm_workshop/results/miniasm_ont/ont_polished.gfa \
> ~/asm_workshop/results/miniasm_ont/ont_polished.fasta
Exercise
Apply
dnadiffon theminipolishedassembly:~/asm_workshop/results/miniasm_ont/ont_polished.fasta.Use the same
mummeroutput folder and call the output file ecoli_pe:~/asm_workshop/results/mummer/miniasm_ont_minipolishCompare the average sequence identity from the
miniasm_ont_minipolishassembly with theminiasm_ontsequence identity.Solution
run dnadiff with:
$ cd ~/asm_workshop/results/mummerdnadiff -p miniasm_ont_minipolish \ ~/asm_workshop/reference/Ecoli_K12_reference.fasta \ ~/asm_workshop/results/miniasm_ont/ont_polished.fastaOpen the report
miniasm_ont_minipolish.reportand compare the average sequence identity(AvgIdentity)under the1-to-1 alignmentswith the from the ont assembly report:spades_ont.report$ less ~/asm_workshop/results/mummer/miniasm_ont_minipolish.report
Polishing with Pilon
Now that we have seen that we need to error correct (polish) assemblies that are based on noisy long reads. Minipolish used the noisy long reads to align these to the raw assembly and based on the sequence depth it was able to correct the sequence of the assembly.
Still the error rate is to high to do for instance an annotation. The errors would result in highly fragmented genes.
An approach would be to use the Illumina reads from which we know it has high read accuracy. A polishing tool that can be used for this is Pilon.
Pilon requires as input a fasta file of the draft assembly and the aligned Illumina reads to this draft assembly in BAM format as we did during the Variant Calling pipeline.
Exercise
Align the Illumina reads from
~/asm_workshop/data/trimmed_fastq/PE_600bp_1.trim.fastq.gzand~/asm_workshop/data/trimmed_fastq/PE_600bp_2.trim.fastq.gzto the minipolished assembly:~/asm_workshop/results/miniasm_ont/ont_polished.fastaas we have done during the Variant Calling sessions.Use
bwa memto do the mapping and create anSAMfileUse
samtoolstosortand convert theSAMfile to a sortedBAMfile.Solution
First index the polished assembly:
$ bwa index ~/asm_workshop/results/miniasm_ont/ont_polished.fastaMap the paired-end Illumina reads to the indexed polished assembly:
$ bwa mem ~/asm_workshop/results/miniasm_ont/ont_polished.fasta \ ~/asm_workshop/data/trimmed_fastq/PE_600bp_1.trim.fastq.gz \ ~/asm_workshop/data/trimmed_fastq/PE_600bp_2.trim.fastq.gz \ > ~/asm_workshop/results/miniasm_ont/pe_polished.samSort and convert the SAM file
samtools sort -O bam \ -o ~/asm_workshop/results/miniasm_ont/pe_polished.sorted.bam \ ~/asm_workshop/results/miniasm_ont/pe_polished.sam
Pilon
Before we can apply pilon we first have to index the alignment file.
$ samtools index ~/asm_workshop/results/miniasm_ont/pe_polished.sorted.bam
Now we are ready to run pilon:
$ pilon -Xmx2G \
--genome ~/asm_workshop/results/miniasm_ont/ont_polished.fasta \
--frags ~/asm_workshop/results/miniasm_ont/pe_polished.sorted.bam \
--output ~/asm_workshop/results/miniasm_ont/ont_pilon_polished \
--changes --fix all
Exercise
Apply
dnadiffon thepilon polishedassembly:~/asm_workshop/results/miniasm_ont/ont_pilon_polished.fasta.Use the same
mummeroutput folder and call the output file:~/asm_workshop/results/mummer/miniasm_ont_pilonCompare the average sequence identity from the
miniasm_ont_pilonassembly with theminiasm_ontandminiasm_ont_minipolishsequence identity.Solution
run dnadiff with:
$ cd ~/asm_workshop/results/mummerdnadiff -p miniasm_ont_pilon \ ~/asm_workshop/reference/Ecoli_K12_reference.fasta \ ~/asm_workshop/results/miniasm_ont/ont_pilon_polished.fastaOpen the report
ecoli_ont_pilon.reportand compare the average sequence identity(AvgIdentity)under the1-to-1 alignmentswith the from the ont assembly report:ecoli_ont_minipolish.report$ less ~/asm_workshop/results/mummer/miniasm_ont_pilon.report
QUAST: compare ONT, PE and PE-MP assemblies
Exercise
Compare the
pilon polished ONTassembly with the illumina assembliesPEandPE-MPby usingQUAST.What are the major differences between the two assemblies?
Use
quast_pe_mp_ontas output folder.Solution
$ quast.py \ ~/asm_workshop/results/spades_pe/contigs.fasta \ ~/asm_workshop/results/spades_pe_mp/scaffolds.fasta \ ~/asm_workshop/results/miniasm_ont/ont_pilon_polished.fasta \ -R ~/asm_workshop/reference/Ecoli_K12_reference.fasta \ -o ~/asm_workshop/results/quast_pe_mp_ontOpen the file ~/asm_workshop/results/quast_pe_mp_ont/report.txt with less.
$ less ~/asm_workshop/results/quast_pe_mp_ont/report.txt
Assembly alignment
Exercise
Align the polished ONT assembly to the reference: (
~/asm_workshop/reference/Ecoli_K12_reference.fasta). Use as a prefix:miniasm_ont.Make sure you are working in the mummer folder:
~/asm_workshop/results/mummerInspect the resulting
miniasm_ont.pngplot. Has the long reads improved the assembly?Solution
Use as working directory:
~/asm_workshop/results/mummer$ cd ~/asm_workshop/results/mummerRun nucmer
$ nucmer --prefix miniasm_ont \ ~/asm_workshop/reference/Ecoli_K12_reference.fasta \ ~/asm_workshop/results/miniasm_ont/ont_polished.fastanucmer has aligned the assembly to the reference.
Use mummerplot to plot the alignments:
$ mummerplot --png --layout --filter --prefix miniasm_ont \ ~/asm_workshop/results/mummer/miniasm_ont.delta \ -R ~/asm_workshop/reference/Ecoli_K12_reference.fasta \ -Q ~/asm_workshop/results/miniasm_ont/ont_polished.fastaA plot file ‘miniasm_ont.png’ has been created. Download the file to your local computer and inspect the file.
In a new tab (local computer) in your terminal do:
$ scp YOUR-NETID@vm0X-bt-edu.tnw.tudelft.nl:~/asm_workshop/results/mummer/miniasm_ont.png ~/Desktop/mummer/
Visualise the assembly graphs with Bandage (Optional)
Exercise
Download the
polished assembly graphof theONTassembly and compare it with theassembly graphof thePEandPE-MPassemblies.Solution
In a new tab (local computer) in your terminal do:
scp YOUR-NETID@vm0X-bt-edu.tnw.tudelft.nl:~/asm_workshop/results/miniasm_ont/ont_polished.gfa \ ~/Desktop/bandage/start Bandage and load the file ont_polished.gfa.
Click on “Draw graph” and save as image (current view)
compare with the
PEPE-MPassembly graphs.
Key Points