de novo Long Read Assembly

Overview

Teaching: 10 min
Exercises: 0 min

Questions

How to do a de novo long-read genome assembly?

Objectives

Long read vs short read characteristics

Long read error rate vs short read error rate

De Novo assembly (Nanopore)

Recent new single molecule technologies like PacBio and Oxford Nanopore are promising. Here we will assemble data from NCBI Bioproject https://www.ncbi.nlm.nih.gov/bioproject/PRJDB9013, which was used for Benchmarks of de novo assemblers for bacterial genomes. The data we will use is E. coli str. K12 substr. MG1655 with identifier DRR198814 (https://www.ncbi.nlm.nih.gov/sra/DRX189231[accn]) which was sequenced on a MinION from Oxford Nanopore. https://nanoporetech.com/products/minion

Quality Control

Like we have done for the short reads we are going to do quality control on the raw long read data.

FastQC

Exercise

Assess the quality of the long reads from Oxford Nanopre technology (ONT) located in: ~/asm_workshop/data/ont/DRR198814.fastq.gz. These are single reads with different read lengths. We will use a subset of the data.

Use as output folder ~/asm_workshop/results/fastqc_ont

Compare the error rate with those from the Illumina error rates.
Compare the read length distribution in respect to the Illumina libraries.

(Hint: Use fastqc and scp to download the created html files.)
Solution

Create an output folder for the result files.
$ mkdir -p ~/asm_workshop/results/fastqc_ont
Run fastqc on the ONT library. Use fastqc with option -t 2 (–threads 2; To speed up the process and to prevend OutOfMemoryError).
$ fastqc -t 2 ~/asm_workshop/data/ont/DRR198814.fastq.gz -o ~/asm_workshop/results/fastqc_ont
In a new tab (local computer) in your terminal do:
$ mkdir ~/Desktop/fastqc_html/
$ scp YOUR-NETID@vm0X-bt-edu.tnw.tudelft.nl:~/asm_workshop/results/fastqc_ont/DRR198814_fastqc.html \
       ~/Desktop/fastqc_html/
Then take a look at the html files in your browser.

Miniasm assembler

Miniasm is a very fast OLC-based de novo assembler for noisy long reads like PacBio and Oxford Nanopore data. https://github.com/lh3/miniasm

Miniasm needs an all-vs-all read self-mapping (typically by minimap2) as input.

Before we can start we first we will move to the asm_workshop folder

$ cd ~/asm_workshop

And create an output folder for the Nanopore assembly results.

$ mkdir ~/asm_workshop/results/miniasm_ont

Step 1: minimap2

The first step is to run minimap for the all-vs-all mapping.

Run minimap2 without options for command usage.

$  minimap2

For all-vs-all mapping we use the raw ont reads as target and as query. As preset setting we have to use -x ava-ont for ONT read overlap and redirect the output to a PAF file (Pairwise mApping Format), which is a text format describing the approximate mapping positions between two sets of sequences.

$  minimap2 -x ava-ont \
            ~/asm_workshop/data/ont/DRR198814.fastq.gz \
            ~/asm_workshop/data/ont/DRR198814.fastq.gz \
            > ~/asm_workshop/results/miniasm_ont/ont_overlaps.paf

Step 2: miniasm

The all-vs-all alignments (ont_overlaps.paf) will be used as input for miniasm that will build a assembly graph in GFA format.

Besides the all-vs-all PAF file we also have to provide the raw ONT reads.

$ miniasm -f ~/asm_workshop/data/ont/DRR198814.fastq.gz \
            ~/asm_workshop/results/miniasm_ont/ont_overlaps.paf \
            > ~/asm_workshop/results/miniasm_ont/ont_assembly.gfa

Step 3: Consensus sequence

Miniasm stores the assembly graph in GFA format. This is a tab-delimited text format for describing a set of sequences and their overlap. The first field of each line identifies the type of the line:

Header lines start with H
Seqment lines start with S
Link lines with L
Jump lines with J
Path line with P
Walk lines with W

The assembly graph stores the assembled sequences that we are interested in on the Segment line:

S Seqgment line format:

Column 1	Column 2	Column 3
S	identifier	Sequence

Next step is extracting the identifier and the consensus sequence from the assembly graph and convert it to multi-fasta format:

>identifier1

sequence1

>identifier2

sequence2

We will apply the linux tool awk, which does pattern scanning and processing in text files, for extracting the identifier and sequence from the graph and print it to fasta format.

The identifier of the sequence on a segment line is stored in the second column and can be captured by AWK with the variable $2. The sequence beloning to that identifier can be found on the third column and can be captured by AWK with the variable $3.

$ awk '/^S/{print ">"$2"\n"$3}' \
        ~/asm_workshop/results/miniasm_ont/ont_assembly.gfa \
        > ~/asm_workshop/results/miniasm_ont/ont_assembly.fasta

/^S/	capture all lines that start with ‘S’
”>”	print the char ‘>’
$2	print the content of variable $2, which is the identifier
“\n”	start a new line
$3	print the content of variable $3, which is the sequence

Assembly Error rate

Since we are using noisy reads to do a De Novo assembly we would like to know how this will workout for the assembly.

We could align the assembly to the reference sequence we have to inspect the sequence similarity.

For this we will use the tool dnadiff from the MUMmer package. https://github.com/garviz/MUMmer

on how to use dnadiff:

$ dnadiff -h

USAGE: dnadiff [options]

    <reference>       Set the input reference multi-FASTA filename
    <query>           Set the input query multi-FASTA filename

We will use as a reference: ~/asm_workshop/reference/Ecoli_K12_reference.fasta

And the ONT assembly as query: ~/asm_workshop/results/miniasm_ont/ont/assmbly.fasta

Move to the earlier created Mummer folder:

$ cd ~/asm_workshop/results/mummer

Run dnadiff with the -p option to control the output file name.

$ dnadiff -p miniasm_ont \
            ~/asm_workshop/reference/Ecoli_K12_reference.fasta \
            ~/asm_workshop/results/miniasm_ont/ont_assembly.fasta

Now open the report file that was generated by dnadiff and inspect the average sequence identity (AvgIdentity) under the 1-to-1 alignments.

$ less ~/asm_workshop/results/mummer/miniasm_ont.report

Exercise

Noisy long read assembly results in assemblies with a high error rate. into the sequence similarity of Noisy long read assembly but not for Illumina based assemblies.

Apply dnadiff on the paired-end SPAdes assembly we did: ~/asm_workshop/results/spades_pe/contigs.fasta .

Use the same mummer output folder and call the output file ecoli_pe: ~/asm_workshop/results/mummer/spades_pe

Compare the average sequence identity from the spades_pe assembly with the miniasm_ont sequence identity.
Solution

run dnadiff with:
dnadiff -p spades_pe \
           ~/asm_workshop/reference/Ecoli_K12_reference.fasta \
           ~/asm_workshop/results/spades_pe/contigs.fasta
Open the report spades_pe.reportand compare the average sequence identity (AvgIdentity) under the 1-to-1 alignments with the from the ont assembly report: spades_ont.report
$ less ~/asm_workshop/results/mummer/spades_pe.report

Error correcting (polishing) long read based assemblies

Contigs genereated from erroneous long read based De Novo assemblies needs to be error corrected to improve the consensus sequence. Here we will use Minipolish, which will use Racon for polishing, to error correct in three steps the miniasm assembly.

Step1: initial Racon polish with constituent reads

Miniasm’s assembled contigs are made up of pieces of long reads and therefore has a high error rate – probably around 90% or so, depending on the input reads.

The miniasm GFA file indicates specifically which reads contributed to each contig.

Therefore, the first thing Minipolish does is to run Racon on each contig independently, only using the reads which were used to create that contig. This step typically quite fast because it does not involve high read depths, and it can bring the percent identity up to the high 90s.

Step 2: full Racon polish rounds

Now that the assembly is in better shape, Minipolish does full Racon-polishing rounds – aligning the full read set to the whole assembly and getting a Racon consensus.

Step 3: contig read depth

Minipolish finishes by doing one more read-to-assembly alignment, this time not to polish but to calculate read depths. These depths are added to the GFA line for each contig (e.g. dp:f:77.179) and they will be recognised if the graph is loaded in Bandage.

Run Minipolish

To be able to run Minipolish we need to provide it with the raw reads and the assembly graph from the miniasm output and redirect this to a new now pollished assembly graph:

$ minipolish ~/asm_workshop/data/ont/DRR198814.fastq.gz \
            ~/asm_workshop/results/miniasm_ont/ont_assembly.gfa \
            > ~/asm_workshop/results/miniasm_ont/ont_polished.gfa

Polished consensus sequence

Now we have to extract the consence sequence from the polished assembly graph and for that we have to use awk, which is a linux tool that does pattern scanning and processing in text files.

The assembly graph contains the sequence that we need but we have to convert it to fasta format. AWK is searching for lines that start with S (/^S/) and starts printing the sequence identifiers $2 with the fasta header symbol > in front of it. On the next line \n the actual sequence will be printed $3.

$ awk '/^S/{print ">"$2"\n"$3}' \
        ~/asm_workshop/results/miniasm_ont/ont_polished.gfa \
        > ~/asm_workshop/results/miniasm_ont/ont_polished.fasta

Exercise

Apply dnadiff on the minipolished assembly: ~/asm_workshop/results/miniasm_ont/ont_polished.fasta .

Use the same mummer output folder and call the output file ecoli_pe: ~/asm_workshop/results/mummer/miniasm_ont_minipolish

Compare the average sequence identity from the miniasm_ont_minipolish assembly with the miniasm_ont sequence identity.
Solution

run dnadiff with:
$ cd ~/asm_workshop/results/mummer
dnadiff -p miniasm_ont_minipolish \
           ~/asm_workshop/reference/Ecoli_K12_reference.fasta \
           ~/asm_workshop/results/miniasm_ont/ont_polished.fasta
Open the report miniasm_ont_minipolish.reportand compare the average sequence identity (AvgIdentity) under the 1-to-1 alignments with the from the ont assembly report: spades_ont.report
$ less ~/asm_workshop/results/mummer/miniasm_ont_minipolish.report

Polishing with Pilon

Now that we have seen that we need to error correct (polish) assemblies that are based on noisy long reads. Minipolish used the noisy long reads to align these to the raw assembly and based on the sequence depth it was able to correct the sequence of the assembly.

Still the error rate is to high to do for instance an annotation. The errors would result in highly fragmented genes.

An approach would be to use the Illumina reads from which we know it has high read accuracy. A polishing tool that can be used for this is Pilon.

Pilon requires as input a fasta file of the draft assembly and the aligned Illumina reads to this draft assembly in BAM format as we did during the Variant Calling pipeline.

Exercise

Align the Illumina reads from ~/asm_workshop/data/trimmed_fastq/PE_600bp_1.trim.fastq.gz and ~/asm_workshop/data/trimmed_fastq/PE_600bp_2.trim.fastq.gz to the minipolished assembly: ~/asm_workshop/results/miniasm_ont/ont_polished.fasta as we have done during the Variant Calling sessions.

Use bwa mem to do the mapping and create an SAM file

Use samtools to sort and convert the SAM file to a sorted BAM file.
Solution

First index the polished assembly:
$ bwa index ~/asm_workshop/results/miniasm_ont/ont_polished.fasta
Map the paired-end Illumina reads to the indexed polished assembly:
$ bwa mem ~/asm_workshop/results/miniasm_ont/ont_polished.fasta \
         ~/asm_workshop/data/trimmed_fastq/PE_600bp_1.trim.fastq.gz \
         ~/asm_workshop/data/trimmed_fastq/PE_600bp_2.trim.fastq.gz \
         > ~/asm_workshop/results/miniasm_ont/pe_polished.sam
Sort and convert the SAM file
 samtools sort -O bam \
           -o ~/asm_workshop/results/miniasm_ont/pe_polished.sorted.bam \
           ~/asm_workshop/results/miniasm_ont/pe_polished.sam

Pilon

Before we can apply pilon we first have to index the alignment file.

$ samtools index ~/asm_workshop/results/miniasm_ont/pe_polished.sorted.bam

Now we are ready to run pilon:

$ pilon -Xmx2G  \
        --genome ~/asm_workshop/results/miniasm_ont/ont_polished.fasta \
        --frags ~/asm_workshop/results/miniasm_ont/pe_polished.sorted.bam \
        --output ~/asm_workshop/results/miniasm_ont/ont_pilon_polished \
        --changes --fix all

Exercise

Apply dnadiff on the pilon polished assembly: ~/asm_workshop/results/miniasm_ont/ont_pilon_polished.fasta .

Use the same mummer output folder and call the output file: ~/asm_workshop/results/mummer/miniasm_ont_pilon

Compare the average sequence identity from the miniasm_ont_pilon assembly with the miniasm_ont and miniasm_ont_minipolish sequence identity.
Solution

run dnadiff with:
$ cd ~/asm_workshop/results/mummer
dnadiff -p miniasm_ont_pilon \
           ~/asm_workshop/reference/Ecoli_K12_reference.fasta \
           ~/asm_workshop/results/miniasm_ont/ont_pilon_polished.fasta
Open the report ecoli_ont_pilon.reportand compare the average sequence identity (AvgIdentity) under the 1-to-1 alignments with the from the ont assembly report: ecoli_ont_minipolish.report
$ less ~/asm_workshop/results/mummer/miniasm_ont_pilon.report

QUAST: compare ONT, PE and PE-MP assemblies

Exercise

Compare the pilon polished ONT assembly with the illumina assemblies PE and PE-MP by using QUAST.

What are the major differences between the two assemblies?

Use quast_pe_mp_ont as output folder.
Solution
$ quast.py \
     ~/asm_workshop/results/spades_pe/contigs.fasta \
     ~/asm_workshop/results/spades_pe_mp/scaffolds.fasta \
     ~/asm_workshop/results/miniasm_ont/ont_pilon_polished.fasta \
     -R ~/asm_workshop/reference/Ecoli_K12_reference.fasta \
     -o ~/asm_workshop/results/quast_pe_mp_ont
Open the file ~/asm_workshop/results/quast_pe_mp_ont/report.txt with less.
$ less ~/asm_workshop/results/quast_pe_mp_ont/report.txt 

Assembly alignment

Exercise

Align the polished ONT assembly to the reference: (~/asm_workshop/reference/Ecoli_K12_reference.fasta). Use as a prefix: miniasm_ont.

Make sure you are working in the mummer folder: ~/asm_workshop/results/mummer

Inspect the resulting miniasm_ont.png plot. Has the long reads improved the assembly?
Solution

Use as working directory: ~/asm_workshop/results/mummer
$ cd ~/asm_workshop/results/mummer
Run nucmer
$ nucmer --prefix miniasm_ont \
         ~/asm_workshop/reference/Ecoli_K12_reference.fasta \
         ~/asm_workshop/results/miniasm_ont/ont_polished.fasta
nucmer has aligned the assembly to the reference.

Use mummerplot to plot the alignments:
$ mummerplot --png --layout --filter --prefix miniasm_ont \
         ~/asm_workshop/results/mummer/miniasm_ont.delta \
         -R ~/asm_workshop/reference/Ecoli_K12_reference.fasta \
         -Q ~/asm_workshop/results/miniasm_ont/ont_polished.fasta
A plot file ‘miniasm_ont.png’ has been created. Download the file to your local computer and inspect the file.

In a new tab (local computer) in your terminal do:
$ scp YOUR-NETID@vm0X-bt-edu.tnw.tudelft.nl:~/asm_workshop/results/mummer/miniasm_ont.png ~/Desktop/mummer/

Visualise the assembly graphs with Bandage (Optional)

Exercise

Download the polished assembly graph of the ONT assembly and compare it with the assembly graph of the PE and PE-MP assemblies.
Solution

In a new tab (local computer) in your terminal do:
scp YOUR-NETID@vm0X-bt-edu.tnw.tudelft.nl:~/asm_workshop/results/miniasm_ont/ont_polished.gfa \
       ~/Desktop/bandage/
start Bandage and load the file ont_polished.gfa.

Click on “Draw graph” and save as image (current view)

compare with the PE PE-MP assembly graphs.

Key Points

previous episode

Genome Assembly

next episode

de novo Long Read Assembly

Overview

De Novo assembly (Nanopore)

Quality Control

FastQC

Exercise

Solution

Miniasm assembler

Step 1: minimap2

Step 2: miniasm

Step 3: Consensus sequence

Assembly Error rate

Exercise

Solution

Error correcting (polishing) long read based assemblies

Step1: initial Racon polish with constituent reads

Step 2: full Racon polish rounds

Step 3: contig read depth

Run Minipolish

Polished consensus sequence

Exercise

Solution

Polishing with Pilon

Exercise

Solution

Pilon

Exercise

Solution

QUAST: compare ONT, PE and PE-MP assemblies

Exercise

Solution

Assembly alignment

Exercise

Solution

Visualise the assembly graphs with Bandage (Optional)

Exercise

Solution

Key Points

previous episode

next episode