Quality Control and Trimming Recap

Overview

Teaching: 10 min
Exercises: 0 min

Questions

How to start a genome assembly?

How can I describe the quality of my data?

How can I get rid of sequence data that doesn’t meet my quality standards?

Objectives

Explain how a FASTQ file encodes per-base quality scores.

Interpret a FastQC plot summarizing per-base quality across all reads.

Clean FASTQ reads using Trimmomatic.

De Novo assembly workflows

When working with high-throughput sequencing data, the raw reads you get off of the sequencer will need to pass through a number of different tools in order to generate your final desired output. The execution of this set of tools in a specified order is commonly referred to as a workflow or a pipeline.

An example of the workflow we will be using for our De Novo assembly is provided below with a brief description of each step.

Quality control - Assessing quality using FastQC
Quality control - Trimming and/or filtering reads (if necessary)
Assemble the reads
Perform quality control of the assembly
Annotate the assembled genome

Like in the variant calling workflow, these workflows adopt a plug-and-play approach in that the output of one tool can be easily used as input to another tool without any extensive configuration. Having standards for data formats is what makes this feasible. Standards ensure that data is stored in a way that is generally accepted and agreed upon within the community. The tools that are used to analyze data at different stages of the workflow are therefore built under the assumption that the data will be provided in a specific format.

Starting with Data

During this session we will work with sequencing reads coming from a study of Trivedi et all doi were they used benchmark datasets generated from control samples across a range of genome sizes to illustrate that QC inferences made using draft assemblies are broadly equivalent to those made using a well-established reference. Multiple different organismes were sequenced but here we will use only a 600bp paired-end and a 2.5 kb mate pair library from Escherichia coli from illumina platform.

As a first step we will inspect the paired-end library. The data structure and do some quality control and filtering. Before we can work with the data we first create a working directory and set the environment.

Load the environment

$ source /mnt/linapps/conda3loader
$ conda activate ASM

Use pwd (print working directory) to see in wich directory you are:

$ pwd

You will get something like:

$ /home/nfs/YOUR-NETID

If you are not in your home folder type “cd” to go back to your home folder:

$ cd

Get the data for the genomics course

Copy the data that we are going to use for this session:

$ cp -r /mnt/linapps/share/asm_workshop/ .

Move into the just created directorie “asm_workshop”:

$ cd asm_workshop

Check with “pwd” that you have something like:

$ /home/nfs/YOUR-NETID/asm_workshop

Quality Control

We learned about fastq files and how to do quality control in the variant calling sessions. fastq and quality control

Exercise

Assess the quality of the paired-end library called PE_600bp_50x. PE stands for Paired-end, 600bp is the insert-size of the sequenced fragment and we will use a subset of the data, in this case 50x coverage. (Hint: Use fastqc and scp to download the created html files.)
Solution

Create an output folder for the result files.
$ mkdir -p results/fastqc_untrimmed_reads
Run fastqc on the paired-end library
$ fastqc data/untrimmed_fastq/PE_600bp_* -o results/fastqc_untrimmed_reads
In a new tab (local computer) in your terminal do:
$ mkdir ~/Desktop/fastqc_html/
$ scp YOUR-NETID@vm0X-bt-edu.tnw.tudelft.nl:~/asm_workshop/results/fastqc_untrimmed_reads/*.html ~/Desktop/fastqc_html/
Then take a look at the html files in your browser.

Trimming

Trimmomatic performs a variety of useful trimming tasks, like removal of bad quality data and adapters, for illumina paired-end and single ended data. For a recap look at the trimming lesson. trimmomatic

Exercise

Apply Trimmomatic on the Paired End 600bp frags library using:

No adapter trimming.

Remove leading low quality or N bases (below quality 3) (LEADING:3)

Remove trailing low quality or N bases (below quality 3) (TRAILING:3)

Scan the read with a 4-base wide sliding window, cutting when the average quality per base drops below 15 (SLIDINGWINDOW:4:15)

Drop reads below the 100 bases long (MINLEN:100)

How many reads were kept and how many removed?
Solution

Create an output folder trimmed_fastq in folder data.
$ mkdir -p data/trimmed_fastq
Move into trimmed_fastq
$ cd data/trimmed_fastq
Run trimmomatic on the paired end library:
$ trimmomatic PE \
        ~/asm_workshop/data/untrimmed_fastq/PE_600bp_1.fastq.gz \
        ~/asm_workshop/data/untrimmed_fastq/PE_600bp_2.fastq.gz \
        PE_600bp_1.trim.fastq.gz PE_600bp_1un.trim.fastq.gz \
        PE_600bp_2.trim.fastq.gz PE_600bp_2un.trim.fastq.gz \
        LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:100 

Exercise

Assess the quality of the trimmed paired-end library.
Solution

Create an output folder for the result files.
$ mkdir -p ~/asm_workshop/results/fastqc_trimmed_reads
Run fastqc on the paired-end library
$ fastqc ~/asm_workshop/data/trimmed_fastq/PE_600bp_50x_* -o ~/asm_workshop/results/fastqc_trimmed_reads
In a new tab (local computer) in your terminal do:
$ mkdir ~/Desktop/fastqc_html/
$ scp YOUR-NETID@vm0X-bt-edu.tnw.tudelft.nl:~/asm_workshop/results/fastqc_trimmed_reads/*.html ~/Desktop/fastqc_html/
Then take a look at the html files in your browser.

Key Points

The options you set for the command-line tools you use are important!

Data cleaning is an essential step in a genomics workflow.

Quality encodings vary across sequencing platforms.

lesson home

Genome Assembly

next episode

Quality Control and Trimming Recap

Overview

De Novo assembly workflows

Starting with Data

Load the environment

Get the data for the genomics course

Quality Control

Exercise

Solution

Trimming

Exercise

Solution

Exercise

Solution

Key Points

lesson home

next episode