Page tree

Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Introduction

PL-Grid instance of Galaxy provides all the tools necessary for quality control of your raw reads, using FASTQC software, and  for preparation to alignment to the reference throughout trimming and filtering, using Flexbar software. This tutorial will give you an introduction to how to use these tools and it will guide you through the process.

Input Dataset

In this tutorial we  will use the file ‘testtest_dataset.fastq’ fastqwhich is available for download from the Bismark homepage (it contains 10,000 reads in FastQ format, Phred33 qualities, 50 bp long reads, from a human directional BS-Seq library).

...

In order to learn how to interpret these charts, please refer to the official help on FastQC and interpretation of QC metrics.

Quality trimming and filtering sequences

Flexbar software demultiplexes barcoded runs and removes adapter sequences. Moreover, trimming and filtering features are provided. Flexbar increases read mapping rates and improves genome and transcriptome assemblies. It supports next-generation sequencing data in fasta/q and csfasta/q format from Illumina, Roche 454, and the SOLiD platform.

...

Adapter removal statistics
==========================
Adapter:            Overlap removal:    Full length:
TruSeq Universal Adapter  2711062             0
TruSeq Adapter, Index 1  2754956             0
TruSeq Adapter, Index 2  3342                0
TruSeq Adapter, Index 3  567                 0
TruSeq Adapter, Index 4  151                 0
TruSeq Adapter, Index 5  34                  0
TruSeq Adapter, Index 6  45                  0
TruSeq Adapter, Index 7  39                  0
TruSeq Adapter, Index 8  7                   0
TruSeq Adapter, Index 9  14                  0
TruSeq Adapter, Index 10  4496                0 -> specified for this sample
TruSeq Adapter, Index 11  7                   0
TruSeq Adapter, Index 12  12                  0
TruSeq Adapter, Index 13  14                  0
TruSeq Adapter, Index 14  8                   0
TruSeq Adapter, Index 15  3                   0
TruSeq Adapter, Index 16  2                   0
TruSeq Adapter, Index 18  2                   0
TruSeq Adapter, Index 19  8                   0
TruSeq Adapter, Index 20  0                   0
TruSeq Adapter, Index 21  0                   0
TruSeq Adapter, Index 22  1                   0
TruSeq Adapter, Index 23  0                   0
TruSeq Adapter, Index 25  12                  0
TruSeq Adapter, Index 27  4                   0

Min, max, mean and median adapter overlap: 1 / 58 / 2 / 2

Output file statistics
======================
Read file:               FlexbarTargetFile.fastq
  written reads          10124610
  skipped short reads    7265

Filtering statistics
====================
Processed reads                   10148633
  skipped due to uncalled bases      16758
  trimmed due to low quality        748711
  short prior adapter removal         2046
  finally skipped short reads         7265
Discarded reads overall              24023
Remaining reads 10124610 (99% of input reads)

Trimmed data quality control (optional)

To check quality trimming and filtering results, and to compare our raw data and data after the trimming, we must repeat the "Raw data quality control" step for trimmed data.

...

We could observe that "Per base sequence quality" figure (to the left) is better in comparison to the figure before trimming. Some of the reads with quality lower than 20 were excluded from the analysis, especially that at the end of reads (46-50 base pair). Similar changes can also be seen at "Sequence length (bp)" figure before and after trimming (to the right). The main part of reads remained without trimming (50 bp) but from some of them adapters (probably those which lower than 45 bp) and reads which have poor quality (46-49 bp) were cut off. The middle figure, "Per base sequence content" shows the mean percentage of nucleotides (A, C, G, T). We could observe that after trimming, mainly of adapters which have repetitive sequences, there was equalization of the mean nucleotide percentage in comparison to reads before trimming.

Closing remarks

This tutorial covers trimming adapters and low quality reads using Flexbar. Additionally, we used FastQC at every step of analysis to control quality of our raw reads. It should be the first step before you proceed with the alignment of your raw sequences to the reference Short reads alignment to human reference genome using BWA or Short reads alignment to mouse transcriptome using Tophat2

...