Page tree
Skip to end of metadata
Go to start of metadata

Overview

This tutorial will show simple workflow for aligning short reads from Illumina HiSeq to human genome using Burrows-Wheeler Aligner. Aligning short read is mandatory step for performing more complex analysis, e. g. variant calling.

RAW short reads

Before we perform any analysis we need to load our data. Galaxy offers a variety of different methods for importing input data. All data import tools are grouped in Get Data category in Tools side panel.

Upload file

If you want to process your own data you can upload file form local disk or download data from Internet. See the Upload File tool in the Get Data section of the leftmost menu. In this example we will use data from 1000 Genomes Project. To download data first set file format to fastqsanger and genome to hg19.

Most combo boxes support 'search as you type'

Copy and paste data file URLs into URL/Text text box ("Paste/Fetch data" button)

ftp://ftp.ncbi.nlm.nih.gov/1000genomes/ftp/phase3/data/HG04047/sequence_read/SRR794807_1.filt.fastq.gz
ftp://ftp.ncbi.nlm.nih.gov/1000genomes/ftp/phase3/data/HG04047/sequence_read/SRR794807_2.filt.fastq.gz

 

Set file format (Type) as fastqsanger [HINT: this is important, because galaxy will recognize all kinds of fastq files as generic fastq format. However, BWA require more specific fastqsanger format]

You could do it also later using "Edit Attributes" in your history Data/History window.

 

 

Pressing Start will schedule download of compressed FASTQ files from 1000 Genomes Project FTP. After downloading, Galaxy will automatically decompress FASTQ files.

If command execution was successful you will be presented with confirmation and two new items will appear in History side panel on the right edge of screen. Until scheduled jobs are complete, a history item background will be yellow.

Upon completion, a history item background will turn green (if everything worked properly) or red (in case there was an error and the job failed). 

Shared data

Downloading data for this tutorial from the 1000 Genomes Project page can take a few minutes. Galaxy was designed to enable easy collaboration between its users. The most important feature to enable such collaboration is sharing of data and analysis workflows. These features can be accessed by clicking the Shared Data drop down menu on top navigation bar. Instead of downloading data from Internet we can load it quickly using Published Histories.

Click history name to display more information.

Publish history can be imported by clicking Import history button above history title.

On success you will see confirmation message. To display imported history press Galaxy logo in top left corner.

Quality Control

Before using data we must evaluate quality of reads. QC allows us to detect poor quality reads or other characteristics of our data and act accordingly.

For each FastQ file we must execute FastQC:Read Quality Reports from category NGS: QC and manipulation.

 

Quick search at the top of tools pane allow to access all tools by typing tool name

 

To use FastQC you must pick input file and press Execute. It is a good practice to set title of resulting history item, e. g. 'FastQC Read 1'. It makes history easier to read

.

To view FastQC report press eye icon located next to history item name

FastQC report for read 1:

FastQC report can be used to asses quality or our data and troubleshoot common problems. Per base sequence quality graph shows that most bases of our reads have quality higher than Q30 which means our we have high quality reads.

Large differences in sequence content between bases at the beginning of the read suggest untrimmed barcodes.

More information about FasQC and interpretation of QC metrics.

Align reads using BWA

One of the most popular short reads mapper is Burrows-Wheeler Aligner. BWA provides three alignment algorithms. BWA-backtrack is for Illumina reads up to 100bp, while BWA-SW (based on Smith–Waterman algorithm) and BWA-MEM (Maximal Exact Matches) are for longer sequences: 70bp to 1Mbp. Tool available in Galaxy is using BWA-MEM.

Using BWA is very simple and requires two parameters. First parameter is the name of reference genome we are aligning our reads to - it is used to determine name and location of appropriate index file. Second parameter is the name of input file. If we are aligning pair end reads, as in this example, we need to provide names for file with forward reads (usually file name with _1, r1 or equivalent in it) and reverse reads (e. g. _1, r1, etc.).

Mapping reads to reference will yield single BAM file containing both forward and reverse reads with mapping information.

 

 

  • No labels