PL-Grid instance of Galaxy provides all the tools necessary for creating and executing a complete RNA-seq analysis pipeline for human (hg19) and mouse (mm9, mm10). This tutorial introduces these tools and guides you through them. Tools which are required for every step are marked by info frames. All the form fields that has to be filled are described below the info frames.
In this tutorial we will use sample reads from total RNA sequencing of mouse astroglial primary cultures treated with dexamethasone (on the courtesy of Institute of Pharmacology PAS). This is ribo-depleted totalRNA sample without size selection. Therefore various types of RNA might be found in the sample including tRNA, lincRNA, pre-mRNA or mRNA.
Fill in the form:
We will supply Tophat with a set of gene model annotations and/or known transcripts, as a GTF 2.2 formatted file. If the annotation is provided, Tophat will first extract the transcript sequences and use Bowtie to align reads to this virtual transcriptome first. Only the reads that do not fully map to the transcriptome will then be mapped on the genome. The reads that did map on the transcriptome will be converted to genomic mappings (spliced as needed) and merged with the novel mappings and junctions in the final tophat output.
We will use UCSC table browser for this, which is included in galaxy as a tool. We have to select clad, species and assembly version. The tool enables to query various tables form UCSC genome browser database. We are interested in gene annotations therefore we will use 'genes and gene prediction tracks' group of tables. 'ensGene' table from 'Ensembl genes' track is a set of genes and their isoforms that were curated by EBI Ensembl. Finally, we will select format in which we would like to export our data. GTF is widely used by tools that use genome annotations. Also make sure to check the 'Galaxy' checkbox to send data to galaxy.
Fill in the form:
UCSC is a rich resource for genomic analyses (Karolchik et al., 2014). There are various tracks that contain information about genes, gene predictions, mRNA, proteins, conservation, genetic variations, transcription regulation chromatin structure or repeatable elements for almost all sequenced genomes available. The most suitable way of viewing the data is the Genome Browser. However, the data can be accessed also by the Table Browser (presented in the tutorial), a ftp download or a MySQL client.
TopHat is a specialized alignment software for RNA-Seq reads that is aware of splice junctions when aligning to a reference assembly. Tophat aligns RNA-Seq reads to genom using short read aligner Bowtie, and then analyzes the mapping results to identify splice junctions between exons. We will use annotations downloaded in the previous step. Tophat will first extract the transcript sequences and use Bowtie to align reads to this virtual transcriptome first. Only the reads that do not fully map to the transcriptome will then be mapped on the genome. We have to select our fastq file as an input. Next, we can select a reference genome, which should be mm9 in this case. We have to expand full parameter list to access fields that allow to supply annotations. Scroll down to 'Use Own Junctions' field. Regrettably, after every selection focus will be moved to the top of the page and we will have to scroll down again. So, we would like to use our own junctions in the form of gene annotation model (gtf or gff), and finally we have to choose our annotation file.
Fill in the form:
Using splice-junction mappers is the most popular way of RNA-seq analysis. Especially if the analyzed genome is well sequenced with few gaps in it (like rat, mouse, human or arabidopsis). There are few splice junction mappers available including TopHat (Trapnell et al., 2010), STAR (Dobin et al., 2013), GSNAP, MapSplice, SpliceMap and SoapSplice (Hunag et al., 2011). There are two criteria for selection of a tool: accuracy and speed. TopHat and STAR are the fastest tools. STAR is more accurate, however it does not use annotation guidance. Therefore for well annotated genomes (mouse, human, caenorhabditis) it is recommended to use TopHat. Using annotations is generally accepted practice for rna-seq analysis - it increases speed and accuracy of alignment, especially for short reads (coming from illumina instrument for example).
For visualization we will use Trackster. Trackster is Galaxy's visualization and visual analysis environment. It lets you visualize your SAM/BAM, BED, GFF/GTF, WIG, bigWig, bigBed, bedGraph, and VCF datasets from within Galaxy. First, click on Visualization --> New Track Browser in the top menu bar, NOT in the tools bar . Choose a build and name for the visualization. You will now see a visualization with no datasets in it. Add datasets by clicking the "Add Datasets to Visualization" button. Select a history or data library, followed by one or more datasets. If your desired datasets do not show up, make sure they are of the same build as the visualization and that they are of a supported dataset type. After adding tracks, Trackster will index the datasets for fast access. Once this is done, you can navigate to a chrom/contig in the dropdown to view your data.
Fill in the form:
Fill in the form:
All tracks will be visualized by Squish plot type. Bold lines depict reads for bam file, exons for annotation gtf file or intron endings for splice junction bed file. Thin lines depict gaps in reads for bam files, introns for annotation gtf files and introns for splice junction bed files. The color has only meaning for visualization of bam files, where they depict strand orientation of read. The orientation is only preserved for RNA-seq and not for DNA-seq standard protocols of library preparation. The reads from bam files occur mostly at sites annotated by exons. However there are some reads occurring on intronic (pre mRNA) or intergenic sites (enhancerRNA, rRNA or DNA contamination).
More about trackster plots can be found here https://wiki.galaxyproject.org/Learn/Visualization.
This tutorial does not cover differential expression analysis. However, we will give brief introduction to do that. There is no consensus on the best methodology for analyzing RNA-seq data. First, the user has to decide whether to use feature-count methods versus transcript deconvolution methods for transcript quantification. This issue was best described by the cuffdiff2 paper (Trapnell et al., 2013, see figure below). The first choice influences further statistical approach. There are two popular options for transcript deconvolution: (1) cufflinks (Roberts et al., 2011) and (2) MISO (Katz et al. 2010). Cufflinks is more widely adopted than MISO. Moreover, MISO does not accept reads of different length - therefore it is unusable for reads coming from SOLiD or Ion Proton instruments. The list of tools that allow feature based read counts include HTSeq and custom made scripts.
Figure reproduced from the (Trapnell et al., 2013) under fair use. (a) A simple schemes for expression summarization using gene exons. In the exon-union model read counts falls on one of the exons of the gene, whereas in the exon-intersection model read counts fall only on the constitutive exons. (b) both of the exon-union and exon-intersection counting schemes may incorrectly estimate the changes in gene expression of the different isoforms. The real expression is estimated by the sum of the length isoform normalized read counts. The discrepancy between the change in gene expression is driven by a change in the abundance of isoforms in relation to each other. In the top row, a gene produces the same number of reads in both conditions A and B, but in the state B, there is more of the shorter isoform. Intersection count program underestimates the true changes in gene expression. In the middle row intersection-union algorithm does not detect a change driven by a shift in the predominant isoform gene. Union program detects changes in the wrong direction. In the bottom row, the expression of the gene is fixed, but the isoforms undergo a complete switch between the A and B. Both the simplified counting schemes outputs change magnitude that do not reflect true changes in gene expression.
The isoform quantification method influenced further choice of tool for differential expression. Moreover, approaches to statistics for RNA-seq was strongly influenced by the price of this technology. As RNA-seq experiments are expensive, they suffer from low number of samples. That is why approaches that are powerful (in the statistical sense; giving many false positives) are popular for RNA-seq analysis. When transcript isoforms are quantified using MISO package, differential expression analysis can also be performed with this package. If the quantification is done using cufflinks, it can be followed up by cuffdiff. These two approaches give many false positives, especially for low and high abundance transcripts. If there is enough samples to get statistical power it is recommended to use traditional parametric statistics followed up by correction for multiple testing. If the isoform quantification is done using feature-based methods they can be followed up by edgeR, DESeq or DEXSeq methods.
Roberts A, Trapnell C, Donaghey J, Rinn JL, Pachter L. (2011), Improving RNA-Seq expression estimates by correcting for fragment bias. Genome Biol. 2011;12(3):R22. doi: 10.1186/gb-2011-12-3-r22.