This section of the document will provide you with some introductory material regarding the most common file and data notation formats, that you are to encounter when performing various NGS analyses. Knowing what is the purpose of a specific data format and how one may apply it to further processing is a very important piece of information that you are going to use quite often. Hence, we decided to start the NGS workflows tutorial with this part.
The FASTA format is a standard for displaying (nucleotide or protein) sequences in a text file. An entry for a sequence takes up two lines in the file: the first line begins with a ">" symbol, followed by the sequence description, and the second line contains the sequence itself.
>gi|67328264|gb|AAFC02129962.1| Bos taurus breed Hereford Con136352, whole genome shotgun sequence
The FASTQ format stores sequences and Phred qualities scores in a single file. FASTQ file uses four lines per sequence:
Sequence identifier contains: WGG97JN1 the unique instrument name; 192 the run id; C200YACXX the flowcell id; 7 flowcell lane; 1101 tile number within the flowcell lane; 1307 'x'-coordinate of the cluster within the tile; 1960 'y'-coordinate of the cluster within the tile; 1 the member of a pair, 1 or 2 (paired-end or mate-pair reads only); Y if the read fails filter (read is bad), N otherwise; 0 when none of the control bits are on, otherwise it is an even number; TTAGGC index sequence.
The quality score of a base, also known as a Phred or Q score, is an integer value representing the estimated probability of an error, i.e. that the base is incorrect. If P is the error probability, then:
P = 10–Q/10
Q = –10 log10(P)
Q scores are often represented as ASCII characters. The quality scores are logarithmically linked to error probabilities, as shown in the following table:
Phred quality score (Q)
Probability that the base is called wrong
Accuracy of the base call (P)
1 in 10
1 in 100
1 in 1,000
1 in 10,000
The process of alignment of the sequence reads to a reference genome produces a Sequence Alignment Map (SAM) format file. The SAM file is then converted into a binary form, or Binary-sequence Alignment Format (BAM) file, for indexing to allow efficient random access of the data contained within. The SAM file example is presented below:
@SQ SN:1 LN:185838109
@RG ID:reSeq LB:library SM:nn PL:illumina
HWI-H1:155:c04j0abxx:7:1101:6942:1982 113 1 146794624 1 33M = 146794624 0 TATGGGATCCCGCCTCAACATGACCTGATGAGC @?;?@?55'GGGDCGFB?GEEAAHHIHFHHHHH NM:i:1 AS:i:38 XS:i:34 RG:Z:reSeq
HWI-H1:155:c04j0abxx:7:1101:6942:1982 177 1 146794624 1 33M = 146794624 0 TATGGGATCCCGCCTCAACATGACCTGATGAGC @?;?@?55'GGGDCGFB?GEEAAHHIHFHHHHH NM:i:1 AS:i:38 XS:i:34 RG:Z:reSeq
HWI-H1:155:c04j0abxx:7:1101:6674:1967 177 1 101188968 60 27M = 100338662 0 CCATCTACTGGGGATTGGATCAATAAA 3<A<,DFF:<<2<AFBDDB4DDBD??: NM:i:1 AS:i:52 XS:i:21 RG:Z:reSeq
The SAM file contains:
The following table gives an overview of the mandatory fields in the SAM format with alignment example:
Query template NAME
Reference sequence NAME
1-based leftmost mapping POSition
Ref. name of the mate/next read
Position of the mate/next read
observed Template LENgth
ASCII of Phred-scaled base QUALity+33
additional optional information
NM:i:1 AS:i:38 XS:i:34 RG:Z:reSeq
NM:i:1 AS:i:38 XS:i:34 RG:Z:reSeq
NM:i:1 AS:i:52 XS:i:21 RG:Z:reSeq
The so-called CIGAR string is a sequence of base lengths and the associated operation. They are used to indicate things like which bases align (either a match/mismatch, M) with the reference, which are deleted (D) from the reference, and which are insertions (I) that are not in the reference (N). In our example all reads were perfectly matched (Alignment 1 - 33M; Alignment 2 - 33M and Alignment 3 - 27M) to the reference genome. In order to better understand the CIGAR string the additional example is presented:
RefPos: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
Reference: C C A T A C T G A A C T G A C T A A C
Read: A C T A G A A T G G C T
With the alignment above, you get: POS: 5; 3 matches, 1 insertion, 3 matches, 1 deletion and 5 matches so the CIGAR string looks like: 3M1I3M1D5M
The full Sequence Alignment/Map format specification is available at http://samtools.github.io/hts-specs/SAMv1.pdf.
The Variant Call Format (VCF) is a standardized format for storing and reporting genomic sequence variations. VCF files are used to report sequence variations (e.g., SNPs, insertions and deletions (INDELS) and larger structural variants) together with rich annotations. VCF files are modular where the annotations and genotype information for a variant are separated from the call itself. VCF version 4.1 is the currently active format specification.
##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of Samples With Data">
##INFO=<ID=DB,Number=0,Type=Flag,Description="dbSNP membership, build 129">
##FILTER=<ID=q10,Description="Quality below 10">
##FILTER=<ID=s50,Description="Less than 50% of samples have data">
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA00001 NA00002 NA00003
20 14370 rs6054257 G A 29 PASS NS=3;DP=14;AF=0.5;DB;H2 GT:GQ:DP:HQ 0|0:48:1:51,51 1|0:48:8:51,51 1/1:43:5:.,.
20 17330 . T A 3 q10 NS=3;DP=11;AF=0.017 GT:GQ:DP:HQ 0|0:49:3:58,50 0|1:3:5:65,3 0/0:41:3
20 1110696 rs6040355 A G,T 67 PASS NS=2;DP=10;AF=0.333,0.667;AA=T;DB GT:GQ:DP:HQ 1|2:21:6:23,27 2|1:2:0:18,2 2/2:35:4
20 1230237 . T . 47 PASS NS=3;DP=13;AA=T GT:GQ:DP:HQ 0|0:54:7:56,60 0|0:48:4:51,51 0/0:61:2
20 1234567 microsat1 GTC G,GTCT 50 PASS NS=3;DP=9;AA=G GT:GQ:DP 0/1:35:4 0/2:17:2 1/1:40:3
VCF files contains meta-information lines (starting from ## string and must be key=value pairs), a header line (starting from # with 8 fixed mandatory columns: CHROM, POS, ID, REF, ALT, QUAL, FILTER, INFO and genotype fields with FORMAT column with the corresponding genotype for each sample), and then data lines each containing information about a position in the genome.
There are also other specifications of Variant Call Format (VCF) with certain modifications to support supplemental information specific to the project like TCGA Variant Call Format (VCF) version 1.2. For example, TCGA VCF specification allows for additional fields to represent data associated with complex rearrangements, RNA-Seq variants, and sample-specific metadata. The full summary of additions/modifications within TCGA Variant Call Format is available at https://wiki.nci.nih.gov/display/TCGA/TCGA+Variant+Call+Format+(VCF)+1.2+Specification.
The Gene transfer format (GTF) is a file format used to hold information about gene structure. It is a tab-delimited text format based on the general feature format (GFF), but contains some additional conventions specific to gene information.
1 protein_coding CDS 15588 15702 . + 1 gene_id "ENSGALG00000009771"; transcript_id "ENSGALT00000015891"; exon_number "6"; gene_biotype "protein_coding"; protein_id "ENSGALP00000015874";
1 protein_coding stop_codon 15703 15705 . + 0 gene_id "ENSGALG00000009771"; transcript_id "ENSGALT00000015891"; exon_number "6"; gene_biotype "protein_coding";
1 protein_coding exon 35753 35950 . - . gene_id "ENSGALG00000027884"; transcript_id "ENSGALT00000042749"; exon_number "1"; gene_biotype "protein_coding"; exon_id "ENSGALE00000312791";
1 protein_coding CDS 35753 35950 . - 0 gene_id "ENSGALG00000027884"; transcript_id "ENSGALT00000042749"; exon_number "1"; gene_biotype "protein_coding"; protein_id "ENSGALP00000042558";
The GTF file contains: tab-separated fields, track lines and additional information. The most important part - the tab-separated fields - consists of:
The full GTF/GFF Format specification is available at http://www.ensembl.org/info/website/upload/gff.html and at http://mblab.wustl.edu/GTF22.html#fields.