Page tree
Skip to end of metadata
Go to start of metadata

This section of the document will provide you with some introductory material regarding the most common file and data notation formats, that you are to encounter when performing various NGS analyses. Knowing what is the purpose of a specific data format and how one may apply it to further processing is a very important piece of information that you are going to use quite often. Hence, we decided to start the NGS workflows tutorial with this part.

The general scheme of application of various NGS-related formats in different stages of NGS data analysis workflows. Normally one would start with the left-most set of files and proceed rightwards with subsequent steps of analysis.







 

 

 

 




FASTA Format

The FASTA format is a standard for displaying (nucleotide or protein) sequences in a text file. An entry for a sequence takes up two lines in the file: the first line begins with a ">" symbol, followed by the sequence description, and the second line contains the sequence itself.

>gi|67328264|gb|AAFC02129962.1| Bos taurus breed Hereford Con136352, whole genome shotgun sequence
CCCCCCCCCCCCCCCCCCGGGCACGTACCTGCTGGATCAGCCCCACCTGGAGCTGGGTGAGGAACAGCTG
GGGAAGGAAGCAAGCGGACAGTGAGCTGAGCCCCGGTGCCGGGTGCCGGCAGGCCCGCCCACCCTGGCCC

FASTQ Format

The FASTQ format stores sequences and Phred qualities scores in a single file. FASTQ file uses four lines per sequence:

  • Line 1 - begins with a '@' character and is followed by a sequence identifier and an optional description (like a sequence description),
  • Line 2 - is the raw sequence letters,
  • Line 3 - begins with a '+' character,
  • Line 4 - encodes the quality values for the sequence in Line 2.
@WGG97JN1:192:C200YACXX:7:1101:1307:1960 1:N:0:TTAGGC
CGAGGAGCTGAGTCACAGAGCAGAAGGGGTTTCAGAGATTCGGGCTGTCCA
+
@@@FFFFFHHCFHHDHHIEGIIGIJGIHHGGHJIJIIIJIJJJGGGHFI7@
@WGG97JN1:192:C200YACXX:7:1101:1602:1991 1:N:0:TTAGGC
CTGCGGTTCCTCTCGTACTGAGCAGGATTACTAGCGCAACAACACATCATC
+
=??DD@=<AF?DFFF;EBDHCCFFG:E<D<?DFC>GGHD@BG.=@C;FGEE

Sequence identifier contains: WGG97JN1 the unique instrument name; 192 the run id; C200YACXX the flowcell id; 7 flowcell lane; 1101 tile number within the flowcell lane; 1307 'x'-coordinate of the cluster within the tile; 1960 'y'-coordinate of the cluster within the tile; 1 the member of a pair, 1 or 2 (paired-end or mate-pair reads only); Y if the read fails filter (read is bad), N otherwise; 0 when none of the control bits are on, otherwise it is an even number; TTAGGC index sequence.

Cock et al (2009) The Sanger FASTQ file format for sequences with quality scores, and the Solexa/Illumina FASTQ variants. Nucleic Acids Research, doi:10.1093/nar/gkp1137

Phred qualities scores

The quality score of a base, also known as a Phred or Q score, is an integer value representing the estimated probability of an error, i.e. that the base is incorrect. If P is the error probability, then:

P = 10–Q/10

Q = –10 log10(P)

Q scores are often represented as ASCII characters. The quality scores are logarithmically linked to error probabilities, as shown in the following table:

Phred quality score (Q)

Probability that the base is called wrong

Accuracy of the base call (P)

10

1 in 10

90%

20

1 in 100

99%

30

1 in 1,000

99.9%

40

1 in 10,000

99.99%


SAM/BAM Format

The process of alignment of the sequence reads to a reference genome produces a Sequence Alignment Map (SAM) format file. The SAM file is then converted into a binary form, or Binary-sequence Alignment Format (BAM) file, for indexing to allow efficient random access of the data contained within. The SAM file example is presented below:

@SQ SN:1 LN:185838109
@RG ID:reSeq LB:library SM:nn PL:illumina
HWI-H1:155:c04j0abxx:7:1101:6942:1982 113 1 146794624 1 33M = 146794624 0 TATGGGATCCCGCCTCAACATGACCTGATGAGC @?;?@?55'GGGDCGFB?GEEAAHHIHFHHHHH NM:i:1 AS:i:38 XS:i:34 RG:Z:reSeq
HWI-H1:155:c04j0abxx:7:1101:6942:1982 177 1 146794624 1 33M = 146794624 0 TATGGGATCCCGCCTCAACATGACCTGATGAGC @?;?@?55'GGGDCGFB?GEEAAHHIHFHHHHH NM:i:1 AS:i:38 XS:i:34 RG:Z:reSeq
HWI-H1:155:c04j0abxx:7:1101:6674:1967 177 1 101188968 60 27M = 100338662 0 CCATCTACTGGGGATTGGATCAATAAA 3<A<,DFF:<<2<AFBDDB4DDBD??: NM:i:1 AS:i:52 XS:i:21 RG:Z:reSeq

The SAM file contains:

  • header sections begins with character `@' followed by a two-letter record type code,
  • alignment section with 11 (12) tab delimited mandatory fields.

The following table gives an overview of the mandatory fields in the SAM format with alignment example:

Col

Field

Brief description

Alignment 1

Alignment 2

Alignment 3

1

QNAME

Query template NAME

HWI-H1:155:c04j0abxx:7:1101:6942:1982

HWI-H1:155:c04j0abxx:7:1101:6942:1982

HWI-H1:155:c04j0abxx:7:1101:6674:1967

2

FLAG

bitwise FLAG

113

177

113

3

RNAME

Reference sequence NAME

1

1

1

4

POS

1-based leftmost mapping POSition

146794624

146794624

101188968

5

MAPQ

MAPping Quality

1

1

60

6

CIGAR

CIGAR string

33M

33M

27M

7

RNEXT

Ref. name of the mate/next read

=

=

=

8

PNEXT

Position of the mate/next read

146794624

146794624

100338662

9

TLEN

observed Template LENgth

0

0

0

10

SEQ

segment SEQuence

TATGGGATCCCGCCTCAACATGACCTGATGAGC

TATGGGATCCCGCCTCAACATGACCTGATGAGC

TGGTACATTCACAGAATGGAATACTAG

11

QUAL

ASCII of Phred-scaled base QUALity+33

@?;?@?55'GGGDCGFB?GEEAAHHIHFHHHHH

@?;?@?55'GGGDCGFB?GEEAAHHIHFHHHHH

3<A<,DFF:<<2<AFBDDB4DDBD??:

12

TAGs

additional optional information

NM:i:1 AS:i:38 XS:i:34 RG:Z:reSeq

NM:i:1 AS:i:38 XS:i:34 RG:Z:reSeq

NM:i:1 AS:i:52 XS:i:21 RG:Z:reSeq

 

The so-called CIGAR string is a sequence of base lengths and the associated operation. They are used to indicate things like which bases align (either a match/mismatch, M) with the reference, which are deleted (D) from the reference, and which are insertions (I) that are not in the reference (N). In our example all reads were perfectly matched (Alignment 1 - 33M; Alignment 2 - 33M and Alignment 3 - 27M) to the reference genome. In order to better understand the CIGAR string the additional example is presented:

RefPos:     1  2  3  4  5  6  7     8  9 10 11 12 13 14 15 16 17 18 19
Reference: C C A T A C T G A A C T G A C T A A C
Read: A C T A G A A T G G C T

With the alignment above, you get: POS: 5; 3 matches, 1 insertion, 3 matches, 1 deletion and 5 matches so the CIGAR string looks like: 3M1I3M1D5M

The full Sequence Alignment/Map format specification is available at http://samtools.github.io/hts-specs/SAMv1.pdf.


VCF Format

The Variant Call Format (VCF) is a standardized format for storing and reporting genomic sequence variations. VCF files are used to report sequence variations (e.g., SNPs, insertions and deletions (INDELS) and larger structural variants) together with rich annotations. VCF files are modular where the annotations and genotype information for a variant are separated from the call itself. VCF version 4.1 is the currently active format specification.

##fileformat=VCFv4.1
##fileDate=20090805
##source=myImputationProgramV3.1
##reference=file:///seq/references/1000GenomesPilot-NCBI36.fasta
##contig=<ID=20,length=62435964,assembly=B36,md5=f126cdf8a6e0c7f379d618ff66beb2da,species="Homo sapiens",taxonomy=x>
##phasing=partial
##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of Samples With Data">
##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth">
##INFO=<ID=AF,Number=A,Type=Float,Description="Allele Frequency">
##INFO=<ID=AA,Number=1,Type=String,Description="Ancestral Allele">
##INFO=<ID=DB,Number=0,Type=Flag,Description="dbSNP membership, build 129">
##INFO=<ID=H2,Number=0,Type=Flag,Description="HapMap2 membership">
##FILTER=<ID=q10,Description="Quality below 10">
##FILTER=<ID=s50,Description="Less than 50% of samples have data">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth">
##FORMAT=<ID=HQ,Number=2,Type=Integer,Description="Haplotype Quality">
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA00001 NA00002 NA00003
20 14370 rs6054257 G A 29 PASS NS=3;DP=14;AF=0.5;DB;H2 GT:GQ:DP:HQ 0|0:48:1:51,51 1|0:48:8:51,51 1/1:43:5:.,.
20 17330 . T A 3 q10 NS=3;DP=11;AF=0.017 GT:GQ:DP:HQ 0|0:49:3:58,50 0|1:3:5:65,3 0/0:41:3
20 1110696 rs6040355 A G,T 67 PASS NS=2;DP=10;AF=0.333,0.667;AA=T;DB GT:GQ:DP:HQ 1|2:21:6:23,27 2|1:2:0:18,2 2/2:35:4
20 1230237 . T . 47 PASS NS=3;DP=13;AA=T GT:GQ:DP:HQ 0|0:54:7:56,60 0|0:48:4:51,51 0/0:61:2
20 1234567 microsat1 GTC G,GTCT 50 PASS NS=3;DP=9;AA=G GT:GQ:DP 0/1:35:4 0/2:17:2 1/1:40:3

VCF files contains meta-information lines (starting from ## string and must be key=value pairs), a header line (starting from # with 8 fixed mandatory columns: CHROM, POS, ID, REF, ALT, QUAL, FILTER, INFO and genotype fields with FORMAT column with the corresponding genotype for each sample), and then data lines each containing information about a position in the genome.

The full Variant Call Format (VCF) specification is available at http://samtools.github.io/hts-specs/VCFv4.1.pdf and at the 1000 Genomes Project page

There are also other specifications of Variant Call Format (VCF) with certain modifications to support supplemental information specific to the project like TCGA Variant Call Format (VCF) version 1.2.  For example, TCGA VCF specification allows for additional fields to represent data associated with complex rearrangements, RNA-Seq variants, and sample-specific metadata. The full summary of additions/modifications within TCGA Variant Call Format is available at https://wiki.nci.nih.gov/display/TCGA/TCGA+Variant+Call+Format+(VCF)+1.2+Specification.


GTF/GFF Format

The Gene transfer format (GTF) is a file format used to hold information about gene structure. It is a tab-delimited text format based on the general feature format (GFF), but contains some additional conventions specific to gene information.

1 protein_coding CDS 15588 15702 . + 1 gene_id "ENSGALG00000009771"; transcript_id "ENSGALT00000015891"; exon_number "6"; gene_biotype "protein_coding"; protein_id "ENSGALP00000015874";
1 protein_coding stop_codon 15703 15705 . + 0 gene_id "ENSGALG00000009771"; transcript_id "ENSGALT00000015891"; exon_number "6"; gene_biotype "protein_coding";
1 protein_coding exon 35753 35950 . - . gene_id "ENSGALG00000027884"; transcript_id "ENSGALT00000042749"; exon_number "1"; gene_biotype "protein_coding"; exon_id "ENSGALE00000312791";
1 protein_coding CDS 35753 35950 . - 0 gene_id "ENSGALG00000027884"; transcript_id "ENSGALT00000042749"; exon_number "1"; gene_biotype "protein_coding"; protein_id "ENSGALP00000042558";

The GTF file contains: tab-separated fields, track lines and additional information. The most important part - the tab-separated fields - consists of:

  1. seqname - name of the chromosome or scaffold; chromosome names can be given with or without the 'chr' prefix.
  2. source - name of the program that generated this feature, or the data source (database or project name)
  3. feature - feature type name, e.g. Gene, Variation, Similarity
  4. start - Start position of the feature, with sequence numbering starting at 1.
  5. end - End position of the feature, with sequence numbering starting at 1.
  6. score - A floating point value.
  7. strand - defined as + (forward) or - (reverse).
  8. frame - One of '0', '1' or '2'. '0' indicates that the first base of the feature is the first base of a codon, '1' that the second base is the first base of a codon, and so on..
  9. attribute - A semicolon-separated list of tag-value pairs, providing additional information about each feature.

The full GTF/GFF Format specification is available at http://www.ensembl.org/info/website/upload/gff.html and at http://mblab.wustl.edu/GTF22.html#fields.

 

  • No labels