The most common format for DNA sequencing data are FASTA and FASTQ. Both are text-based formats. FASTQ is an extension of FASTQ. It adds quality scores.

We will discuss here only one topic: FASTA/FASTQ compression. currently there is no standardized compressor for FASTA/FASTQ files and often general purpose archivers are used as a substitute. Typically gzip/pigz (the power of inertia), more rarely bzip2/pbzip2 (the best combination of speed and compression ratio for this type of data), even more rarely xz; the latter has much slower compression speed than bzip2).

The FASTQ format extends FASTA by including a numeric quality score to each base in the sequence. The FASTQ format is widely used to store high-throughput sequencing data, which is reported with a per-base quality score indicating the confidence of each base call. Unfortunately like FASTA, FASTQ has variants and pitfalls that can make the seemingly simple format frustrating to work with.

As with FASTA, it’s a common convention in FASTQ files to split description lines by the first space into two parts: the record identifier and comment.

FASTQ is deceivingly tricky to parse correctly. A common pitfall is to treat every line that begins with @ as a description line. However, @ is also a valid quality character. FASTQ sequence and quality lines can wrap around to the next line, so it’s possible that a line beginning with @ could be a quality line and not a header line. Consequently, writing a parser that always treats lines beginning with @ as header lines can lead to fragile and incorrect parsing. However, we can use the fact that the number of quality score characters must be equal to the number of sequence characters to reliably parse this format—which is how the readfq parser introduced later on works.

Base Qualities

Each sequence base of a FASTQ entry has a corresponding numeric quality score in the quality line(s). Each base quality scores is encoded as a single ASCII character. Quality lines look like a string of random characters, like the fourth line here:

@AZ1:233:B390NACCC:2:1203:7689:2153
GTTGTTCTTGATGAGCCATGAGGAAGGCATGCCAAATTAAAATACTGGTGCGAATTTAAT
+
CCFFFFHHHHHJJJJJEIFJIJIJJJIJIJJJJCDGHIIIGIGIJIJIIIIJIJJIJIIH

(This FASTQ entry is in this chapter’s README file if you want to follow along.)

Remember, ASCII characters are just represented as integers between 0 and 127 under the hood (see man ascii for more details). Because not all ASCII characters are printable to screen (e.g., character echoing "\07" makes a “ding” noise), qualities are restricted to the printable ASCII characters, ranging from 33 to 126 (the space character, 32, is omitted).

All programming languages have functions to convert from a character to its decimal ASCII representation, and from ASCII decimal to character. In Python, these are the functions ord() and chr(), respectively. Let’s use ord() in Python’s interactive interpreter to convert the quality characters to a list of ASCII decimal representations:

>>> qual = "JJJJJJJJJJJJGJJJJJIIJJJJJIGJJJJJIJJJJJJJIJIJJJJHHHHHFFFDFCCC"
>>> [ord(b) for b in qual]
[74, 74, 74, 74, 74, 74, 74, 74, 74, 74, 74, 74, 71, 74, 74, 74, 74, 74, 73,
 73, 74, 74, 74, 74, 74, 73, 71, 74, 74, 74, 74, 74, 73, 74, 74, 74, 74, 74,
 74, 74, 73, 74, 73, 74, 74, 74, 74, 72, 72, 72, 72, 72, 70, 70, 70, 68, 70,
 67, 67, 67]

Unfortunately, converting these ASCII values to meaningful quality scores can be tricky because there are three different quality schemes: Sanger, Solexa, and Illumina (see Table 10-2). The Open Bioinformatics Foundation (OBF), which is responsible for projects like Biopython, BioPerl, and BioRuby, gives these the names fastq-sanger, fastq-solexa, and fastq-illumina. Fortunately, the bioinformatics field has finally seemed to settle on the Sanger encoding (which is the format that the quality line shown here is in), so we’ll step through the conversion process using this scheme.

Table 10-2. FASTQ quality schemes (adapted from Cock et al., 2010 with permission)
Name	ASCII character range	Offset	Quality score type	Quality score range
Sanger, Illumina (versions 1.8 onward)	33–126	33	PHRED	0–93
Solexa, early Illumina (before 1.3)	59–126	64	Solexa	5–62
Illumina (versions 1.3–1.7)	64–126	64	PHRED	0–62

First, we need to subtract an offset to convert this Sanger quality score to a PHRED quality score. PHRED was an early base caller written by Phil Green, used for fluorescence trace data written by Phil Green. Looking at Table 10-2, notice that the Sanger format’s offset is 33, so we subtract 33 from each quality score:

>>> phred = [ord(b)-33 for b in qual]
>>> phred
[41, 41, 41, 41, 41, 41, 41, 41, 41, 41, 41, 41, 38, 41, 41, 41, 41, 41, 40,
 40, 41, 41, 41, 41, 41, 40, 38, 41, 41, 41, 41, 41, 40, 41, 41, 41, 41, 41,
 41, 41, 40, 41, 40, 41, 41, 41, 41, 39, 39, 39, 39, 39, 37, 37, 37, 35, 37,
 34, 34, 34]

Now, with our Sanger quality scores converted to PHRED quality scores, we can apply the following formula to convert quality scores to the estimated probability the base is correct:

P = 10^-Q/10

To go from probabilities to qualities, we use the inverse of this function:

Q = -10 log₁₀P

In our case, we want the former equation. Applying this to our PHRED quality scores:

>>> [10**(-q/10) for q in phred]
[1e-05, 1e-05, 1e-05, 1e-05, 1e-05, 1e-05, 1e-05, 1e-05, 1e-05, 1e-05, 1e-05,
 1e-05, 0.0001, 1e-05, 1e-05, 1e-05, 1e-05, 1e-05, 0.0001, 0.0001, 1e-05,
 1e-05, 1e-05, 1e-05, 1e-05, 0.0001, 0.0001, 1e-05, 1e-05, 1e-05, 1e-05,
 1e-05, 0.0001, 1e-05, 1e-05, 1e-05, 1e-05, 1e-05, 1e-05, 1e-05, 0.0001,
 1e-05, 0.0001, 1e-05, 1e-05, 1e-05, 1e-05, 0.0001, 0.0001, 0.0001, 0.0001,
 0.0001, 0.0001, 0.0001, 0.0001, 0.0001, 0.0001, 0.0001, 0.0001, 0.0001]

Converting between Illumina (version 1.3 to 1.7) quality data is an identical process, except we use the offset 64 (see Table 10-2). The Solexa conversion is a bit trickier because this scheme doesn’t use the PHRED function that maps quality scores to probabilities. Instead, it uses Q = (10^P/10 + 1)^-1. See Cock et al., 2010 for more details about this format.

Softpanorama May the source be with you, but remember the KISS principle ;-)	Home	Switchboard	Unix Administration	Red Hat	TCP/IP Networks	Neoliberalism	Toxic Managers
	(slightly skeptical) Educational society promoting "Back to basics" movement against IT overcomplexity and bastardization of classic Unix

Top Visited <p>Your browser does not support iframes.</p>					Switchboard
					Latest
					Past week
					Past month

The FASTQ Format

BWT and DNA

Base Qualities

NEWS CONTENTS

Old News ;-)

Recommended Links

Google matched content

Softpanorama Recommended

Top articles

Sites