Base Qualities
Each sequence base of a FASTQ entry has a corresponding numeric
quality score in the quality line(s). Each base quality scores is encoded as a single ASCII character.
Quality lines look like a string of random characters, like the fourth line here:
@AZ1:233:B390NACCC:2:1203:7689:2153
GTTGTTCTTGATGAGCCATGAGGAAGGCATGCCAAATTAAAATACTGGTGCGAATTTAAT
+
CCFFFFHHHHHJJJJJEIFJIJIJJJIJIJJJJCDGHIIIGIGIJIJIIIIJIJJIJIIH
(This FASTQ entry is in this chapter’s README file if you want to follow along.)
Remember, ASCII characters
are just represented as integers between 0 and 127 under the hood (see man ascii
for more
details). Because not all ASCII characters are printable to screen (e.g., character echoing "\07"
makes a “ding” noise), qualities are restricted to the printable ASCII characters, ranging from 33 to
126 (the space character, 32, is omitted).
All programming languages have functions to convert from a character to its decimal ASCII representation,
and from ASCII decimal to character.
In Python, these are the functions ord()
and chr()
, respectively.
Let’s use ord()
in Python’s interactive interpreter to convert the quality characters to
a list of ASCII decimal representations:
>>>
qual
=
"JJJJJJJJJJJJGJJJJJIIJJJJJIGJJJJJIJJJJJJJIJIJJJJHHHHHFFFDFCCC"
>>>
[
ord
(
b
)
for
b
in
qual
]
[74, 74, 74, 74, 74, 74, 74, 74, 74, 74, 74, 74, 71, 74, 74, 74, 74, 74, 73,
73, 74, 74, 74, 74, 74, 73, 71, 74, 74, 74, 74, 74, 73, 74, 74, 74, 74, 74,
74, 74, 73, 74, 73, 74, 74, 74, 74, 72, 72, 72, 72, 72, 70, 70, 70, 68, 70,
67, 67, 67]
Unfortunately, converting these ASCII values to meaningful quality scores can be tricky because there
are
three different quality schemes: Sanger, Solexa, and Illumina (see
Table 10-2).
The Open Bioinformatics Foundation (OBF), which is responsible for projects like Biopython, BioPerl, and
BioRuby, gives these the names fastq-sanger
, fastq-solexa
, and fastq-illumina
.
Fortunately, the bioinformatics field has finally seemed to settle on the Sanger encoding (which is the
format that the quality line shown here is in), so we’ll step through the conversion process using this
scheme.
Table 10-2. FASTQ quality schemes (adapted from Cock et al.,
2010 with permission)
Name |
ASCII character range |
Offset |
Quality score type |
Quality score range |
Sanger, Illumina (versions 1.8 onward) |
33–126 |
33 |
PHRED |
0–93 |
Solexa, early Illumina (before 1.3) |
59–126 |
64 |
Solexa |
5–62 |
Illumina (versions 1.3–1.7) |
64–126 |
64 |
PHRED |
0–62 |
First, we need to subtract an offset to convert this Sanger quality score to a PHRED quality
score. PHRED was an
early base caller written by Phil Green, used for fluorescence trace data written by Phil Green. Looking
at
Table 10-2, notice that the Sanger format’s offset is 33, so we subtract 33 from each quality score:
>>>
phred
=
[
ord
(
b
)
-
33
for
b
in
qual
]
>>>
phred
[41, 41, 41, 41, 41, 41, 41, 41, 41, 41, 41, 41, 38, 41, 41, 41, 41, 41, 40,
40, 41, 41, 41, 41, 41, 40, 38, 41, 41, 41, 41, 41, 40, 41, 41, 41, 41, 41,
41, 41, 40, 41, 40, 41, 41, 41, 41, 39, 39, 39, 39, 39, 37, 37, 37, 35, 37,
34, 34, 34]
Now, with our Sanger quality scores converted to PHRED quality scores, we can apply the following formula
to convert quality scores to the estimated probability the base is correct:
P = 10-Q/10
To go from probabilities to qualities, we use the inverse of this function:
Q = -10 log10P
In our case, we want the former equation. Applying this to our PHRED quality scores:
>>>
[
10
**
(
-
q
/
10
)
for
q
in
phred
]
[1e-05, 1e-05, 1e-05, 1e-05, 1e-05, 1e-05, 1e-05, 1e-05, 1e-05, 1e-05, 1e-05,
1e-05, 0.0001, 1e-05, 1e-05, 1e-05, 1e-05, 1e-05, 0.0001, 0.0001, 1e-05,
1e-05, 1e-05, 1e-05, 1e-05, 0.0001, 0.0001, 1e-05, 1e-05, 1e-05, 1e-05,
1e-05, 0.0001, 1e-05, 1e-05, 1e-05, 1e-05, 1e-05, 1e-05, 1e-05, 0.0001,
1e-05, 0.0001, 1e-05, 1e-05, 1e-05, 1e-05, 0.0001, 0.0001, 0.0001, 0.0001,
0.0001, 0.0001, 0.0001, 0.0001, 0.0001, 0.0001, 0.0001, 0.0001, 0.0001]
Converting between Illumina (version 1.3 to 1.7) quality data is an identical process, except we use
the offset 64 (see
Table 10-2). The Solexa conversion is a bit trickier because this scheme doesn’t use the PHRED function
that maps quality scores to probabilities. Instead, it uses Q = (10P/10 + 1)-1.
See Cock et al., 2010 for more details about this format.