FASTQ format (skbio.io.format.fastq
)#
The FASTQ file format (fastq
) stores biological (e.g., nucleotide)
sequences and their quality scores in a simple plain text format that is both
human-readable and easy to parse. The file format was invented by Jim Mullikin
at the Wellcome Trust Sanger Institute but wasn’t given a formal definition,
though it has informally become a standard file format for storing
high-throughput sequence data. More information about the format and its
variants can be found in [1] and [2].
Conceptually, a FASTQ file is similar to paired FASTA and QUAL files in that it stores both biological sequences and their quality scores. FASTQ differs from FASTA/QUAL because the quality scores are stored in the same file as the biological sequence data.
An example FASTQ-formatted file containing two DNA sequences and their quality scores:
@seq1 description 1
AACACCAAACTTCTCCACCACGTGAGCTACAAAAG
+
````Y^T]`]c^cabcacc`^Lb^ccYT\T\Y\WF
@seq2 description 2
TATGTATATATAACATATACATATATACATACATA
+
]KZ[PY]_[YY^```ac^\\`bT``c`\aT``bbb
Format Support#
Has Sniffer: Yes
Reader |
Writer |
Object Class |
---|---|---|
Yes |
Yes |
generator of |
Yes |
Yes |
|
Yes |
Yes |
|
Yes |
Yes |
|
Yes |
Yes |
|
Yes |
Yes |
Format Specification#
A FASTQ file contains one or more biological sequences and their corresponding quality scores stored sequentially as records. Each record consists of four sections:
Sequence header line consisting of a sequence identifier (ID) and description (both optional)
Biological sequence data (typically stored using the standard IUPAC lexicon), optionally split over multiple lines
Quality header line separating sequence data from quality scores (optionally repeating the ID and description from the sequence header line)
Quality scores as printable ASCII characters, optionally split over multiple lines. Decoding of quality scores will depend on the specified FASTQ variant (see below for more details)
For the complete FASTQ format specification, see [1]. scikit-bio’s FASTQ implementation follows the format specification described in this excellent publication, including validating the implementation against the FASTQ example files provided in the publication’s supplementary data.
Note
IDs and descriptions will be parsed from sequence header lines in
exactly the same way as FASTA headers (skbio.io.format.fasta
). IDs,
descriptions, and quality scores are also stored on, and written from,
sequence objects in the same way as with FASTA.
Note
Blank or whitespace-only lines are only allowed at the beginning of the file, between FASTQ records, or at the end of the file. A blank or whitespace-only line after the header line, within the sequence, or within quality scores will raise an error.
scikit-bio will ignore leading and trailing whitespace characters on each line while reading.
Note
Validation may be performed depending on the type of object the data is being read into. This behavior matches that of FASTA files.
Note
scikit-bio will write FASTQ files in a normalized format, with each record section on a single line. Thus, each record will be composed of exactly four lines. The quality header line won’t have the sequence ID and description repeated.
Note
lowercase functionality is supported the same as with FASTA.
Quality Score Variants#
FASTQ associates quality scores with sequence data, with each quality score encoded as a single printable ASCII character. In scikit-bio, all quality scores are decoded as Phred quality scores. This is the most common quality score metric, though there are others (e.g., Solexa quality scores). Unfortunately, different sequencers have different ways of encoding quality scores as ASCII characters, notably Sanger and Illumina. Below is a table highlighting the different encoding variants supported by scikit-bio, as well as listing the equivalent variant names used in the Open Bioinformatics Foundation (OBF) [3] projects (e.g., Biopython, BioPerl, etc.).
Variant |
ASCII Range |
Offset |
Quality Range |
Notes |
---|---|---|---|---|
sanger |
33 to 126 |
33 |
0 to 93 |
Equivalent to OBF’s fastq-sanger. |
illumina1.3 |
64 to 126 |
64 |
0 to 62 |
Equivalent to OBF’s fastq-illumina. Use this if your data was generated using Illumina 1.3-1.7 software. |
illumina1.8 |
33 to 95 |
33 |
0 to 62 |
Equivalent to sanger but with 0 to 62 quality score range check. Use this if your data was generated using Illumina 1.8 software or later. |
solexa |
59 to 126 |
64 |
-5 to 62 |
Not currently implemented. |
Note
When writing, Phred quality scores will be truncated to the maximum value in the variant’s range and a warning will be issued. This is consistent with the OBF projects.
When reading, an error will be raised if a decoded quality score is outside the variant’s range.
Format Parameters#
The following parameters are available to all FASTQ format readers and writers:
variant
: A string indicating the quality score variant used to decode/encode Phred quality scores. Must be one ofsanger
,illumina1.3
,illumina1.8
, orsolexa
. This parameter is preferred overphred_offset
because additional quality score range checks and conversions can be performed. It is also more explicit.phred_offset
: An integer indicating the ASCII code offset used to decode/encode Phred quality scores. Must be in the range[33, 126]
. All decoded scores will be assumed to be Phred scores (i.e., no additional conversions are performed). Prefer usingvariant
over this parameter whenever possible.
Note
You must provide variant
or phred_offset
when reading or
writing a FASTQ file. variant
and phred_offset
cannot both be
provided at the same time.
The following additional parameters are the same as in FASTA format
(skbio.io.format.fasta
):
constructor
: seeconstructor
parameter in FASTA formatseq_num
: seeseq_num
parameter in FASTA formatid_whitespace_replacement
: seeid_whitespace_replacement
parameter in FASTA formatdescription_newline_replacement
: seedescription_newline_replacement
parameter in FASTA formatlowercase
: seelowercase
parameter in FASTA format
Examples#
Suppose we have the following FASTQ file with two DNA sequences:
@seq1 description 1
AACACCAAACTTCTCCACC
ACGTGAGCTACAAAAG
+seq1 description 1
''''Y^T]']C^CABCACC
`^LB^CCYT\T\Y\WF
@seq2 description 2
TATGTATATATAACATATACATATATACATACATA
+
]KZ[PY]_[YY^'''AC^\\'BT''C'\AT''BBB
Note that the first sequence and its quality scores are split across multiple lines, while the second sequence and its quality scores are each on a single line. Also note that the first sequence has a duplicate ID and description on the quality header line, while the second sequence does not.
Let’s define this file in-memory as a StringIO
, though this could be a real
file path, file handle, or anything that’s supported by scikit-bio’s I/O
registry in practice:
>>> from io import StringIO
>>> fs = '\n'.join([
... r"@seq1 description 1",
... r"AACACCAAACTTCTCCACC",
... r"ACGTGAGCTACAAAAG",
... r"+seq1 description 1",
... r"''''Y^T]']C^CABCACC",
... r"'^LB^CCYT\T\Y\WF",
... r"@seq2 description 2",
... r"TATGTATATATAACATATACATATATACATACATA",
... r"+",
... r"]KZ[PY]_[YY^'''AC^\\'BT''C'\AT''BBB"])
>>> fh = StringIO(fs)
To load the sequences into a TabularMSA
, we run:
>>> from skbio import TabularMSA, DNA
>>> msa = TabularMSA.read(fh, constructor=DNA, variant='sanger')
>>> msa
TabularMSA[DNA]
-----------------------------------
Stats:
sequence count: 2
position count: 35
-----------------------------------
AACACCAAACTTCTCCACCACGTGAGCTACAAAAG
TATGTATATATAACATATACATATATACATACATA
Note that quality scores are decoded from Sanger. To load the second sequence
as DNA
:
>>> fh = StringIO(fs) # reload the StringIO to read from the beginning again
>>> seq = DNA.read(fh, variant='sanger', seq_num=2)
>>> seq
DNA
----------------------------------------
Metadata:
'description': 'description 2'
'id': 'seq2'
Positional metadata:
'quality': <dtype: uint8>
Stats:
length: 35
has gaps: False
has degenerates: False
has definites: True
GC-content: 14.29%
----------------------------------------
0 TATGTATATA TAACATATAC ATATATACAT ACATA
To write our TabularMSA
to a FASTQ file with quality scores encoded using
the illumina1.3
variant:
>>> new_fh = StringIO()
>>> print(msa.write(new_fh, format='fastq', variant='illumina1.3').getvalue())
@seq1 description 1
AACACCAAACTTCTCCACCACGTGAGCTACAAAAG
+
FFFFx}s|F|b}b`ab`bbF}ka}bbxs{s{x{ve
@seq2 description 2
TATGTATATATAACATATACATATATACATACATA
+
|jyzox|~zxx}FFF`b}{{FasFFbF{`sFFaaa
>>> new_fh.close()
Note that the file has been written in normalized format: sequence and quality scores each only occur on a single line and the sequence header line is not repeated in the quality header line. Note also that the quality scores are different because they have been encoded using a different variant.
References#
Peter J. A. Cock, Christopher J. Fields, Naohisa Goto, Michael L. Heuer, and Peter M. Rice. The Sanger FASTQ file format for sequences with quality scores, and the Solexa/Illumina FASTQ variants. Nucl. Acids Res. (2010) 38 (6): 1767-1771. first published online December 16, 2009. doi:10.1093/nar/gkp1137 http://nar.oxfordjournals.org/content/38/6/1767