FASTA/QUAL format (skbio.io.format.fasta
)#
The FASTA file format (fasta
) stores biological (i.e., nucleotide or
protein) sequences in a simple plain text format that is both human-readable
and easy to parse. The file format was first introduced and used in the FASTA
software package [1]. Additional descriptions of the file format can be found
in [2] and [3].
An example of a FASTA-formatted file containing two DNA sequences:
>seq1 db-accession-149855
CGATGTCGATCGATCGATCGATCAG
>seq2 db-accession-34989
CATCGATCGATCGATGCATGCATGCATG
The QUAL file format is an additional format related to FASTA. A FASTA file is sometimes accompanied by a QUAL file, particuarly when the FASTA file contains sequences generated on a high-throughput sequencing instrument. QUAL files store a Phred quality score (nonnegative integer) for each base in a sequence stored in FASTA format (see [4] for more details). scikit-bio supports reading and writing FASTA (and optionally QUAL) file formats.
Format Support#
Has Sniffer: Yes
Reader |
Writer |
Object Class |
---|---|---|
Yes |
Yes |
generator of |
Yes |
Yes |
|
Yes |
Yes |
|
Yes |
Yes |
|
Yes |
Yes |
|
Yes |
Yes |
Note
All readers and writers support an optional QUAL file via the
qual
parameter. If one is provided, quality scores will be read/written
in addition to FASTA sequence data.
Format Specification#
The following sections define the FASTA and QUAL file formats in detail.
FASTA Format#
A FASTA file contains one or more biological sequences. The sequences are stored sequentially, with a record for each sequence (also referred to as a FASTA record). Each record consists of a single-line header (sometimes referred to as a defline, label, description, or comment) followed by the sequence data, optionally split over multiple lines.
Note
Blank or whitespace-only lines are only allowed at the beginning of the file, between FASTA records, or at the end of the file. A blank or whitespace-only line after the header line, within the sequence (for FASTA files), or within quality scores (for QUAL files) will raise an error.
scikit-bio will ignore leading and trailing whitespace characters on each line while reading.
Note
scikit-bio does not currently support legacy FASTA format (i.e., headers/comments denoted with a semicolon). The format supported by scikit-bio (described below in detail) most closely resembles the description given in NCBI’s BLAST documentation [3]. See [2] for more details on legacy FASTA format. If you would like legacy FASTA format support added to scikit-bio, please consider submitting a feature request on the scikit-bio issue tracker (pull requests are also welcome!).
Sequence Header#
Each sequence header consists of a single line beginning with a greater-than
(>
) symbol. Immediately following this is a sequence identifier (ID) and
description separated by one or more whitespace characters.
Note
When reading a FASTA-formatted file, the sequence ID and description
are stored in the sequence metadata attribute, under the ‘id’ and
‘description’ keys, repectively. Both are optional. Each will be
represented as the empty string (''
) in metadata if it is not present
in the header.
When writing a FASTA-formatted file, sequence metadata identified by keys ‘id’ and ‘description’ will be converted to strings and written as the sequence identifier and description, respectively. Each will be written as the empty string if not present in sequence metadata.
A sequence ID consists of a single word: all characters after the greater- than symbol and before the first whitespace character (if any) are taken as the sequence ID. Unique sequence IDs are not strictly enforced by the FASTA format itself. A single standardized ID format is similarly not enforced by the FASTA format, though it is often common to use a unique library accession number for a sequence ID (e.g., NCBI’s FASTA defline format [5]).
If a description is present, it is taken as the remaining characters that follow the sequence ID and initial whitespace(s). The description is considered additional information about the sequence (e.g., comments about the source of the sequence or the molecule that it encodes).
For example, consider the following header:
>seq1 db-accession-149855
seq1
is the sequence ID and db-accession-149855
is the sequence
description.
Note
scikit-bio’s readers will remove all leading and trailing whitespace
from the description. If a header line begins with whitespace following the
>
, the ID is assumed to be missing and the remainder of the line is
taken as the description.
Sequence Data#
Biological sequence data follows the header, and can be split over multiple lines. The sequence data (i.e., nucleotides or amino acids) are stored using the standard IUPAC lexicon (single-letter codes).
Note
scikit-bio supports both upper and lower case characters.
This functionality depends on the type of object the data is
being read into. For Sequence
objects, sciki-bio doesn’t care about the
case. Other sequence objects do, but all provide the lowercase parameter
to control case functionality. Refer to each class’s respective constructor
documentation for details.
Both -
and .
are supported as gap characters when reading into
DNA
, RNA
, and Protein
sequence objects.
Validation is performed when reading into scikit-bio sequence objects that
enforce an alphabet (e.g., DNA
, RNA
, Protein
). If any invalid
characters are found while reading from the FASTA file, an exception is
raised.
QUAL Format#
A QUAL file contains quality scores for one or more biological sequences stored in a corresponding FASTA file. QUAL format is very similar to FASTA format: it stores records sequentially, with each record beginning with a header line containing a sequence ID and description. The same rules apply to QUAL headers as FASTA headers (see the above sections for details). scikit-bio processes FASTA and QUAL headers in exactly the same way.
Instead of storing biological sequence data in each record, a QUAL file stores a Phred quality score for each base in the corresponding sequence. Quality scores are represented as nonnegative integers separated by whitespace (typically a single space or newline), and can span multiple lines.
Note
When reading a QUAL-formatted file, quality scores are stored in the sequence’s positional_metadata attribute under the ‘quality’ column.
When writing a QUAL-formatted file, a sequence’s positional_metadata ‘quality’ column will be written as the quality scores.
Note
When reading FASTA and QUAL files, scikit-bio requires records to be in the same order in both files (i.e., each FASTA and QUAL record must have the same ID and description after being parsed). In addition to having the same order, the number of FASTA records must match the number of QUAL records (i.e., missing or additonal records are not allowed). scikit-bio also requires that the number of quality scores match the number of bases in the corresponding sequence.
When writing FASTA and QUAL files, scikit-bio will maintain the same ordering of records in both files (i.e., using the same ID and description in both records) to support future reading.
Format Parameters#
The following parameters are available to change how FASTA/QUAL files are read or written in scikit-bio.
QUAL File Parameter (Readers and Writers)#
The qual
parameter is available to all FASTA format readers and writers. It
can be any file-like type supported by scikit-bio’s I/O registry (e.g., file
handle, file path, etc.). If qual
is provided when reading, quality scores
will be included in each in-memory Sequence
object, in addition
to sequence data stored in the FASTA file. When writing, quality scores will be
written in QUAL format in addition to the sequence data being written in FASTA
format.
Reader-specific Parameters#
The available reader parameters differ depending on which reader is used.
Generator and TabularMSA Reader Parameters#
The constructor
parameter can be used with the Sequence
generator and
TabularMSA
FASTA readers. constructor
specifies the type of in-memory
sequence object to read each sequence into. For example, if you know that the
FASTA file you’re reading contains protein sequences, you would pass
constructor=Protein
to the reader call.
When reading into a Sequence
generator, constructor
defaults to
Sequence
and must be a subclass of Sequence
if supplied.
When reading into a TabularMSA
, constructor
is a required format
parameter and must be a subclass of GrammaredSequence
(e.g., DNA
,
RNA
, Protein
).
Note
The FASTA sniffer will not attempt to guess the constructor
parameter.
Sequence Reader Parameters#
The seq_num
parameter can be used with the Sequence
,
DNA
, RNA
, and Protein
FASTA readers. seq_num
specifies which
sequence to read from the FASTA file (and optional QUAL file), and defaults to
1 (i.e., such that the first sequence is read). For example, to read the 50th
sequence from a FASTA file, you would pass seq_num=50
to the reader call.
Writer-specific Parameters#
The following parameters are available to all FASTA format writers:
id_whitespace_replacement
: string to replace each whitespace character in a sequence ID. This parameter is useful for cases where an in-memory sequence ID contains whitespace, which would result in an on-disk representation that would not be read back into memory as the same ID (since IDs in FASTA format cannot contain whitespace). Defaults to_
. IfNone
, no whitespace replacement is performed and IDs are written as they are stored in memory (this has the potential to create an invalid FASTA-formatted file; see note below). This parameter also applies to a QUAL file if one is provided.description_newline_replacement
: string to replace each newline character in a sequence description. Since a FASTA header must be a single line, newlines are not allowed in sequence descriptions and must be replaced in order to write a valid FASTA file. Defaults to a single space. IfNone
, no newline replacement is performed and descriptions are written as they are stored in memory (this has the potential to create an invalid FASTA-formatted file; see note below). This parameter also applies to a QUAL file if one is provided.max_width
: integer specifying the maximum line width (i.e., number of characters) for sequence data and/or quality scores. If a sequence or its quality scores are longer thanmax_width
, it will be split across multiple lines, each with a maximum width ofmax_width
. Note that there are some caveats when splitting quality scores. A single quality score will never be split across multiple lines, otherwise it would become two different quality scores when read again. Thus, splitting only occurs between quality scores. This makes it possible to have a single long quality score written on its own line that exceedsmax_width
. For example, the quality score12345
would not be split across multiple lines even ifmax_width=3
. Thus, a 5-character line would be written. Default behavior is to not split sequence data or quality scores across multiple lines.lowercase
: String or boolean array. If a string, it is treated as a key into the positional metadata of the object. If a boolean array, it indicates characters to write in lowercase. Characters in the sequence corresponding to True values will be written in lowercase. The boolean array must be the same length as the sequence.
Note
The FASTA format writers will have noticeably better runtime
performance if id_whitespace_replacement
and/or
description_newline_replacement
are set to None
so that whitespace
replacement is not performed during writing. However, this can potentially
create invalid FASTA files, especially if there are newline characters in
the IDs or descriptions. For IDs with whitespace, this can also affect how
the IDs are read into memory in a subsequent read operation. For example, if
an in-memory sequence ID is 'seq 1'
and
id_whitespace_replacement=None
, reading the FASTA file back into memory
would result in an ID of 'seq'
, and '1'
would be part of the
sequence description.
Examples#
Reading and Writing FASTA Files#
Suppose we have the following FASTA file with five equal-length sequences (example modified from [6]):
>seq1 Turkey
AAGCTNGGGCATTTCAGGGTGAGCCCGGGCAATACAGGGTAT
>seq2 Salmo gair
AAGCCTTGGCAGTGCAGGGTGAGCCGTGG
CCGGGCACGGTAT
>seq3 H. Sapiens
ACCGGTTGGCCGTTCAGGGTACAGGTTGGCCGTTCAGGGTAA
>seq4 Chimp
AAACCCTTGCCG
TTACGCTTAAAC
CGAGGCCGGGAC
ACTCAT
>seq5 Gorilla
AAACCCTTGCCGGTACGCTTAAACCATTGCCGGTACGCTTAA
Note
Original copyright notice for the above example file:
(c) Copyright 1986-2008 by The University of Washington. Written by Joseph Felsenstein. Permission is granted to copy this document provided that no fee is charged for it and that this copyright notice is not removed.
Note that the sequences are not required to be of equal length in order for the file to be a valid FASTA file (this depends on the object that you’re reading the file into). Also note that some of the sequences occur on a single line, while others are split across multiple lines.
Let’s define this file in-memory as a StringIO
, though this could be a real
file path, file handle, or anything that’s supported by scikit-bio’s I/O
registry in practice:
>>> fl = [">seq1 Turkey\n",
... "AAGCTNGGGCATTTCAGGGTGAGCCCGGGCAATACAGGGTAT\n",
... ">seq2 Salmo gair\n",
... "AAGCCTTGGCAGTGCAGGGTGAGCCGTGG\n",
... "CCGGGCACGGTAT\n",
... ">seq3 H. Sapiens\n",
... "ACCGGTTGGCCGTTCAGGGTACAGGTTGGCCGTTCAGGGTAA\n",
... ">seq4 Chimp\n",
... "AAACCCTTGCCG\n",
... "TTACGCTTAAAC\n",
... "CGAGGCCGGGAC\n",
... "ACTCAT\n",
... ">seq5 Gorilla\n",
... "AAACCCTTGCCGGTACGCTTAAACCATTGCCGGTACGCTTAA\n"]
Since these sequences are of equal length (presumably because they’ve been
aligned), let’s read the FASTA file into a TabularMSA
object:
>>> from skbio import TabularMSA, DNA
>>> msa = TabularMSA.read(fl, constructor=DNA)
>>> msa
TabularMSA[DNA]
------------------------------------------
Stats:
sequence count: 5
position count: 42
------------------------------------------
AAGCTNGGGCATTTCAGGGTGAGCCCGGGCAATACAGGGTAT
AAGCCTTGGCAGTGCAGGGTGAGCCGTGGCCGGGCACGGTAT
ACCGGTTGGCCGTTCAGGGTACAGGTTGGCCGTTCAGGGTAA
AAACCCTTGCCGTTACGCTTAAACCGAGGCCGGGACACTCAT
AAACCCTTGCCGGTACGCTTAAACCATTGCCGGTACGCTTAA
Note that we didn’t specify a file format in the read
call. The FASTA
sniffer detected the correct file format for us!
To write the TabularMSA
in FASTA format:
>>> from io import StringIO
>>> with StringIO() as fh:
... print(msa.write(fh).getvalue())
>seq1 Turkey
AAGCTNGGGCATTTCAGGGTGAGCCCGGGCAATACAGGGTAT
>seq2 Salmo gair
AAGCCTTGGCAGTGCAGGGTGAGCCGTGGCCGGGCACGGTAT
>seq3 H. Sapiens
ACCGGTTGGCCGTTCAGGGTACAGGTTGGCCGTTCAGGGTAA
>seq4 Chimp
AAACCCTTGCCGTTACGCTTAAACCGAGGCCGGGACACTCAT
>seq5 Gorilla
AAACCCTTGCCGGTACGCTTAAACCATTGCCGGTACGCTTAA
TabularMSA
loads all of the sequences from the FASTA file into memory at
once. If the FASTA file is large (which is often the case), this may be
infeasible if you don’t have enough memory. To work around this issue, you can
stream the sequences using scikit-bio’s generator-based FASTA reader and
writer. The generator-based reader yields Sequence
objects (or subclasses
if constructor
is supplied) one at a time, instead of loading all sequences
into memory. For example, let’s use the generator-based reader to process a
single sequence at a time in a for
loop:
>>> import skbio.io
>>> for seq in skbio.io.read(fl, format='fasta'):
... seq
... print('')
Sequence
------------------------------------------------
Metadata:
'description': 'Turkey'
'id': 'seq1'
Stats:
length: 42
------------------------------------------------
0 AAGCTNGGGC ATTTCAGGGT GAGCCCGGGC AATACAGGGT AT
Sequence
------------------------------------------------
Metadata:
'description': 'Salmo gair'
'id': 'seq2'
Stats:
length: 42
------------------------------------------------
0 AAGCCTTGGC AGTGCAGGGT GAGCCGTGGC CGGGCACGGT AT
Sequence
------------------------------------------------
Metadata:
'description': 'H. Sapiens'
'id': 'seq3'
Stats:
length: 42
------------------------------------------------
0 ACCGGTTGGC CGTTCAGGGT ACAGGTTGGC CGTTCAGGGT AA
Sequence
------------------------------------------------
Metadata:
'description': 'Chimp'
'id': 'seq4'
Stats:
length: 42
------------------------------------------------
0 AAACCCTTGC CGTTACGCTT AAACCGAGGC CGGGACACTC AT
Sequence
------------------------------------------------
Metadata:
'description': 'Gorilla'
'id': 'seq5'
Stats:
length: 42
------------------------------------------------
0 AAACCCTTGC CGGTACGCTT AAACCATTGC CGGTACGCTT AA
A single sequence can also be read into a Sequence
(or subclass):
>>> from skbio import Sequence
>>> seq = Sequence.read(fl)
>>> seq
Sequence
------------------------------------------------
Metadata:
'description': 'Turkey'
'id': 'seq1'
Stats:
length: 42
------------------------------------------------
0 AAGCTNGGGC ATTTCAGGGT GAGCCCGGGC AATACAGGGT AT
By default, the first sequence in the FASTA file is read. This can be
controlled with seq_num
. For example, to read the fifth sequence:
>>> seq = Sequence.read(fl, seq_num=5)
>>> seq
Sequence
------------------------------------------------
Metadata:
'description': 'Gorilla'
'id': 'seq5'
Stats:
length: 42
------------------------------------------------
0 AAACCCTTGC CGGTACGCTT AAACCATTGC CGGTACGCTT AA
We can use the same API to read the fifth sequence into a DNA
sequence:
>>> dna_seq = DNA.read(fl, seq_num=5)
>>> dna_seq
DNA
------------------------------------------------
Metadata:
'description': 'Gorilla'
'id': 'seq5'
Stats:
length: 42
has gaps: False
has degenerates: False
has definites: True
GC-content: 50.00%
------------------------------------------------
0 AAACCCTTGC CGGTACGCTT AAACCATTGC CGGTACGCTT AA
Individual sequence objects can also be written in FASTA format:
>>> with StringIO() as fh:
... print(dna_seq.write(fh).getvalue())
>seq5 Gorilla
AAACCCTTGCCGGTACGCTTAAACCATTGCCGGTACGCTTAA
Reading and Writing FASTA/QUAL Files#
In addition to reading and writing standalone FASTA files, scikit-bio supports reading and writing FASTA and QUAL files together. Suppose we have the following FASTA file:
>seq1 db-accession-149855
CGATGTC
>seq2 db-accession-34989
CATCGTC
Also suppose we have the following QUAL file:
>seq1 db-accession-149855
40 39 39 4
50 1 100
>seq2 db-accession-34989
3 3 10 42 80 80 79
>>> fasta_fl = [
... ">seq1 db-accession-149855\n",
... "CGATGTC\n",
... ">seq2 db-accession-34989\n",
... "CATCGTC\n"]
>>> qual_fl = [
... ">seq1 db-accession-149855\n",
... "40 39 39 4\n",
... "50 1 100\n",
... ">seq2 db-accession-34989\n",
... "3 3 10 42 80 80 79\n"]
To read in a single Sequence
at a time, we can use the
generator-based reader as we did above, providing both FASTA and QUAL files:
>>> for seq in skbio.io.read(fasta_fl, qual=qual_fl, format='fasta'):
... seq
... print('')
Sequence
----------------------------------------
Metadata:
'description': 'db-accession-149855'
'id': 'seq1'
Positional metadata:
'quality': <dtype: uint8>
Stats:
length: 7
----------------------------------------
0 CGATGTC
Sequence
---------------------------------------
Metadata:
'description': 'db-accession-34989'
'id': 'seq2'
Positional metadata:
'quality': <dtype: uint8>
Stats:
length: 7
---------------------------------------
0 CATCGTC
Note that the sequence objects have quality scores stored as positional metadata since we provided a QUAL file. The other FASTA readers operate in a similar manner.
Now let’s load the sequences and their quality scores into a TabularMSA
:
>>> msa = TabularMSA.read(fasta_fl, qual=qual_fl, constructor=DNA)
>>> msa
TabularMSA[DNA]
---------------------
Stats:
sequence count: 2
position count: 7
---------------------
CGATGTC
CATCGTC
To write the sequence data and quality scores in the TabularMSA
to FASTA
and QUAL files, respectively:
>>> new_fasta_fh = StringIO()
>>> new_qual_fh = StringIO()
>>> _ = msa.write(new_fasta_fh, qual=new_qual_fh)
>>> print(new_fasta_fh.getvalue())
>seq1 db-accession-149855
CGATGTC
>seq2 db-accession-34989
CATCGTC
>>> print(new_qual_fh.getvalue())
>seq1 db-accession-149855
40 39 39 4 50 1 100
>seq2 db-accession-34989
3 3 10 42 80 80 79
>>> new_fasta_fh.close()
>>> new_qual_fh.close()
Reading multi-FASTA Files#
Suppose you have a multi-FASTA file and want to read each sequence into a DNA
object in a list. We’ll be using io.StringIO
to make a mock FASTA file in
memory.
>>> from io import StringIO
>>> import skbio
>>> from skbio.sequence import DNA
>>> fl = ">seq1 Turkey\n" +\
... "AAGCTNGGGCATTTCAGGGTGAGCCCGGGCAATACAGGGTAT\n" +\
... ">seq2 Salmo gair\n" +\
... "AAGCCTTGGCAGTGCAGGGTGAGCCGTGG\n" +\
... "CCGGGCACGGTAT\n" +\
... ">seq3 H. Sapiens\n" +\
... "ACCGGTTGGCCGTTCAGGGTACAGGTTGGCCGTTCAGGGTAA\n" +\
... ">seq4 Chimp\n" +\
... "AAACCCTTGCCG\n" +\
... "TTACGCTTAAAC\n" +\
... "CGAGGCCGGGAC\n" +\
... "ACTCAT\n" +\
... ">seq5 Gorilla\n" +\
... "AAACCCTTGCCGGTACGCTTAAACCATTGCCGGTACGCTTAA\n"
>>> mock_fl = StringIO(fl)
The following code will read the sequences into scikit-bio. In practice, mock_fl
may be replaced with an opened file handle, or the path to the file.
>>> res = list(skbio.io.read(mock_fl, format="fasta", constructor=DNA))
>>> res[0]
DNA
------------------------------------------------
Metadata:
'description': 'Turkey'
'id': 'seq1'
Stats:
length: 42
has gaps: False
has degenerates: True
has definites: True
GC-content: 54.76%
------------------------------------------------
0 AAGCTNGGGC ATTTCAGGGT GAGCCCGGGC AATACAGGGT AT
>>> res[1]
DNA
------------------------------------------------
Metadata:
'description': 'Salmo gair'
'id': 'seq2'
Stats:
length: 42
has gaps: False
has degenerates: False
has definites: True
GC-content: 66.67%
------------------------------------------------
0 AAGCCTTGGC AGTGCAGGGT GAGCCGTGGC CGGGCACGGT AT
References#
Lipman, DJ; Pearson, WR (1985). “Rapid and sensitive protein similarity searches”. Science 227 (4693): 1435-41.
Madden T. The BLAST Sequence Analysis Tool. 2002 Oct 9 [Updated 2003 Aug 13]. In: McEntyre J, Ostell J, editors. The NCBI Handbook [Internet]. Bethesda (MD): National Center for Biotechnology Information (US); 2002-. Chapter 16. Available from: http://www.ncbi.nlm.nih.gov/books/NBK21097/