QSeq format (skbio.io.format.qseq
)#
The QSeq format (qseq) is a record-based, plain text output format produced by some DNA sequencers for storing biological sequence data, quality scores, per-sequence filtering information, and run-specific metadata.
Format Support#
Has Sniffer: Yes
Reader |
Writer |
Object Class |
---|---|---|
Yes |
No |
generator of |
Yes |
No |
|
Yes |
No |
|
Yes |
No |
|
Yes |
No |
Format Specification#
A QSeq file is composed of single-line records, delimited by tabs. There are 11 fields in a record:
Machine name
Run number
Lane number (positive int)
Tile number (positive int)
X coordinate (integer)
Y coordinate (integer)
Index
Read number (1-3)
Sequence data (typically IUPAC characters)
Quality scores (quality scores encoded as printable ASCII)
Filter boolean (1 if sequence has passed CASAVA’s filter, 0 otherwise)
For more details please refer to the CASAVA documentation [1].
Note
When a QSeq file is read into a scikit-bio object, the object’s metadata attribute is automatically populated with data corresponding to the names above.
Note
lowercase functionality is supported when reading QSeq files. Refer to specific object constructor documentation for details.
Note
scikit-bio allows for the filter field to be ommitted, but it is not clear if this is part of the original format specification.
Format Parameters#
The following parameters are the same as in FASTQ format
(skbio.io.format.fastq
):
variant
: seevariant
parameter in FASTQ formatphred_offset
: seephred_offset
parameter in FASTQ format
The following additional parameters are the same as in FASTA format
(skbio.io.format.fasta
):
constructor
: seeconstructor
parameter in FASTA formatseq_num
: seeseq_num
parameter in FASTA format
Generators Only#
filter
: If True, excludes sequences that did not pass filtering (i.e., filter field is 0). Default is True.
Examples#
Suppose we have the following QSeq file:
illumina 1 3 34 -30 30 0 1 ACG....ACGTAC ruBBBBrBCEFGH 1
illumina 1 3 34 30 -30 0 1 CGGGCATTGCA CGGGCasdGCA 0
illumina 1 3 35 -30 30 0 2 ACGTA.AATAAAC geTaAafhwqAAf 1
illumina 1 3 35 30 -30 0 3 CATTTAGGA.TGCA tjflkAFnkKghvM 0
Let’s define this file in-memory as a StringIO
, though this could be a real
file path, file handle, or anything that’s supported by scikit-bio’s I/O
registry in practice:
>>> from io import StringIO
>>> fs = '\n'.join([
... 'illumina\t1\t3\t34\t-30\t30\t0\t1\tACG....ACGTAC\truBBBBrBCEFGH\t1',
... 'illumina\t1\t3\t34\t30\t-30\t0\t1\tCGGGCATTGCA\tCGGGCasdGCA\t0',
... 'illumina\t1\t3\t35\t-30\t30\t0\t2\tACGTA.AATAAAC\tgeTaAafhwqAAf\t1',
... 'illumina\t1\t3\t35\t30\t-30\t0\t3\tCATTTAGGA.TGCA\ttjflkAFnkKghvM\t0'
... ])
>>> fh = StringIO(fs)
To iterate over the sequences using the generator reader, we run:
>>> import skbio.io
>>> for seq in skbio.io.read(fh, format='qseq', variant='illumina1.3'):
... seq
... print('')
Sequence
--------------------------------------
Metadata:
'id': 'illumina_1:3:34:-30:30#0/1'
'index': 0
'lane_number': 3
'machine_name': 'illumina'
'read_number': 1
'run_number': 1
'tile_number': 34
'x': -30
'y': 30
Positional metadata:
'quality': <dtype: uint8>
Stats:
length: 13
--------------------------------------
0 ACG....ACG TAC
Sequence
--------------------------------------
Metadata:
'id': 'illumina_1:3:35:-30:30#0/2'
'index': 0
'lane_number': 3
'machine_name': 'illumina'
'read_number': 2
'run_number': 1
'tile_number': 35
'x': -30
'y': 30
Positional metadata:
'quality': <dtype: uint8>
Stats:
length: 13
--------------------------------------
0 ACGTA.AATA AAC
Note that only two sequences were loaded because the QSeq reader filters out
sequences whose filter field is 0 (unless filter=False
is supplied).