QSeq format (skbio.io.format.qseq)#

The QSeq format (qseq) is a record-based, plain text output format produced by some DNA sequencers for storing biological sequence data, quality scores, per-sequence filtering information, and run-specific metadata.

Format Support#

Has Sniffer: Yes

Reader

Writer

Object Class

Yes

No

generator of skbio.sequence.Sequence objects

Yes

No

skbio.sequence.Sequence

Yes

No

skbio.sequence.DNA

Yes

No

skbio.sequence.RNA

Yes

No

skbio.sequence.Protein

Format Specification#

A QSeq file is composed of single-line records, delimited by tabs. There are 11 fields in a record:

  • Machine name

  • Run number

  • Lane number (positive int)

  • Tile number (positive int)

  • X coordinate (integer)

  • Y coordinate (integer)

  • Index

  • Read number (1-3)

  • Sequence data (typically IUPAC characters)

  • Quality scores (quality scores encoded as printable ASCII)

  • Filter boolean (1 if sequence has passed CASAVA’s filter, 0 otherwise)

For more details please refer to the CASAVA documentation [1].

Note

When a QSeq file is read into a scikit-bio object, the object’s metadata attribute is automatically populated with data corresponding to the names above.

Note

lowercase functionality is supported when reading QSeq files. Refer to specific object constructor documentation for details.

Note

scikit-bio allows for the filter field to be ommitted, but it is not clear if this is part of the original format specification.

Format Parameters#

The following parameters are the same as in FASTQ format (skbio.io.format.fastq):

  • variant: see variant parameter in FASTQ format

  • phred_offset: see phred_offset parameter in FASTQ format

The following additional parameters are the same as in FASTA format (skbio.io.format.fasta):

  • constructor: see constructor parameter in FASTA format

  • seq_num: see seq_num parameter in FASTA format

Generators Only#

  • filter: If True, excludes sequences that did not pass filtering (i.e., filter field is 0). Default is True.

Examples#

Suppose we have the following QSeq file:

illumina    1       3       34      -30     30      0       1       ACG....ACGTAC   ruBBBBrBCEFGH   1
illumina    1       3       34      30      -30     0       1       CGGGCATTGCA     CGGGCasdGCA     0
illumina    1       3       35      -30     30      0       2       ACGTA.AATAAAC   geTaAafhwqAAf   1
illumina    1       3       35      30      -30     0       3       CATTTAGGA.TGCA  tjflkAFnkKghvM  0

Let’s define this file in-memory as a StringIO, though this could be a real file path, file handle, or anything that’s supported by scikit-bio’s I/O registry in practice:

>>> from io import StringIO
>>> fs = '\n'.join([
...     'illumina\t1\t3\t34\t-30\t30\t0\t1\tACG....ACGTAC\truBBBBrBCEFGH\t1',
...     'illumina\t1\t3\t34\t30\t-30\t0\t1\tCGGGCATTGCA\tCGGGCasdGCA\t0',
...     'illumina\t1\t3\t35\t-30\t30\t0\t2\tACGTA.AATAAAC\tgeTaAafhwqAAf\t1',
...     'illumina\t1\t3\t35\t30\t-30\t0\t3\tCATTTAGGA.TGCA\ttjflkAFnkKghvM\t0'
... ])
>>> fh = StringIO(fs)

To iterate over the sequences using the generator reader, we run:

>>> import skbio.io
>>> for seq in skbio.io.read(fh, format='qseq', variant='illumina1.3'):
...     seq
...     print('')
Sequence
--------------------------------------
Metadata:
    'id': 'illumina_1:3:34:-30:30#0/1'
    'index': 0
    'lane_number': 3
    'machine_name': 'illumina'
    'read_number': 1
    'run_number': 1
    'tile_number': 34
    'x': -30
    'y': 30
Positional metadata:
    'quality': <dtype: uint8>
Stats:
    length: 13
--------------------------------------
0 ACG....ACG TAC

Sequence
--------------------------------------
Metadata:
    'id': 'illumina_1:3:35:-30:30#0/2'
    'index': 0
    'lane_number': 3
    'machine_name': 'illumina'
    'read_number': 2
    'run_number': 1
    'tile_number': 35
    'x': -30
    'y': 30
Positional metadata:
    'quality': <dtype: uint8>
Stats:
    length: 13
--------------------------------------
0 ACGTA.AATA AAC

Note that only two sequences were loaded because the QSeq reader filters out sequences whose filter field is 0 (unless filter=False is supplied).

References#