BLAST+6 format (skbio.io.format.blast6)#

The BLAST+6 format (blast+6) stores the results of a BLAST [1] database search. The results are stored in a simple tabular format with no column headers. Values are separated by the tab character.

An example BLAST+6-formatted file comparing two protein sequences, taken from [2] (tab characters represented by <tab>):

moaC<tab>gi|15800534|ref|NP_286546.1|<tab>100.00<tab>161<tab>0<tab>0
<tab>1<tab>161<tab>1<tab>161<tab>3e-114<tab>330

moaC<tab>gi|170768970|ref|ZP_02903423.1|<tab>99.38<tab>161<tab>1<tab>0
<tab>1<tab>161<tab>1<tab>161<tab>9e-114<tab>329

Format Support#

Has Sniffer: No

Reader

Writer

Object Class

Yes

No

pandas.DataFrame

Format Specification#

BLAST+6 format is a tabular text-based format produced by both BLAST+ output format 6 (-outfmt 6) and legacy BLAST output format 8 (-m 8). It is tab-separated and has no column headers. With BLAST+, users can specify the columns that are present in their BLAST output file by specifying column names (e.g., -outfmt "6 qseqid sseqid bitscore qstart sstart"), if the default columns output by BLAST are not desired.

BLAST Column Types#

The following column types are output by BLAST and supported by scikit-bio. This information is taken from [3].

Name

Description

Type

qseqid

Query Seq-id

str

qgi

Query GI

int

qacc

Query accesion

str

qaccver

Query accesion.version

str

qlen

Query sequence length

int

sseqid

Subject Seq-id

str

sallseqid

All subject Seq-id(s), separated by a ‘;’

str

sgi

Subject GI

int

sallgi

All subject GIs

int

sacc

Subject accesion

str

saccver

Subject accesion.version

str

sallacc

All subject accesions

str

slen

Subject sequence length

int

qstart

Start of alignment in query

int

qend

End of alignment in query

int

sstart

Start of alignment in subject

int

send

End of alignment in subject

int

qseq

Aligned part of query sequence

str

sseq

Aligned part of subject sequence

str

evalue

Expect value

float

bitscore

Bit score

float

score

Raw score

int

length

Alignment length

int

pident

Percent of identical matches

float

nident

Number of identical matches

int

mismatch

Number of mismatches

int

positive

Number of positive-scoring matches

int

gapopen

Number of gap openings

int

gaps

Total number of gaps

int

ppos

Percentage of positive-scoring matches

float

frames

Query and subject frames separated by a ‘/’

str

qframe

Query frame

int

sframe

Subject frame

int

btop

Blast traceback operations (BTOP)

int

staxids

Unique Subject Taxonomy ID(s), separated by a ‘;’ (in numerical order).

str

sscinames

Unique Subject Scientific Name(s), separated by a ‘;’

str

scomnames

Unique Subject Common Name(s), separated by a ‘;’

str

sblastnames

unique Subject Blast Name(s), separated by a ‘;’ (in alphabetical order)

str

sskingdoms

unique Subject Super Kingdom(s), separated by a ‘;’ (in alphabetical order)

str

stitle

Subject Title

str

sstrand

Subject Strand

str

salltitles

All Subject Title(s), separated by a ‘<>’

str

qcovs

Query Coverage Per Subject

int

qcovhsp

Query Coverage Per HSP

int

Note

When a BLAST+6-formatted file contains N/A values, scikit-bio will convert these values into np.nan, matching pandas’ convention for representing missing data.

Note

scikit-bio stores columns of type int as type float in the returned pd.DataFrame. This is necessary in order to allow N/A values in integer columns (this is currently a limitation of pandas).

Format Parameters#

The following format parameters are available in blast+6 format:

  • default_columns: False by default. If True, will use the default columns output by BLAST, which are qseqid, sseqid, pident, length, mismatch, gapopen, qstart, qend, sstart, send, evalue, and bitscore.

    Warning

    When reading legacy BLAST files, you must pass default_columns=True because legacy BLAST does not allow users to specify which columns are present in the output file.

  • columns: None by default. If provided, must be a list of column names in the order they will appear in the file.

Note

Either default_columns or columns must be provided, as blast+6 does not contain column headers.

Examples#

Suppose we have a blast+6 file with default columns:

>>> from io import StringIO
>>> import skbio.io
>>> import pandas as pd
>>> fs = '\n'.join([
...     'moaC\tgi|15800534|ref|NP_286546.1|\t100.00\t161\t0\t0\t1\t161\t1\t161'
...     '\t3e-114\t330',
...     'moaC\tgi|170768970|ref|ZP_02903423.1|\t99.38\t161\t1\t0\t1\t161\t1'
...     '\t161\t9e-114\t329'
... ])
>>> fh = StringIO(fs)

Read the file into a pd.DataFrame and specify that default columns should be used:

>>> df = skbio.io.read(fh, format="blast+6", into=pd.DataFrame,
...                    default_columns=True)
>>> df 
  qseqid                           sseqid  pident  length  mismatch  gapopen \
0   moaC     gi|15800534|ref|NP_286546.1|  100.00   161.0       0.0      0.0
1   moaC  gi|170768970|ref|ZP_02903423.1|   99.38   161.0       1.0      0.0

   qstart   qend  sstart   send         evalue  bitscore
0     1.0  161.0     1.0  161.0  3.000000e-114     330.0
1     1.0  161.0     1.0  161.0  9.000000e-114     329.0

Suppose we have a blast+6 file with user-supplied (non-default) columns:

>>> from io import StringIO
>>> import skbio.io
>>> import pandas as pd
>>> fs = '\n'.join([
...     'moaC\t100.00\t0\t161\t0\t161\t330\t1',
...     'moaC\t99.38\t1\t161\t0\t161\t329\t1'
... ])
>>> fh = StringIO(fs)

Read the file into a pd.DataFrame and specify which columns are present in the file:

>>> df = skbio.io.read(fh, format="blast+6", into=pd.DataFrame,
...                    columns=['qseqid', 'pident', 'mismatch', 'length',
...                             'gapopen', 'qend', 'bitscore', 'sstart'])
>>> df 
  qseqid  pident  mismatch  length  gapopen   qend  bitscore  sstart
0   moaC  100.00       0.0   161.0      0.0  161.0     330.0     1.0
1   moaC   99.38       1.0   161.0      0.0  161.0     329.0     1.0

References#

[1]

Altschul, S.F., Gish, W., Miller, W., Myers, E.W. & Lipman, D.J. (1990) “Basic local alignment search tool.” J. Mol. Biol. 215:403-410.