BLAST+7 format (`skbio.io.format.blast7`)#

The BLAST+7 format (blast+7) stores the results of a BLAST [1] database search. This format is produced by both BLAST+ output format 7 and legacy BLAST output format 9. The results are stored in a simple tabular format with headers. Values are separated by the tab character.

An example BLAST+7-formatted file comparing two nucleotide sequences, taken from [2] (tab characters represented by <tab>):

# BLASTN 2.2.18+
# Query: gi|1786181|gb|AE000111.1|AE000111
# Subject: ecoli
# Fields: query acc., subject acc., evalue, q. start, q. end, s. start, s. end
# 5 hits found
AE000111<tab>AE000111<tab>0.0<tab>1<tab>10596<tab>1<tab>10596
AE000111<tab>AE000174<tab>8e-30<tab>5565<tab>5671<tab>6928<tab>6821
AE000111<tab>AE000394<tab>1e-27<tab>5587<tab>5671<tab>135<tab>219
AE000111<tab>AE000425<tab>6e-26<tab>5587<tab>5671<tab>8552<tab>8468
AE000111<tab>AE000171<tab>3e-24<tab>5587<tab>5671<tab>2214<tab>2130

Format Support#

Has Sniffer: Yes

Reader	Writer	Object Class
Yes	No	`pandas.DataFrame`

Format Specification#

There are two BLAST+7 file formats supported by scikit-bio: BLAST+ output format 7 (-outfmt 7) and legacy BLAST output format 9 (-m 9). Both file formats are structurally similar, with minor differences.

Example BLAST+ output format 7 file:

# BLASTP 2.2.31+
# Query: query1
# Subject: subject2
# Fields: q. start, q. end, s. start, s. end, identical, mismatches, sbjctframe, query acc.ver, subject acc.ver
# 2 hits found
1   8       3       10      8       0       1       query1  subject2
2   5       2       15      8       0       2       query1  subject2

Note

Database searches without hits may occur in BLAST+ output format 7 files. scikit-bio ignores these “empty” records:

# BLASTP 2.2.31+
# Query: query1
# Subject: subject1
# 0 hits found

Example legacy BLAST output format 9 file:

# BLASTN 2.2.3 [May-13-2002]
# Database: other_vertebrate
# Query: AF178033
# Fields:
Query id,Subject id,% identity,alignment length,mismatches,gap openings,q. start,q. end,s. start,s. end,e-value,bit score
AF178033    EMORG:AF178033  100.00  811 0   0   1   811 1   811 0.0 1566.6
AF178033    EMORG:AF031394  99.63   811 3   0   1   811 99  909 0.0 1542.8

Note

scikit-bio requires fields to be consistent within a file.

BLAST Column Types#

The following column types are output by BLAST and supported by scikit-bio. For more information on these column types, see skbio.io.format.blast6.

Field Name	DataFrame Column Name
query id	qseqid
query gi	qgi
query acc.	qacc
query acc.ver	qaccver
query length	qlen
subject id	sseqid
subject ids	sallseqid
subject gi	sgi
subject gis	sallgi
subject acc.	sacc
subject acc.ver	saccver
subject accs	sallacc
subject length	slen
q. start	qstart
q. end	qend
s. start	sstart
s. end	send
query seq	qseq
subject seq	sseq
evalue	evalue
bit score	bitscore
score	score
alignment length	length
% identity	pident
identical	nident
mismatches	mismatch
positives	positive
gap opens	gapopen
gaps	gaps
% positives	ppos
query/sbjct frames	frames
query frame	qframe
sbjct frame	sframe
BTOP	btop
subject tax ids	staxids
subject sci names	sscinames
subject com names	scomnames
subject blast names	sblastnames
subject super kingdoms	sskingdoms
subject title	stitle
subject strand	sstrand
subject titles	salltitles
% query coverage per subject	qcovs
% query coverage per hsp	qcovhsp

Examples#

Suppose we have a BLAST+7 file:

>>> from io import StringIO
>>> import skbio.io
>>> import pandas as pd
>>> fs = '\n'.join([
...     '# BLASTN 2.2.18+',
...     '# Query: gi|1786181|gb|AE000111.1|AE000111',
...     '# Database: ecoli',
...     '# Fields: query acc., subject acc., evalue, q. start, q. end, s. start, s. end',
...     '# 5 hits found',
...     'AE000111\tAE000111\t0.0\t1\t10596\t1\t10596',
...     'AE000111\tAE000174\t8e-30\t5565\t5671\t6928\t6821',
...     'AE000111\tAE000171\t3e-24\t5587\t5671\t2214\t2130',
...     'AE000111\tAE000425\t6e-26\t5587\t5671\t8552\t8468'
... ])
>>> fh = StringIO(fs)

Read the file into a pd.DataFrame:

>>> df = skbio.io.read(fh, into=pd.DataFrame)
>>> df
       qacc      sacc        evalue  qstart     qend  sstart     send
0  AE000111  AE000111  0.000000e+00     1.0  10596.0     1.0  10596.0
1  AE000111  AE000174  8.000000e-30  5565.0   5671.0  6928.0   6821.0
2  AE000111  AE000171  3.000000e-24  5587.0   5671.0  2214.0   2130.0
3  AE000111  AE000425  6.000000e-26  5587.0   5671.0  8552.0   8468.0

Suppose we have a legacy BLAST 9 file:

>>> from io import StringIO
>>> import skbio.io
>>> import pandas as pd
>>> fs = '\n'.join([
...     '# BLASTN 2.2.3 [May-13-2002]',
...     '# Database: other_vertebrate',
...     '# Query: AF178033',
...     '# Fields: ',
...     'Query id,Subject id,% identity,alignment length,mismatches,gap openings,q. start,q. end,s. start,s. end,e-value,bit score',
...     'AF178033\tEMORG:AF178033\t100.00\t811\t0\t0\t1\t811\t1\t811\t0.0\t1566.6',
...     'AF178033\tEMORG:AF178032\t94.57\t811\t44\t0\t1\t811\t1\t811\t0.0\t1217.7',
...     'AF178033\tEMORG:AF178031\t94.82\t811\t42\t0\t1\t811\t1\t811\t0.0\t1233.5'
... ])
>>> fh = StringIO(fs)

Read the file into a pd.DataFrame:

>>> df = skbio.io.read(fh, into=pd.DataFrame)
>>> df[['qseqid', 'sseqid', 'pident']]
     qseqid          sseqid  pident
0  AF178033  EMORG:AF178033  100.00
1  AF178033  EMORG:AF178032   94.57
2  AF178033  EMORG:AF178031   94.82

References#

[1]

Altschul, S.F., Gish, W., Miller, W., Myers, E.W. & Lipman, D.J. (1990) “Basic local alignment search tool.” J. Mol. Biol. 215:403-410.

[2]

http://www.ncbi.nlm.nih.gov/books/NBK279682/

BLAST+7 format (skbio.io.format.blast7)#