BLAST+7 format (skbio.io.format.blast7
)#
The BLAST+7 format (blast+7
) stores the results of a BLAST [1] database
search. This format is produced by both BLAST+ output
format 7 and legacy BLAST output format 9. The results
are stored in a simple tabular format with headers. Values are separated by the
tab character.
An example BLAST+7-formatted file comparing two nucleotide sequences, taken
from [2] (tab characters represented by <tab>
):
# BLASTN 2.2.18+
# Query: gi|1786181|gb|AE000111.1|AE000111
# Subject: ecoli
# Fields: query acc., subject acc., evalue, q. start, q. end, s. start, s. end
# 5 hits found
AE000111<tab>AE000111<tab>0.0<tab>1<tab>10596<tab>1<tab>10596
AE000111<tab>AE000174<tab>8e-30<tab>5565<tab>5671<tab>6928<tab>6821
AE000111<tab>AE000394<tab>1e-27<tab>5587<tab>5671<tab>135<tab>219
AE000111<tab>AE000425<tab>6e-26<tab>5587<tab>5671<tab>8552<tab>8468
AE000111<tab>AE000171<tab>3e-24<tab>5587<tab>5671<tab>2214<tab>2130
Format Support#
Has Sniffer: Yes
Reader |
Writer |
Object Class |
---|---|---|
Yes |
No |
|
Format Specification#
There are two BLAST+7 file formats supported by scikit-bio: BLAST+ output
format 7 (-outfmt 7
) and legacy BLAST output format 9 (-m 9
). Both file
formats are structurally similar, with minor differences.
Example BLAST+ output format 7 file:
# BLASTP 2.2.31+
# Query: query1
# Subject: subject2
# Fields: q. start, q. end, s. start, s. end, identical, mismatches, sbjctframe, query acc.ver, subject acc.ver
# 2 hits found
1 8 3 10 8 0 1 query1 subject2
2 5 2 15 8 0 2 query1 subject2
Note
Database searches without hits may occur in BLAST+ output format 7 files. scikit-bio ignores these “empty” records:
# BLASTP 2.2.31+
# Query: query1
# Subject: subject1
# 0 hits found
Example legacy BLAST output format 9 file:
# BLASTN 2.2.3 [May-13-2002]
# Database: other_vertebrate
# Query: AF178033
# Fields:
Query id,Subject id,% identity,alignment length,mismatches,gap openings,q. start,q. end,s. start,s. end,e-value,bit score
AF178033 EMORG:AF178033 100.00 811 0 0 1 811 1 811 0.0 1566.6
AF178033 EMORG:AF031394 99.63 811 3 0 1 811 99 909 0.0 1542.8
Note
scikit-bio requires fields to be consistent within a file.
BLAST Column Types#
The following column types are output by BLAST and supported by scikit-bio.
For more information on these column types, see skbio.io.format.blast6
.
Field Name |
DataFrame Column Name |
---|---|
query id |
qseqid |
query gi |
qgi |
query acc. |
qacc |
query acc.ver |
qaccver |
query length |
qlen |
subject id |
sseqid |
subject ids |
sallseqid |
subject gi |
sgi |
subject gis |
sallgi |
subject acc. |
sacc |
subject acc.ver |
saccver |
subject accs |
sallacc |
subject length |
slen |
q. start |
qstart |
q. end |
qend |
s. start |
sstart |
s. end |
send |
query seq |
qseq |
subject seq |
sseq |
evalue |
evalue |
bit score |
bitscore |
score |
score |
alignment length |
length |
% identity |
pident |
identical |
nident |
mismatches |
mismatch |
positives |
positive |
gap opens |
gapopen |
gaps |
gaps |
% positives |
ppos |
query/sbjct frames |
frames |
query frame |
qframe |
sbjct frame |
sframe |
BTOP |
btop |
subject tax ids |
staxids |
subject sci names |
sscinames |
subject com names |
scomnames |
subject blast names |
sblastnames |
subject super kingdoms |
sskingdoms |
subject title |
stitle |
subject strand |
sstrand |
subject titles |
salltitles |
% query coverage per subject |
qcovs |
% query coverage per hsp |
qcovhsp |
Examples#
Suppose we have a BLAST+7 file:
>>> from io import StringIO
>>> import skbio.io
>>> import pandas as pd
>>> fs = '\n'.join([
... '# BLASTN 2.2.18+',
... '# Query: gi|1786181|gb|AE000111.1|AE000111',
... '# Database: ecoli',
... '# Fields: query acc., subject acc., evalue, q. start, q. end, s. start, s. end',
... '# 5 hits found',
... 'AE000111\tAE000111\t0.0\t1\t10596\t1\t10596',
... 'AE000111\tAE000174\t8e-30\t5565\t5671\t6928\t6821',
... 'AE000111\tAE000171\t3e-24\t5587\t5671\t2214\t2130',
... 'AE000111\tAE000425\t6e-26\t5587\t5671\t8552\t8468'
... ])
>>> fh = StringIO(fs)
Read the file into a pd.DataFrame
:
>>> df = skbio.io.read(fh, into=pd.DataFrame)
>>> df
qacc sacc evalue qstart qend sstart send
0 AE000111 AE000111 0.000000e+00 1.0 10596.0 1.0 10596.0
1 AE000111 AE000174 8.000000e-30 5565.0 5671.0 6928.0 6821.0
2 AE000111 AE000171 3.000000e-24 5587.0 5671.0 2214.0 2130.0
3 AE000111 AE000425 6.000000e-26 5587.0 5671.0 8552.0 8468.0
Suppose we have a legacy BLAST 9 file:
>>> from io import StringIO
>>> import skbio.io
>>> import pandas as pd
>>> fs = '\n'.join([
... '# BLASTN 2.2.3 [May-13-2002]',
... '# Database: other_vertebrate',
... '# Query: AF178033',
... '# Fields: ',
... 'Query id,Subject id,% identity,alignment length,mismatches,gap openings,q. start,q. end,s. start,s. end,e-value,bit score',
... 'AF178033\tEMORG:AF178033\t100.00\t811\t0\t0\t1\t811\t1\t811\t0.0\t1566.6',
... 'AF178033\tEMORG:AF178032\t94.57\t811\t44\t0\t1\t811\t1\t811\t0.0\t1217.7',
... 'AF178033\tEMORG:AF178031\t94.82\t811\t42\t0\t1\t811\t1\t811\t0.0\t1233.5'
... ])
>>> fh = StringIO(fs)
Read the file into a pd.DataFrame
:
>>> df = skbio.io.read(fh, into=pd.DataFrame)
>>> df[['qseqid', 'sseqid', 'pident']]
qseqid sseqid pident
0 AF178033 EMORG:AF178033 100.00
1 AF178033 EMORG:AF178032 94.57
2 AF178033 EMORG:AF178031 94.82
References#
Altschul, S.F., Gish, W., Miller, W., Myers, E.W. & Lipman, D.J. (1990) “Basic local alignment search tool.” J. Mol. Biol. 215:403-410.