Stockholm format (skbio.io.format.stockholm
)#
The Stockholm format is a multiple sequence alignment format (MSA) that optionally supports storing arbitrary alignment features (metadata). Features can be placed into four different categories: GF, GS, GR, and GC (described in more detail below).
An example Stockholm file, taken from [1]:
# STOCKHOLM 1.0
#=GF ID UPSK
#=GF SE Predicted; Infernal
#=GF SS Published; PMID 9223489
#=GF RN [1]
#=GF RM 9223489
#=GF RT The role of the pseudoknot at the 3' end of turnip yellow mosaic
#=GF RT virus RNA in minus-strand synthesis by the viral RNA-dependent RNA
#=GF RT polymerase.
#=GF RA Deiman BA, Kortlever RM, Pleij CW;
#=GF RL J Virol 1997;71:5990-5996.
AF035635.1/619-641 UGAGUUCUCGAUCUCUAAAAUCG
M24804.1/82-104 UGAGUUCUCUAUCUCUAAAAUCG
J04373.1/6212-6234 UAAGUUCUCGAUCUUUAAAAUCG
M24803.1/1-23 UAAGUUCUCGAUCUCUAAAAUCG
#=GC SS_cons .AAA....<<<<aaa....>>>>
//
Format Support#
Has Sniffer: Yes
Reader |
Writer |
Object Class |
---|---|---|
Yes |
Yes |
Format Specification#
The Stockholm format consists of a header, a multiple sequence alignment, associated metadata (features), and a footer.
Header#
The first line of a Stockholm file must be the following header:
# STOCKHOLM 1.0
Multiple Sequence Alignment#
Sequence lines consist of a sequence name, followed by whitespace, followed by the aligned sequence. For example:
seq1 ACG-T-GGT
seq2 ACCGTTCG-
Sequence names (seq1
, seq2
) are stored in the TabularMSA
index
.
Note
scikit-bio currently supports reading Stockholm files where each sequence is contained on a single line. Interleaved/wrap-around Stockholm files are not supported. When writing, each sequence will be placed on its own line.
Warning
Sequence names must be unique in the Stockholm file. Likewise,
when writing from a TabularMSA
, index
must be unique.
Metadata#
Stockholm files support storing arbitrary metadata (features) about the MSA.
All metadata described in the following sections are optional and may appear in
any order. Metadata “mark-up” lines begin with either #=GF
, #=GS
,
#=GR
, or #=GC
, and each line describes a single feature of the
alignment.
Note
Stockholm format supports generic features. [1] and [2] provide a
list of common features output by Pfam/Rfam. scikit-bio does not
require that these features are present. These features are processed in the
same way as any arbitrary feature would be, as a simple key-value pair of
strings. When writing, feature names, feature data, and sequence names are
converted to type str
.
Note
When writing a Stockholm file, scikit-bio will place the metadata in the format’s recommended order:
GF: Above the alignment
GS: Above the alignment (after GF)
GR: Below corresponding sequence
GC: Below the alignment
GF metadata#
Data relating to the multiple sequence alignment as a whole, such as authors or
number of sequences in the alignment. Starts with #=GF
followed by a
feature name and data relating to the feature. Typically comes first in a
Stockholm file.
For example (taken from [2]):
#=GF DE CBS domain
Where DE
is the feature name and CBS Domain
is the feature data.
GF metadata is stored in the TabularMSA
metadata
dictionary.
Note
When reading, duplicate GF feature names will have their values concatenated in the order they appear in the file. Concatenation will also add a space between lines if one isn’t already there in order to avoid joining words together. When writing, each GF feature will be placed on its own line, regardless of length.
Note
Trees labelled with NH
/TN
are handled differently than other
GF features. When reading a Stockholm file with these features, the reader
follows the rules described in [2]. Trees split over multiple lines will
have their values concatenated. Unlike other GF features, trees will never
have a space added when they are concatenated.
A single tree without an identifier will be stored as:
metadata = {
'NH': 'tree in NHX format'
}
A single tree with an identifier will be stored as:
metadata = {
'NH': {
'tree-id': 'tree in NHX format'
}
}
Multiple trees (which must have identifiers) will be stored as:
metadata = {
'NH': {
'tree-id-1': 'tree in NHX format',
'tree-id-2': 'tree in NHX format'
}
}
Note
References labelled with RN
/RM
/RT
/RA
/RL
/RC
are handled differently than other GF features. When reading a Stockholm
file with these features, the reader populates a list of dictionaries,
where each dictionary represents a single reference. The list contains
references in the order they appear in the file, regardless of the value
provided for RN
. If a reference does not include all possible reference
tags (e.g. RC
is missing), the dictionary will only contain the
reference tags present for that reference. When writing, the writer adds a
reference number (RN
) line before writing each reference, for example:
#=GF RN [1]
#=GF RA Kestrel Gorlick
...
#=GF RN [2]
...
References will be stored as:
metadata = {
'RN': [{
'RM': 'reference medline',
'RT': 'reference title',
'RA': 'reference author',
'RL': 'reference location',
'RC': 'reference comment'
}, {
'RM': 'reference medline',
...
}]
}
GS metadata#
Data relating to a specific sequence in the multiple sequence alignment.
Starts with #=GS
followed by the sequence name followed by a feature name
and data relating to the feature. Typically comes after GF metadata in a
Stockholm file.
For example (taken from [2]):
#=GS O83071/259-312 AC O83071
Where O83071/259-312
is the sequence name, AC
is the feature name, and
083071
is the feature data.
GS metadata is stored in the sequence-specific metadata
dictionary.
Note
When reading, duplicate GS feature names will have their values concatenated in the order they appear in the file. Concatenation will also add a space between lines if one isn’t already there in order to avoid joining words together. When writing, each GS feature will be placed on its own line, regardless of length.
GR metadata#
Data relating to the columns of a specific sequence in a multiple sequence
alignment. Starts with #=GR
followed by the sequence name followed by a
feature name and data relating to the feature, one character per column.
Typically comes after the sequence line it relates to.
For example (taken from [2]):
#=GR O31698/18-71 SS CCCHHHHHHHHHHHHHHH..EEEEEEEE....EEEEEEEEHHH
Where O31698/18-71
is the sequence name, SS
is the feature name, and
CCCHHHHHHHHHHHHHHH..EEEEEEEE....EEEEEEEEHHH
is the feature data.
GR metadata is stored in sequence-specific positional_metadata
.
Note
Duplicate GR feature names attributed to a single sequence are disallowed.
GC metadata#
Data relating to the columns of the multiple sequence alignment as a whole.
Starts with #=GC
followed by a feature name and data relating to the
feature, one character per column. Typically comes at the end of the multiple
sequence alignment.
For example (taken from [2]):
#=GC SS_cons CCCCCHHHHHHHHHHHHH..EEEEEEEE....EEEEEEEEEEH
Where SS_cons
is the feature name and
CCCCCHHHHHHHHHHHHH..EEEEEEEE....EEEEEEEEEEH
is the feature data.
GC metadata is stored in TabularMSA
positional_metadata
.
Note
Duplicate GC feature names are disallowed.
Format Parameters#
The only supported format parameter is constructor
, which specifies the
type of in-memory sequence object to read each aligned sequence into. This must
be a subclass of GrammaredSequence
(e.g., DNA
, RNA
, Protein
)
and is a required format parameter. For example, if you know that the Stockholm
file you’re reading contains DNA sequences, you would pass constructor=DNA
to the reader call.
Examples#
Suppose we have a Stockholm file containing an MSA of protein sequences (modified from [2]):
>>> import skbio.io
>>> from io import StringIO
>>> from skbio import Protein, TabularMSA
>>> fs = '\n'.join([
... '# STOCKHOLM 1.0',
... '#=GF CC CBS domains are small intracellular modules mostly'
... ' found',
... '#=GF CC in 2 or four copies within a protein.',
... '#=GS O83071/192-246 AC O83071',
... '#=GS O31698/88-139 OS Bacillus subtilis',
... 'O83071/192-246 MTCRAQLIAVPRASSLAE..AIACAQKM....RVSRV',
... '#=GR O83071/192-246 SA 999887756453524252..55152525....36463',
... 'O83071/259-312 MQHVSAPVFVFECTRLAY..VQHKLRAH....SRAVA',
... 'O31698/18-71 MIEADKVAHVQVGNNLEH..ALLVLTKT....GYTAI',
... 'O31698/88-139 EVMLTDIPRLHINDPIMK..GFGMVINN......GFV',
... 'O31699/88-139 EVMLTDIPRLHINDPIMK..GFGMVINN......GFV',
... '#=GR O31699/88-139 AS ________________*____________________',
... '#=GR O31699/88-139 IN ____________1______________2_________',
... '#=GC SS_cons CCCCCHHHHHHHHHHHHH..EEEEEEEE....EEEEE',
... '//'
... ])
>>> fh = StringIO(fs)
>>> msa = TabularMSA.read(fh, constructor=Protein)
>>> msa
TabularMSA[Protein]
----------------------------------------------------------------------
Metadata:
'CC': 'CBS domains are small intracellular modules mostly found in
2 or four copies within a protein.'
Positional metadata:
'SS_cons': <dtype: object>
Stats:
sequence count: 5
position count: 37
----------------------------------------------------------------------
MTCRAQLIAVPRASSLAE..AIACAQKM....RVSRV
MQHVSAPVFVFECTRLAY..VQHKLRAH....SRAVA
MIEADKVAHVQVGNNLEH..ALLVLTKT....GYTAI
EVMLTDIPRLHINDPIMK..GFGMVINN......GFV
EVMLTDIPRLHINDPIMK..GFGMVINN......GFV
The sequence names are stored in the index
:
>>> msa.index
Index(['O83071/192-246', 'O83071/259-312', 'O31698/18-71', 'O31698/88-139',
'O31699/88-139'],
dtype='object')
The TabularMSA
has GF metadata stored in its metadata
dictionary:
>>> list(msa.metadata.items())
[('CC', 'CBS domains are small intracellular modules mostly found in 2 or four copies within a protein.')]
GC metadata is stored in the TabularMSA
positional_metadata
:
>>> msa.positional_metadata
SS_cons
0 C
1 C
2 C
3 C
4 C
5 H
6 H
7 H
8 H
9 H
...
GS metadata is stored in the sequence-specific metadata
dictionary:
>>> list(msa[0].metadata.items())
[('AC', 'O83071')]
GR metadata is stored in sequence-specific positional_metadata
:
>>> msa[4].positional_metadata
AS IN
0 _ _
1 _ _
2 _ _
3 _ _
4 _ _
5 _ _
6 _ _
7 _ _
8 _ _
9 _ _
...
Let’s write this TabularMSA
in Stockholm format:
>>> fh = StringIO()
>>> _ = msa.write(fh, format='stockholm')
>>> print(fh.getvalue())
# STOCKHOLM 1.0
#=GF CC CBS domains are small intracellular modules mostly found in 2 or four copies within a protein.
#=GS O83071/192-246 AC O83071
#=GS O31698/88-139 OS Bacillus subtilis
O83071/192-246 MTCRAQLIAVPRASSLAE..AIACAQKM....RVSRV
#=GR O83071/192-246 SA 999887756453524252..55152525....36463
O83071/259-312 MQHVSAPVFVFECTRLAY..VQHKLRAH....SRAVA
O31698/18-71 MIEADKVAHVQVGNNLEH..ALLVLTKT....GYTAI
O31698/88-139 EVMLTDIPRLHINDPIMK..GFGMVINN......GFV
O31699/88-139 EVMLTDIPRLHINDPIMK..GFGMVINN......GFV
#=GR O31699/88-139 AS ________________*____________________
#=GR O31699/88-139 IN ____________1______________2_________
#=GC SS_cons CCCCCHHHHHHHHHHHHH..EEEEEEEE....EEEEE
//
>>> fh.close()