Clustal format (skbio.io.format.clustal
)#
Clustal format (clustal
) stores multiple sequence alignments. This format
was originally introduced in the Clustal package [1].
Format Support#
Has Sniffer: Yes
Reader |
Writer |
Object Class |
---|---|---|
Yes |
Yes |
Format Specification#
A clustal-formatted file is a plain text format. It can optionally have a header, which states the clustal version number. This is followed by the multiple sequence alignment, and optional information about the degree of conservation at each position in the alignment [2].
Alignment Section#
Each sequence in the alignment is divided into subsequences each at most 60 characters long. The sequence identifier for each sequence precedes each subsequence. Each subsequence can optionally be followed by the cumulative number of non-gap characters up to that point in the full sequence (not included in the examples below). A line containing conservation information about each position in the alignment can optionally follow all of the subsequences (not included in the examples below).
Note
scikit-bio ignores conservation information when reading and does not support writing conservation information.
Note
When reading a clustal-formatted file into an
skbio.alignment.TabularMSA
object, sequence identifiers/labels are
stored as TabularMSA
index labels (index
property).
When writing an skbio.alignment.TabularMSA
object as a clustal-formatted
file, TabularMSA
index labels will be converted to strings and written
as sequence identifiers/labels.
Format Parameters#
The only supported format parameter is constructor
, which specifies the
type of in-memory sequence object to read each aligned sequence into. This must
be a subclass of GrammaredSequence
(e.g., DNA
, RNA
, Protein
)
and is a required format parameter. For example, if you know that the clustal
file you’re reading contains DNA sequences, you would pass constructor=DNA
to the reader call.
Examples#
Assume we have a clustal-formatted file of RNA sequences:
CLUSTAL W (1.82) multiple sequence alignment
abc GCAUGCAUCUGCAUACGUACGUACGCAUGCAUCA
def ----------------------------------
xyz ----------------------------------
abc GUCGAUACAUACGUACGUCGUACGUACGU-CGAC
def ---------------CGCGAUGCAUGCAU-CGAU
xyz -----------CAUGCAUCGUACGUACGCAUGAC
We can use the following code to read the clustal file into a TabularMSA
:
>>> from skbio import TabularMSA, RNA
>>> clustal_f = ['CLUSTAL W (1.82) multiple sequence alignment\n',
... '\n',
... 'abc GCAUGCAUCUGCAUACGUACGUACGCAUGCA\n',
... 'def -------------------------------\n',
... 'xyz -------------------------------\n',
... '\n',
... 'abc GUCGAUACAUACGUACGUCGGUACGU-CGAC\n',
... 'def ---------------CGUGCAUGCAU-CGAU\n',
... 'xyz -----------CAUUCGUACGUACGCAUGAC\n']
>>> msa = TabularMSA.read(clustal_f, constructor=RNA)
>>> msa
TabularMSA[RNA]
--------------------------------------------------------------
Stats:
sequence count: 3
position count: 62
--------------------------------------------------------------
GCAUGCAUCUGCAUACGUACGUACGCAUGCAGUCGAUACAUACGUACGUCGGUACGU-CGAC
----------------------------------------------CGUGCAUGCAU-CGAU
------------------------------------------CAUUCGUACGUACGCAUGAC
>>> msa.index
Index(['abc', 'def', 'xyz'], dtype='object')
We can use the following code to write a TabularMSA
to a clustal-formatted
file:
>>> from io import StringIO
>>> from skbio import DNA
>>> seqs = [DNA('ACCGTTGTA-GTAGCT', metadata={'id': 'seq1'}),
... DNA('A--GTCGAA-GTACCT', metadata={'id': 'sequence-2'}),
... DNA('AGAGTTGAAGGTATCT', metadata={'id': '3'})]
>>> msa = TabularMSA(seqs, minter='id')
>>> msa
TabularMSA[DNA]
----------------------
Stats:
sequence count: 3
position count: 16
----------------------
ACCGTTGTA-GTAGCT
A--GTCGAA-GTACCT
AGAGTTGAAGGTATCT
>>> msa.index
Index(['seq1', 'sequence-2', '3'], dtype='object')
>>> fh = StringIO()
>>> _ = msa.write(fh, format='clustal')
>>> print(fh.getvalue())
CLUSTAL
seq1 ACCGTTGTA-GTAGCT
sequence-2 A--GTCGAA-GTACCT
3 AGAGTTGAAGGTATCT