Clustal format (skbio.io.format.clustal)#

Clustal format (clustal) stores multiple sequence alignments. This format was originally introduced in the Clustal package [1].

Format Support#

Has Sniffer: Yes

Reader

Writer

Object Class

Yes

Yes

skbio.alignment.TabularMSA

Format Specification#

A clustal-formatted file is a plain text format. It can optionally have a header, which states the clustal version number. This is followed by the multiple sequence alignment, and optional information about the degree of conservation at each position in the alignment [2].

Alignment Section#

Each sequence in the alignment is divided into subsequences each at most 60 characters long. The sequence identifier for each sequence precedes each subsequence. Each subsequence can optionally be followed by the cumulative number of non-gap characters up to that point in the full sequence (not included in the examples below). A line containing conservation information about each position in the alignment can optionally follow all of the subsequences (not included in the examples below).

Note

scikit-bio ignores conservation information when reading and does not support writing conservation information.

Note

When reading a clustal-formatted file into an skbio.alignment.TabularMSA object, sequence identifiers/labels are stored as TabularMSA index labels (index property).

When writing an skbio.alignment.TabularMSA object as a clustal-formatted file, TabularMSA index labels will be converted to strings and written as sequence identifiers/labels.

Format Parameters#

The only supported format parameter is constructor, which specifies the type of in-memory sequence object to read each aligned sequence into. This must be a subclass of GrammaredSequence (e.g., DNA, RNA, Protein) and is a required format parameter. For example, if you know that the clustal file you’re reading contains DNA sequences, you would pass constructor=DNA to the reader call.

Examples#

Assume we have a clustal-formatted file of RNA sequences:

CLUSTAL W (1.82) multiple sequence alignment

abc   GCAUGCAUCUGCAUACGUACGUACGCAUGCAUCA
def   ----------------------------------
xyz   ----------------------------------

abc   GUCGAUACAUACGUACGUCGUACGUACGU-CGAC
def   ---------------CGCGAUGCAUGCAU-CGAU
xyz   -----------CAUGCAUCGUACGUACGCAUGAC

We can use the following code to read the clustal file into a TabularMSA:

>>> from skbio import TabularMSA, RNA
>>> clustal_f = ['CLUSTAL W (1.82) multiple sequence alignment\n',
...              '\n',
...              'abc   GCAUGCAUCUGCAUACGUACGUACGCAUGCA\n',
...              'def   -------------------------------\n',
...              'xyz   -------------------------------\n',
...              '\n',
...              'abc   GUCGAUACAUACGUACGUCGGUACGU-CGAC\n',
...              'def   ---------------CGUGCAUGCAU-CGAU\n',
...              'xyz   -----------CAUUCGUACGUACGCAUGAC\n']
>>> msa = TabularMSA.read(clustal_f, constructor=RNA)
>>> msa
TabularMSA[RNA]
--------------------------------------------------------------
Stats:
    sequence count: 3
    position count: 62
--------------------------------------------------------------
GCAUGCAUCUGCAUACGUACGUACGCAUGCAGUCGAUACAUACGUACGUCGGUACGU-CGAC
----------------------------------------------CGUGCAUGCAU-CGAU
------------------------------------------CAUUCGUACGUACGCAUGAC
>>> msa.index
Index(['abc', 'def', 'xyz'], dtype='object')

We can use the following code to write a TabularMSA to a clustal-formatted file:

>>> from io import StringIO
>>> from skbio import DNA
>>> seqs = [DNA('ACCGTTGTA-GTAGCT', metadata={'id': 'seq1'}),
...         DNA('A--GTCGAA-GTACCT', metadata={'id': 'sequence-2'}),
...         DNA('AGAGTTGAAGGTATCT', metadata={'id': '3'})]
>>> msa = TabularMSA(seqs, minter='id')
>>> msa
TabularMSA[DNA]
----------------------
Stats:
    sequence count: 3
    position count: 16
----------------------
ACCGTTGTA-GTAGCT
A--GTCGAA-GTACCT
AGAGTTGAAGGTATCT
>>> msa.index
Index(['seq1', 'sequence-2', '3'], dtype='object')
>>> fh = StringIO()
>>> _ = msa.write(fh, format='clustal')
>>> print(fh.getvalue()) 
CLUSTAL


seq1        ACCGTTGTA-GTAGCT
sequence-2  A--GTCGAA-GTACCT
3           AGAGTTGAAGGTATCT

References#