skbio.alignment.TabularMSA.conservation#

TabularMSA.conservation(metric='inverse_shannon_uncertainty', degenerate_mode='error', gap_mode='nan')[source]#

Apply metric to compute conservation for all alignment positions.

Parameters:
metric{‘inverse_shannon_uncertainty’}, optional

Metric that should be applied for computing conservation. Resulting values should be larger when a position is more conserved.

degenerate_mode{‘nan’, ‘error’}, optional

Mode for handling positions with degenerate characters. If "nan", positions with degenerate characters will be assigned a conservation score of np.nan. If "error", an error will be raised if one or more degenerate characters are present.

gap_mode{‘nan’, ‘ignore’, ‘error’, ‘include’}, optional

Mode for handling positions with gap characters. If "nan", positions with gaps will be assigned a conservation score of np.nan. If "ignore", positions with gaps will be filtered to remove gaps before metric is applied. If "error", an error will be raised if one or more gap characters are present. If "include", conservation will be computed on alignment positions with gaps included. In this case, it is up to the metric to ensure that gaps are handled as they should be or to raise an error if gaps are not supported by that metric.

Returns:
np.array of floats

Values resulting from the application of metric to each position in the alignment.

Raises:
ValueError

If an unknown metric, degenerate_mode or gap_mode is provided.

ValueError

If any degenerate characters are present in the alignment when degenerate_mode is "error".

ValueError

If any gaps are present in the alignment when gap_mode is "error".

Notes

Users should be careful interpreting results when gap_mode = "include" as the results may be misleading. For example, as pointed out in [1], a protein alignment position composed of 90% gaps and 10% tryptophans would score as more highly conserved than a position composed of alanine and glycine in equal frequencies with the "inverse_shannon_uncertainty" metric.

gap_mode = "include" will result in all gap characters being recoded to TabularMSA.dtype.default_gap_char. Because no conservation metrics that we are aware of consider different gap characters differently (e.g., none of the metrics described in [1]), they are all treated the same within this method.

The inverse_shannon_uncertainty metric is simply one minus Shannon’s uncertainty metric. This method uses the inverse of Shannon’s uncertainty so that larger values imply higher conservation. Shannon’s uncertainty is also referred to as Shannon’s entropy, but when making computations from symbols, as is done here, “uncertainty” is the preferred term ([2]).

References

[1] (1,2)

Valdar WS. Scoring residue conservation. Proteins. (2002)

[2]

Schneider T. Pitfalls in information theory (website, ca. 2015). https://schneider.ncifcrf.gov/glossary.html#Shannon_entropy