skbio.alignment.TabularMSA.conservation#
- TabularMSA.conservation(metric='inverse_shannon_uncertainty', degenerate_mode='error', gap_mode='nan')[source]#
Apply metric to compute conservation for all alignment positions.
- Parameters:
- metric{‘inverse_shannon_uncertainty’}, optional
Metric that should be applied for computing conservation. Resulting values should be larger when a position is more conserved.
- degenerate_mode{‘nan’, ‘error’}, optional
Mode for handling positions with degenerate characters. If
"nan"
, positions with degenerate characters will be assigned a conservation score ofnp.nan
. If"error"
, an error will be raised if one or more degenerate characters are present.- gap_mode{‘nan’, ‘ignore’, ‘error’, ‘include’}, optional
Mode for handling positions with gap characters. If
"nan"
, positions with gaps will be assigned a conservation score ofnp.nan
. If"ignore"
, positions with gaps will be filtered to remove gaps beforemetric
is applied. If"error"
, an error will be raised if one or more gap characters are present. If"include"
, conservation will be computed on alignment positions with gaps included. In this case, it is up to the metric to ensure that gaps are handled as they should be or to raise an error if gaps are not supported by that metric.
- Returns:
- np.array of floats
Values resulting from the application of
metric
to each position in the alignment.
- Raises:
- ValueError
If an unknown
metric
,degenerate_mode
orgap_mode
is provided.- ValueError
If any degenerate characters are present in the alignment when
degenerate_mode
is"error"
.- ValueError
If any gaps are present in the alignment when
gap_mode
is"error"
.
Notes
Users should be careful interpreting results when
gap_mode = "include"
as the results may be misleading. For example, as pointed out in [1], a protein alignment position composed of 90% gaps and 10% tryptophans would score as more highly conserved than a position composed of alanine and glycine in equal frequencies with the"inverse_shannon_uncertainty"
metric.gap_mode = "include"
will result in all gap characters being recoded toTabularMSA.dtype.default_gap_char
. Because no conservation metrics that we are aware of consider different gap characters differently (e.g., none of the metrics described in [1]), they are all treated the same within this method.The
inverse_shannon_uncertainty
metric is simply one minus Shannon’s uncertainty metric. This method uses the inverse of Shannon’s uncertainty so that larger values imply higher conservation. Shannon’s uncertainty is also referred to as Shannon’s entropy, but when making computations from symbols, as is done here, “uncertainty” is the preferred term ([2]).References
[2]Schneider T. Pitfalls in information theory (website, ca. 2015). https://schneider.ncifcrf.gov/glossary.html#Shannon_entropy