skbio.stats.distance.permanova#

skbio.stats.distance.permanova(distmat, grouping, column=None, permutations=999, seed=None)[source]#

Test for significant differences between groups using PERMANOVA.

Permutational Multivariate Analysis of Variance (PERMANOVA) is a non-parametric method that tests whether two or more groups of objects (e.g., samples) are significantly different based on a categorical factor. It is conceptually similar to ANOVA except that it operates on distances between objects via a distance matrix, which allows for multivariate analysis. Unlike classical Multivariate Analysis of Variance (MANOVA), PERMANOVA makes no assumptions about the distribution of the underlying data. As such, rather than computing a true F statistic based in known distributions of variables, it computes a pseudo-F statistic whose significance can be assessed by a permutation test.

The pseudo-F statistic is the ratio of between-group variance to within-group variance, defined in [1] analogously to the F statistic in ANOVA:

\[F = \frac{{SS}_{between}/(g - 1)}{{SS}_{within}/(n - g)}\]

It is computed from the sums of squares \({SS}_{between}\) and \({SS}_{within}\) divided by their corresponding degrees of freedom, where \(n\) is the number of distinct objects and \(g\) is the number of groups.

Statistical significance is assessed via a permutation test. Objects in the distance matrix are assigned to groups (grouping) based on a categorical factor. This assignment of groups is permuted a number of times (controlled via permutations), and a pseudo-F statistic is computed for each permutation. Under the null hypothesis that the groupings of objects have no effect on the distribution of the underlying data, the pseudo-F statistics of these permutations should be identically distributed for a given distance matrix. The probability of a given pseudo-F statistic being at least as extreme as an observed one is then the proportion of permuted pseudo-F statistics (\(F^{\pi}\)) that are greater than or equal to the observed (unpermuted) one (\(F\)):

\[p = \frac{1 + \text{no. of } F^{\pi} \geq F}{1 + \text{no. of permutations}}\]
Parameters:
distmatDistanceMatrix

Distance matrix containing distances between objects (e.g., distances between samples of microbial communities).

Changed in version 0.7.0: Renamed from distance_matrix. The old name is kept as an alias.

grouping1-D array_like or pandas.DataFrame

Vector indicating the assignment of objects to groups. For example, these could be strings or integers denoting which group an object belongs to. If grouping is 1-D array_like, it must be the same length and in the same order as the objects in distmat. If grouping is a DataFrame, the column specified by column will be used as the grouping vector. The DataFrame must be indexed by the IDs in distmat (i.e., the row labels must be distance matrix IDs), but the order of IDs between distmat and the DataFrame need not be the same. All IDs in the distance matrix must be present in the DataFrame. Extra IDs in the DataFrame are allowed (they are ignored in the calculations).

columnstr, optional

Column name to use as the grouping vector if grouping is a DataFrame. Must be provided if grouping is a DataFrame. Cannot be provided if grouping is 1-D array_like.

permutationsint, optional

Number of permutations to use when assessing statistical significance. Must be greater than or equal to zero. If zero, statistical significance calculations will be skipped and the p-value will be np.nan.

seedint, Generator or RandomState, optional

A user-provided random seed or random generator instance. See details.

Added in version 0.6.3.

Returns:
pandas.Series

Results of the statistical test, including test statistic and p-value.

Notes

See [1] for the original method reference, as well as vegan::adonis, available in R’s vegan package [2].

The precision of the p-value is dependent on the number of permutations. The default precision is \(0.001=1/(1+999)\) from the default value permutations=999. The unpermuted grouping always contributes the first permutation to the numerator and denominator of the p-value, so 1 is added to both. This circumvents the risk of the probability being zero by chance even when it is nonzero. It is suggested in [1] that at least 1000 permutations should be performed for a confidence level of 0.05, and 5000 permutations should be performed for a confidence level of 0.01. The p-value will be np.nan if permutations is zero.

A related statistic reported by some implementations (such as vegan::adonis) is the \(R^2\) value, which describes the proportion of variance in the data explained by the grouping:

\[R^2 = \frac{{SS}_{between}}{{SS}_{total}}\]

This is not currently computed by this function, but it may be derived from the outputs using the following formula:

\[R^2 = \frac{1}{1 + \frac{n - g}{(g - 1)F}}\]

where \(F\) is the pseudo-F statistic, \(n\) is the number of objects, and \(g\) is the number of groups.

References

[1] (1,2,3)

Anderson, Marti J. “A new method for non-parametric multivariate analysis of variance.” Austral Ecology 26.1 (2001): 32-46.

Examples

See skbio.stats.distance.anosim for usage examples (both functions provide similar interfaces).