skbio.stats.distance.permdisp#

skbio.stats.distance.permdisp(distmat, grouping, column=None, test='median', permutations=999, method='eigh', dimensions=10, seed=None, warn_neg_eigval=0.01)[source]#

Test for Homogeneity of Multivariate Groups Disperisons.

PERMDISP is a multivariate analog of Levene’s test for homogeneity of multivariate variances. Distances are handled by reducing the original distances to principal coordinates. PERMDISP calculates an F-statistic to assess whether the dispersions between groups is significant.

Parameters:

distmatDistanceMatrix or OrdinationResults: Distance matrix containing distances between objects (e.g., distances between samples of microbial communities) or result of pcoa on such a matrix.

Changed in version 0.7.0: Renamed from distance_matrix. The old name is kept as an alias.
grouping1-D array_like or pandas.DataFrame: Vector indicating the assignment of objects to groups. For example, these could be strings or integers denoting which group an object belongs to. If grouping is 1-D array_like, it must be the same length and in the same order as the objects in distmat. If grouping is a DataFrame, the column specified by column will be used as the grouping vector. The DataFrame must be indexed by the IDs in distmat (i.e., the row labels must be distance matrix IDs), but the order of IDs between distmat and the DataFrame need not be the same. All IDs in the distance matrix must be present in the DataFrame. Extra IDs in the DataFrame are allowed (they are ignored in the calculations).
columnstr, optional: Column name to use as the grouping vector if grouping is a DataFrame. Must be provided if grouping is a DataFrame. Cannot be provided if grouping is 1-D array_like.
test{‘centroid’, ‘median’}, optional: Determines whether the analysis is done using centroid or spatial median (default).
permutationsint, optional: Number of permutations to use when assessing statistical significance. Must be greater than or equal to zero. If zero, statistical significance calculations will be skipped and the p-value will be np.nan.
method{‘eigh’, ‘fsvd’}, optional: Matrix decomposition method to use. Options are “eigh” (eigendecomposition, default) and “fsvd” (fast singular value decomposition). See ~skbio.stats.ordination.pcoa for details. Not used if distmat is a OrdinationResults object.
dimensionsint, optional: Dimensions to reduce the distance matrix to if using the fsvd method. Not used if the eigh method is being selected.

Changed in version 0.7.0: Renamed from number_of_dimensions. The old name is kept as an alias.
seedint, Generator or RandomState, optional: A user-provided random seed or random generator instance. See details.

Added in version 0.6.3.
warn_neg_eigvalbool or float, optional: Raise a warning if any negative eigenvalue of large magnitude is generated during PCoA. See skbio.stats.ordination.pcoa for details.

Added in version 0.6.3.

Returns:

pandas.Series: Results of the statistical test, including test statistic and p-value.

Raises:

TypeError: If, when using the spatial median test, the pcoa ordination is not of type np.float32 or np.float64, the spatial median function will fail and the centroid test should be used instead
ValueError: If the test is not centroid or median, or if method is not eigh or fsvd.
TypeError: If the distance matrix is not an instance of DistanceMatrix.
ValueError: If there is only one group.
ValueError: If a list and a column name are both provided.
ValueError: If a list is provided for grouping and it’s length does not match. the number of ids in distmat.
ValueError: If all of the values in the grouping vector are unique.
KeyError: If there are ids in grouping that are not in distmat.

See also

permanova
anosim

Notes

This function uses parallel computation for improved performance. See the parallelization guide for information on controlling the number of threads used.

This function uses Marti Anderson’s PERMDISP2 procedure.

The significance of the results from this function will be the same as the results found in vegan’s betadisper, however due to floating point variability the F-statistic results may vary slightly.

See [1] for the original method reference, as well as vegan::betadisper, available in R’s vegan package [2].

References

[1]

Anderson, M. J. (2006). Distance-based tests for homogeneity of multivariate dispersions. Biometrics, 62(1), 245-253.

[2]

http://cran.r-project.org/web/packages/vegan/index.html

Examples

Load a 6x6 distance matrix and grouping vector denoting 2 groups of objects:

>>> from skbio import DistanceMatrix
>>> dm = DistanceMatrix([[0,    0.5,  0.75, 1, 0.66, 0.33],
...                       [0.5,  0,    0.25, 0.33, 0.77, 0.61],
...                       [0.75, 0.25, 0,    0.1, 0.44, 0.55],
...                       [1,    0.33, 0.1,  0, 0.75, 0.88],
...                       [0.66, 0.77, 0.44, 0.75, 0, 0.77],
...                       [0.33, 0.61, 0.55, 0.88, 0.77, 0]],
...                       ['s1', 's2', 's3', 's4', 's5', 's6'])
>>> grouping = ['G1', 'G1', 'G1', 'G2', 'G2', 'G2']

Run PERMDISP using 99 permutations to calculate the p-value. The seed is to make the output deterministic. You may skip it if that’s not necessary.

>>> from skbio.stats.distance import permdisp
>>> permdisp(dm, grouping, permutations=99, seed=42)
method name               PERMDISP
test statistic name        F-value
sample size                      6
number of groups                 2
test statistic             1.03296
p-value                       ...
number of permutations          99
Name: PERMDISP results, dtype: object

The return value is a pandas.Series object containing the results of the statistical test.

To suppress calculation of the p-value and only obtain the F statistic, specify zero permutations:

>>> permdisp(dm, grouping, permutations=0)
method name               PERMDISP
test statistic name        F-value
sample size                      6
number of groups                 2
test statistic             1.03296
p-value                        NaN
number of permutations           0
Name: PERMDISP results, dtype: object

PERMDISP computes variances based on two types of tests, using either centroids or spatial medians, also commonly referred to as a geometric median. The spatial median is thought to yield a more robust test statistic, and this test is used by default. Spatial medians are computed using an iterative algorithm to find the optimally minimum point from all other points in a group while centroids are computed using a deterministic formula. As such the two different tests yield slightly different F statistics.

>>> permdisp(dm, grouping, test='centroid', permutations=6, seed=42)
method name               PERMDISP
test statistic name        F-value
sample size                      6
number of groups                 2
test statistic            3.670816
p-value                   0.285714
number of permutations           6
Name: PERMDISP results, dtype: object

You can also provide a pandas.DataFrame and a column denoting the grouping instead of a grouping vector. The following DataFrame’s Grouping column specifies the same grouping as the vector we used in the previous examples.:

>>> import pandas as pd
>>> df = pd.DataFrame.from_dict(
...      {'Grouping': {'s1': 'G1', 's2': 'G1', 's3': 'G1', 's4': 'G2',
...                    's5': 'G2', 's6': 'G2'}})
>>> permdisp(dm, df, 'Grouping', permutations=6, test='centroid', seed=42)
method name               PERMDISP
test statistic name        F-value
sample size                      6
number of groups                 2
test statistic            3.670816
p-value                   0.285714
number of permutations           6
Name: PERMDISP results, dtype: object

Note that when providing a DataFrame, the ordering of rows and/or columns does not affect the grouping vector that is extracted. The DataFrame must be indexed by the distance matrix IDs (i.e., the row labels must be distance matrix IDs).

If IDs (rows) are present in the DataFrame but not in the distance matrix, they are ignored. The previous example’s s7 ID illustrates this behavior: note that even though the DataFrame had 7 objects, only 6 were used in the test (see the “Sample size” row in the results above to confirm this). Thus, the DataFrame can be a superset of the distance matrix IDs. Note that the reverse is not true: IDs in the distance matrix must be present in the DataFrame or an error will be raised.

PERMDISP should be used to determine whether the dispersions between the groups in your distance matrix are significantly separated. A non-significant test result indicates that group dispersions are similar to each other. PERMANOVA or ANOSIM should then be used in conjunction to determine whether clustering within groups is significant.