scikit-bio is back in active development! Check out our announcement of revitalization.

skbio.stats.distance.bioenv#

skbio.stats.distance.bioenv(distance_matrix, data_frame, columns=None)[source]#

Find subset of variables maximally correlated with distances.

Finds subsets of variables whose Euclidean distances (after scaling the variables; see Notes section below for details) are maximally rank-correlated with the distance matrix. For example, the distance matrix might contain distances between communities, and the variables might be numeric environmental variables (e.g., pH). Correlation between the community distance matrix and Euclidean environmental distance matrix is computed using Spearman’s rank correlation coefficient (\(\\rho\)).

Subsets of environmental variables range in size from 1 to the total number of variables (inclusive). For example, if there are 3 variables, the “best” variable subsets will be computed for subset sizes 1, 2, and 3.

The “best” subset is chosen by computing the correlation between the community distance matrix and all possible Euclidean environmental distance matrices at the given subset size. The combination of environmental variables with maximum correlation is chosen as the “best” subset.

Parameters:
distance_matrixDistanceMatrix

Distance matrix containing distances between objects (e.g., distances between samples of microbial communities).

data_framepandas.DataFrame

Contains columns of variables (e.g., numeric environmental variables such as pH) associated with the objects in distance_matrix. Must be indexed by the IDs in distance_matrix (i.e., the row labels must be distance matrix IDs), but the order of IDs between distance_matrix and data_frame need not be the same. All IDs in the distance matrix must be present in data_frame. Extra IDs in data_frame are allowed (they are ignored in the calculations).

columnsiterable of strs, optional

Column names in data_frame to include as variables in the calculations. If not provided, defaults to all columns in data_frame. The values in each column must be numeric or convertible to a numeric type.

Returns:
pandas.DataFrame

Data frame containing the “best” subset of variables at each subset size, as well as the correlation coefficient of each.

Raises:
TypeError

If invalid input types are provided, or if one or more specified columns in data_frame are not numeric.

ValueError

If column name(s) or distance_matrix IDs cannot be found in data_frame, if there is missing data (NaN) in the environmental variables, or if the environmental variables cannot be scaled (e.g., due to zero variance).

Notes

See [1] for the original method reference (originally called BIO-ENV). The general algorithm and interface are similar to vegan::bioenv, available in R’s vegan package [2]. This method can also be found in PRIMER-E [3] (originally called BIO-ENV, but is now called BEST).

Warning

This method can take a long time to run if a large number of variables are specified, as all possible subsets are evaluated at each subset size.

The variables are scaled before computing the Euclidean distance: each column is centered and then scaled by its standard deviation.

References

[1]

Clarke, K. R & Ainsworth, M. 1993. “A method of linking multivariate community structure to environmental variables”. Marine Ecology Progress Series, 92, 205-219.

Examples

Import the functionality we’ll use in the following examples:

>>> import pandas as pd
>>> from skbio import DistanceMatrix
>>> from skbio.stats.distance import bioenv

Load a 4x4 community distance matrix:

>>> dm = DistanceMatrix([[0.0, 0.5, 0.25, 0.75],
...                      [0.5, 0.0, 0.1, 0.42],
...                      [0.25, 0.1, 0.0, 0.33],
...                      [0.75, 0.42, 0.33, 0.0]],
...                     ['A', 'B', 'C', 'D'])

Load a pandas.DataFrame with two environmental variables, pH and elevation:

>>> df = pd.DataFrame([[7.0, 400],
...                    [8.0, 530],
...                    [7.5, 450],
...                    [8.5, 810]],
...                   index=['A','B','C','D'],
...                   columns=['pH', 'Elevation'])

Note that the data frame is indexed with the same IDs ('A', 'B', 'C', and 'D') that are in the distance matrix. This is necessary in order to link the environmental variables (metadata) to each of the objects in the distance matrix. In this example, the IDs appear in the same order in both the distance matrix and data frame, but this is not necessary.

Find the best subsets of environmental variables that are correlated with community distances:

>>> bioenv(dm, df) 
               size  correlation
vars
pH                1     0.771517
pH, Elevation     2     0.714286

We see that in this simple example, pH alone is maximally rank-correlated with the community distances (\(\rho=0.771517\)).