skbio.stats.ordination.cca#

skbio.stats.ordination.cca(y, x, scaling=1, sample_ids=None, feature_ids=None, constraint_ids=None, output_format=None)[source]#

Compute canonical (also known as constrained) correspondence analysis.

Canonical (or constrained) correspondence analysis is a multivariate ordination technique. It appeared in community ecology [1] and relates community composition to the variation in the environment (or in other factors). It works from data on abundances or counts of samples and constraints variables, and outputs ordination axes that maximize sample separation among species.

It is better suited to extract the niches of taxa than linear multivariate methods because it assumes unimodal response curves (habitat preferences are often unimodal functions of habitat variables [2]).

As more environmental variables are added, the result gets more similar to unconstrained ordination, so only the variables that are deemed explanatory should be included in the analysis.

Parameters:
yDataFrame or ndarray

Samples by features table (n, m). Can be numpy, pandas, polars, AnnData, or BIOM (skbio.Table).

xDataFrame or ndarray

Samples by constraints table (n, q). Can be numpy, pandas, polars, AnnData, or BIOM (skbio.Table).

scalingint, {1, 2}, optional

Scaling type 1 maintains \(\chi^2\) distances between rows. Scaling type 2 preserves \(\chi^2\) distances between columns. For a more detailed explanation of the interpretation, check Legendre & Legendre 1998, section 9.4.3.

sample_idslist of str, optional

List of ids of samples. If not provided implicitly by the input DataFrame or explicitly by the user, sample_ids will default to a list of integers starting at zero.

feature_idslist of str, optional

List of ids of features. If not provided implicitly by y or explicitly by the user, it will default to a list of integers starting at zero.

constraint_idslist of str, optional

List of ids of constraints. If not provided implicitly by y or explicitly by the user, it will default to a list of integers starting at zero.

output_formatstr, optional

The desired format of the output object. Can be pandas, polars, or numpy. Note that all scikit-bio ordination functions return an OrdinationResults object. In this case the attributes of the OrdinationResults object will be in the specified format. Default is pandas.

Returns:
OrdinationResults

Object that stores the cca results.

Raises:
ValueError

If x and y have different number of rows If y contains negative values If y contains a row of only 0’s.

NotImplementedError

If scaling is not 1 or 2.

Notes

The algorithm is based on [3], S 11.2, and is expected to give the same results as cca(y, x) in R’s package vegan, except that this implementation won’t drop constraining variables due to perfect collinearity: the user needs to choose which ones to input.

Canonical correspondence analysis shouldn’t be confused with canonical correlation analysis (CCorA, but sometimes called CCA), a different technique to search for multivariate relationships between two datasets. Canonical correlation analysis is a statistical tool that, given two vectors of random variables, finds linear combinations that have maximum correlation with each other. In some sense, it assumes linear responses of “species” to “environmental variables” and is not well suited to analyze ecological data.

References

[1]

Cajo J. F. Ter Braak, “Canonical Correspondence Analysis: A New Eigenvector Technique for Multivariate Direct Gradient Analysis”, Ecology 67.5 (1986), pp. 1167-1179.

[2]

Cajo J.F. Braak and Piet F.M. Verdonschot, “Canonical correspondence analysis and related multivariate methods in aquatic ecology”, Aquatic Sciences 57.3 (1995), pp. 255-289.

[3]

Legendre P. and Legendre L. 1998. Numerical Ecology. Elsevier, Amsterdam.