scikit-bio is back in active development! Check out our announcement of revitalization.

Ordination methods (skbio.stats.ordination)#

This module provides functions for ordination – a category of methods that aim at arranging data so that similar data points are proximate to each other. Ordination can preserve and represent the structure of high-dimensional data within a low-dimensional space, thereby facilitating visual exploration and statistical analysis.

Mathematically, ordination shares similarities with, and is in multiple respects equivalent to, embedding and dimensionality reduction. While all three aim to represent high-dimensional data in a lower-dimensional space, the term “ordination” is mainly used in the field of ecology to reveal patterns such as groups or gradients underlying community data. However, the ordination methods implemented in scikit-bio are versatile, serving not only ecological studies but also broader applications in scientific computing.

Multidimensional scaling#

pcoa(distance_matrix[, method, ...])

Perform Principal Coordinate Analysis.

pcoa_biplot(ordination, y)

Compute the projection of descriptors into a PCoA matrix.

Correspondence analysis#

ca(X[, scaling])

Compute correspondence analysis.

Canonical analysis#

cca(y, x[, scaling])

Compute canonical (also known as constrained) correspondence analysis.

rda(y, x[, scale_Y, scaling])

Compute redundancy analysis, a type of canonical analysis.

Ordination results#

OrdinationResults(short_method_name, ...[, ...])

Store ordination results, providing serialization and plotting support.

Utility functions#

mean_and_std(a[, axis, weights, with_mean, ...])

Compute the weighted average and standard deviation along the specified axis.

corr(x[, y])

Compute correlation between columns of x, or x and y.

scale(a[, weights, with_mean, with_std, ...])

Scale array by columns to have weighted average 0 and standard deviation 1.

svd_rank(M_shape, S[, tol])

Matrix rank of M given its singular values S.

e_matrix(distance_matrix)

Compute E matrix from a distance matrix.

f_matrix(E_matrix)

Compute F matrix from E matrix.

Examples#

This is an artificial dataset (table 11.3 in [1]) that represents fish abundance in different sites (Y, the response variables) and environmental variables (X, the explanatory variables).

>>> import numpy as np
>>> import pandas as pd

First we need to construct our explanatory variable dataset X.

>>> X = np.array([[1.0, 0.0, 1.0, 0.0],
...               [2.0, 0.0, 1.0, 0.0],
...               [3.0, 0.0, 1.0, 0.0],
...               [4.0, 0.0, 0.0, 1.0],
...               [5.0, 1.0, 0.0, 0.0],
...               [6.0, 0.0, 0.0, 1.0],
...               [7.0, 1.0, 0.0, 0.0],
...               [8.0, 0.0, 0.0, 1.0],
...               [9.0, 1.0, 0.0, 0.0],
...               [10.0, 0.0, 0.0, 1.0]])
>>> transects = ['depth', 'substrate_coral', 'substrate_sand',
...              'substrate_other']
>>> sites = ['site1', 'site2', 'site3', 'site4', 'site5', 'site6', 'site7',
...          'site8', 'site9', 'site10']
>>> X = pd.DataFrame(X, sites, transects)

Then we need to create a dataframe with the information about the species observed at different sites.

>>> species = ['specie1', 'specie2', 'specie3', 'specie4', 'specie5',
...            'specie6', 'specie7', 'specie8', 'specie9']
>>> Y = np.array([[1, 0, 0, 0, 0, 0, 2, 4, 4],
...               [0, 0, 0, 0, 0, 0, 5, 6, 1],
...               [0, 1, 0, 0, 0, 0, 0, 2, 3],
...               [11, 4, 0, 0, 8, 1, 6, 2, 0],
...               [11, 5, 17, 7, 0, 0, 6, 6, 2],
...               [9, 6, 0, 0, 6, 2, 10, 1, 4],
...               [9, 7, 13, 10, 0, 0, 4, 5, 4],
...               [7, 8, 0, 0, 4, 3, 6, 6, 4],
...               [7, 9, 10, 13, 0, 0, 6, 2, 0],
...               [5, 10, 0, 0, 2, 4, 0, 1, 3]])
>>> Y = pd.DataFrame(Y, sites, species)

We can now perform canonical correspondence analysis. Matrix X contains a continuous variable (depth) and a categorical one (substrate type) encoded using a one-hot encoding.

>>> from skbio.stats.ordination import cca

We explicitly need to avoid perfect collinearity, so we’ll drop one of the substrate types (the last column of X).

>>> del X['substrate_other']
>>> ordination_result = cca(Y, X, scaling=2)

Exploring the results we see that the first three axes explain about 80% of all the variance.

>>> ordination_result.proportion_explained
CCA1    0.466911
CCA2    0.238327
CCA3    0.100548
CCA4    0.104937
CCA5    0.044805
CCA6    0.029747
CCA7    0.012631
CCA8    0.001562
CCA9    0.000532
dtype: float64

References#

[1]

Legendre P. and Legendre L. 1998. Numerical Ecology. Elsevier, Amsterdam.