Ordination methods (skbio.stats.ordination)#

This module provides functions for ordination – a category of methods that aim at arranging data so that similar data points are proximate to each other. Ordination can preserve and represent the structure of high-dimensional data within a low-dimensional space, thereby facilitating visual exploration and statistical analysis.

Mathematically, ordination shares similarities with, and is in multiple respects equivalent to, embedding and dimensionality reduction. While all three aim to represent high-dimensional data in a lower-dimensional space, the term “ordination” is mainly used in the field of ecology to reveal patterns such as groups or gradients underlying community data. However, the ordination methods implemented in scikit-bio are versatile, serving not only ecological studies but also broader applications in scientific computing.

Multidimensional scaling#

pcoa

Perform Principal Coordinate Analysis (PCoA).

pcoa_biplot

Compute the projection of descriptors into a PCoA matrix.

Correspondence analysis#

ca

Compute correspondence analysis.

Canonical analysis#

cca

Compute canonical (also known as constrained) correspondence analysis.

rda

Compute redundancy analysis, a type of canonical analysis.

Ordination results#

OrdinationResults

Store ordination results, providing serialization and plotting support.

Utility functions#

mean_and_std

Compute the weighted average and standard deviation along the specified axis.

corr

Compute correlation between columns of x, or x and y.

scale

Scale array by columns to have weighted average 0 and standard deviation 1.

svd_rank

Matrix rank of M given its singular values S.

e_matrix

Compute E matrix from a distance matrix.

f_matrix

Compute F matrix from E matrix.

Examples#

This is an artificial dataset (table 11.3 in [1]) that represents fish abundance in different sites (Y, the response variables) and environmental variables (X, the explanatory variables).

>>> import numpy as np
>>> import pandas as pd

First we need to construct our explanatory variable dataset X.

>>> X = np.array([[1.0, 0.0, 1.0, 0.0],
...               [2.0, 0.0, 1.0, 0.0],
...               [3.0, 0.0, 1.0, 0.0],
...               [4.0, 0.0, 0.0, 1.0],
...               [5.0, 1.0, 0.0, 0.0],
...               [6.0, 0.0, 0.0, 1.0],
...               [7.0, 1.0, 0.0, 0.0],
...               [8.0, 0.0, 0.0, 1.0],
...               [9.0, 1.0, 0.0, 0.0],
...               [10.0, 0.0, 0.0, 1.0]])
>>> transects = ['depth', 'substrate_coral', 'substrate_sand',
...              'substrate_other']
>>> sites = ['site1', 'site2', 'site3', 'site4', 'site5', 'site6', 'site7',
...          'site8', 'site9', 'site10']
>>> X = pd.DataFrame(X, sites, transects)

Then we need to create a dataframe with the information about the species observed at different sites.

>>> species = ['specie1', 'specie2', 'specie3', 'specie4', 'specie5',
...            'specie6', 'specie7', 'specie8', 'specie9']
>>> Y = np.array([[1, 0, 0, 0, 0, 0, 2, 4, 4],
...               [0, 0, 0, 0, 0, 0, 5, 6, 1],
...               [0, 1, 0, 0, 0, 0, 0, 2, 3],
...               [11, 4, 0, 0, 8, 1, 6, 2, 0],
...               [11, 5, 17, 7, 0, 0, 6, 6, 2],
...               [9, 6, 0, 0, 6, 2, 10, 1, 4],
...               [9, 7, 13, 10, 0, 0, 4, 5, 4],
...               [7, 8, 0, 0, 4, 3, 6, 6, 4],
...               [7, 9, 10, 13, 0, 0, 6, 2, 0],
...               [5, 10, 0, 0, 2, 4, 0, 1, 3]])
>>> Y = pd.DataFrame(Y, sites, species)

We can now perform canonical correspondence analysis. Matrix X contains a continuous variable (depth) and a categorical one (substrate type) encoded using a one-hot encoding.

>>> from skbio.stats.ordination import cca

We explicitly need to avoid perfect collinearity, so we’ll drop one of the substrate types (the last column of X).

>>> del X['substrate_other']
>>> ordination_result = cca(Y, X, scaling=2)

Exploring the results we see that the first three axes explain about 80% of all the variance.

>>> ordination_result.proportion_explained
CCA1    0.466911
CCA2    0.238327
CCA3    0.100548
CCA4    0.104937
CCA5    0.044805
CCA6    0.029747
CCA7    0.012631
CCA8    0.001562
CCA9    0.000532
dtype: float64

References#

[1]

Legendre P. and Legendre L. 1998. Numerical Ecology. Elsevier, Amsterdam.