skbio.stats.ordination.pca#

skbio.stats.ordination.pca(X, method='eigh', iterative=False, dimensions=None, sample_ids=None, feature_ids=None, output_format=None)[source]#

Perform Principal Component Analysis (PCA).

Principal component analysis (PCA) is a dimensionality reduction technique that finds linear combinations of features which maximize the variance among samples. The vectors of feature weights resulting from this are the principal components of the data. The original samples are projected onto the principal components to obtain a lower-dimensional representation of the data that captures as much variance as possible.

PCA operates on a set of n samples that are each associated with a set of p features. Each sample is then a row of the n \(\times\) p data matrix \(\mathbf{X}\), whose columns are the features of the data. The goal of PCA is to find vectors \(\mathbf{w}_i\) in feature space along which the variance of the samples is maximized, the principal components of the data matrix:

\[\mathbf{w}_i^\ast = \arg\max_{\mathbf{w}_i} \operatorname{Var}(\mathbf{X_c} \mathbf{w}_i)\]

where \(\mathbf{X_c}\) is the mean-centered data matrix, centered by columns (features). The principal components of \(\mathbf{X}\) are the unit eigenvectors of the covariance matrix \(\mathbf{\Sigma}\):

\[\mathbf{\Sigma} = \frac{1}{n-1} \mathbf{X_c}^T \mathbf{X_c}\]

Each entry \(\mathbf{\Sigma}_{ij}\) is the covariance between features i and j:

\[\mathbf{\Sigma}_{ij} = \operatorname{Cov}(\mathbf{X}_{\cdot i}, \mathbf{X}_{\cdot j})\]

The eigenvalue associated with each principal direction is the variance of the data along that direction:

\[\sigma_i^2 = \operatorname{Var}(\mathbf{X_c} \mathbf{w}_i)\]

Parameters:

Xtable_like: Samples by features table (n, p). See supported formats.
methodstr, optional: Matrix decomposition method to use. Default is “eigh” (eigendecomposition), which computes exact eigenvectors and eigenvalues of the covariance matrix. The alternative is “svd” (singular value decomposition), which bypasses computing the full covariance matrix and instead computes the singular values of the input matrix.
iterativebool, optional: Whether to use iterative algorithms via ARPACK to compute only the specified number of eigenvalues. Default is False, which uses dense algorithms via LAPACK to compute all eigenvalues. Only applied if dimensions is specified; otherwise, dense algorithms are used regardless.
dimensionsint, optional: Number of principal components to compute. Must be a positive integer less than or equal to min(n, p). If not provided, all principal components will be computed.
sample_ids, feature_ids, output_formatoptional: Standard table parameters. See Common parameters for details.

Returns:

OrdinationResults: Object that stores the PCA results, including eigenvalues, the proportion of variance explained by each of them, and transformed sample coordinates.

Raises:

ValueError: If dimensions is not a positive integer less than or equal to min(n_samples, n_features)
ValueError: If method is not one of “eigh” or “svd”