skbio.stats.ordination.pca#
- skbio.stats.ordination.pca(X, method='eigh', iterative=False, dimensions=None, sample_ids=None, feature_ids=None, output_format=None)[source]#
Perform Principal Component Analysis (PCA).
Principal component analysis (PCA) is a dimensionality reduction technique that finds linear combinations of features which maximize the variance among samples. The vectors of feature weights resulting from this are the principal components of the data. The original samples are projected onto the principal components to obtain a lower-dimensional representation of the data that captures as much variance as possible.
PCA operates on a set of n samples that are each associated with a set of p features. Each sample is then a row of the n \(\times\) p data matrix \(\mathbf{X}\), whose columns are the features of the data. The goal of PCA is to find vectors \(\mathbf{w}_i\) in feature space along which the variance of the samples is maximized, the principal components of the data matrix:
\[\mathbf{w}_i^\ast = \arg\max_{\mathbf{w}_i} \operatorname{Var}(\mathbf{X_c} \mathbf{w}_i)\]where \(\mathbf{X_c}\) is the mean-centered data matrix, centered by columns (features). The principal components of \(\mathbf{X}\) are the unit eigenvectors of the covariance matrix \(\mathbf{\Sigma}\):
\[\mathbf{\Sigma} = \frac{1}{n-1} \mathbf{X_c}^T \mathbf{X_c}\]Each entry \(\mathbf{\Sigma}_{ij}\) is the covariance between features i and j:
\[\mathbf{\Sigma}_{ij} = \operatorname{Cov}(\mathbf{X}_{\cdot i}, \mathbf{X}_{\cdot j})\]The eigenvalue associated with each principal direction is the variance of the data along that direction:
\[\sigma_i^2 = \operatorname{Var}(\mathbf{X_c} \mathbf{w}_i)\]- Parameters:
- Xtable_like
Samples by features table (n, p). See supported formats.
- methodstr, optional
Matrix decomposition method to use. Default is “eigh” (eigendecomposition), which computes exact eigenvectors and eigenvalues of the covariance matrix. The alternative is “svd” (singular value decomposition), which bypasses computing the full covariance matrix and instead computes the singular values of the input matrix.
- iterativebool, optional
Whether to use iterative algorithms via ARPACK to compute only the specified number of eigenvalues. Default is False, which uses dense algorithms via LAPACK to compute all eigenvalues. Only applied if
dimensionsis specified; otherwise, dense algorithms are used regardless.- dimensionsint, optional
Number of principal components to compute. Must be a positive integer less than or equal to min(n, p). If not provided, all principal components will be computed.
- sample_ids, feature_ids, output_formatoptional
Standard table parameters. See Common parameters for details.
- Returns:
- OrdinationResults
Object that stores the PCA results, including eigenvalues, the proportion of variance explained by each of them, and transformed sample coordinates.
- Raises:
- ValueError
If
dimensionsis not a positive integer less than or equal to min(n_samples, n_features)- ValueError
If
methodis not one of “eigh” or “svd”
See also
Notes
Principal Component Analysis (PCA) was first described in [1] and [2].
Two methods are provided for performing the matrix decomposition: The default method,
eigh, constructs the covariance matrix first and then performs eigenvalue decomposition to obtain the principal components. The alternative method,svd, performs singular value decomposition (SVD) on the centered data matrix to obtain the same result. Because SVD avoids explicitly computing the covariance matrix, it may be more accurate than eigenvalue decomposition for very small eigenvalues.The
iterativeparameter determines whether the methods are performed by dense algorithms, which compute all eigenvalues to full accuracy via LAPACK, or sparse algorithms, which compute only the specified number of eigenvalues iteratively by ARPACK. Dense algorithms are used by default, while iterative algorithms are preferred for very large data matrices when the desired number of principal components is much less than the size of the data matrix.The number of principal components kept is determined by
dimensions, and if no dimension is specified, the default is to compute all eigenvalues. The iterative parameter is incompatible with computing all eigenvalues; therefore, if no dimension is specified, dense algorithms are used regardless.If
iterativeis specified as False,svdis computed by SciPy’s dense “svd” function, andeighis computed by SciPy’s dense “eigh” function. Ifiterativeis specified as True, the methods are computed by SciPy’s sparse “svds” and “eigsh” functions respectively.References