skbio.stats.ordination.mmvec#
- skbio.stats.ordination.mmvec(X, Y, dimensions=3, optimizer='lbfgs', max_iter=1000, x_prior_mean=0.0, x_prior_scale=1.0, y_prior_mean=0.0, y_prior_scale=1.0, learning_rate=0.001, batch_size=50, beta_1=0.9, beta_2=0.95, clipnorm=10.0, batch_norm='unbiased', seed=None, verbose=False, output_format=None)[source]#
Perform multiomics Microbe-Metabolite Vectors (MMvec) analysis.
MMvec learns joint embeddings of two feature sets from their co-occurrence patterns using a multinomial likelihood model [1]. The model learns:
\[P(Y_j | X_i) = \text{softmax}(\hat{X}_i \cdot \hat{Y}_j + b_{\hat{X}_i} + b_{\hat{Y}_j})\]where \(\hat{X}_i\) is the learned embedding vector of feature \(i\) in the X modality (e.g., microbiome), \(\hat{Y}_j\) is the learned embedding vector of feature \(j\) in the Y modality (e.g., metabolome), and \(b_{\hat{X}_i}\), \(b_{\hat{Y}_j}\) are learned bias terms.
While MMvec was originally developed to analyze microbe-metabolite co-occurrence, this method is generic and can be applied to any two modalities that can be represented as compositional (count-based) data and share the same samples.
Added in version 0.7.3.
- Parameters:
- Xtable_like of shape (n_samples, n_features_x)
Abundance counts for the first modality (e.g., microbes). This modality is treated as the “conditioning” variable. See supported formats.
- Ytable_like of shape (n_samples, n_features_y)
Abundance counts for the second modality (e.g., metabolites). This modality is treated as the “conditioned” variable. See above. Must have the same samples as
X.- dimensionsint, optional
Number of latent dimensions for embeddings. Default is 3.
- optimizer{‘lbfgs’, ‘adam’}, optional
Optimization algorithm to use. Default is ‘lbfgs’.
‘lbfgs’: L-BFGS-B quasi-Newton method. Recommended for most cases. Typically converges in 50-200 iterations. Deterministic.
‘adam’: Stochastic gradient descent with Adam. Use for very large datasets or when stochastic behavior is desired.
- max_iterint, optional
Maximum number of iterations. Default is 1000. For ‘lbfgs’, this is the max number of L-BFGS iterations. For ‘adam’, this is the number of epochs.
- x_prior_meanfloat, optional
Mean of Gaussian prior on first modality embeddings. Default is 0.0.
- x_prior_scalefloat, optional
Scale (std) of Gaussian prior on first modality embeddings. Default is 1.0. Smaller values increase regularization.
- y_prior_meanfloat, optional
Mean of Gaussian prior on second modality embeddings. Default is 0.0.
- y_prior_scalefloat, optional
Scale (std) of Gaussian prior on second modality embeddings. Default is 1.0. Smaller values increase regularization.
- learning_ratefloat, optional
Adam learning rate. Ignored for ‘lbfgs’. Default is 1e-3.
- batch_sizeint, optional
Adam mini-batch size. Ignored for ‘lbfgs’. Default is 50.
- beta_1float, optional
Adam exponential decay rate for first moment. Ignored for ‘lbfgs’. Default is 0.9.
- beta_2float, optional
Adam exponential decay rate for second moment. Ignored for ‘lbfgs’. Default is 0.95.
- clipnormfloat, optional
Adam gradient clipping threshold (global L2 norm). Ignored for ‘lbfgs’. Default is 10.0.
- batch_norm{‘unbiased’, ‘legacy’}, optional
Method for scaling mini-batch likelihood in Adam. Ignored for ‘lbfgs’.
‘unbiased’ (default): Uses norm = sum(n_features_x) / batch_size.
‘legacy’: Uses norm = n_samples / batch_size.
- seedint, Generator or RandomState, optional
A user-provided random seed or random generator instance. See
details.- verbosebool, optional
Print training progress. Default is False.
- output_formatstr, optional
Output table format. See Common parameters for details.
- Returns:
MMvecResultResult of MMvec model fitting, including learned embeddings, conditional probabilities, and convergence history. It also provides methods for predicting target distributions, and scoring predictive performance.
See also
References
[1]Morton, J. T., Aksenov, A. A., Nothias, L. F., Foulds, J. R., Quinn, R. A., Badri, M. H., … & Knight, R. (2019). Learning representations of microbe-metabolite interactions. Nature Methods, 16(12), 1306-1314.
Examples
>>> from skbio.stats.ordination import mmvec
Create a toy example of paired microbiome and metabolome datasets.
>>> import pandas as pd >>> samples = ['S1', 'S2', 'S3', 'S4', 'S5'] >>> microbes = ['O1', 'O2', 'O3', 'O4', 'O5', 'O6', 'O7', 'O8'] >>> microbiome = pd.DataFrame([ ... [1, 0, 0, 7, 0, 2, 1, 0], ... [0, 1, 2, 9, 1, 1, 0, 1], ... [2, 0, 1, 6, 0, 0, 2, 0], ... [1, 2, 1, 8, 3, 1, 0, 2], ... [0, 1, 3, 3, 1, 4, 1, 0], ... ], index=samples, columns=microbes) >>> metabolites = ['C1', 'C2', 'C3', 'C4', 'C5', 'C6'] >>> metabolome = pd.DataFrame([ ... [0, 1, 0, 1, 1, 2], ... [1, 0, 2, 1, 5, 0], ... [0, 3, 1, 0, 1, 1], ... [3, 1, 0, 2, 2, 0], ... [0, 0, 4, 1, 7, 3], ... ], index=samples, columns=metabolites)
Fit the MMvec model to the data. This will perform numerical optimization for up to 100 iterations to learn the embeddings and co-occurrence patterns.
>>> res = mmvec(microbiome, metabolome, max_iter=100, seed=42)
Features of the two datasets (i.e., microbes and metabolites) are projected to the same latent space. Their coordinates (i.e., embeddings) are stored in two matrices:
x_embeddingsandy_embeddings, respectively. The last column of each matrix is the bias term.>>> res.x_embeddings.round(3) PC1 PC2 PC3 bias O1 -0.442 -0.358 0.701 0.036 O2 0.136 0.378 -0.280 -0.304 O3 0.226 -0.206 -0.668 0.376 O4 0.036 -0.015 0.088 0.055 O5 0.137 0.490 -0.068 -0.423 O6 -0.268 -0.047 -0.899 0.622 O7 0.009 -0.924 0.025 1.099 O8 0.575 0.823 0.398 -0.593
>>> res.y_embeddings.round(3) PC1 PC2 PC3 bias C1 0.000 0.000 0.000 0.000 C2 -0.153 -0.861 0.936 -0.149 C3 0.186 -0.475 -0.742 0.102 C4 -0.079 0.429 -0.001 0.020 C5 0.266 -0.074 -0.541 1.028 C6 -0.743 -0.965 -0.525 -0.132
One can generate a biplot visualizing the co-occurrence patterns of microbes and metabolites. The distance and angle between features indicate their association strength. Note that the origin of the plot alway lies on the first metabolite in the dataset, which serves as a reference.
>>> import matplotlib.pyplot as plt >>> plt.scatter('PC1', 'PC2', data=res.x_embeddings) >>> plt.scatter('PC1', 'PC2', data=res.y_embeddings) >>> plt.grid()
The
ranksmatrix stores the log conditional probabilities of microbe-metabolite co-occurrences. Larger positive values indicate a stronger likelihood of co-occurrence. Low and negative values indicate no relationship, not necessarily a negative correlation. Values are row-centered. One can sort each row to reveal which metabolites are most strongly associated with each microbe.>>> res.ranks.round(3) C1 C2 C3 C4 C5 C6 O1 -0.228 0.691 -0.522 -0.291 0.366 -0.017 O2 0.202 -0.859 0.054 0.070 1.086 -0.553 O3 -0.602 -0.858 0.511 -0.313 1.239 0.023 O4 -0.180 -0.184 -0.074 -0.114 0.866 -0.314 O5 0.368 -0.710 -0.109 0.165 1.011 -0.725 O6 -0.833 -1.119 0.531 -0.189 1.236 0.373 O7 -1.355 0.411 0.266 -0.634 0.829 0.483 O8 0.725 -0.441 -0.347 0.459 1.036 -1.431
The actual conditional probabilities are provided by
probs.>>> res.probs.round(3) C1 C2 C3 C4 C5 C6 O1 0.121 0.304 0.090 0.114 0.220 0.150 O2 0.167 0.058 0.144 0.147 0.405 0.079 O3 0.070 0.054 0.212 0.093 0.440 0.130 O4 0.127 0.126 0.141 0.135 0.361 0.111 O5 0.200 0.068 0.124 0.163 0.379 0.067 O6 0.053 0.040 0.208 0.101 0.421 0.177 O7 0.034 0.201 0.174 0.071 0.305 0.216 O8 0.256 0.080 0.088 0.197 0.350 0.030
The
convergencevector stores the loss curve over iterations during model training. If the model has converged, the curve should decrease and stabilize.>>> plt.plot(res.convergence)
A fitted MMvec model can be used to predict metabolite abundances given new microbiome data using the
predictmethod.>>> samples_test = ['S6', 'S7', 'S8'] >>> microbiome_test = pd.DataFrame([ ... [3, 0, 2, 6, 1, 4, 4, 0], ... [2, 1, 5, 0, 0, 3, 6, 1], ... [8, 4, 1, 1, 5, 0, 5, 1], ... ], index=samples_test, columns=microbes) >>> metabolome_pred = res.predict(microbiome_test) >>> metabolome_pred.round(3) C1 C2 C3 C4 C5 C6 S6 0.091 0.140 0.160 0.109 0.349 0.151 S7 0.077 0.130 0.175 0.098 0.360 0.160 S8 0.131 0.171 0.129 0.124 0.318 0.128
To evaluate the model’s generalizability, one can compute the \(Q^2\) statistic on held-out test data using the
scoremethod. A positive \(Q^2\) indicates better-than-mean predictive performance (with 1.0 indicating perfect prediction). A negative or close-to-zero \(Q^2\) is a sign of model overfitting or poor generalization.>>> metabolome_test = pd.DataFrame([ ... [0, 5, 1, 1, 8, 0], ... [1, 0, 5, 3, 2, 3], ... [4, 1, 0, 1, 6, 3], ... ], index=samples_test, columns=metabolites) >>> q2 = res.score(microbiome_test, metabolome_test) >>> float(q2.round(5)) 0.02291