skbio.stats.ordination.mmvec#
- skbio.stats.ordination.mmvec(microbes, metabolites, n_components=3, optimizer='lbfgs', max_iter=1000, learning_rate=0.001, batch_size=50, u_prior_mean=0.0, u_prior_scale=1.0, v_prior_mean=0.0, v_prior_scale=1.0, beta_1=0.9, beta_2=0.95, clipnorm=10.0, batch_normalization='unbiased', random_state=None, verbose=False)[source]#
Multiomics Microbe-Metabolite Vectors (MMvec).
Learns joint embeddings of two feature sets from their co-occurrence patterns using a multinomial likelihood model.
While the parameter names use “microbes” and “metabolites” following the original publication, this method is generic and can be applied to any two omics modalities representable as compositional (count-based) data. For example: microbes and host transcripts, proteins and metabolites, or any pair of feature tables sharing the same samples.
Added in version 0.7.2.
- Parameters:
- microbespd.DataFrame or array-like of shape (n_samples, n_microbes)
Abundance counts for the first modality (e.g., microbes, proteins). This modality is treated as the “conditioning” variable.
- metabolitespd.DataFrame or array-like of shape (n_samples, n_metabolites)
Abundance counts for the second modality (e.g., metabolites, transcripts). This modality is treated as the “conditioned” variable.
- n_componentsint, optional
Number of latent dimensions for embeddings. Default is 3.
- optimizer{‘lbfgs’, ‘adam’}, optional
Optimization algorithm to use. Default is ‘lbfgs’.
‘lbfgs’: L-BFGS-B quasi-Newton method. Recommended for most cases. Typically converges in 50-200 iterations. Deterministic.
‘adam’: Stochastic gradient descent with Adam. Use for very large datasets or when stochastic behavior is desired.
- max_iterint, optional
Maximum number of iterations. Default is 1000. For ‘lbfgs’, this is the max number of L-BFGS iterations. For ‘adam’, this is the number of epochs.
- learning_ratefloat, optional
Adam optimizer learning rate. Ignored for ‘lbfgs’. Default is 1e-3.
- batch_sizeint, optional
Mini-batch size for Adam optimizer. Ignored for ‘lbfgs’. Default is 50.
- u_prior_meanfloat, optional
Mean of Gaussian prior on first modality (microbes) embeddings. Default is 0.0.
- u_prior_scalefloat, optional
Scale (std) of Gaussian prior on first modality embeddings. Default is 1.0. Smaller values increase regularization.
- v_prior_meanfloat, optional
Mean of Gaussian prior on second modality (metabolites) embeddings. Default is 0.0.
- v_prior_scalefloat, optional
Scale (std) of Gaussian prior on second modality embeddings. Default is 1.0. Smaller values increase regularization.
- beta_1float, optional
Adam exponential decay rate for first moment. Ignored for ‘lbfgs’. Default is 0.9.
- beta_2float, optional
Adam exponential decay rate for second moment. Ignored for ‘lbfgs’. Default is 0.95.
- clipnormfloat, optional
Gradient clipping threshold for Adam (global L2 norm). Ignored for ‘lbfgs’. Default is 10.0.
- batch_normalization{‘unbiased’, ‘legacy’}, optional
Method for scaling mini-batch likelihood in Adam. Ignored for ‘lbfgs’. Default is ‘unbiased’.
‘unbiased’: Uses norm = sum(microbe_counts) / batch_size.
‘legacy’: Uses norm = n_samples / batch_size.
- random_stateint or numpy.random.Generator, optional
Seed for random number generation or a Generator instance. If an int, creates a Generator with that seed. Default is None.
- verbosebool, optional
Print training progress. Default is False.
- Returns:
- MMvecResults
Object containing:
microbe_embeddings: DataFrame (n_microbes, n_components + 1)
metabolite_embeddings: DataFrame (n_metabolites, n_components + 1)
ranks: DataFrame (n_microbes, n_metabolites)
convergence: DataFrame with loss per iteration
Notes
The model learns:
\[P(\text{metabolite}_j | \text{microbe}_i) = \text{softmax}(U_i \cdot V_j + b_{U_i} + b_{V_j})\]To evaluate model performance on held-out data, use the
scoremethod of the returnedMMvecResultsobject.References
[1]Morton, J.T., et al. “Learning representations of microbe-metabolite interactions.” Nature Methods, 2019.
Examples
>>> from skbio.stats.ordination import mmvec >>> import numpy as np >>> import pandas as pd >>> # Create synthetic data >>> np.random.seed(42) >>> microbes = pd.DataFrame( ... np.random.randint(0, 100, size=(50, 10)), ... columns=[f'OTU_{i}' for i in range(10)] ... ) >>> metabolites = pd.DataFrame( ... np.random.randint(0, 100, size=(50, 15)), ... columns=[f'metabolite_{i}' for i in range(15)] ... ) >>> result = mmvec(microbes, metabolites, n_components=2, max_iter=10) >>> result.ranks.shape (10, 15)