skbio.stats.ordination.mmvec#

skbio.stats.ordination.mmvec(microbes, metabolites, n_components=3, optimizer='lbfgs', max_iter=1000, learning_rate=0.001, batch_size=50, u_prior_mean=0.0, u_prior_scale=1.0, v_prior_mean=0.0, v_prior_scale=1.0, beta_1=0.9, beta_2=0.95, clipnorm=10.0, batch_normalization='unbiased', random_state=None, verbose=False)[source]#

Multiomics Microbe-Metabolite Vectors (MMvec).

Learns joint embeddings of two feature sets from their co-occurrence patterns using a multinomial likelihood model.

While the parameter names use “microbes” and “metabolites” following the original publication, this method is generic and can be applied to any two omics modalities representable as compositional (count-based) data. For example: microbes and host transcripts, proteins and metabolites, or any pair of feature tables sharing the same samples.

Added in version 0.7.2.

Parameters:
microbespd.DataFrame or array-like of shape (n_samples, n_microbes)

Abundance counts for the first modality (e.g., microbes, proteins). This modality is treated as the “conditioning” variable.

metabolitespd.DataFrame or array-like of shape (n_samples, n_metabolites)

Abundance counts for the second modality (e.g., metabolites, transcripts). This modality is treated as the “conditioned” variable.

n_componentsint, optional

Number of latent dimensions for embeddings. Default is 3.

optimizer{‘lbfgs’, ‘adam’}, optional

Optimization algorithm to use. Default is ‘lbfgs’.

  • ‘lbfgs’: L-BFGS-B quasi-Newton method. Recommended for most cases. Typically converges in 50-200 iterations. Deterministic.

  • ‘adam’: Stochastic gradient descent with Adam. Use for very large datasets or when stochastic behavior is desired.

max_iterint, optional

Maximum number of iterations. Default is 1000. For ‘lbfgs’, this is the max number of L-BFGS iterations. For ‘adam’, this is the number of epochs.

learning_ratefloat, optional

Adam optimizer learning rate. Ignored for ‘lbfgs’. Default is 1e-3.

batch_sizeint, optional

Mini-batch size for Adam optimizer. Ignored for ‘lbfgs’. Default is 50.

u_prior_meanfloat, optional

Mean of Gaussian prior on first modality (microbes) embeddings. Default is 0.0.

u_prior_scalefloat, optional

Scale (std) of Gaussian prior on first modality embeddings. Default is 1.0. Smaller values increase regularization.

v_prior_meanfloat, optional

Mean of Gaussian prior on second modality (metabolites) embeddings. Default is 0.0.

v_prior_scalefloat, optional

Scale (std) of Gaussian prior on second modality embeddings. Default is 1.0. Smaller values increase regularization.

beta_1float, optional

Adam exponential decay rate for first moment. Ignored for ‘lbfgs’. Default is 0.9.

beta_2float, optional

Adam exponential decay rate for second moment. Ignored for ‘lbfgs’. Default is 0.95.

clipnormfloat, optional

Gradient clipping threshold for Adam (global L2 norm). Ignored for ‘lbfgs’. Default is 10.0.

batch_normalization{‘unbiased’, ‘legacy’}, optional

Method for scaling mini-batch likelihood in Adam. Ignored for ‘lbfgs’. Default is ‘unbiased’.

  • ‘unbiased’: Uses norm = sum(microbe_counts) / batch_size.

  • ‘legacy’: Uses norm = n_samples / batch_size.

random_stateint or numpy.random.Generator, optional

Seed for random number generation or a Generator instance. If an int, creates a Generator with that seed. Default is None.

verbosebool, optional

Print training progress. Default is False.

Returns:
MMvecResults

Object containing:

  • microbe_embeddings: DataFrame (n_microbes, n_components + 1)

  • metabolite_embeddings: DataFrame (n_metabolites, n_components + 1)

  • ranks: DataFrame (n_microbes, n_metabolites)

  • convergence: DataFrame with loss per iteration

Notes

The model learns:

\[P(\text{metabolite}_j | \text{microbe}_i) = \text{softmax}(U_i \cdot V_j + b_{U_i} + b_{V_j})\]

To evaluate model performance on held-out data, use the score method of the returned MMvecResults object.

References

[1]

Morton, J.T., et al. “Learning representations of microbe-metabolite interactions.” Nature Methods, 2019.

Examples

>>> from skbio.stats.ordination import mmvec
>>> import numpy as np
>>> import pandas as pd
>>> # Create synthetic data
>>> np.random.seed(42)
>>> microbes = pd.DataFrame(
...     np.random.randint(0, 100, size=(50, 10)),
...     columns=[f'OTU_{i}' for i in range(10)]
... )
>>> metabolites = pd.DataFrame(
...     np.random.randint(0, 100, size=(50, 15)),
...     columns=[f'metabolite_{i}' for i in range(15)]
... )
>>> result = mmvec(microbes, metabolites, n_components=2, max_iter=10)
>>> result.ranks.shape
(10, 15)