skbio.stats.composition.dirmult_lme#

skbio.stats.composition.dirmult_lme(table, metadata, formula, grouping, pseudocount=0.5, draws=128, p_adjust='holm', seed=None, re_formula=None, vc_formula=None, model_kwargs={}, fit_method=None, fit_converge=False, fit_warnings=False, fit_kwargs={})[source]#

Fit a Dirichlet-multinomial linear mixed effects model.

Added in version 0.7.0.

The Dirichlet-multinomial distribution is a compound distribution that combines a Dirichlet distribution over the probabilities of a multinomial distribution. This distribution is used to model the distribution of species abundances in a community.

To fit the linear mixed effects model we first fit a Dirichlet-multinomial distribution for each sample, and then we compute the fold change and p-value for each feature. The fold change is computed as the slopes from the resulting model. Statistical tests are then performed on the posterior samples, drawn from each Dirichlet-multinomial distribution. The log-fold changes as well as their credible intervals, the p-values and the multiple comparison corrected p-values are reported.

This function uses the MixedLM class from statsmodels.

Note

Because the analysis iteratively runs many numeric optimizations, it can take longer than usual to finish. Please allow extra time for completion.

Parameters:

tabletable_like of shape (n_samples, n_features): A matrix containing count or proportional abundance data of the samples. See supported formats.
metadatapd.DataFrame or 2-D array_like: The metadata for the model. Rows correspond to samples and columns correspond to covariates in the model. Must be a pandas DataFrame or convertible to a pandas DataFrame.
formulastr or generic Formula object: The formula defining the model. Refer to Patsy’s documentation on how to specify a formula.
groupingstr, pd.Series or 1-D array_like: A vector or a metadata column name indicating the assignment of samples to groups. Samples are independent between groups during model fitting.
pseudocountfloat, optional: A non-zero value added to the input counts to ensure that all of the estimated abundances are strictly greater than zero. Default is 0.5.
drawsint, optional: Number of draws from the Dirichlet-multinomial posterior distribution. Default is 128.
p_adjuststr, optional: Method to correct p-values for multiple comparisons. Options are Holm- Boniferroni (“holm” or “holm-bonferroni”) (default), Benjamini- Hochberg (“bh”, “fdr_bh” or “benjamini-hochberg”), or any method supported by statsmodels’ multipletests function. Case-insensitive. If None, no correction will be performed.
seedint, Generator or RandomState, optional: A user-provided random seed or random generator instance for drawing from the Dirichlet distribution. See details.
re_formulastr, optional: Random coefficient formula. See MixedLM.from_formula for details.
vc_formulastr, optional: Variance component formula. See MixedLM.from_formula for details.
model_kwargsdict, optional: Additional keyword arguments to pass to MixedLM.
fit_methodstr or list of str, optional: Optimization method for model fitting. Can be a single method name, or a list of method names to be tried sequentially. See statsmodels.optimization for available methods. If None, a default list of methods will be tried.
fit_convergebool, optional: If True, model fittings that were completed but did not converge will be excluded from the calculation of final statistics. Default is False.
fit_warningsbool, optional: Issue warnings if any during the model fitting process. Default is False. Warnings are usually issued when the optimization methods do not converge, which is common in the analysis. Default is False.
fit_kwargsdict, optional: Additional keyword arguments to pass to MixedLM.fit.

Returns:

pd.DataFrame

A table of features and covariates, their log-fold changes and other relevant statistics.

FeatureID: Feature identifier, i.e., dependent variable.
Covariate: Covariate name, i.e., independent variable.
Reps: Number of Dirichlet-multinomial posterior draws that supported the reported statistics, i.e., the number of successful model fittings on this feature. Max: draws (if none failed). Min: 0 (in which case all statistics are NaN).
Log2(FC): Expected log2-fold change of abundance from the reference category to the covariate category defined in the formula. The value is expressed in the center log ratio (see clr) transformed coordinates. The reported value is the average of all of the log2-fold changes computed from each of the posterior draws.
CI(2.5): 2.5% quantile of the log2-fold change. The reported value is the minimum of all of the 2.5% quantiles computed from each of the posterior draws.
CI(97.5): 97.5% quantile of the log2-fold change. The reported value is the maximum of all of the 97.5% quantiles computed from each of the posterior draws.
pvalue: p-value of the linear mixed effects model. The reported value is the average of all of the p-values computed from each of the posterior draws.
qvalue: Corrected p-value of the linear mixed effects model for multiple comparisons. The reported value is the average of all of the q-values computed from each of the posterior draws.
Signif: Whether the covariate category is significantly differentially abundant from the reference category. A feature-covariate pair marked as “True” suffice: 1) The q-value must be less than or equal to the significance level (0.05). 2) The confidence interval (CI(2.5)..CI(97.5)) must not overlap with zero.

See also

dirmult_ttest
statsmodels.formula.api.mixedlm
statsmodels.regression.mixed_linear_model.MixedLM

Examples

>>> import pandas as pd
>>> from skbio.stats.composition import dirmult_lme
>>> table = pd.DataFrame(
...     [[1.00000053, 6.09924644],
...      [0.99999843, 7.0000045],
...      [1.09999884, 8.08474053],
...      [1.09999758, 1.10000349],
...      [0.99999902, 2.00000027],
...      [1.09999862, 2.99998318],
...      [1.00000084, 2.10001257],
...      [0.9999991, 3.09998418],
...      [0.99999899, 3.9999742],
...      [1.10000124, 5.0001796],
...      [1.00000053, 6.09924644],
...      [1.10000173, 6.99693644]],
...     index=['u1', 'u2', 'u3', 'x1', 'x2', 'x3', 'y1', 'y2', 'y3', 'z1',
...            'z2', 'z3'],
...     columns=['Y1', 'Y2'])
>>> metadata = pd.DataFrame(
...     {'patient': [1, 1, 1, 2, 2, 2, 3, 3, 3, 4, 4, 4],
...      'treatment': [1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2],
...      'time': [1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3],},
...     index=['x1', 'x2', 'x3', 'y1', 'y2', 'y3', 'z1', 'z2', 'z3', 'u1',
...            'u2', 'u3'])
>>> result = dirmult_lme(table, metadata, formula='time + treatment',
...                      grouping='patient', seed=0, p_adjust='sidak')
>>> result
  FeatureID  Covariate  Reps  Log2(FC)   CI(2.5)  CI(97.5)    pvalue  \
0        Y1       time   128 -0.210769 -1.532255  1.122148  0.403737
1        Y1  treatment   128 -0.744061 -3.401978  1.581917  0.252057
2        Y2       time   128  0.210769 -1.122148  1.532255  0.403737
3        Y2  treatment   128  0.744061 -1.581917  3.401978  0.252057

     qvalue  Signif
0  0.644470   False
1  0.440581   False
2  0.644470   False
3  0.440581   False