skbio.stats.composition.dirmult_lme#
- skbio.stats.composition.dirmult_lme(table, metadata, formula, grouping, pseudocount=0.5, draws=128, p_adjust='holm', seed=None, re_formula=None, vc_formula=None, model_kwargs={}, fit_method=None, fit_converge=False, fit_warnings=False, fit_kwargs={})[source]#
Fit a Dirichlet-multinomial linear mixed effects model.
Added in version 0.7.0.
The Dirichlet-multinomial distribution is a compound distribution that combines a Dirichlet distribution over the probabilities of a multinomial distribution. This distribution is used to model the distribution of species abundances in a community.
To fit the linear mixed effects model we first fit a Dirichlet-multinomial distribution for each sample, and then we compute the fold change and p-value for each feature. The fold change is computed as the slopes from the resulting model. Statistical tests are then performed on the posterior samples, drawn from each Dirichlet-multinomial distribution. The log-fold changes as well as their credible intervals, the p-values and the multiple comparison corrected p-values are reported.
This function uses the
MixedLM
class from statsmodels.Note
Because the analysis iteratively runs many numeric optimizations, it can take longer than usual to finish. Please allow extra time for completion.
- Parameters:
- tabletable_like of shape (n_samples, n_features)
A matrix containing count or proportional abundance data of the samples. See supported formats.
- metadatapd.DataFrame or 2-D array_like
The metadata for the model. Rows correspond to samples and columns correspond to covariates in the model. Must be a pandas DataFrame or convertible to a pandas DataFrame.
- formulastr or generic Formula object
The formula defining the model. Refer to Patsy’s documentation on how to specify a formula.
- groupingstr, pd.Series or 1-D array_like
A vector or a metadata column name indicating the assignment of samples to groups. Samples are independent between groups during model fitting.
- pseudocountfloat, optional
A non-zero value added to the input counts to ensure that all of the estimated abundances are strictly greater than zero. Default is 0.5.
- drawsint, optional
Number of draws from the Dirichlet-multinomial posterior distribution. Default is 128.
- p_adjuststr, optional
Method to correct p-values for multiple comparisons. Options are Holm- Boniferroni (“holm” or “holm-bonferroni”) (default), Benjamini- Hochberg (“bh”, “fdr_bh” or “benjamini-hochberg”), or any method supported by statsmodels’
multipletests
function. Case-insensitive. If None, no correction will be performed.- seedint, Generator or RandomState, optional
A user-provided random seed or random generator instance for drawing from the Dirichlet distribution. See
details
.- re_formulastr, optional
Random coefficient formula. See
MixedLM.from_formula
for details.- vc_formulastr, optional
Variance component formula. See
MixedLM.from_formula
for details.- model_kwargsdict, optional
Additional keyword arguments to pass to
MixedLM
.- fit_methodstr or list of str, optional
Optimization method for model fitting. Can be a single method name, or a list of method names to be tried sequentially. See statsmodels.optimization for available methods. If None, a default list of methods will be tried.
- fit_convergebool, optional
If True, model fittings that were completed but did not converge will be excluded from the calculation of final statistics. Default is False.
- fit_warningsbool, optional
Issue warnings if any during the model fitting process. Default is False. Warnings are usually issued when the optimization methods do not converge, which is common in the analysis. Default is False.
- fit_kwargsdict, optional
Additional keyword arguments to pass to
MixedLM.fit
.
- Returns:
- pd.DataFrame
A table of features and covariates, their log-fold changes and other relevant statistics.
FeatureID
: Feature identifier, i.e., dependent variable.Covariate
: Covariate name, i.e., independent variable.Reps
: Number of Dirichlet-multinomial posterior draws that supported the reported statistics, i.e., the number of successful model fittings on this feature. Max:draws
(if none failed). Min: 0 (in which case all statistics are NaN).Log2(FC)
: Expected log2-fold change of abundance from the reference category to the covariate category defined in the formula. The value is expressed in the center log ratio (seeclr
) transformed coordinates. The reported value is the average of all of the log2-fold changes computed from each of the posterior draws.CI(2.5)
: 2.5% quantile of the log2-fold change. The reported value is the minimum of all of the 2.5% quantiles computed from each of the posterior draws.CI(97.5)
: 97.5% quantile of the log2-fold change. The reported value is the maximum of all of the 97.5% quantiles computed from each of the posterior draws.pvalue
: p-value of the linear mixed effects model. The reported value is the average of all of the p-values computed from each of the posterior draws.qvalue
: Corrected p-value of the linear mixed effects model for multiple comparisons. The reported value is the average of all of the q-values computed from each of the posterior draws.Signif
: Whether the covariate category is significantly differentially abundant from the reference category. A feature-covariate pair marked as “True” suffice: 1) The q-value must be less than or equal to the significance level (0.05). 2) The confidence interval (CI(2.5)..CI(97.5)) must not overlap with zero.
See also
Examples
>>> import pandas as pd >>> from skbio.stats.composition import dirmult_lme >>> table = pd.DataFrame( ... [[1.00000053, 6.09924644], ... [0.99999843, 7.0000045], ... [1.09999884, 8.08474053], ... [1.09999758, 1.10000349], ... [0.99999902, 2.00000027], ... [1.09999862, 2.99998318], ... [1.00000084, 2.10001257], ... [0.9999991, 3.09998418], ... [0.99999899, 3.9999742], ... [1.10000124, 5.0001796], ... [1.00000053, 6.09924644], ... [1.10000173, 6.99693644]], ... index=['u1', 'u2', 'u3', 'x1', 'x2', 'x3', 'y1', 'y2', 'y3', 'z1', ... 'z2', 'z3'], ... columns=['Y1', 'Y2']) >>> metadata = pd.DataFrame( ... {'patient': [1, 1, 1, 2, 2, 2, 3, 3, 3, 4, 4, 4], ... 'treatment': [1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2], ... 'time': [1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3],}, ... index=['x1', 'x2', 'x3', 'y1', 'y2', 'y3', 'z1', 'z2', 'z3', 'u1', ... 'u2', 'u3']) >>> result = dirmult_lme(table, metadata, formula='time + treatment', ... grouping='patient', seed=0, p_adjust='sidak') >>> result FeatureID Covariate Reps Log2(FC) CI(2.5) CI(97.5) pvalue \ 0 Y1 time 128 -0.210769 -1.532255 1.122148 0.403737 1 Y1 treatment 128 -0.744061 -3.401978 1.581917 0.252057 2 Y2 time 128 0.210769 -1.122148 1.532255 0.403737 3 Y2 treatment 128 0.744061 -1.581917 3.401978 0.252057 qvalue Signif 0 0.644470 False 1 0.440581 False 2 0.644470 False 3 0.440581 False