skbio.stats.composition.dirmult_lme#
- skbio.stats.composition.dirmult_lme(table, metadata, formula, grouping=None, pseudocount=0.5, draws=128, p_adjust='holm', seed=None, re_formula=None, vc_formula=None, model_kwargs={}, fit_method=None, fit_warnings=False, fit_kwargs={})[source]#
Fit a Dirichlet-multinomial linear mixed effects model.
The Dirichlet-multinomial distribution is a compound distribution that combines a Dirichlet distribution over the probabilities of a multinomial distribution. This distribution is used to model the distribution of species abundances in a community.
To fit the linear mixed effect model we first fit a Dirichlet-multinomial distribution for each sample, and then we compute the fold change and p-value for each feature. The fold change is computed as the slopes from the resulting model. Statistical tests are then performed on the posterior samples, drawn from each Dirichlet-multinomial distribution. The log-fold changes as well as their credible intervals, the p-values and the multiple comparison corrected p-values are reported.
This function uses the
MixedLM
class from statsmodels.- Parameters:
- tablearray-like
The data for the model. If data is a pd.DataFrame, it must contain the dependent variables in data.columns. If data is not a pd.DataFrame, it must contain the dependent variable in indices of data. data can be a a numpy structured array, or a numpy recarray, or a dictionary. It must not contain duplicate indices.
- metadata: array-like
The metadata for the model. If metadata is a pd.DataFrame, it must contain the covariates in metadata.columns. If metadata is not a pd.DataFrame, it must contain the covariates in indices of metadata. metadata can be a numpy structured array, or a numpy recarray, or a dictionary. It must not contain duplicate indices.
- formulastr or generic Formula object
The formula specifying the model.
- groupingstr
The column name in data that identifies the grouping variable.
- pseudocountfloat, optional
A non-zero value added to the input counts to ensure that all of the estimated abundances are strictly greater than zero. Default is 0.5.
- drawsint, optional
The number of draws from the Dirichilet-multinomial posterior distribution. Default is 128.
- p_adjuststr or None, optional
Method to correct p-values for multiple comparisons. Options are Holm- Boniferroni (“holm” or “holm-bonferroni”) (default), Benjamini- Hochberg (“bh”, “fdr_bh” or “benjamini-hochberg”), or any method supported by statsmodels’
multipletests
function. Case-insensitive. If None, no correction will be performed.- seedint or np.random.Generator, optional
A user-provided random seed or random generator instance for drawing from the Dirichlet distribution.
- re_formulastr, optional
Random coefficient formula. See
MixedLM.from_formula
.- vc_formulastr, optional
Variance component formula. See
MixedLM.from_formula
.- model_kwargsdict, optional
Additional keyword arguments to pass to
MixedLM.from_formula
- fit_methodstr or list of str, optional
Optimization method for model fitting. Can be a single method name, or a list of method names to be tried sequentially. See statsmodels.optimization for available methods. If None, a default list of methods will be tried.
- fit_warningsbool, optional
Issue warnings if any during the model fitting process. Default is False. Warnings are usually issued when the optimization methods do not converge, which is common in the analysis.
- fit_kwargsdict, optional
Additional keyword arguments to pass to
MixedLM.fit
.
- Returns:
- pd.DataFrame
A table of features, their log-fold changes and other relevant statistics.
FeatureID
is the feature identifier, ie: dependent variablesCovariate
is the covariate name, ie: independent variablesLog2(FC)
is the expected log2-fold change. The reportedLog2(FC)
is the average of all of the log2-fold changes computed from each of the posterior draws.CI(2.5)
is the 2.5% quantile of the log2-fold change. The reportedCI(2.5)
is the average of the 2.5% quantile of all of the log2-fold changes computed from each of the posterior draws.CI(97.5)
is the 97.5% quantile of the log2-fold change. The reportedCI(97.5)
is the average of the 97.5% quantile of all of the log2-fold changes computed from each of the posterior draws.pvalue
is the p-value of the linear mixed effects model. The reported values are the average of all of the p-values computed from the linear mixed effects models calculated across all of the posterior draws.qvalue
is the corrected p-value of the linear mixed effects model for multiple comparisons. The reported values are the average of all of the q-values computed from the linear mixed effects models calculated across all of the posterior draws.
See also
Examples
>>> import pandas as pd >>> from skbio.stats.composition import dirmult_lme >>> table = pd.DataFrame( ... [[1.00000053, 6.09924644], ... [0.99999843, 7.0000045], ... [1.09999884, 8.08474053], ... [1.09999758, 1.10000349], ... [0.99999902, 2.00000027], ... [1.09999862, 2.99998318], ... [1.00000084, 2.10001257], ... [0.9999991, 3.09998418], ... [0.99999899, 3.9999742], ... [1.10000124, 5.0001796], ... [1.00000053, 6.09924644], ... [1.10000173, 6.99693644]], ... index=['u1', 'u2', 'u3', 'x1', 'x2', 'x3', 'y1', 'y2', 'y3', 'z1', ... 'z2', 'z3'], ... columns=['Y1', 'Y2']) >>> metadata = pd.DataFrame( ... {'patient': [1, 1, 1, 2, 2, 2, 3, 3, 3, 4, 4, 4], ... 'treatment': [1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2], ... 'time': [1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3],}, ... index=['x1', 'x2', 'x3', 'y1', 'y2', 'y3', 'z1', 'z2', 'z3', 'u1', ... 'u2', 'u3']) >>> result = dirmult_lme(table, metadata, formula='time + treatment', ... grouping='patient', seed=0, p_adjust='sidak') >>> result FeatureID Covariate Log2(FC) CI(2.5) CI(97.5) pvalue qvalue 0 Y1 time -0.210769 -1.571095 1.144057 0.411140 0.879760 1 Y1 treatment -0.164704 -3.456697 3.384563 0.593769 0.972767 2 Y2 time 0.210769 -1.144057 1.571095 0.411140 0.879760 3 Y2 treatment 0.164704 -3.384563 3.456697 0.593769 0.972767