skbio.stats.composition.dirmult_lme#

skbio.stats.composition.dirmult_lme(table, metadata, formula, grouping=None, pseudocount=0.5, draws=128, p_adjust='holm', seed=None, re_formula=None, vc_formula=None, model_kwargs={}, fit_method=None, fit_warnings=False, fit_kwargs={})[source]#

Fit a Dirichlet-multinomial linear mixed effects model.

The Dirichlet-multinomial distribution is a compound distribution that combines a Dirichlet distribution over the probabilities of a multinomial distribution. This distribution is used to model the distribution of species abundances in a community.

To fit the linear mixed effect model we first fit a Dirichlet-multinomial distribution for each sample, and then we compute the fold change and p-value for each feature. The fold change is computed as the slopes from the resulting model. Statistical tests are then performed on the posterior samples, drawn from each Dirichlet-multinomial distribution. The log-fold changes as well as their credible intervals, the p-values and the multiple comparison corrected p-values are reported.

This function uses the MixedLM class from statsmodels.

Parameters:
tablearray-like

The data for the model. If data is a pd.DataFrame, it must contain the dependent variables in data.columns. If data is not a pd.DataFrame, it must contain the dependent variable in indices of data. data can be a a numpy structured array, or a numpy recarray, or a dictionary. It must not contain duplicate indices.

metadata: array-like

The metadata for the model. If metadata is a pd.DataFrame, it must contain the covariates in metadata.columns. If metadata is not a pd.DataFrame, it must contain the covariates in indices of metadata. metadata can be a numpy structured array, or a numpy recarray, or a dictionary. It must not contain duplicate indices.

formulastr or generic Formula object

The formula specifying the model.

groupingstr

The column name in data that identifies the grouping variable.

pseudocountfloat, optional

A non-zero value added to the input counts to ensure that all of the estimated abundances are strictly greater than zero. Default is 0.5.

drawsint, optional

The number of draws from the Dirichilet-multinomial posterior distribution. Default is 128.

p_adjuststr or None, optional

Method to correct p-values for multiple comparisons. Options are Holm- Boniferroni (“holm” or “holm-bonferroni”) (default), Benjamini- Hochberg (“bh”, “fdr_bh” or “benjamini-hochberg”), or any method supported by statsmodels’ multipletests function. Case-insensitive. If None, no correction will be performed.

seedint, Generator or RandomState, optional

A user-provided random seed or random generator instance for drawing from the Dirichlet distribution. See details.

re_formulastr, optional

Random coefficient formula. See MixedLM.from_formula.

vc_formulastr, optional

Variance component formula. See MixedLM.from_formula.

model_kwargsdict, optional

Additional keyword arguments to pass to MixedLM.from_formula

fit_methodstr or list of str, optional

Optimization method for model fitting. Can be a single method name, or a list of method names to be tried sequentially. See statsmodels.optimization for available methods. If None, a default list of methods will be tried.

fit_warningsbool, optional

Issue warnings if any during the model fitting process. Default is False. Warnings are usually issued when the optimization methods do not converge, which is common in the analysis.

fit_kwargsdict, optional

Additional keyword arguments to pass to MixedLM.fit.

Returns:
pd.DataFrame

A table of features, their log-fold changes and other relevant statistics.

FeatureID is the feature identifier, ie: dependent variables

Covariate is the covariate name, ie: independent variables

Log2(FC) is the expected log2-fold change. The reported Log2(FC) is the average of all of the log2-fold changes computed from each of the posterior draws.

CI(2.5) is the 2.5% quantile of the log2-fold change. The reported CI(2.5) is the average of the 2.5% quantile of all of the log2-fold changes computed from each of the posterior draws.

CI(97.5) is the 97.5% quantile of the log2-fold change. The reported CI(97.5) is the average of the 97.5% quantile of all of the log2-fold changes computed from each of the posterior draws.

pvalue is the p-value of the linear mixed effects model. The reported values are the average of all of the p-values computed from the linear mixed effects models calculated across all of the posterior draws.

qvalue is the corrected p-value of the linear mixed effects model for multiple comparisons. The reported values are the average of all of the q-values computed from the linear mixed effects models calculated across all of the posterior draws.

Examples

>>> import pandas as pd
>>> from skbio.stats.composition import dirmult_lme
>>> table = pd.DataFrame(
...     [[1.00000053, 6.09924644],
...      [0.99999843, 7.0000045],
...      [1.09999884, 8.08474053],
...      [1.09999758, 1.10000349],
...      [0.99999902, 2.00000027],
...      [1.09999862, 2.99998318],
...      [1.00000084, 2.10001257],
...      [0.9999991, 3.09998418],
...      [0.99999899, 3.9999742],
...      [1.10000124, 5.0001796],
...      [1.00000053, 6.09924644],
...      [1.10000173, 6.99693644]],
...     index=['u1', 'u2', 'u3', 'x1', 'x2', 'x3', 'y1', 'y2', 'y3', 'z1',
...            'z2', 'z3'],
...     columns=['Y1', 'Y2'])
>>> metadata = pd.DataFrame(
...     {'patient': [1, 1, 1, 2, 2, 2, 3, 3, 3, 4, 4, 4],
...      'treatment': [1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2],
...      'time': [1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3],},
...     index=['x1', 'x2', 'x3', 'y1', 'y2', 'y3', 'z1', 'z2', 'z3', 'u1',
...            'u2', 'u3'])
>>> result = dirmult_lme(table, metadata, formula='time + treatment',
...                      grouping='patient', seed=0, p_adjust='sidak')
>>> result
  FeatureID  Covariate  Log2(FC)   CI(2.5)  CI(97.5)    pvalue    qvalue
0        Y1       time -0.210769 -1.571095  1.144057  0.411140  0.879760
1        Y1  treatment -0.164704 -3.456697  3.384563  0.593769  0.972767
2        Y2       time  0.210769 -1.144057  1.571095  0.411140  0.879760
3        Y2  treatment  0.164704 -3.384563  3.456697  0.593769  0.972767