skbio.stats.composition.dirmult_ttest#

skbio.stats.composition.dirmult_ttest(table, grouping, treatment, reference, pseudocount=0.5, draws=128, p_adjust='holm', seed=None)[source]#

T-test using Dirichlet-multinomial distribution.

The Dirichlet-multinomial distribution is a compound distribution that combines a Dirichlet distribution over the probabilities of a multinomial distribution. This distribution is used to model the distribution of species abundances in a community.

To perform the t-test, we first fit a Dirichlet-multinomial distribution for each sample, and then we compute the fold change and p-value for each feature. The fold change is computed as the difference between the samples of the two groups. t-tests are then performed on the posterior samples, drawn from each Dirichlet-multinomial distribution. The log-fold changes as well as their credible intervals, the p-values and the multiple comparison corrected p-values are reported.

This process mirrors the approach performed by the R package “ALDEx2” [1].

Parameters:

tablepd.DataFrame: Contingency table of counts where rows are features and columns are samples.
groupingpd.Series: Vector indicating the assignment of samples to groups. For example, these could be strings or integers denoting which group a sample belongs to. It must be the same length as the samples in table. The index must be the same on table and grouping but need not be in the same order. The t-test is computed between the treatment group and the reference group specified in the grouping vector.
treatmentstr: Name of the treatment group.
referencestr: Name of the reference group.
pseudocountfloat, optional: A non-zero value added to the input counts to ensure that all of the estimated abundances are strictly greater than zero.
drawsint, optional: The number of draws from the Dirichilet-multinomial posterior distribution More draws provide higher uncertainty surrounding the estimated log-fold changes and p-values.
p_adjuststr or None, optional: Method to correct p-values for multiple comparisons. Options are Holm- Boniferroni (“holm” or “holm-bonferroni”) (default), Benjamini- Hochberg (“bh”, “fdr_bh” or “benjamini-hochberg”), or any method supported by statsmodels’ multipletests function. Case-insensitive. If None, no correction will be performed.
seedint, Generator or RandomState, optional: A user-provided random seed or random generator instance for drawing from the Dirichlet distribution. See details.

Returns:

pd.DataFrame

A table of features, their log-fold changes and other relevant statistics.

T statistic is the t-statistic outputted from the t-test. t-statistics are generated from each posterior draw. The reported T statistic is the average across all of the posterior draws.

df is the degrees of freedom from the t-test.

Log2(FC) is the expected log2-fold change. Within each posterior draw the log2 fold-change is computed as the difference between the mean log-abundance the treatment group and the reference group. All log2 fold changes are expressed in clr coordinates. The reported Log2(FC) is the average of all of the log2-fold changes computed from each of the posterior draws.

CI(2.5) is the 2.5% quantile of the log2-fold change. The reported CI(2.5) is the 2.5% quantile of all of the log2-fold changes computed from each of the posterior draws.

CI(97.5) is the 97.5% quantile of the log2-fold change. The reported CI(97.5) is the 97.5% quantile of all of the log2-fold changes computed from each of the posterior draws.

pvalue is the p-value of the t-test. The reported values are the average of all of the p-values computed from the t-tests calculated across all of the posterior draws.

qvalue is the p-value of the t-test after performing multiple comparison correction.

Reject null hypothesis indicates if feature is differentially abundant across groups (True) or not (False). In order for a feature to be differentially abundant, the qvalue needs to be significant (i.e. <0.05) and the confidence intervals reported by CI(2.5) and CI(97.5) must not overlap with zero.

See also

scipy.stats.ttest_ind

Notes

The confidence intervals are computed using the mininum 2.5% and maximum 97.5% bounds computed across all of the posterior draws.

The reference frame here is the geometric mean. Extracting absolute log fold changes from this test assumes that the average feature abundance between the treatment and the reference groups are the same. If this assumption is violated, then the log-fold changes will be biased, and the p-values will not be reliable. However, the bias is the same across each feature, as a result the ordering of the log-fold changes can still be useful.

One benefit of using the Dirichlet-multinomial distribution is that the statistical power increases with regards to the abundance magnitude. More counts per sample will shrink the size of the confidence intervals, and can result in lower p-values.

References

[1]

Fernandes, A. D., Reid, J. N., Macklaim, J. M., McMurrough, T. A., Edgell, D. R., & Gloor, G. B. (2014). Unifying the analysis of high-throughput sequencing datasets: characterizing RNA-seq, 16S rRNA gene sequencing and selective growth experiments by compositional data analysis. Microbiome, 2, 1-13.

Examples

>>> import pandas as pd
>>> from skbio.stats.composition import dirmult_ttest
>>> table = pd.DataFrame([[20,  110, 100, 101, 100, 103, 104],
...                       [33,  110, 120, 100, 101, 100, 102],
...                       [12,  110, 100, 110, 100, 50,  90],
...                       [202, 201, 9,  10, 10, 11, 11],
...                       [200, 202, 10, 10, 13, 10, 10],
...                       [203, 201, 14, 10, 10, 13, 12]],
...                      index=['s1', 's2', 's3', 's4', 's5', 's6'],
...                      columns=['b1', 'b2', 'b3', 'b4', 'b5', 'b6', 'b7'])
>>> grouping = pd.Series(['treatment', 'treatment', 'treatment',
...                       'placebo', 'placebo', 'placebo'],
...                      index=['s1', 's2', 's3', 's4', 's5', 's6'])
>>> lfc_result = dirmult_ttest(table, grouping, 'treatment', 'placebo',
...                            seed=0)
>>> lfc_result[["Log2(FC)", "CI(2.5)", "CI(97.5)", "qvalue"]]
    Log2(FC)   CI(2.5)  CI(97.5)    qvalue
b1 -4.991987 -7.884498 -2.293463  0.020131
b2 -2.533729 -3.594590 -1.462339  0.007446
b3  1.627677 -1.048219  4.750792  0.068310
b4  1.707221 -0.467481  4.164998  0.065613
b5  1.528243 -1.036910  3.978387  0.068310
b6  1.182343 -0.702656  3.556061  0.068310
b7  1.480232 -0.601277  4.043888  0.068310