skbio.stats.composition.dirmult_ttest#

skbio.stats.composition.dirmult_ttest(table, grouping, treatment=None, reference=None, pseudocount=0.5, draws=128, p_adjust='holm', seed=None)[source]#

T-test using Dirichlet-multinomial distribution.

The Dirichlet-multinomial distribution is a compound distribution that combines a Dirichlet distribution over the probabilities of a multinomial distribution. This distribution is used to model the distribution of species abundances in a community.

To perform the t-test, we first fit a Dirichlet-multinomial distribution for each sample, and then we compute the fold change and p-value for each feature. The fold change is computed as the difference between the samples of the two groups. t-tests are then performed on the posterior samples, drawn from each Dirichlet-multinomial distribution. The log-fold changes as well as their credible intervals, the p-values and the multiple comparison corrected p-values are reported.

This process mirrors the approach performed by the R package “ALDEx2” [1].

Additionally, this function excludes hits with a 95% confidence interval of fold-change crossing zero during any draw. This step further reduces false positive hits, especially among low-abundance features.

Changed in version 0.7.0: Computational efficiency significantly improved.

Parameters:

tabletable_like of shape (n_samples, n_features): A matrix containing count or proportional abundance data of the samples. See supported formats.
groupingpd.Series or 1-D array_like: Vector indicating the assignment of samples to groups. These could be strings or integers denoting which group a sample belongs to. If it is a pandas Series and the table contains sample IDs, its index will be filtered and reordered to match the sample IDs. Otherwise, it must be the same length as the samples in the table.
treatmentstr, optional: Name of the treatment group. The t-test is computed between the treatment group and the reference group specified in the grouping vector. If omitted, the first group in the sorted order of all group names will be the treatment group.
referencestr, optional: Name of the reference group. See above. If omitted, all groups other than the treatment group will be combined as the reference group.

Changed in version 0.7.0: treatment and reference are now optional.
pseudocountfloat, optional: A non-zero value added to the input counts to ensure that all of the estimated abundances are strictly greater than zero.
drawsint, optional: The number of draws from the Dirichilet-multinomial posterior distribution More draws provide higher uncertainty surrounding the estimated log-fold changes and p-values.
p_adjuststr, optional: Method to correct p-values for multiple comparisons. Options are Holm- Boniferroni (“holm” or “holm-bonferroni”) (default), Benjamini- Hochberg (“bh”, “fdr_bh” or “benjamini-hochberg”), or any method supported by statsmodels’ multipletests function. Case-insensitive. If None, no correction will be performed.
seedint, Generator or RandomState, optional: A user-provided random seed or random generator instance for drawing from the Dirichlet distribution. See details.

Returns:

pd.DataFrame

A table of features, their log-fold changes and other relevant statistics.

T-statistic: t-statistic of Welch’s t-test. The reported value is the average across all of the posterior draws.
Log2(FC): Expected log2-fold change of abundance from the reference group to the treatment group. The value is expressed in the center log ratio (see clr) transformed coordinates. The reported value is the average of all of the log2-fold changes computed from each of the posterior draws.
CI(2.5): 2.5% quantile of the log2-fold change. The reported value is the minimum of all of the 2.5% quantiles computed from each of the posterior draws.
CI(97.5): 97.5% quantile of the log2-fold change. The reported value is the maximum of all of the 97.5% quantiles computed from each of the posterior draws.
pvalue: p-value of Welch’s t-test. The reported value is the average of all of the p-values computed from each of the posterior draws.
qvalue: Corrected p-value of Welch’s t-test for multiple comparisons. The reported value is the average of all of the q-values computed from each of the posterior draws.
Signif: Whether feature is significantly differentially abundant between the treatment and reference groups. A feature marked as “True” suffice: 1) The q-value must be less than or equal to the significance level (0.05). 2) The confidence interval (CI(2.5)..CI(97.5)) must not overlap with zero.

Changed in version 0.7.0: df (degrees of freedom) was removed from the report, as this metric is inconsistent across draws.

Changed in version 0.7.0: Renamed T statistic as T-statistic. Renamed Reject null hypothesis as Signif.

See also

dirmult_ttest
scipy.stats.ttest_ind
statsmodels.stats.weightstats.CompareMeans

Notes

The confidence intervals are computed using the mininum 2.5% and maximum 97.5% bounds computed across all of the posterior draws.

The reference frame here is the geometric mean. Extracting absolute log fold changes from this test assumes that the average feature abundance between the treatment and the reference groups are the same. If this assumption is violated, then the log-fold changes will be biased, and the p-values will not be reliable. However, the bias is the same across each feature, as a result the ordering of the log-fold changes can still be useful.

One benefit of using the Dirichlet-multinomial distribution is that the statistical power increases with regards to the abundance magnitude. More counts per sample will shrink the size of the confidence intervals, and can result in lower p-values.

References

[1]

Fernandes, A. D., Reid, J. N., Macklaim, J. M., McMurrough, T. A., Edgell, D. R., & Gloor, G. B. (2014). Unifying the analysis of high-throughput sequencing datasets: characterizing RNA-seq, 16S rRNA gene sequencing and selective growth experiments by compositional data analysis. Microbiome, 2, 1-13.

Examples

>>> import pandas as pd
>>> from skbio.stats.composition import dirmult_ttest
>>> table = pd.DataFrame([[20,  110, 100, 101, 100, 103, 104],
...                       [33,  110, 120, 100, 101, 100, 102],
...                       [12,  110, 100, 110, 100, 50,  90],
...                       [202, 201, 9,  10, 10, 11, 11],
...                       [200, 202, 10, 10, 13, 10, 10],
...                       [203, 201, 14, 10, 10, 13, 12]],
...                      index=['s1', 's2', 's3', 's4', 's5', 's6'],
...                      columns=['b1', 'b2', 'b3', 'b4', 'b5', 'b6', 'b7'])
>>> grouping = pd.Series(['treatment', 'treatment', 'treatment',
...                       'placebo', 'placebo', 'placebo'],
...                      index=['s1', 's2', 's3', 's4', 's5', 's6'])
>>> result = dirmult_ttest(table, grouping, 'treatment', 'placebo', seed=0)
>>> result
    T-statistic  Log2(FC)   CI(2.5)  CI(97.5)    pvalue    qvalue  Signif
b1   -17.178600 -4.991987 -7.884498 -2.293463  0.003355  0.020131    True
b2   -16.873187 -2.533729 -3.594590 -1.462339  0.001064  0.007446    True
b3     6.942727  1.627677 -1.048219  4.750792  0.021130  0.068310   False
b4     6.522786  1.707221 -0.467481  4.164998  0.013123  0.065613   False
b5     6.654142  1.528243 -1.036910  3.978387  0.019360  0.068310   False
b6     3.839520  1.182343 -0.702656  3.556061  0.045376  0.068310   False
b7     7.600734  1.480232 -0.601277  4.043888  0.017077  0.068310   False