skbio.stats.composition.dirmult_ttest#
- skbio.stats.composition.dirmult_ttest(table, grouping, treatment, reference, pseudocount=0.5, draws=128, p_adjust='holm', seed=None)[source]#
T-test using Dirichlet-multinomial distribution.
The Dirichlet-multinomial distribution is a compound distribution that combines a Dirichlet distribution over the probabilities of a multinomial distribution. This distribution is used to model the distribution of species abundances in a community.
To perform the t-test, we first fit a Dirichlet-multinomial distribution for each sample, and then we compute the fold change and p-value for each feature. The fold change is computed as the difference between the samples of the two groups. t-tests are then performed on the posterior samples, drawn from each Dirichlet-multinomial distribution. The log-fold changes as well as their credible intervals, the p-values and the multiple comparison corrected p-values are reported.
This process mirrors the approach performed by the R package “ALDEx2” [1].
- Parameters:
- tablepd.DataFrame
Contingency table of counts where rows are features and columns are samples.
- groupingpd.Series
Vector indicating the assignment of samples to groups. For example, these could be strings or integers denoting which group a sample belongs to. It must be the same length as the samples in
table
. The index must be the same ontable
andgrouping
but need not be in the same order. The t-test is computed between thetreatment
group and thereference
group specified in thegrouping
vector.- treatmentstr
Name of the treatment group.
- referencestr
Name of the reference group.
- pseudocountfloat, optional
A non-zero value added to the input counts to ensure that all of the estimated abundances are strictly greater than zero.
- drawsint, optional
The number of draws from the Dirichilet-multinomial posterior distribution More draws provide higher uncertainty surrounding the estimated log-fold changes and p-values.
- p_adjuststr or None, optional
Method to correct p-values for multiple comparisons. Options are Holm- Boniferroni (“holm” or “holm-bonferroni”) (default), Benjamini- Hochberg (“bh”, “fdr_bh” or “benjamini-hochberg”), or any method supported by statsmodels’
multipletests
function. Case-insensitive. If None, no correction will be performed.- seedint or np.random.Generator, optional
A user-provided random seed or random generator instance.
- Returns:
- pd.DataFrame
A table of features, their log-fold changes and other relevant statistics.
T statistic
is the t-statistic outputted from the t-test. t-statistics are generated from each posterior draw. The reportedT statistic
is the average across all of the posterior draws.df
is the degrees of freedom from the t-test.Log2(FC)
is the expected log2-fold change. Within each posterior draw the log2 fold-change is computed as the difference between the mean log-abundance thetreatment
group and thereference
group. All log2 fold changes are expressed in clr coordinates. The reportedLog2(FC)
is the average of all of the log2-fold changes computed from each of the posterior draws.CI(2.5)
is the 2.5% quantile of the log2-fold change. The reportedCI(2.5)
is the 2.5% quantile of all of the log2-fold changes computed from each of the posterior draws.CI(97.5)
is the 97.5% quantile of the log2-fold change. The reportedCI(97.5)
is the 97.5% quantile of all of the log2-fold changes computed from each of the posterior draws.pvalue
is the p-value of the t-test. The reported values are the average of all of the p-values computed from the t-tests calculated across all of the posterior draws.qvalue
is the p-value of the t-test after performing multiple comparison correction.Reject null hypothesis
indicates if feature is differentially abundant across groups (True
) or not (False
). In order for a feature to be differentially abundant, the qvalue needs to be significant (i.e. <0.05) and the confidence intervals reported byCI(2.5)
andCI(97.5)
must not overlap with zero.
See also
Notes
The confidence intervals are computed using the mininum 2.5% and maximum 97.5% bounds computed across all of the posterior draws.
The reference frame here is the geometric mean. Extracting absolute log fold changes from this test assumes that the average feature abundance between the
treatment
and thereference
groups are the same. If this assumption is violated, then the log-fold changes will be biased, and the p-values will not be reliable. However, the bias is the same across each feature, as a result the ordering of the log-fold changes can still be useful.One benefit of using the Dirichlet-multinomial distribution is that the statistical power increases with regards to the abundance magnitude. More counts per sample will shrink the size of the confidence intervals, and can result in lower p-values.
References
[1]Fernandes et al. “Unifying the analysis of high-throughput sequencing datasets: characterizing RNA-seq, 16S rRNA gene sequencing and selective growth experiments by compositional data analysis.” Microbiome (2014).
Examples
>>> import pandas as pd >>> from skbio.stats.composition import dirmult_ttest >>> table = pd.DataFrame([[20, 110, 100, 101, 100, 103, 104], ... [33, 110, 120, 100, 101, 100, 102], ... [12, 110, 100, 110, 100, 50, 90], ... [202, 201, 9, 10, 10, 11, 11], ... [200, 202, 10, 10, 13, 10, 10], ... [203, 201, 14, 10, 10, 13, 12]], ... index=['s1', 's2', 's3', 's4', 's5', 's6'], ... columns=['b1', 'b2', 'b3', 'b4', 'b5', 'b6', ... 'b7']) >>> grouping = pd.Series(['treatment', 'treatment', 'treatment', ... 'placebo', 'placebo', 'placebo'], ... index=['s1', 's2', 's3', 's4', 's5', 's6']) >>> lfc_result = dirmult_ttest(table, grouping, 'treatment', 'placebo', ... seed=0) >>> lfc_result[["Log2(FC)", "CI(2.5)", "CI(97.5)", "qvalue"]] Log2(FC) CI(2.5) CI(97.5) qvalue b1 -4.991987 -7.884498 -2.293463 0.020131 b2 -2.533729 -3.594590 -1.462339 0.007446 b3 1.627677 -1.048219 4.750792 0.068310 b4 1.707221 -0.467481 4.164998 0.065613 b5 1.528243 -1.036910 3.978387 0.068310 b6 1.182343 -0.702656 3.556061 0.068310 b7 1.480232 -0.601277 4.043888 0.068310