skbio.stats.composition.dirmult_ttest#
- skbio.stats.composition.dirmult_ttest(table, grouping, treatment=None, reference=None, pseudocount=0.5, draws=128, p_adjust='holm', seed=None)[source]#
Perform t-test using Dirichlet-multinomial distribution.
The Dirichlet-multinomial distribution is a compound distribution that combines a Dirichlet distribution over the probabilities of a multinomial distribution. This distribution models the distribution of species abundances in a community.
Abundance data of each sample is first fit to a Dirichlet-multinomial distribution. The posteriors are then transformed to the center log ratio (CLR) coordinates (see
clr), based on which the fold changes (differences between the treatment/reference group means) are calculated and Welch’s t-tests are performed to assess the significance. This process is repeated for multiple draws to provide robust estimates of the statistics. The fold changes, their confidence intervals, the p-values and multiple-comparison corrected p-values are reported.This process mirrors the approach performed by the R package “ALDEx2” [1].
Additionally, this function excludes hits with a 95% confidence interval of fold-change crossing zero during any draw. This step further reduces false positive hits, especially among low-abundance features.
Changed in version 0.7.0: Computational efficiency significantly improved.
- Parameters:
- tabletable_like of shape (n_samples, n_features)
A matrix containing count or proportional abundance data of the samples. See supported formats. Counts are recommended over proportions for lower statistical uncertainty. See Notes for details.
- groupingpd.Series or 1-D array_like
Vector indicating the assignment of samples to groups. These could be strings or integers denoting which group a sample belongs to. If it is a pandas Series and the table contains sample IDs, its index will be filtered and reordered to match the sample IDs. Otherwise, it must be the same length as the samples in the table.
- treatmentstr, optional
Name of the treatment group. The t-test is computed between the
treatmentgroup and thereferencegroup specified in thegroupingvector. If omitted, the first group in the sorted order of all group names will be the treatment group.- referencestr, optional
Name of the reference group. See above. If omitted, all groups other than the treatment group will be combined as the reference group.
Changed in version 0.7.0:
treatmentandreferenceare now optional.- pseudocountfloat, optional
A non-zero value added to the input counts to ensure that all of the estimated abundances are strictly greater than zero. Default is 0.5.
- drawsint, optional
The number of draws from the Dirichlet-multinomial posterior distribution. More draws provide more robust estimates of uncertainty for the log-fold changes and p-values. Default is 128.
- p_adjuststr, optional
Method to correct p-values for multiple comparisons. Options are Holm-Boniferroni (“holm” or “holm-bonferroni”) (default), Benjamini-Hochberg (“bh”, “fdr_bh” or “benjamini-hochberg”), or any method supported by statsmodels’
multipletestsfunction. Case-insensitive. If None, no correction will be performed.- seedint, Generator or RandomState, optional
A user-provided random seed or random generator instance for drawing from the Dirichlet distribution. See
details.
- Returns:
- pd.DataFrame
A table of features, their log-fold changes and other relevant statistics.
T-statistic: t-statistic of Welch’s t-test. The reported value is the average across all of the posterior draws.Log2(FC): Expected log2-fold change of abundance from the reference group to the treatment group. The value is expressed in the CLR-transformed coordinates. The reported value is the average of all of the log2-fold changes computed from each of the posterior draws.CI(2.5): 2.5% quantile of the log2-fold change. The reported value is the minimum of per-draw 2.5% quantiles across all posterior draws.CI(97.5): 97.5% quantile of the log2-fold change. The reported value is the maximum of per-draw 97.5% quantiles across all posterior draws.pvalue: p-value of Welch’s t-test. The reported value is the average of the per-draw p-values across all posterior draws.qvalue: Corrected p-value of Welch’s t-test for multiple comparisons. The reported value is the average of the per-draw q-values across all posterior draws.Signif: Whether feature is significantly differentially abundant between the treatment and reference groups. A feature marked as “True” suffice: 1) The q-value must be less than or equal to the significance level (0.05). 2) The confidence interval (CI(2.5)..CI(97.5)) must not overlap with zero.
Changed in version 0.7.0:
df(degrees of freedom) was removed from the report, as this metric is inconsistent across draws.Changed in version 0.7.0: Renamed
T statisticasT-statistic. RenamedReject null hypothesisasSignif.
Notes
A benefit of using the Dirichlet-multinomial distribution is that the statistical uncertainty decreases as the magnitude of abundance increases. Larger counts per sample (e.g., resulting from higher sequencing depth) will reduce the uncertainty of the estimated statistics, resulting in lower p-values for true positives. This is in contrast to many other statistical tests.
Therefore, it is generally recommended to provide raw counts instead of pre-normalized proportions, as the latter will remove the magnitude information, which can lead to higher uncertainty in the estimates and increased false negatives.
The confidence intervals are computed by taking the minimum of all per-draw 2.5% bounds and the maximum of all per-draw 97.5% bounds across posterior draws. This produces a conservative (wider) interval that captures variability across draws.
The reference frame in this analysis is the geometric mean. Extracting absolute log-fold changes from this test assumes that the average feature abundance between the treatment and reference groups are identical. If this assumption is violated, the log-fold changes will be biased, and the p-values will not be reliable. However, the bias is the same across each feature, as a result the ordering of the log-fold changes can still be useful.
References
[1]Fernandes, A. D., Reid, J. N., Macklaim, J. M., McMurrough, T. A., Edgell, D. R., & Gloor, G. B. (2014). Unifying the analysis of high-throughput sequencing datasets: characterizing RNA-seq, 16S rRNA gene sequencing and selective growth experiments by compositional data analysis. Microbiome, 2, 1-13.
Examples
>>> import pandas as pd >>> from skbio.stats.composition import dirmult_ttest >>> table = pd.DataFrame([[20, 110, 100, 101, 100, 103, 104], ... [33, 110, 120, 100, 101, 100, 102], ... [12, 110, 100, 110, 100, 50, 90], ... [202, 201, 9, 10, 10, 11, 11], ... [200, 202, 10, 10, 13, 10, 10], ... [203, 201, 14, 10, 10, 13, 12]], ... index=['s1', 's2', 's3', 's4', 's5', 's6'], ... columns=['b1', 'b2', 'b3', 'b4', 'b5', 'b6', 'b7']) >>> grouping = pd.Series(['treatment', 'treatment', 'treatment', ... 'placebo', 'placebo', 'placebo'], ... index=['s1', 's2', 's3', 's4', 's5', 's6']) >>> result = dirmult_ttest(table, grouping, 'treatment', 'placebo', seed=0) >>> result T-statistic Log2(FC) CI(2.5) CI(97.5) pvalue qvalue Signif b1 -17.178600 -4.991987 -7.884498 -2.293463 0.003355 0.020131 True b2 -16.873187 -2.533729 -3.594590 -1.462339 0.001064 0.007446 True b3 6.942727 1.627677 -1.048219 4.750792 0.021130 0.068310 False b4 6.522786 1.707221 -0.467481 4.164998 0.013123 0.065613 False b5 6.654142 1.528243 -1.036910 3.978387 0.019360 0.068310 False b6 3.839520 1.182343 -0.702656 3.556061 0.045376 0.068310 False b7 7.600734 1.480232 -0.601277 4.043888 0.017077 0.068310 False