skbio.stats.composition.ancombc#

skbio.stats.composition.ancombc(table, metadata, formula, grouping=None, max_iter=100, tol=1e-05, alpha=0.05, p_adjust='holm')[source]#

Perform differential abundance test using ANCOM-BC.

Analysis of compositions of microbiomes with bias correction (ANCOM-BC) [1] is a differential abundance testing method featuring the estimation and correction for the bias of differential sampling fractions.

Added in version 0.7.1.

Parameters:
tabletable_like of shape (n_samples, n_features)

A matrix containing strictly positive count or proportional abundance data of the samples. See supported formats.

Note

If the table contains zero values, one should add a pseudocount or apply multi_replace to convert all values into positive numbers.

metadatapd.DataFrame or 2-D array_like

Metadata of the samples. Rows correspond to samples and columns correspond to covariates (attributes). Must be a pandas DataFrame or convertible to a pandas DataFrame.

formulastr or generic Formula object

A formula defining the model using factors included in the metadata columns. Refer to Patsy’s documentation on how to specify a formula.

groupingstr, optional

A metadata column name of interest for global test, which identifies features that are differentially abundant between at least two groups across three or more groups in that column. Must be one of the factors in formula. Default is None, which skips global test.

max_iterint, optional

Maximum number of iterations for the bias estimation process. Default is 100.

tolfloat, optional

Absolute convergence tolerance for the bias estimation process. Default is 1e-5.

alphafloat, optional

Significance level for the statistical tests. Must be in the range of (0, 1). Default is 0.05.

p_adjuststr, optional

Method to correct p-values for multiple comparisons. Options are Holm-Boniferroni (“holm” or “holm-bonferroni”) (default), Benjamini-Hochberg (“bh”, “fdr_bh” or “benjamini-hochberg”), or any method supported by statsmodels’ multipletests function. Case-insensitive. If None, no correction will be performed.

Returns:
res_mainpd.DataFrame

A table of features and covariates, their log-fold changes and other relevant statistics of the primary ANCOM-BC analysis.

  • FeatureID: Feature identifier, i.e., dependent variable.

  • Covariate: Covariate name, i.e., independent variable.

  • Log2(FC): Expected log2-fold change of abundance from the reference category to the covariate category defined in the formula. The value is expressed in the center log ratio (see clr) transformed coordinates.

  • SE: Standard error of the estimated Log2(FC).

  • W: W-statistic, or the number of features that the current feature is tested to be significantly different against.

  • pvalue: p-value of the ANCOM-BC test.

  • qvalue: Corrected p-value of the ANCOM-BC test for multiple comparisons.

  • Signif: Whether the covariate category is significantly differentially abundant from the reference category. A feature-covariate pair is marked as “True” if the q-value is less than or equal to the significance level (alpha).

res_globalpd.DataFrame, optional

A table of features and statistics from the global test (when grouping is set).

  • FeatureID: Feature identifier, i.e., dependent variable.

  • W: W-statistic of Chi-squared statistic that quantifies the overall evidence against null hypothesis (the mean abundance of the feature is the same across all groups).

  • pvalue: p-value of the W-statistic in global test.

  • qvalue: Corrected p-value in global test.

  • Signif: Whether at least one group mean is different from other in grouping variable.

Notes

This function is a Python re-implementation of the ANCOM-BC method [1], which was originally implemented in the R package ANCOMBC. This function provides an efficient and scalable algorithm, with a simple interface consistent with other scikit-bio components. The output of this function should match that of the R package.

Comparing with the R command ancombc, this function defers the flexibility of data preprocessing to the user. Most importantly, if the input table contains zero values (which is common), one needs to remove them by, e.g., adding a pseudocount of 1 (pseudo=1 in the R command’s parameters) (assuming table is a pandas DataFrame):

table += 1

See also multi_replace for additional information on zero handling.

Some other pre-processing options provided by the R command can be performed with:

To aggregate data at a given taxonomic level (tax_level="Family"):

table = table.T.groupby(feature_to_family_dict).sum().T

To discard features with prevalence < 10% among samples (prv_cut=0.1):

table = table.loc[:, (table > 0).mean() >= 0.1]

To discard samples with a total abundance < 1 million (lib_cut=1e6):

table = table.loc[table.sum(axis=1) >= 1e6]

Categorical columns in metadata are sorted alphabetically, and the reference level for each column is automatically set to be the first category. If this behavior is not intended, you will need to change the order manually, like:

metadata['bmi'] = pd.Categorical(
    metadata['bmi'], categories=['lean', 'overweight', 'obese'], ordered=True)

References

[1] (1,2)

Lin, H. and Peddada, S.D., 2020. Analysis of compositions of microbiomes with bias correction. Nature Communications, 11(1), p.3514.

Examples

>>> from skbio.stats.composition import ancombc
>>> import pandas as pd

A basic example

Let’s create a data table with six samples and seven features (e.g., these may be microbial taxa):

>>> samples = ['S1', 'S2', 'S3', 'S4', 'S5', 'S6']
>>> features = ['F1', 'F2', 'F3', 'F4', 'F5', 'F6', 'F7']
>>> table = pd.DataFrame(
...     [[ 2,  0,  4,  7,  0,  0,  1],
...      [ 1,  1,  0,  6,  5,  1, 10],
...      [ 3,  0,  2,  9,  6,  0,  1],
...      [ 0, 12,  1,  2,  0,  3,  2],
...      [ 2,  2, 37,  0,  0,  7,  3],
...      [10,  1,  0,  0,  4,  4,  3]],
...     index=samples, columns=features)

Then create a sampling grouping vector. In this example, there is a “healthy” group and a “sick” group.

>>> grouping = ['healthy', 'healthy', 'healthy', 'sick', 'sick', 'sick']
>>> metadata = pd.Series(grouping, index=samples, name='status').to_frame()

Now run ancombc to determine if there are any features that are significantly different in abundance between the healthy and the sick groups. Note that a pseudocount of 1 is manually added to the table to remove zero values.

>>> result = ancombc(table + 1, metadata, formula='status')
>>> result.round(3)
                          Log2(FC)     SE      W  pvalue  qvalue  Signif
FeatureID Covariate
F1        Intercept         -0.045  0.218 -0.207   0.836   1.000   False
          status[T.sick]     0.126  0.589  0.214   0.831   1.000   False
F2        Intercept         -0.873  0.137 -6.381   0.000   0.000    True
          status[T.sick]     1.241  0.538  2.307   0.021   0.105   False
F3        Intercept         -0.202  0.475 -0.425   0.671   1.000   False
          status[T.sick]     0.561  0.964  0.582   0.561   1.000   False
F4        Intercept          1.005  0.136  7.399   0.000   0.000    True
          status[T.sick]    -1.723  0.392 -4.399   0.000   0.000    True
F5        Intercept          0.141  0.422  0.335   0.738   1.000   False
          status[T.sick]    -0.690  0.625 -1.104   0.270   1.000   False
F6        Intercept         -0.873  0.137 -6.381   0.000   0.000    True
          status[T.sick]     1.480  0.160  9.255   0.000   0.000    True
F7        Intercept          0.157  0.401  0.391   0.695   1.000   False
          status[T.sick]     0.049  0.405  0.121   0.904   1.000   False

The covariate “status[T.sick]” stands for the effect of the “sick” group relative to the reference group, “healthy” (the first group in alphabetical order is automatically selected as the reference group). “Log2(FC)” represents the clr-transformed fold change of abundance (positive/negative: more/less abundant in “sick” than in “healthy”, respectively). A “True” in the “Signif” column indicates a significantly differentially abundant feature-covariate pair. This example shows that two features differ by healthy/sick status.

>>> result.query('Covariate != "Intercept" & Signif == True').round(3)
                          Log2(FC)     SE      W  pvalue  qvalue  Signif
FeatureID Covariate
F4        status[T.sick]    -1.723  0.392 -4.399     0.0     0.0    True
F6        status[T.sick]     1.480  0.160  9.255     0.0     0.0    True

An advanced example

Now we will create a complex dataset with 15 samples grouped into three disease status: “mild”, “moderate” and “severe”, plus age as a confounder. Our goal is to identify features that are differentially abundant between disease status.

>>> samples = [f'S{i}' for i in range(1, 16)]
>>> features = [f'F{i}' for i in range(1, 9)]
>>> data = [[ 2,  0,  7,  0,  0,  2,  3,  2],
...         [ 0,  2,  0,  1,  1,  2,  0,  0],
...         [ 3,  1,  0,  9,  0,  1,  1,  0],
...         [ 2,  0,  1,  8,  1,  0, 11, 46],
...         [ 1,  0,  1,  1,  0,  0,  2,  2],
...         [ 0,  3, 22,  1,  1,  1,  3,  0],
...         [ 1,  7, 16,  1,  0,  0,  2,  2],
...         [ 0,  5,  6,  1,  2,  1,  0,  1],
...         [ 1,  7,  0,  2,  1,  0,  3,  2],
...         [ 0,  6,  4,  2,  0,  2,  2,  1],
...         [ 3, 13,  7,  0,  0,  0,  3,  9],
...         [ 1,  8,  5,  1,  0,  0,  0,  0],
...         [ 0,  5, 14,  1,  0,  1,  0,  1],
...         [ 5, 26,  3,  2,  0,  3,  1,  3],
...         [ 0, 18,  7,  0,  0,  3,  1,  0]]
>>> table = pd.DataFrame(data, index=samples, columns=features)
>>> status = ['mild'] * 5 + ['moderate'] * 5 + ['severe'] * 5
>>> age = [39, 19, 20, 31, 15, 37, 27, 47, 26, 23, 39, 48, 46, 33, 36]
>>> metadata = pd.DataFrame({'status': status, 'age': age}, index=samples)

Run ancombc. This time, we specify two factors: “status” and “age” in the formula, such that the function will test the individual effects of each factor while controlling for the other. Additionally, we instruct the function to perform a global test on “status”, which identifies features that are differentially abundant between at least two status.

>>> res_main, res_global = ancombc(
...     table + 1, metadata, formula='status + age', grouping='status')
>>> res_main.round(3)
                              Log2(FC)     SE      W  pvalue  qvalue  Signif
FeatureID Covariate
F1        Intercept             -0.006  0.346 -0.016   0.987   1.000   False
          status[T.moderate]    -0.337  0.245 -1.376   0.169   1.000   False
          status[T.severe]       0.278  0.350  0.792   0.428   1.000   False
          age                    0.002  0.011  0.141   0.888   1.000   False
F2        Intercept              0.072  0.446  0.160   0.873   1.000   False
          status[T.moderate]     1.906  0.259  7.371   0.000   0.000    True
          status[T.severe]       2.935  0.304  9.649   0.000   0.000    True
          age                   -0.022  0.013 -1.676   0.094   0.562   False
F3        Intercept             -1.690  0.570 -2.967   0.003   0.021    True
          status[T.moderate]     1.010  0.578  1.747   0.081   0.565   False
          status[T.severe]       0.717  0.517  1.388   0.165   1.000   False
          age                    0.063  0.022  2.883   0.004   0.031    True
F4        Intercept              0.439  0.483  0.910   0.363   1.000   False
          status[T.moderate]    -0.046  0.405 -0.113   0.910   1.000   False
          status[T.severe]      -0.244  0.536 -0.455   0.649   1.000   False
          age                   -0.003  0.019 -0.185   0.854   1.000   False
F5        Intercept             -1.227  0.393 -3.125   0.002   0.014    True
          status[T.moderate]     0.274  0.293  0.934   0.350   1.000   False
          status[T.severe]      -0.323  0.320 -1.011   0.312   1.000   False
          age                    0.027  0.014  2.021   0.043   0.303   False
F6        Intercept             -0.439  0.440 -0.999   0.318   1.000   False
          status[T.moderate]     0.114  0.432  0.264   0.792   1.000   False
          status[T.severe]       0.375  0.544  0.691   0.490   1.000   False
          age                    0.008  0.016  0.480   0.631   1.000   False
F7        Intercept              0.201  0.472  0.426   0.670   1.000   False
          status[T.moderate]     0.082  0.302  0.270   0.787   1.000   False
          status[T.severe]      -0.264  0.401 -0.658   0.510   1.000   False
          age                    0.004  0.016  0.272   0.785   1.000   False
F8        Intercept             -0.168  0.526 -0.319   0.750   1.000   False
          status[T.moderate]    -0.402  0.584 -0.688   0.491   1.000   False
          status[T.severe]      -0.299  0.728 -0.411   0.681   1.000   False
          age                    0.022  0.018  1.237   0.216   1.000   False

We found that feature “F2” is significantly differentially (more) abundant in “moderate” and “severe” groups compared with “mild”, which serves as the reference group. Additionally, feature “F3” is separately correlated with age.

>>> res_main.query('Covariate != "Intercept" & Signif == True').round(3)
                              Log2(FC)     SE      W  pvalue  qvalue  Signif
FeatureID Covariate
F2        status[T.moderate]     1.906  0.259  7.371   0.000   0.000    True
          status[T.severe]       2.935  0.304  9.649   0.000   0.000    True
F3        age                    0.063  0.022  2.883   0.004   0.031    True

The global test result suggests that “F2” and “F4” are differentially abundant between two of the three groups (though it doesn’t tell which groups).

>>> res_global.round(3)
                W  pvalue  qvalue  Signif
FeatureID
F1          1.855   0.791   1.000   False
F2         80.771   0.000   0.000    True
F3          2.925   0.463   1.000   False
F4         -0.093   0.000   0.000    True
F5          1.068   0.827   1.000   False
F6          0.121   0.117   0.704   False
F7          0.220   0.208   1.000   False
F8          0.485   0.430   1.000   False

Structural zero test

The structural zero test identifies features that are systematically absent from certain sample groups. This test is an option of the R command ancombc. In scikit-bio, struc_zero is a standalone function, as it is generally useful with or without ANCOM-BC.

>>> from skbio.stats.composition import struc_zero
>>> res_zero = struc_zero(table, metadata, 'status')
>>> res_zero
     mild  moderate  severe
F1  False     False   False
F2  False     False   False
F3  False     False   False
F4  False     False   False
F5  False     False    True
F6  False     False   False
F7  False     False   False
F8  False     False   False

The result reveals that feature “F5” is a structural zero in the “severe” groups, as all or most of its values are zero. Although the ANCOM-BC test itself didn’t identify “F5”, we should consider it as differentially (less) abundant than in the other two groups.

We can use this additional information to update the global test result. The rule is that a feature should be considered as globally differentially abundant if it is a structural zero in at least one but not all groups.

>>> signif_global = res_global['Signif'] | (
...     ~res_zero.all(axis=1) & res_zero.any(axis=1))
>>> signif_global
FeatureID
F1    False
F2     True
F3    False
F4     True
F5     True
F6    False
F7    False
F8    False
dtype: bool

The main ANCOM-BC result can also be updated with structural zeros.

>>> signif_main = res_main.query(
...     'Covariate.str.startswith("status[T.")')['Signif'].unstack()
>>> signif_main.columns = signif_main.columns.str.removeprefix(
...     'status[T.').str.removesuffix(']')
>>> signif_zero = res_zero.loc[signif_main.index, signif_main.columns]
>>> signif_main |= signif_zero
>>> signif_main.columns.name = None
>>> signif_main
           moderate  severe
FeatureID
F1            False   False
F2             True    True
F3            False   False
F4            False   False
F5            False    True
F6            False   False
F7            False   False
F8            False   False