scikit-bio: Bioinformatics in Python — scikit-bio

A community-driven Python library for bioinformatics, providing versatile data structures, algorithms and educational resources.

For Researchers

Robust, performant and scalable algorithms tailored for the vast landscape of biological data analysis spanning genomics, microbiomics, ecology, evolutionary biology and more. Built to unveil the insights hidden in complex, multi-omic data.

Example

from skbio.tree import TreeNode
from skbio.diversity import beta_diversity
from skbio.stats.ordination import pcoa

table = pd.read_table('data.tsv', index_col=0)
metadata = pd.read_table('metadata.tsv', index_col=0)
tree = TreeNode.read('tree.nwk')

bdiv = beta_diversity('weighted_unifrac', table, tree=tree)

ordi = pcoa(bdiv, dimensions=3)
ordi.plot(metadata, column='bodysite')

For Educators

Fundamental bioinformatics algorithms enriched by comprehensive documentation, examples and references, offering a rich resource for classroom and laboratory education (with proven success). Designed to spark curiosity and foster innovation.

Example

from skbio.alignment import pair_align_prot
from skbio.alignment import TabularMSA
from skbio.sequence.distance import hamming
from skbio.stats.distance import DistanceMatrix
from skbio.tree import nj

def align_dist(seq1, seq2):
    score, (path,), _ = pair_align_prot(seq1, seq2)
    msa = TabularMSA.from_path_seqs(path, (seq1, seq2))
    return hamming(*msa)

dm = DistanceMatrix.from_iterable(
   seqs, align_dist, keys=ids, validate=False
)

tree = nj(dm).root_at_midpoint()
print(tree.ascii_art())

          /-chicken
         |
---------|                    /-rat
         |          /--------|
         |         |          \-mouse
          \--------|
                   |          /-pig
                   |         |
                    \--------|                    /-chimp
                             |          /--------|
                              \--------|          \-human
                                       |
                                        \-monkey

For Developers

Industry-standard, production-ready Python codebase featuring a stable, unit-tested API that streamlines development and integration. Licensed under the 3-Clause BSD, it provides an expansive platform for both academic research and commercial ventures.

Example

def centralize(mat: "ArrayLike") -> "StdArray":
    r"""Center data around its geometric average.

    Parameters
    ----------
    mat : array_like of shape (n_compositions, n_components)
        A matrix of proportions.

    Returns
    -------
    ndarray of shape (n_compositions, n_components)
        Centered composition matrix.

    Examples
    --------
    >>> import numpy as np
    >>> from skbio.stats.composition import centralize
    >>> X = np.array([[.1, .3, .4, .2], [.2, .2, .2, .4]])
    >>> centralize(X)
    array([[ 0.17445763,  0.30216948,  0.34891526,  0.17445763],
           [ 0.32495488,  0.18761279,  0.16247744,  0.32495488]])

    """
    from scipy.stats import gmean

    mat = closure(mat)
    cen = gmean(mat, axis=0)
    return perturb_inv(mat, cen)

Install

Conda

conda install -c conda-forge scikit-bio

PyPI

pip install scikit-bio

Dev

pip install git+https://github.com/scikit-bio/scikit-bio.git

More

See detailed instructions on installing scikit-bio on various platforms.

News

Latest release (2025-07-16):

scikit-bio 0.7.0

Thrilled with completion of scikit-bio workshop at ISMB 2024. Materials are publicly available.

New DOE award for scikit-bio development in multi-omics and complex modeling.

New website: scikit.bio and organization: scikit-bio are online.

Feature Highlights

Biological sequences: Efficient data structure with a flexible grammar for easy manipulation, annotation, alignment, and conversion into motifs or k-mers for in-depth analysis.

Phylogenetic trees: Scalable tree structure tailored for evolutionary biology, supporting diverse operations in navigation, manipulation, comparison, and construction.

Community diversity analysis for ecological studies, with an extensive suite of metrics such as UniFrac and PD, optimized to handle large-scale community datasets.

Ordination methods, such as PCoA, CA, and RDA, to uncover patterns underlying high-dimensional data, facilitating insightful visualization.

Multivariate statistical tests, such as PERMANOVA, BIOENV, and Mantel, to decode complex relationships across data matrices and sample properties.

Compositional data processing and analysis, such as CLR transform and ANCOM, built for various omic data types from high-throughput experiments.