Taxdump format (skbio.io.format.taxdump)#

The NCBI Taxonomy database dump (taxdump) format stores information of organism names, classifications and other properties. It is a tabular format with a delimiter: <tab><pipe><tab> between columns, and a line end <tab><pipe> after all columns. The file name usually ends with .dmp.

Format Support#

Has Sniffer: No

Reader

Writer

Object Class

Yes

No

pandas.DataFrame

Format Specification#

The NCBI taxonomy database [1] [2] hosts organism names and classifications. It has a web portal [3] and an FTP download server [4]. It is also accessible using E-utilities [5]. The database is being updated daily, and an archive is generated every month. The data release has the file name taxdump. It consists of multiple .dmp files. These files serve different purposes, but they follow a common format pattern:

  • It is a tabular format.

  • Column delimiter is <tab><pipe><tab>.

  • Line end is <tab><pipe>.

  • The first column is a numeric identifier, which usually represent taxa (i.e., “TaxID”), but can also be genetic codes, citations or other entries.

The two most important files of the data release are nodes.dmp and names.dmp. They store the hierarchical structure of the classification system (i.e., taxonomy) and the names of organisms, respectively. They can be used to construct the taxonomy tree of organisms.

The definition of columns of each .dmp file type are taken from [6] and [7].

nodes.dmp#

Name

Description

tax_id

node id in GenBank taxonomy database

parent tax_id

parent node id in GenBank taxonomy database

rank

rank of this node (superkingdom, kingdom, …)

embl code

locus-name prefix; not unique

division id

see division.dmp file

inherited div flag (1 or 0)

1 if node inherits division from parent

genetic code id

see gencode.dmp file

inherited GC flag (1 or 0)

1 if node inherits genetic code from parent

mitochondrial genetic code id

see gencode.dmp file

inherited MGC flag (1 or 0)

1 if node inherits mitochondrial gencode from parent

GenBank hidden flag (1 or 0)

1 if name is suppressed in GenBank entry lineage

hidden subtree root flag (1 or 0)

1 if this subtree has no sequence data yet

comments

free-text comments and citations

Since 2018, NCBI releases “new taxonomy files” [8] (new_taxdump). The new nodes.dmp format is compatible with the classical format, plus five extra columns after all aforementioned columns.

Name

Description

plastid genetic code id

see gencode.dmp file

inherited PGC flag (1 or 0)

1 if node inherits plastid gencode from parent

specified species

1 if species in the node’s lineage has formal name

hydrogenosome genetic code id

see gencode.dmp file

inherited HGC flag (1 or 0)

1 if node inherits hydrogenosome gencode from parent

names.dmp#

Name

Description

tax_id

the id of node associated with this name

name_txt

name itself

unique name

the unique variant of this name if name not unique

name class

(synonym, common name, …)

division.dmp#

Name

Description

division id

taxonomy database division id

division cde

GenBank division code (three characters)

division name

e.g. BCT, PLN, VRT, MAM, PRI…

comments

gencode.dmp#

Name

Description

genetic code id

GenBank genetic code id

abbreviation

genetic code name abbreviation

name

genetic code name

cde

translation table for this genetic code

starts

start codons for this genetic code

Other types of .dmp files are currently not supported by scikit-bio. However, the user may customize column definitions in using this utility. See below for details.

Format Parameters#

The following format parameters are available in taxdump format:

  • scheme: The column definition scheme name of the input .dmp file. Available options are listed below. Alternatively, one can provide a custom scheme as defined in a name-to-data type dictionary.

    1. nodes: The classical nodes.dmp scheme. It is also compatible with new nodes.dmp format, in which case only the columns defined by the classical format will be read.

    2. nodes_new: The new nodes.dmp scheme.

    3. nodes_slim: Only the first three columns: tax_id, parent_tax_id and rank, which are the minimum required information for constructing the taxonomy tree. It can be applied to both classical and new nodes.dmp files. It can also handle custom files which only contains these three columns.

    4. names: The names.dmp scheme.

    5. division: The division.dmp scheme.

    6. gencode: The gencode.dmp scheme.

Note

scikit-bio will read columns from leftmost till the number of columns defined in the scheme. Extra columns will be cropped.

Examples#

>>> from io import StringIO
>>> import skbio.io
>>> import pandas as pd
>>> fs = '\n'.join([
...     '1\t|\t1\t|\tno rank\t|',
...     '2\t|\t131567\t|\tsuperkingdom\t|',
...     '6\t|\t335928\t|\tgenus\t|'
... ])
>>> fh = StringIO(fs)

Read the file into a pd.DataFrame and specify that the “nodes_slim” scheme should be used:

>>> df = skbio.io.read(fh, format="taxdump", into=pd.DataFrame,
...                    scheme="nodes_slim")
>>> df 
        parent_tax_id          rank
tax_id
1                   1       no rank
2              131567  superkingdom
6              335928         genus

References#

[1]

Federhen, S. (2012). The NCBI taxonomy database. Nucleic acids research, 40(D1), D136-D143.

[2]

Schoch, C. L., Ciufo, S., Domrachev, M., Hotton, C. L., Kannan, S., Khovanskaya, R., … & Karsch-Mizrachi, I. (2020). NCBI Taxonomy: a comprehensive update on curation, resources and tools. Database, 2020.

[5]

Kans, J. (2022). Entrez direct: E-utilities on the UNIX command line. In Entrez Programming Utilities Help [Internet]. National Center for Biotechnology Information (US).

[8]

https://ncbiinsights.ncbi.nlm.nih.gov/2018/02/22/new-taxonomy-files- available-with-lineage-type-and-host-information/