Taxdump format (`skbio.io.format.taxdump`)#

The NCBI Taxonomy database dump (taxdump) format stores information of organism names, classifications and other properties. It is a tabular format with a delimiter: <tab><pipe><tab> between columns, and a line end <tab><pipe> after all columns. The file name usually ends with .dmp.

Format Support#

Has Sniffer: No

Reader	Writer	Object Class
Yes	No	`pandas.DataFrame`

Format Specification#

The NCBI taxonomy database [1] [2] hosts organism names and classifications. It has a web portal [3] and an FTP download server [4]. It is also accessible using E-utilities [5]. The database is being updated daily, and an archive is generated every month. The data release has the file name taxdump. It consists of multiple .dmp files. These files serve different purposes, but they follow a common format pattern:

It is a tabular format.
Column delimiter is <tab><pipe><tab>.
Line end is <tab><pipe>.
The first column is a numeric identifier, which usually represent taxa (i.e., “TaxID”), but can also be genetic codes, citations or other entries.

The two most important files of the data release are nodes.dmp and names.dmp. They store the hierarchical structure of the classification system (i.e., taxonomy) and the names of organisms, respectively. They can be used to construct the taxonomy tree of organisms.

The definition of columns of each .dmp file type are taken from [6] and [7].

`nodes.dmp`#

Name	Description
tax_id	node id in GenBank taxonomy database
parent tax_id	parent node id in GenBank taxonomy database
rank	rank of this node (superkingdom, kingdom, …)
embl code	locus-name prefix; not unique
division id	see division.dmp file
inherited div flag (1 or 0)	1 if node inherits division from parent
genetic code id	see gencode.dmp file
inherited GC flag (1 or 0)	1 if node inherits genetic code from parent
mitochondrial genetic code id	see gencode.dmp file
inherited MGC flag (1 or 0)	1 if node inherits mitochondrial gencode from parent
GenBank hidden flag (1 or 0)	1 if name is suppressed in GenBank entry lineage
hidden subtree root flag (1 or 0)	1 if this subtree has no sequence data yet
comments	free-text comments and citations

Since 2018, NCBI releases “new taxonomy files” [8] (new_taxdump). The new nodes.dmp format is compatible with the classical format, plus five extra columns after all aforementioned columns.

Name	Description
plastid genetic code id	see gencode.dmp file
inherited PGC flag (1 or 0)	1 if node inherits plastid gencode from parent
specified species	1 if species in the node’s lineage has formal name
hydrogenosome genetic code id	see gencode.dmp file
inherited HGC flag (1 or 0)	1 if node inherits hydrogenosome gencode from parent

`names.dmp`#

Name	Description
tax_id	the id of node associated with this name
name_txt	name itself
unique name	the unique variant of this name if name not unique
name class	(synonym, common name, …)

`division.dmp`#

Name	Description
division id	taxonomy database division id
division cde	GenBank division code (three characters)
division name	e.g. BCT, PLN, VRT, MAM, PRI…
comments

`gencode.dmp`#

Name	Description
genetic code id	GenBank genetic code id
abbreviation	genetic code name abbreviation
name	genetic code name
cde	translation table for this genetic code
starts	start codons for this genetic code

Other types of .dmp files are currently not supported by scikit-bio. However, the user may customize column definitions in using this utility. See below for details.

Format Parameters#

The following format parameters are available in taxdump format:

scheme: The column definition scheme name of the input .dmp file. Available options are listed below. Alternatively, one can provide a custom scheme as defined in a name-to-data type dictionary.
1. nodes: The classical nodes.dmp scheme. It is also compatible with new nodes.dmp format, in which case only the columns defined by the classical format will be read.
2. nodes_new: The new nodes.dmp scheme.
3. nodes_slim: Only the first three columns: tax_id, parent_tax_id and rank, which are the minimum required information for constructing the taxonomy tree. It can be applied to both classical and new nodes.dmp files. It can also handle custom files which only contains these three columns.
4. names: The names.dmp scheme.
5. division: The division.dmp scheme.
6. gencode: The gencode.dmp scheme.

Note

scikit-bio will read columns from leftmost till the number of columns defined in the scheme. Extra columns will be cropped.

Examples#

>>> from io import StringIO
>>> import skbio.io
>>> import pandas as pd
>>> fs = '\n'.join([
...     '1\t|\t1\t|\tno rank\t|',
...     '2\t|\t131567\t|\tsuperkingdom\t|',
...     '6\t|\t335928\t|\tgenus\t|'
... ])
>>> fh = StringIO(fs)

Read the file into a pd.DataFrame and specify that the “nodes_slim” scheme should be used:

>>> df = skbio.io.read(fh, format="taxdump", into=pd.DataFrame,
...                    scheme="nodes_slim")
>>> df
        parent_tax_id          rank
tax_id
1                   1       no rank
2              131567  superkingdom
6              335928         genus

References#

[1]

Federhen, S. (2012). The NCBI taxonomy database. Nucleic acids research, 40(D1), D136-D143.

[2]

Schoch, C. L., Ciufo, S., Domrachev, M., Hotton, C. L., Kannan, S., Khovanskaya, R., … & Karsch-Mizrachi, I. (2020). NCBI Taxonomy: a comprehensive update on curation, resources and tools. Database, 2020.

[3]

https://www.ncbi.nlm.nih.gov/taxonomy

[4]

https://ftp.ncbi.nlm.nih.gov/pub/taxonomy/

[5]

Kans, J. (2022). Entrez direct: E-utilities on the UNIX command line. In Entrez Programming Utilities Help [Internet]. National Center for Biotechnology Information (US).

[6]

https://ftp.ncbi.nlm.nih.gov/pub/taxonomy/taxdump_readme.txt

[7]

https://ftp.ncbi.nlm.nih.gov/pub/taxonomy/new_taxdump/taxdump_readme.txt

[8]

https://ncbiinsights.ncbi.nlm.nih.gov/2018/02/22/new-taxonomy-files- available-with-lineage-type-and-host-information/

Taxdump format (skbio.io.format.taxdump)#