Taxdump format (skbio.io.format.taxdump
)#
The NCBI Taxonomy database dump (taxdump
) format stores information of
organism names, classifications and other properties. It is a tabular format
with a delimiter: <tab><pipe><tab>
between columns, and a line end
<tab><pipe>
after all columns. The file name usually ends with .dmp.
Format Support#
Has Sniffer: No
Reader |
Writer |
Object Class |
---|---|---|
Yes |
No |
|
Format Specification#
The NCBI taxonomy database [1] [2] hosts organism names and classifications.
It has a web portal [3] and an FTP download server [4]. It is also accessible
using E-utilities [5]. The database is being updated daily, and an archive is
generated every month. The data release has the file name taxdump
. It
consists of multiple .dmp files. These files serve different purposes, but they
follow a common format pattern:
It is a tabular format.
Column delimiter is
<tab><pipe><tab>
.Line end is
<tab><pipe>
.The first column is a numeric identifier, which usually represent taxa (i.e., “TaxID”), but can also be genetic codes, citations or other entries.
The two most important files of the data release are nodes.dmp
and
names.dmp
. They store the hierarchical structure of the classification
system (i.e., taxonomy) and the names of organisms, respectively. They can be
used to construct the taxonomy tree of organisms.
The definition of columns of each .dmp file type are taken from [6] and [7].
nodes.dmp
#
Name |
Description |
---|---|
tax_id |
node id in GenBank taxonomy database |
parent tax_id |
parent node id in GenBank taxonomy database |
rank |
rank of this node (superkingdom, kingdom, …) |
embl code |
locus-name prefix; not unique |
division id |
see division.dmp file |
inherited div flag (1 or 0) |
1 if node inherits division from parent |
genetic code id |
see gencode.dmp file |
inherited GC flag (1 or 0) |
1 if node inherits genetic code from parent |
mitochondrial genetic code id |
see gencode.dmp file |
inherited MGC flag (1 or 0) |
1 if node inherits mitochondrial gencode from parent |
GenBank hidden flag (1 or 0) |
1 if name is suppressed in GenBank entry lineage |
hidden subtree root flag (1 or 0) |
1 if this subtree has no sequence data yet |
comments |
free-text comments and citations |
Since 2018, NCBI releases “new taxonomy files” [8] (new_taxdump
). The new
nodes.dmp
format is compatible with the classical format, plus five extra
columns after all aforementioned columns.
Name |
Description |
---|---|
plastid genetic code id |
see gencode.dmp file |
inherited PGC flag (1 or 0) |
1 if node inherits plastid gencode from parent |
specified species |
1 if species in the node’s lineage has formal name |
hydrogenosome genetic code id |
see gencode.dmp file |
inherited HGC flag (1 or 0) |
1 if node inherits hydrogenosome gencode from parent |
names.dmp
#
Name |
Description |
---|---|
tax_id |
the id of node associated with this name |
name_txt |
name itself |
unique name |
the unique variant of this name if name not unique |
name class |
(synonym, common name, …) |
division.dmp
#
Name |
Description |
---|---|
division id |
taxonomy database division id |
division cde |
GenBank division code (three characters) |
division name |
e.g. BCT, PLN, VRT, MAM, PRI… |
comments |
gencode.dmp
#
Name |
Description |
---|---|
genetic code id |
GenBank genetic code id |
abbreviation |
genetic code name abbreviation |
name |
genetic code name |
cde |
translation table for this genetic code |
starts |
start codons for this genetic code |
Other types of .dmp files are currently not supported by scikit-bio. However, the user may customize column definitions in using this utility. See below for details.
Format Parameters#
The following format parameters are available in taxdump
format:
scheme
: The column definition scheme name of the input .dmp file. Available options are listed below. Alternatively, one can provide a custom scheme as defined in a name-to-data type dictionary.nodes
: The classicalnodes.dmp
scheme. It is also compatible with newnodes.dmp
format, in which case only the columns defined by the classical format will be read.nodes_new
: The newnodes.dmp
scheme.nodes_slim
: Only the first three columns: tax_id, parent_tax_id and rank, which are the minimum required information for constructing the taxonomy tree. It can be applied to both classical and newnodes.dmp
files. It can also handle custom files which only contains these three columns.names
: Thenames.dmp
scheme.division
: Thedivision.dmp
scheme.gencode
: Thegencode.dmp
scheme.
Note
scikit-bio will read columns from leftmost till the number of columns defined in the scheme. Extra columns will be cropped.
Examples#
>>> from io import StringIO
>>> import skbio.io
>>> import pandas as pd
>>> fs = '\n'.join([
... '1\t|\t1\t|\tno rank\t|',
... '2\t|\t131567\t|\tsuperkingdom\t|',
... '6\t|\t335928\t|\tgenus\t|'
... ])
>>> fh = StringIO(fs)
Read the file into a pd.DataFrame
and specify that the “nodes_slim” scheme
should be used:
>>> df = skbio.io.read(fh, format="taxdump", into=pd.DataFrame,
... scheme="nodes_slim")
>>> df
parent_tax_id rank
tax_id
1 1 no rank
2 131567 superkingdom
6 335928 genus
References#
Federhen, S. (2012). The NCBI taxonomy database. Nucleic acids research, 40(D1), D136-D143.
Schoch, C. L., Ciufo, S., Domrachev, M., Hotton, C. L., Kannan, S., Khovanskaya, R., … & Karsch-Mizrachi, I. (2020). NCBI Taxonomy: a comprehensive update on curation, resources and tools. Database, 2020.
Kans, J. (2022). Entrez direct: E-utilities on the UNIX command line. In Entrez Programming Utilities Help [Internet]. National Center for Biotechnology Information (US).
https://ncbiinsights.ncbi.nlm.nih.gov/2018/02/22/new-taxonomy-files- available-with-lineage-type-and-host-information/