Newick format (skbio.io.format.newick
)#
Newick format (newick
) stores spanning-trees with weighted edges and node
names in a minimal file format [1]. This is useful for representing
phylogenetic trees and taxonomies. Newick was created as an informal
specification on June 26, 1986 [2].
Format Support#
Has Sniffer: Yes
Reader |
Writer |
Object Class |
---|---|---|
Yes |
Yes |
Format Specification#
A Newick file represents a tree using the following grammar. See below for an explanation of the format in plain English.
Formal Grammar#
NEWICK ==> NODE ;
NODE ==> FORMATTING SUBTREE FORMATTING NODE_INFO FORMATTING
SUBTREE ==> ( CHILDREN ) | null
NODE_INFO ==> LABEL | LENGTH | LABEL FORMATTING LENGTH | null
FORMATTING ==> [ COMMENT_CHARS ] | whitespace | null
CHILDREN ==> NODE | CHILDREN , NODE
LABEL ==> ' ALL_CHARS ' | SAFE_CHARS
LENGTH ==> : FORMATTING NUMBER
COMMENT_CHARS ==> any
ALL_CHARS ==> any
SAFE_CHARS ==> any except: ,;:()[] and whitespace
NUMBER ==> a decimal or integer
Note
The _
character inside of SAFE_CHARS will be converted to a
blank space in skbio.tree.TreeNode
and vice versa.
'
is considered the escape character. To escape '
use a
preceding '
.
The implementation of newick in scikit-bio allows nested comments. To
escape [
or ]
from within COMMENT_CHARS, use a preceding '
.
Explanation#
The Newick format defines a tree by creating a minimal representation of nodes and their relationships to each other.
Basic Symbols#
There are several symbols which define nodes, the first of which is the
semi-colon (;
). The semi-colon creates a root node to its left. Recall that
there can only be one root in a tree.
The next symbol is the comma (,
), which creates a node to its right.
However, these two alone are not enough. For example imagine the following
string: , , , ;
. It is evident that there is a root, but the other 3 nodes,
defined by commas, have no relationship. For this reason, it is not a valid
Newick string to have more than one node at the root level.
To provide these relationships, there is another structure:
paired parenthesis (( )
). These are inserted at the location of an existing
node and give it the ability to have children. Placing ( )
in a node’s
location will create a child inside the parenthesis on the left-most
inner edge.
Application of Rules#
Adding a comma within the parenthesis will create two children: ( , )
(also known as a bifurcating node). Notice that only one comma is needed
because the parenthesis have already created a child. Adding more commas will
create more children who are siblings to each other. For example, writing
( , , , )
will create a multifurcating node with 4 child nodes who are
siblings to each other.
The notation for a root can be used to create a complete tree. The ;
will
create a root node where parenthesis can be placed: ( );
. Adding commas
will create more children: ( , );
. These rules can be applied recursively
ad. infinitum: (( , ), ( , ));
.
Adding Node Information#
Information about a node can be added to improve the clarity and meaning of a tree. Each node may have a label and/or a length (to the parent). Newick always places the node information at the right-most edge of a node’s position.
Starting with labels, (( , ), ( , ));
would become
((D, E)B, (F, G)C)A;
. There is a named root A
and the root’s children
(from left to right) are B
and C
. B
has the children D
and
E
, and C
has the children F
and G
.
Length represents the distance (or weight of the edge) that connects a node to
its parent. This must be a decimal or integer. As an example, suppose D
is
rather estranged from B
, and E
is very close. That can be written as:
((D:10, E:0.5)B, (F, G)C)A;
. Notice that the colon (:
) separates the
label from the length. If the length is provided but the label is omitted, a
colon must still precede the length ((:0.25,:0.5):0.0;
). Without this, the
length would be interpreted as a label (which happens to be a number).
Note
Internally scikit-bio will cast a length to float
which
technically means that even exponent strings (1e-3
) are supported)
Advanced Label and Length Rules#
More characters can be used to create more descriptive labels. When creating a
label there are some rules that must be considered due to limitations in the
Newick format. The following characters are not allowed within a standard
label: parenthesis, commas, square-brackets, colon, semi-colon, and whitespace.
These characters are also disallowed from occurring within a length, which has
a much stricter format: decimal or integer. Many of these characters are
symbols which define the structure of a Newick tree and are thus disallowed for
obvious reasons. The symbols not yet mentioned are square-brackets ([ ]
)
and whitespace (space, tab, and newline).
What if these characters are needed within a label? In the simple case of
spaces, an underscore (_
) will be translated as a space on read and vice
versa on write.
What if a literal underscore or any of the others mentioned are needed?
A label can be escaped (meaning that its contents are understood as regular
text) using single-quotes ('
). When a label is surrounded by single-quotes,
any character is permissible. If a single-quote is needed inside of an escaped
label or anywhere else, it can be escaped with another single-quote.
For example, A_1
is written 'A_1'
and 'A'_1
would be '''A''_1'
.
Inline Comments#
Square-brackets define a comment, which are the least commonly used part of
the Newick format. Comments are not included in the generated objects and exist
only as human readable text ignored by the parser. The implementation in
scikit-bio allows for nested comments ([comment [nested]]
). Unpaired
square-brackets can be escaped with a single-quote preceding the bracket when
inside an existing comment. (This is identical to escaping a single-quote).
The single-quote has the highest operator precedence, so there is no need to
worry about starting a comment from within a properly escaped label.
Whitespace#
Whitespace is not allowed within any un-escaped label or in any length, but it is permitted anywhere else.
Caveats#
Newick cannot always provide a unique representation of any tree, in other
words, the same tree can be written multiple ways. For example: (A, B);
is
isomorphic to (B, A);
. The implementation in scikit-bio maintains the given
sibling order in its object representations.
Newick has no representation of an unrooted tree. Some biological packages make
the assumption that when a trifurcated root exists in an otherwise bifurcated
tree that the tree must be unrooted. In scikit-bio, skbio.tree.TreeNode
will always be rooted at the newick
root (;
).
Format Parameters#
The only supported format parameter is convert_underscores. This is True by default. When False, underscores found in unescaped labels will not be converted to spaces. This is useful when reading the output of an external program in which the underscores were not escaped. This parameter only affects read operations. It does not exist for write operations; they will always properly escape underscores.
Examples#
This is a simple Newick string.
>>> from io import StringIO
>>> from skbio import read
>>> from skbio.tree import TreeNode
>>> f = StringIO("((D, E)B, (F, G)C)A;")
>>> tree = read(f, format="newick", into=TreeNode)
>>> f.close()
>>> print(tree.ascii_art())
/-D
/B-------|
| \-E
-A-------|
| /-F
\C-------|
\-G
This is a complex Newick string.
>>> f = StringIO("[example](a:0.1, 'b_b''':0.2, (c:0.3, d_d:0.4)e:0.5)f:0.0;")
>>> tree = read(f, format="newick", into=TreeNode)
>>> f.close()
>>> print(tree.ascii_art())
/-a
|
-f-------|--b_b'
|
| /-c
\e-------|
\-d d
Notice that the node originally labeled d_d
became d d
. Additionally
'b_b'''
became b_b'
. Note that the underscore was preserved in b_b’.