Newick format (skbio.io.format.newick)#

Newick format (newick) stores spanning-trees with weighted edges and node names in a minimal file format [1]. This is useful for representing phylogenetic trees and taxonomies. Newick was created as an informal specification on June 26, 1986 [2].

Format Support#

Has Sniffer: Yes

Reader

Writer

Object Class

Yes

Yes

skbio.tree.TreeNode

Format Specification#

A Newick file represents a tree using the following grammar. See below for an explanation of the format in plain English.

Formal Grammar#

       NEWICK ==> NODE ;
         NODE ==> FORMATTING SUBTREE FORMATTING NODE_INFO FORMATTING
      SUBTREE ==> ( CHILDREN ) | null
    NODE_INFO ==> LABEL | LENGTH | LABEL FORMATTING LENGTH | null
   FORMATTING ==> [ COMMENT_CHARS ] | whitespace | null
     CHILDREN ==> NODE | CHILDREN , NODE
        LABEL ==> ' ALL_CHARS ' | SAFE_CHARS
       LENGTH ==> : FORMATTING NUMBER
COMMENT_CHARS ==> any
    ALL_CHARS ==> any
   SAFE_CHARS ==> any except: ,;:()[] and whitespace
       NUMBER ==> a decimal or integer

Note

The _ character inside of SAFE_CHARS will be converted to a blank space in skbio.tree.TreeNode and vice versa.

' is considered the escape character. To escape ' use a preceding '.

The implementation of newick in scikit-bio allows nested comments. To escape [ or ] from within COMMENT_CHARS, use a preceding '.

Explanation#

The Newick format defines a tree by creating a minimal representation of nodes and their relationships to each other.

Basic Symbols#

There are several symbols which define nodes, the first of which is the semi-colon (;). The semi-colon creates a root node to its left. Recall that there can only be one root in a tree.

The next symbol is the comma (,), which creates a node to its right. However, these two alone are not enough. For example imagine the following string: , , , ;. It is evident that there is a root, but the other 3 nodes, defined by commas, have no relationship. For this reason, it is not a valid Newick string to have more than one node at the root level.

To provide these relationships, there is another structure: paired parenthesis (( )). These are inserted at the location of an existing node and give it the ability to have children. Placing ( ) in a node’s location will create a child inside the parenthesis on the left-most inner edge.

Application of Rules#

Adding a comma within the parenthesis will create two children: ( , ) (also known as a bifurcating node). Notice that only one comma is needed because the parenthesis have already created a child. Adding more commas will create more children who are siblings to each other. For example, writing ( , , , ) will create a multifurcating node with 4 child nodes who are siblings to each other.

The notation for a root can be used to create a complete tree. The ; will create a root node where parenthesis can be placed: ( );. Adding commas will create more children: ( , );. These rules can be applied recursively ad. infinitum: (( , ), ( , ));.

Adding Node Information#

Information about a node can be added to improve the clarity and meaning of a tree. Each node may have a label and/or a length (to the parent). Newick always places the node information at the right-most edge of a node’s position.

Starting with labels, (( , ), ( , )); would become ((D, E)B, (F, G)C)A;. There is a named root A and the root’s children (from left to right) are B and C. B has the children D and E, and C has the children F and G.

Length represents the distance (or weight of the edge) that connects a node to its parent. This must be a decimal or integer. As an example, suppose D is rather estranged from B, and E is very close. That can be written as: ((D:10, E:0.5)B, (F, G)C)A;. Notice that the colon (:) separates the label from the length. If the length is provided but the label is omitted, a colon must still precede the length ((:0.25,:0.5):0.0;). Without this, the length would be interpreted as a label (which happens to be a number).

Note

Internally scikit-bio will cast a length to float which technically means that even exponent strings (1e-3) are supported)

Advanced Label and Length Rules#

More characters can be used to create more descriptive labels. When creating a label there are some rules that must be considered due to limitations in the Newick format. The following characters are not allowed within a standard label: parenthesis, commas, square-brackets, colon, semi-colon, and whitespace. These characters are also disallowed from occurring within a length, which has a much stricter format: decimal or integer. Many of these characters are symbols which define the structure of a Newick tree and are thus disallowed for obvious reasons. The symbols not yet mentioned are square-brackets ([ ]) and whitespace (space, tab, and newline).

What if these characters are needed within a label? In the simple case of spaces, an underscore (_) will be translated as a space on read and vice versa on write.

What if a literal underscore or any of the others mentioned are needed? A label can be escaped (meaning that its contents are understood as regular text) using single-quotes ('). When a label is surrounded by single-quotes, any character is permissible. If a single-quote is needed inside of an escaped label or anywhere else, it can be escaped with another single-quote. For example, A_1 is written 'A_1' and 'A'_1 would be '''A''_1'.

Inline Comments#

Square-brackets define a comment, which are the least commonly used part of the Newick format. Comments are not included in the generated objects and exist only as human readable text ignored by the parser. The implementation in scikit-bio allows for nested comments ([comment [nested]]). Unpaired square-brackets can be escaped with a single-quote preceding the bracket when inside an existing comment. (This is identical to escaping a single-quote). The single-quote has the highest operator precedence, so there is no need to worry about starting a comment from within a properly escaped label.

Whitespace#

Whitespace is not allowed within any un-escaped label or in any length, but it is permitted anywhere else.

Caveats#

Newick cannot always provide a unique representation of any tree, in other words, the same tree can be written multiple ways. For example: (A, B); is isomorphic to (B, A);. The implementation in scikit-bio maintains the given sibling order in its object representations.

Newick has no representation of an unrooted tree. Some biological packages make the assumption that when a trifurcated root exists in an otherwise bifurcated tree that the tree must be unrooted. In scikit-bio, skbio.tree.TreeNode will always be rooted at the newick root (;).

Format Parameters#

The only supported format parameter is convert_underscores. This is True by default. When False, underscores found in unescaped labels will not be converted to spaces. This is useful when reading the output of an external program in which the underscores were not escaped. This parameter only affects read operations. It does not exist for write operations; they will always properly escape underscores.

Examples#

This is a simple Newick string.

>>> from io import StringIO
>>> from skbio import read
>>> from skbio.tree import TreeNode
>>> f = StringIO("((D, E)B, (F, G)C)A;")
>>> tree = read(f, format="newick", into=TreeNode)
>>> f.close()
>>> print(tree.ascii_art())
                    /-D
          /B-------|
         |          \-E
-A-------|
         |          /-F
          \C-------|
                    \-G

This is a complex Newick string.

>>> f = StringIO("[example](a:0.1, 'b_b''':0.2, (c:0.3, d_d:0.4)e:0.5)f:0.0;")
>>> tree = read(f, format="newick", into=TreeNode)
>>> f.close()
>>> print(tree.ascii_art())
          /-a
         |
-f-------|--b_b'
         |
         |          /-c
          \e-------|
                    \-d d

Notice that the node originally labeled d_d became d d. Additionally 'b_b''' became b_b'. Note that the underscore was preserved in b_b’.

References#