GenBank format (`skbio.io.format.genbank`)#

GenBank format (GenBank Flat File Format) stores sequence and its annotation together. The start of the annotation section is marked by a line beginning with the word “LOCUS”. The start of sequence section is marked by a line beginning with the word “ORIGIN” and the end of the section is marked by a line with only “//”.

The GenBank file usually ends with .gb or sometimes .gbk. The GenBank format for protein has been renamed to GenPept. The GenBank (for nucleotide) and Genpept are essentially the same format. An example of a GenBank file can be seen here [1].

Format Support#

Has Sniffer: Yes

Reader	Writer	Object Class
Yes	Yes	`skbio.sequence.Sequence`
Yes	Yes	`skbio.sequence.DNA`
Yes	Yes	`skbio.sequence.RNA`
Yes	Yes	`skbio.sequence.Protein`
Yes	Yes	generator of `skbio.sequence.Sequence` objects

Format Specification#

Sections before `FEATURES`#

All the sections before FEATURES will be read into the attribute of metadata. The header and its content of a section is stored as a pair of key and value in metadata. For the REFERENCE section, its value is stored as a list, as there are often multiple reference sections in one GenBank record.

`FEATURES` section#

The International Nucleotide Sequence Database Collaboration (INSDC [2]) is a joint effort among the DDBJ, EMBL, and GenBank. These organisations all use the same “Feature Table” layout in their plain text flat file formats, which are documented in detail [3]. The feature keys and their qualifiers are also described in this webpage [4].

The FEATURES section will be stored in interval_metadata of Sequence or its sub-class. Each sub-section is stored as an Interval object in interval_metadata. Each Interval object has metadata keeping the information of this feature in the sub-section.

To normalize the vocabulary between multiple formats (currently only the INSDC Feature Table and GFF3) to store metadata of interval features, we rename some terms in some formats to the same common name when parsing them into memory, as described in this table:

INSDC feature table	GFF3 columns or attributes	Key stored	Value type stored	Description
inference	source (column 2)	source	str	the algorithm or experiment used to generate this feature
feature key	type (column 3)	type	str	the type of the feature
N/A	score (column 6)	score	float	the score of the feature
N/A	strand (column 7)	strand	str	the strand of the feature. + for positive strand, - for minus strand, and . for features that are not stranded. In addition, ? can be used for features whose strandedness is relevant, but unknown.
codon_start	phase (column 8)	phase	int	the offset at which the first complete codon of a coding feature can be found, relative to the first base of that feature. It is 0, 1, or 2 in GFF3 or 1, 2, or 3 in GenBank. The stored value is 0, 1, or 2, following in GFF3 format.
db_xref	Dbxref	db_xref	list of str	A database cross reference
N/A	ID	ID	str	feature ID
note	Note	note	str	any comment or additional information
translation	N/A	translation	str	the protein sequence for CDS features

`Location` string#

There are 5 types of location descriptors defined in Feature Table. This explains how they will be parsed into the bounds of Interval object (note it converts the 1-based coordinate to 0-based):

a single base number. e.g. 67. This is parsed to (66, 67).

a site between two neighboring bases. e.g. 67^68. This is parsed to (66, 67).

a single base from inside a range. e.g. 67.89. This is parsed to (66, 89).

a pair of base numbers defining a sequence span. e.g. 67..89. This is parsed to (66, 89).

a remote sequence identifier followed by a location descriptor defined above. e.g. J00123.1:67..89. This will be discarded because it is not on the current sequence. When it is combined with local descriptor like J00123.1:67..89,200..209, the local part will be kept to be (199, 209).

Note

The Location string is fully stored in Interval.metadata with key __location. The key starting with __ is “private” and should be modified with care.

`ORIGIN` section#

The sequence in the ORIGIN section is always in lowercase for the GenBank files downloaded from NCBI. For the RNA molecules, t (thymine), instead of u (uracil) is used in the sequence. All GenBank writers follow these conventions while writing GenBank files.

Format Parameters#

Reader-specific Parameters#

The constructor parameter can be used with the Sequence generator to specify the in-memory type of each GenBank record that is parsed. constructor should be Sequence or a sub-class of Sequence. It is also detected by the unit label on the LOCUS line. For example, if it is bp, it will be read into DNA; if it is aa, it will be read into Protein. Otherwise, it will be read into Sequence. This default behavior is overridden by setting constructor.

lowercase is another parameter available for all GenBank readers. By default, it is set to True to read in the ORIGIN sequence as lowercase letters. This parameter is passed to Sequence or its sub-class constructor.

seq_num is a parameter used with the Sequence, DNA, RNA, and Protein GenBank readers. It specifies which GenBank record to read from a GenBank file with multiple records in it.

Examples#

Reading and Writing GenBank Files#

Suppose we have the following GenBank file example modified from [5]:

>>> gb_str = '''
... LOCUS       3K1V_A       34 bp    RNA     linear   SYN 10-OCT-2012
... DEFINITION  Chain A, Structure Of A Mutant Class-I Preq1.
... ACCESSION   3K1V_A
... VERSION     3K1V_A  GI:260656459
... KEYWORDS    .
... SOURCE      synthetic construct
...   ORGANISM  synthetic construct
...             other sequences; artificial sequences.
... REFERENCE   1  (bases 1 to 34)
...   AUTHORS   Klein,D.J., Edwards,T.E. and Ferre-D'Amare,A.R.
...   TITLE     Cocrystal structure of a class I preQ1 riboswitch
...   JOURNAL   Nat. Struct. Mol. Biol. 16 (3), 343-344 (2009)
...    PUBMED   19234468
... COMMENT     SEQRES.
... FEATURES             Location/Qualifiers
...      source          1..34
...                      /organism="synthetic construct"
...                      /mol_type="other RNA"
...                      /db_xref="taxon:32630"
...      misc_binding    1..30
...                      /note="Preq1 riboswitch"
...                      /bound_moiety="preQ1"
... ORIGIN
...         1 agaggttcta gcacatccct ctataaaaaa ctaa
... //
... '''

Now we can read it as DNA object:

>>> import io
>>> from skbio import DNA, RNA, Sequence
>>> gb = io.StringIO(gb_str)
>>> dna_seq = DNA.read(gb)
>>> dna_seq
DNA
-----------------------------------------------------------------
Metadata:
    'ACCESSION': '3K1V_A'
    'COMMENT': 'SEQRES.'
    'DEFINITION': 'Chain A, Structure Of A Mutant Class-I Preq1.'
    'KEYWORDS': '.'
    'LOCUS': <class 'dict'>
    'REFERENCE': <class 'list'>
    'SOURCE': <class 'dict'>
    'VERSION': '3K1V_A  GI:260656459'
Interval metadata:
    2 interval features
Stats:
    length: 34
    has gaps: False
    has degenerates: False
    has definites: True
    GC-content: 35.29%
-----------------------------------------------------------------
0 AGAGGTTCTA GCACATCCCT CTATAAAAAA CTAA

Since this is a riboswitch molecule, we may want to read it as RNA. As the GenBank file usually have t instead of u in the sequence, we can read it as RNA by converting t to u:

>>> gb = io.StringIO(gb_str)
>>> rna_seq = RNA.read(gb)
>>> rna_seq
RNA
-----------------------------------------------------------------
Metadata:
    'ACCESSION': '3K1V_A'
    'COMMENT': 'SEQRES.'
    'DEFINITION': 'Chain A, Structure Of A Mutant Class-I Preq1.'
    'KEYWORDS': '.'
    'LOCUS': <class 'dict'>
    'REFERENCE': <class 'list'>
    'SOURCE': <class 'dict'>
    'VERSION': '3K1V_A  GI:260656459'
Interval metadata:
    2 interval features
Stats:
    length: 34
    has gaps: False
    has degenerates: False
    has definites: True
    GC-content: 35.29%
-----------------------------------------------------------------
0 AGAGGUUCUA GCACAUCCCU CUAUAAAAAA CUAA

>>> rna_seq == dna_seq.transcribe()
True

>>> with io.StringIO() as fh:
...     print(dna_seq.write(fh, format='genbank').getvalue())
LOCUS       3K1V_A   34 bp   RNA   linear   SYN   10-OCT-2012
DEFINITION  Chain A, Structure Of A Mutant Class-I Preq1.
ACCESSION   3K1V_A
VERSION     3K1V_A  GI:260656459
KEYWORDS    .
SOURCE      synthetic construct
  ORGANISM  synthetic construct
            other sequences; artificial sequences.
REFERENCE   1  (bases 1 to 34)
  AUTHORS   Klein,D.J., Edwards,T.E. and Ferre-D'Amare,A.R.
  TITLE     Cocrystal structure of a class I preQ1 riboswitch
  JOURNAL   Nat. Struct. Mol. Biol. 16 (3), 343-344 (2009)
  PUBMED    19234468
COMMENT     SEQRES.
FEATURES             Location/Qualifiers
     source          1..34
                     /db_xref="taxon:32630"
                     /mol_type="other RNA"
                     /organism="synthetic construct"
     misc_binding    1..30
                     /bound_moiety="preQ1"
                     /note="Preq1 riboswitch"
ORIGIN
        1 agaggttcta gcacatccct ctataaaaaa ctaa
//

References#

[1]

http://www.ncbi.nlm.nih.gov/Sitemap/samplerecord.html

[2]

http://www.insdc.org/

[3]

http://www.insdc.org/files/feature_table.html

[4]

http://www.ebi.ac.uk/ena/WebFeat/

[5]

http://www.ncbi.nlm.nih.gov/nuccore/3K1V_A

GenBank format (skbio.io.format.genbank)#