GenBank format (skbio.io.format.genbank
)#
GenBank format (GenBank Flat File Format) stores sequence and its annotation together. The start of the annotation section is marked by a line beginning with the word “LOCUS”. The start of sequence section is marked by a line beginning with the word “ORIGIN” and the end of the section is marked by a line with only “//”.
The GenBank file usually ends with .gb or sometimes .gbk. The GenBank format for protein has been renamed to GenPept. The GenBank (for nucleotide) and Genpept are essentially the same format. An example of a GenBank file can be seen here [1].
Format Support#
Has Sniffer: Yes
Reader |
Writer |
Object Class |
---|---|---|
Yes |
Yes |
|
Yes |
Yes |
|
Yes |
Yes |
|
Yes |
Yes |
|
Yes |
Yes |
generator of |
Format Specification#
Sections before FEATURES
#
All the sections before FEATURES
will be read into the attribute
of metadata
. The header and its content of a section is stored as
a pair of key and value in metadata
. For the REFERENCE
section, its value is stored as a list, as there are often multiple
reference sections in one GenBank record.
FEATURES
section#
The International Nucleotide Sequence Database Collaboration (INSDC [2]) is a joint effort among the DDBJ, EMBL, and GenBank. These organisations all use the same “Feature Table” layout in their plain text flat file formats, which are documented in detail [3]. The feature keys and their qualifiers are also described in this webpage [4].
The FEATURES
section will be stored in interval_metadata
of
Sequence
or its sub-class. Each sub-section is stored as an
Interval
object in interval_metadata
. Each Interval
object
has metadata
keeping the information of this feature in the
sub-section.
To normalize the vocabulary between multiple formats (currently only the INSDC Feature Table and GFF3) to store metadata of interval features, we rename some terms in some formats to the same common name when parsing them into memory, as described in this table:
INSDC feature table |
GFF3 columns or attributes |
Key stored |
Value type stored |
Description |
---|---|---|---|---|
inference |
source (column 2) |
source |
str |
the algorithm or experiment used to generate this feature |
feature key |
type (column 3) |
type |
str |
the type of the feature |
N/A |
score (column 6) |
score |
float |
the score of the feature |
N/A |
strand (column 7) |
strand |
str |
the strand of the feature. + for positive strand, - for minus strand, and . for features that are not stranded. In addition, ? can be used for features whose strandedness is relevant, but unknown. |
codon_start |
phase (column 8) |
phase |
int |
the offset at which the first complete codon of a coding feature can be found, relative to the first base of that feature. It is 0, 1, or 2 in GFF3 or 1, 2, or 3 in GenBank. The stored value is 0, 1, or 2, following in GFF3 format. |
db_xref |
Dbxref |
db_xref |
list of str |
A database cross reference |
N/A |
ID |
ID |
str |
feature ID |
note |
Note |
note |
str |
any comment or additional information |
translation |
N/A |
translation |
str |
the protein sequence for CDS features |
Location
string#
There are 5 types of location descriptors defined in Feature
Table. This explains how they will be parsed into the bounds of
Interval
object (note it converts the 1-based coordinate to
0-based):
a single base number. e.g. 67. This is parsed to
(66, 67)
.a site between two neighboring bases. e.g. 67^68. This is parsed to
(66, 67)
.a single base from inside a range. e.g. 67.89. This is parsed to
(66, 89)
.a pair of base numbers defining a sequence span. e.g. 67..89. This is parsed to
(66, 89)
.a remote sequence identifier followed by a location descriptor defined above. e.g. J00123.1:67..89. This will be discarded because it is not on the current sequence. When it is combined with local descriptor like J00123.1:67..89,200..209, the local part will be kept to be
(199, 209)
.
Note
The Location string is fully stored in Interval.metadata
with key __location
. The key starting with __
is “private”
and should be modified with care.
ORIGIN
section#
The sequence in the ORIGIN
section is always in lowercase for
the GenBank files downloaded from NCBI. For the RNA molecules, t
(thymine), instead of u
(uracil) is used in the sequence. All
GenBank writers follow these conventions while writing GenBank files.
Format Parameters#
Reader-specific Parameters#
The constructor
parameter can be used with the Sequence
generator
to specify the in-memory type of each GenBank record that is parsed.
constructor
should be Sequence
or a sub-class of Sequence
.
It is also detected by the unit label on the LOCUS line. For example, if it
is bp
, it will be read into DNA
; if it is aa
, it will be read
into Protein
. Otherwise, it will be read into Sequence
. This default
behavior is overridden by setting constructor
.
lowercase
is another parameter available for all GenBank readers.
By default, it is set to True
to read in the ORIGIN
sequence
as lowercase letters. This parameter is passed to Sequence
or
its sub-class constructor.
seq_num
is a parameter used with the Sequence
, DNA
, RNA
, and
Protein
GenBank readers. It specifies which GenBank record to read from
a GenBank file with multiple records in it.
Examples#
Reading and Writing GenBank Files#
Suppose we have the following GenBank file example modified from [5]:
>>> gb_str = '''
... LOCUS 3K1V_A 34 bp RNA linear SYN 10-OCT-2012
... DEFINITION Chain A, Structure Of A Mutant Class-I Preq1.
... ACCESSION 3K1V_A
... VERSION 3K1V_A GI:260656459
... KEYWORDS .
... SOURCE synthetic construct
... ORGANISM synthetic construct
... other sequences; artificial sequences.
... REFERENCE 1 (bases 1 to 34)
... AUTHORS Klein,D.J., Edwards,T.E. and Ferre-D'Amare,A.R.
... TITLE Cocrystal structure of a class I preQ1 riboswitch
... JOURNAL Nat. Struct. Mol. Biol. 16 (3), 343-344 (2009)
... PUBMED 19234468
... COMMENT SEQRES.
... FEATURES Location/Qualifiers
... source 1..34
... /organism="synthetic construct"
... /mol_type="other RNA"
... /db_xref="taxon:32630"
... misc_binding 1..30
... /note="Preq1 riboswitch"
... /bound_moiety="preQ1"
... ORIGIN
... 1 agaggttcta gcacatccct ctataaaaaa ctaa
... //
... '''
Now we can read it as DNA
object:
>>> import io
>>> from skbio import DNA, RNA, Sequence
>>> gb = io.StringIO(gb_str)
>>> dna_seq = DNA.read(gb)
>>> dna_seq
DNA
-----------------------------------------------------------------
Metadata:
'ACCESSION': '3K1V_A'
'COMMENT': 'SEQRES.'
'DEFINITION': 'Chain A, Structure Of A Mutant Class-I Preq1.'
'KEYWORDS': '.'
'LOCUS': <class 'dict'>
'REFERENCE': <class 'list'>
'SOURCE': <class 'dict'>
'VERSION': '3K1V_A GI:260656459'
Interval metadata:
2 interval features
Stats:
length: 34
has gaps: False
has degenerates: False
has definites: True
GC-content: 35.29%
-----------------------------------------------------------------
0 AGAGGTTCTA GCACATCCCT CTATAAAAAA CTAA
Since this is a riboswitch molecule, we may want to read it as
RNA
. As the GenBank file usually have t
instead of u
in
the sequence, we can read it as RNA
by converting t
to u
:
>>> gb = io.StringIO(gb_str)
>>> rna_seq = RNA.read(gb)
>>> rna_seq
RNA
-----------------------------------------------------------------
Metadata:
'ACCESSION': '3K1V_A'
'COMMENT': 'SEQRES.'
'DEFINITION': 'Chain A, Structure Of A Mutant Class-I Preq1.'
'KEYWORDS': '.'
'LOCUS': <class 'dict'>
'REFERENCE': <class 'list'>
'SOURCE': <class 'dict'>
'VERSION': '3K1V_A GI:260656459'
Interval metadata:
2 interval features
Stats:
length: 34
has gaps: False
has degenerates: False
has definites: True
GC-content: 35.29%
-----------------------------------------------------------------
0 AGAGGUUCUA GCACAUCCCU CUAUAAAAAA CUAA
>>> rna_seq == dna_seq.transcribe()
True
>>> with io.StringIO() as fh:
... print(dna_seq.write(fh, format='genbank').getvalue())
LOCUS 3K1V_A 34 bp RNA linear SYN 10-OCT-2012
DEFINITION Chain A, Structure Of A Mutant Class-I Preq1.
ACCESSION 3K1V_A
VERSION 3K1V_A GI:260656459
KEYWORDS .
SOURCE synthetic construct
ORGANISM synthetic construct
other sequences; artificial sequences.
REFERENCE 1 (bases 1 to 34)
AUTHORS Klein,D.J., Edwards,T.E. and Ferre-D'Amare,A.R.
TITLE Cocrystal structure of a class I preQ1 riboswitch
JOURNAL Nat. Struct. Mol. Biol. 16 (3), 343-344 (2009)
PUBMED 19234468
COMMENT SEQRES.
FEATURES Location/Qualifiers
source 1..34
/db_xref="taxon:32630"
/mol_type="other RNA"
/organism="synthetic construct"
misc_binding 1..30
/bound_moiety="preQ1"
/note="Preq1 riboswitch"
ORIGIN
1 agaggttcta gcacatccct ctataaaaaa ctaa
//