EMBL format (skbio.io.format.embl
)#
EMBL format stores sequence and its annotation together. The start of the annotation section is marked by a line beginning with the word “ID”. The start of sequence section is marked by a line beginning with the word “SQ”. The “//” (terminator) line also contains no data or comments and designates the end of an entry. More information on EMBL file format can be found here [1].
The EMBL file may end with .embl or .txt extension. An example of EMBL file can be seen here [2].
Feature Level Products#
As described in [3] “Feature-level products contain nucleotide sequence and related annotations derived from submitted ENA assembled and annotated sequences. Data are distributed in flatfile format, similar to that of parent ENA records, with each flatfile representing a single feature”. While only the sequence of the feature is included in such entries, features are derived from the parent entry, and can’t be applied as interval metadata. For such reason, interval metatdata are ignored from Feature-level products, as they will be ignored by subsetting a generic Sequence object.
Format Support#
Has Sniffer: Yes
NOTE: No protein support at the moment
Current protein support development is tracked in issue-1499 [4]
Reader |
Writer |
Object Class |
---|---|---|
Yes |
Yes |
|
Yes |
Yes |
|
Yes |
Yes |
|
No |
No |
|
Yes |
Yes |
generator of |
Format Specification#
Sections before FH (Feature Header)
#
All the sections before FH (Feature Header)
will be read into the attribute
of metadata
. The header and its content of a section are stored as
key-value pairs in metadata
. For the RN (Reference Number)
section, its value is stored as a list, as there are often multiple
reference sections in one EMBL record.
FT
section#
SQ
section#
The sequence in the SQ
section is always in lowercase for
the EMBL files downloaded from ENA. For the RNA molecules, t
(thymine), instead of u
(uracil) is used in the sequence. All
EMBL writers follow these conventions while writing EMBL files.
Examples#
Reading EMBL Files#
Suppose we have the following EMBL file example:
>>> embl_str = '''
... ID X56734; SV 1; linear; mRNA; STD; PLN; 1859 BP.
... XX
... AC X56734; S46826;
... XX
... DT 12-SEP-1991 (Rel. 29, Created)
... DT 25-NOV-2005 (Rel. 85, Last updated, Version 11)
... XX
... DE Trifolium repens mRNA for non-cyanogenic beta-glucosidase
... XX
... KW beta-glucosidase.
... XX
... OS Trifolium repens (white clover)
... OC Eukaryota; Viridiplantae; Streptophyta; Embryophyta; Tracheophyta;
... OC Spermatophyta; Magnoliophyta; eudicotyledons; Gunneridae;
... OC Pentapetalae; rosids; fabids; Fabales; Fabaceae; Papilionoideae;
... OC Trifolieae; Trifolium.
... XX
... RN [5]
... RP 1-1859
... RX DOI; 10.1007/BF00039495.
... RX PUBMED; 1907511.
... RA Oxtoby E., Dunn M.A., Pancoro A., Hughes M.A.;
... RT "Nucleotide and derived amino acid sequence of the cyanogenic
... RT beta-glucosidase (linamarase) from white clover
... RT (Trifolium repens L.)";
... RL Plant Mol. Biol. 17(2):209-219(1991).
... XX
... RN [6]
... RP 1-1859
... RA Hughes M.A.;
... RT ;
... RL Submitted (19-NOV-1990) to the INSDC.
... RL Hughes M.A., University of Newcastle Upon Tyne, Medical School,
... RL Newcastle
... RL Upon Tyne, NE2 4HH, UK
... XX
... DR MD5; 1e51ca3a5450c43524b9185c236cc5cc.
... XX
... FH Key Location/Qualifiers
... FH
... FT source 1..1859
... FT /organism="Trifolium repens"
... FT /mol_type="mRNA"
... FT /clone_lib="lambda gt10"
... FT /clone="TRE361"
... FT /tissue_type="leaves"
... FT /db_xref="taxon:3899"
... FT mRNA 1..1859
... FT /experiment="experimental evidence, no additional
... FT details recorded"
... FT CDS 14..1495
... FT /product="beta-glucosidase"
... FT /EC_number="3.2.1.21"
... FT /note="non-cyanogenic"
... FT /db_xref="GOA:P26204"
... FT /db_xref="InterPro:IPR001360"
... FT /db_xref="InterPro:IPR013781"
... FT /db_xref="InterPro:IPR017853"
... FT /db_xref="InterPro:IPR033132"
... FT /db_xref="UniProtKB/Swiss-Prot:P26204"
... FT /protein_id="CAA40058.1"
... FT /translation="MDFIVAIFALFVISSFTITSTNAVEASTLLDIGNLSRS
... FT SFPRGFIFGAGSSAYQFEGAVNEGGRGPSIWDTFTHKYPEKIRDGSNADITV
... FT DQYHRYKEDVGIMKDQNMDSYRFSISWPRILPKGKLSGGINHEGIKYYNNLI
... FT NELLANGIQPFVTLFHWDLPQVLEDEYGGFLNSGVINDFRDYTDLCFKEFGD
... FT RVRYWSTLNEPWVFSNSGYALGTNAPGRCSASNVAKPGDSGTGPYIVTHNQI
... FT LAHAEAVHVYKTKYQAYQKGKIGITLVSNWLMPLDDNSIPDIKAAERSLDFQ
... FT FGLFMEQLTTGDYSKSMRRIVKNRLPKFSKFESSLVNGSFDFIGINYYSSSY
... FT ISNAPSHGNAKPSYSTNPMTNISFEKHGIPLGPRAASIWIYVYPYMFIQEDF
... FT EIFCYILKINITILQFSITENGMNEFNDATLPVEEALLNTYRIDYYYRHLYY
... FT IRSAIRAGSNVKGFYAWSFLDCNEWFAGFTVRFGLNFVD"
... XX
... SQ Sequence 1859 BP; 609 A; 314 C; 355 G; 581 T; 0 other;
... aaacaaacca aatatggatt ttattgtagc catatttgct ctgtttgtta ttagctcatt
... cacaattact tccacaaatg cagttgaagc ttctactctt cttgacatag gtaacctgag
... tcggagcagt tttcctcgtg gcttcatctt tggtgctgga tcttcagcat accaatttga
... aggtgcagta aacgaaggcg gtagaggacc aagtatttgg gataccttca cccataaata
... tccagaaaaa ataagggatg gaagcaatgc agacatcacg gttgaccaat atcaccgcta
... caaggaagat gttgggatta tgaaggatca aaatatggat tcgtatagat tctcaatctc
... ttggccaaga atactcccaa agggaaagtt gagcggaggc ataaatcacg aaggaatcaa
... atattacaac aaccttatca acgaactatt ggctaacggt atacaaccat ttgtaactct
... ttttcattgg gatcttcccc aagtcttaga agatgagtat ggtggtttct taaactccgg
... tgtaataaat gattttcgag actatacgga tctttgcttc aaggaatttg gagatagagt
... gaggtattgg agtactctaa atgagccatg ggtgtttagc aattctggat atgcactagg
... aacaaatgca ccaggtcgat gttcggcctc caacgtggcc aagcctggtg attctggaac
... aggaccttat atagttacac acaatcaaat tcttgctcat gcagaagctg tacatgtgta
... taagactaaa taccaggcat atcaaaaggg aaagataggc ataacgttgg tatctaactg
... gttaatgcca cttgatgata atagcatacc agatataaag gctgccgaga gatcacttga
... cttccaattt ggattgttta tggaacaatt aacaacagga gattattcta agagcatgcg
... gcgtatagtt aaaaaccgat tacctaagtt ctcaaaattc gaatcaagcc tagtgaatgg
... ttcatttgat tttattggta taaactatta ctcttctagt tatattagca atgccccttc
... acatggcaat gccaaaccca gttactcaac aaatcctatg accaatattt catttgaaaa
... acatgggata cccttaggtc caagggctgc ttcaatttgg atatatgttt atccatatat
... gtttatccaa gaggacttcg agatcttttg ttacatatta aaaataaata taacaatcct
... gcaattttca atcactgaaa atggtatgaa tgaattcaac gatgcaacac ttccagtaga
... agaagctctt ttgaatactt acagaattga ttactattac cgtcacttat actacattcg
... ttctgcaatc agggctggct caaatgtgaa gggtttttac gcatggtcat ttttggactg
... taatgaatgg tttgcaggct ttactgttcg ttttggatta aactttgtag attagaaaga
... tggattaaaa aggtacccta agctttctgc ccaatggtac aagaactttc tcaaaagaaa
... ctagctagta ttattaaaag aactttgtag tagattacag tacatcgttt gaagttgagt
... tggtgcacct aattaaataa aagaggttac tcttaacata tttttaggcc attcgttgtg
... aagttgttag gctgttattt ctattatact atgttgtagt aataagtgca ttgttgtacc
... agaagctatg atcataacta taggttgatc cttcatgtat cagtttgatg ttgagaatac
... tttgaattaa aagtcttttt ttattttttt aaaaaaaaaa aaaaaaaaaa aaaaaaaaa
... //
... '''
Now we can read it as DNA
object:
>>> import io
>>> from skbio import DNA, RNA, Sequence
>>> embl = io.StringIO(embl_str)
>>> dna_seq = DNA.read(embl)
>>> dna_seq
DNA
----------------------------------------------------------------------
Metadata:
'ACCESSION': 'X56734; S46826;'
'CROSS_REFERENCE': <class 'list'>
'DATE': <class 'list'>
'DBSOURCE': 'MD5; 1e51ca3a5450c43524b9185c236cc5cc.'
'DEFINITION': 'Trifolium repens mRNA for non-cyanogenic beta-
glucosidase'
'KEYWORDS': 'beta-glucosidase.'
'LOCUS': <class 'dict'>
'REFERENCE': <class 'list'>
'SOURCE': <class 'dict'>
'VERSION': 'X56734.1'
Interval metadata:
3 interval features
Stats:
length: 1859
has gaps: False
has degenerates: False
has definites: True
GC-content: 35.99%
----------------------------------------------------------------------
0 AAACAAACCA AATATGGATT TTATTGTAGC CATATTTGCT CTGTTTGTTA TTAGCTCATT
60 CACAATTACT TCCACAAATG CAGTTGAAGC TTCTACTCTT CTTGACATAG GTAACCTGAG
...
1740 AGAAGCTATG ATCATAACTA TAGGTTGATC CTTCATGTAT CAGTTTGATG TTGAGAATAC
1800 TTTGAATTAA AAGTCTTTTT TTATTTTTTT AAAAAAAAAA AAAAAAAAAA AAAAAAAAA
Since this is a mRNA molecule, we may want to read it as RNA
.
As the EMBL file usually have t
instead of u
in
the sequence, we can read it as RNA
by converting t
to u
:
>>> embl = io.StringIO(embl_str)
>>> rna_seq = RNA.read(embl)
>>> rna_seq
RNA
----------------------------------------------------------------------
Metadata:
'ACCESSION': 'X56734; S46826;'
'CROSS_REFERENCE': <class 'list'>
'DATE': <class 'list'>
'DBSOURCE': 'MD5; 1e51ca3a5450c43524b9185c236cc5cc.'
'DEFINITION': 'Trifolium repens mRNA for non-cyanogenic beta-
glucosidase'
'KEYWORDS': 'beta-glucosidase.'
'LOCUS': <class 'dict'>
'REFERENCE': <class 'list'>
'SOURCE': <class 'dict'>
'VERSION': 'X56734.1'
Interval metadata:
3 interval features
Stats:
length: 1859
has gaps: False
has degenerates: False
has definites: True
GC-content: 35.99%
----------------------------------------------------------------------
0 AAACAAACCA AAUAUGGAUU UUAUUGUAGC CAUAUUUGCU CUGUUUGUUA UUAGCUCAUU
60 CACAAUUACU UCCACAAAUG CAGUUGAAGC UUCUACUCUU CUUGACAUAG GUAACCUGAG
...
1740 AGAAGCUAUG AUCAUAACUA UAGGUUGAUC CUUCAUGUAU CAGUUUGAUG UUGAGAAUAC
1800 UUUGAAUUAA AAGUCUUUUU UUAUUUUUUU AAAAAAAAAA AAAAAAAAAA AAAAAAAAA
We can also trascribe
a sequence and verify that it will be a RNA
sequence
>>> rna_seq == dna_seq.transcribe()
True
Reading EMBL Files using generators#
Soppose we have an EMBL file with multiple records: we can instantiate a generator object to deal with multiple records
>>> import skbio
>>> embl = io.StringIO(embl_str)
>>> embl_gen = skbio.io.read(embl, format="embl")
>>> dna_seq = next(embl_gen)
For more informations, see skbio.io