GFF3 format (skbio.io.format.gff3
)#
GFF3 (Generic Feature Format version 3) is a standard file format for describing features for biological sequences. It contains lines of text, each consisting of 9 tab-delimited columns [1].
Format Support#
Has Sniffer: Yes
Reader |
Writer |
Object Class |
---|---|---|
Yes |
Yes |
|
Yes |
Yes |
|
Yes |
Yes |
|
Yes |
Yes |
generator of tuple (seq_id of str type,
|
Format Specification#
The first line of the file is a comment that identifies the format and version. This is followed by a series of data lines. Each data line corresponds to an annotation and consists of 9 columns: SEQID, SOURCE, TYPE, START, END, SCORE, STRAND, PHASE, and ATTR.
Column 9 (ATTR) is list of feature attributes in the format “tag=value”. Multiple “tag=value” pairs are delimited by semicolons. Multiple values of the same tag are separated with the comma “,”. The following tags have predefined meanings: ID, Name, Alias, Parent, Target, Gap, Derives_from, Note, Dbxref, Ontology_term, and Is_circular.
The meaning and format of these columns and attributes are explained
detail in the format specification [1]. And they are read in as the
vocabulary defined in GenBank parser (skbio.io.format.genbank
).
Format Parameters#
Reader-specific Parameters#
IntervalMetadata
GFF3 reader requires 1 parameter: seq_id
.
It reads the annotation with the specified
sequence ID from the GFF3 file into an IntervalMetadata
object.
DNA
and Sequence
GFF3 readers require seq_num
of int as
parameter. It specifies which GFF3 record to read from a GFF3 file
with annotations of multiple sequences in it.
Writer-specific Parameters#
skip_subregion
is a boolean parameter used by all the GFF3 writers. It
specifies whether you would like to write each non-contiguous
sub-region for a feature annotation. For example, if there is
interval feature for a gene with two exons in an IntervalMetadata
object, it will write one line into the GFF3 file when skip_subregion
is
True
and will write 3 lines (one for the gene and one for each
exon, respectively) when skip_subregion
is False
. Default is True
.
In addition, IntervalMetadata
GFF3 writer needs a parameter of
seq_id
. It specify the sequence ID (column 1 in GFF3 file) that
the annotation belong to.
Examples#
Let’s create a file stream with following data in GFF3 format:
>>> from skbio import Sequence, DNA
>>> gff_str = '''
... ##gff-version 3
... seq_1\t.\tgene\t10\t90\t.\t+\t0\tID=gen1
... seq_1\t.\texon\t10\t30\t.\t+\t.\tParent=gen1
... seq_1\t.\texon\t50\t90\t.\t+\t.\tParent=gen1
... seq_2\t.\tgene\t80\t96\t.\t-\t.\tID=gen2
... ##FASTA
... >seq_1
... ATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGC
... ATGCATGCATGCATGCATGCATGCATGCATGCATGCATGC
... >seq_2
... ATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGC
... ATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGC
... '''
>>> import io
>>> from skbio.metadata import IntervalMetadata
>>> from skbio.io import read
>>> gff = io.StringIO(gff_str)
We can read it into IntervalMetadata
. Each line will be read into
an interval feature in IntervalMetadata
object:
>>> im = read(gff, format='gff3', into=IntervalMetadata,
... seq_id='seq_1')
>>> im
3 interval features
-------------------
Interval(interval_metadata=<4604421736>, bounds=[(9, 90)], fuzzy=[(False, False)], metadata={'type': 'gene', 'phase': 0, 'strand': '+', 'source': '.', 'score': '.', 'ID': 'gen1'})
Interval(interval_metadata=<4604421736>, bounds=[(9, 30)], fuzzy=[(False, False)], metadata={'strand': '+', 'source': '.', 'type': 'exon', 'Parent': 'gen1', 'score': '.'})
Interval(interval_metadata=<4604421736>, bounds=[(49, 90)], fuzzy=[(False, False)], metadata={'strand': '+', 'source': '.', 'type': 'exon', 'Parent': 'gen1', 'score': '.'})
We can write the IntervalMetadata
object back to GFF3 file:
>>> with io.StringIO() as fh:
... print(im.write(fh, format='gff3', seq_id='seq_1').getvalue())
##gff-version 3
seq_1 . gene 10 90 . + 0 ID=gen1
seq_1 . exon 10 30 . + . Parent=gen1
seq_1 . exon 50 90 . + . Parent=gen1
If the GFF3 file does not have the sequence ID, it will return an empty object:
>>> gff = io.StringIO(gff_str)
>>> im = read(gff, format='gff3', into=IntervalMetadata,
... seq_id='foo')
>>> im
0 interval features
-------------------
We can also read the GFF3 file into a generator:
>>> gff = io.StringIO(gff_str)
>>> gen = read(gff, format='gff3')
>>> for im in gen:
... print(im[0]) # the seq id
... print(im[1]) # the interval metadata on this seq
seq_1
3 interval features
-------------------
Interval(interval_metadata=<4603377592>, bounds=[(9, 90)], fuzzy=[(False, False)], metadata={'type': 'gene', 'ID': 'gen1', 'source': '.', 'score': '.', 'strand': '+', 'phase': 0})
Interval(interval_metadata=<4603377592>, bounds=[(9, 30)], fuzzy=[(False, False)], metadata={'strand': '+', 'type': 'exon', 'Parent': 'gen1', 'source': '.', 'score': '.'})
Interval(interval_metadata=<4603377592>, bounds=[(49, 90)], fuzzy=[(False, False)], metadata={'strand': '+', 'type': 'exon', 'Parent': 'gen1', 'source': '.', 'score': '.'})
seq_2
1 interval feature
------------------
Interval(interval_metadata=<4603378712>, bounds=[(79, 96)], fuzzy=[(False, False)], metadata={'strand': '-', 'type': 'gene', 'ID': 'gen2', 'source': '.', 'score': '.'})
For the GFF3 file with sequences, we can read it into Sequence
or DNA
:
>>> gff = io.StringIO(gff_str)
>>> seq = read(gff, format='gff3', into=Sequence, seq_num=1)
>>> seq
Sequence
--------------------------------------------------------------------
Metadata:
'description': ''
'id': 'seq_1'
Interval metadata:
3 interval features
Stats:
length: 100
--------------------------------------------------------------------
0 ATGCATGCAT GCATGCATGC ATGCATGCAT GCATGCATGC ATGCATGCAT GCATGCATGC
60 ATGCATGCAT GCATGCATGC ATGCATGCAT GCATGCATGC
>>> gff = io.StringIO(gff_str)
>>> seq = read(gff, format='gff3', into=DNA, seq_num=2)
>>> seq
DNA
--------------------------------------------------------------------
Metadata:
'description': ''
'id': 'seq_2'
Interval metadata:
1 interval feature
Stats:
length: 120
has gaps: False
has degenerates: False
has definites: True
GC-content: 50.00%
--------------------------------------------------------------------
0 ATGCATGCAT GCATGCATGC ATGCATGCAT GCATGCATGC ATGCATGCAT GCATGCATGC
60 ATGCATGCAT GCATGCATGC ATGCATGCAT GCATGCATGC ATGCATGCAT GCATGCATGC