GFF3 format (skbio.io.format.gff3)#

GFF3 (Generic Feature Format version 3) is a standard file format for describing features for biological sequences. It contains lines of text, each consisting of 9 tab-delimited columns [1].

Format Support#

Has Sniffer: Yes

Reader

Writer

Object Class

Yes

Yes

skbio.sequence.Sequence

Yes

Yes

skbio.sequence.DNA

Yes

Yes

skbio.metadata.IntervalMetadata

Yes

Yes

generator of tuple (seq_id of str type, skbio.metadata.IntervalMetadata)

Format Specification#

The first line of the file is a comment that identifies the format and version. This is followed by a series of data lines. Each data line corresponds to an annotation and consists of 9 columns: SEQID, SOURCE, TYPE, START, END, SCORE, STRAND, PHASE, and ATTR.

Column 9 (ATTR) is list of feature attributes in the format “tag=value”. Multiple “tag=value” pairs are delimited by semicolons. Multiple values of the same tag are separated with the comma “,”. The following tags have predefined meanings: ID, Name, Alias, Parent, Target, Gap, Derives_from, Note, Dbxref, Ontology_term, and Is_circular.

The meaning and format of these columns and attributes are explained detail in the format specification [1]. And they are read in as the vocabulary defined in GenBank parser (skbio.io.format.genbank).

Format Parameters#

Reader-specific Parameters#

IntervalMetadata GFF3 reader requires 1 parameter: seq_id. It reads the annotation with the specified sequence ID from the GFF3 file into an IntervalMetadata object.

DNA and Sequence GFF3 readers require seq_num of int as parameter. It specifies which GFF3 record to read from a GFF3 file with annotations of multiple sequences in it.

Writer-specific Parameters#

skip_subregion is a boolean parameter used by all the GFF3 writers. It specifies whether you would like to write each non-contiguous sub-region for a feature annotation. For example, if there is interval feature for a gene with two exons in an IntervalMetadata object, it will write one line into the GFF3 file when skip_subregion is True and will write 3 lines (one for the gene and one for each exon, respectively) when skip_subregion is False. Default is True.

In addition, IntervalMetadata GFF3 writer needs a parameter of seq_id. It specify the sequence ID (column 1 in GFF3 file) that the annotation belong to.

Examples#

Let’s create a file stream with following data in GFF3 format:

>>> from skbio import Sequence, DNA
>>> gff_str = '''
... ##gff-version 3
... seq_1\t.\tgene\t10\t90\t.\t+\t0\tID=gen1
... seq_1\t.\texon\t10\t30\t.\t+\t.\tParent=gen1
... seq_1\t.\texon\t50\t90\t.\t+\t.\tParent=gen1
... seq_2\t.\tgene\t80\t96\t.\t-\t.\tID=gen2
... ##FASTA
... >seq_1
... ATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGC
... ATGCATGCATGCATGCATGCATGCATGCATGCATGCATGC
... >seq_2
... ATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGC
... ATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGC
... '''
>>> import io
>>> from skbio.metadata import IntervalMetadata
>>> from skbio.io import read
>>> gff = io.StringIO(gff_str)

We can read it into IntervalMetadata. Each line will be read into an interval feature in IntervalMetadata object:

>>> im = read(gff, format='gff3', into=IntervalMetadata,
...           seq_id='seq_1')
>>> im  
3 interval features
-------------------
Interval(interval_metadata=<4604421736>, bounds=[(9, 90)], fuzzy=[(False, False)], metadata={'type': 'gene', 'phase': 0, 'strand': '+', 'source': '.', 'score': '.', 'ID': 'gen1'})
Interval(interval_metadata=<4604421736>, bounds=[(9, 30)], fuzzy=[(False, False)], metadata={'strand': '+', 'source': '.', 'type': 'exon', 'Parent': 'gen1', 'score': '.'})
Interval(interval_metadata=<4604421736>, bounds=[(49, 90)], fuzzy=[(False, False)], metadata={'strand': '+', 'source': '.', 'type': 'exon', 'Parent': 'gen1', 'score': '.'})

We can write the IntervalMetadata object back to GFF3 file:

>>> with io.StringIO() as fh:    
...     print(im.write(fh, format='gff3', seq_id='seq_1').getvalue())
##gff-version 3
seq_1   .       gene    10      90      .       +       0       ID=gen1
seq_1   .       exon    10      30      .       +       .       Parent=gen1
seq_1   .       exon    50      90      .       +       .       Parent=gen1

If the GFF3 file does not have the sequence ID, it will return an empty object:

>>> gff = io.StringIO(gff_str)
>>> im = read(gff, format='gff3', into=IntervalMetadata,
...           seq_id='foo')
>>> im
0 interval features
-------------------

We can also read the GFF3 file into a generator:

>>> gff = io.StringIO(gff_str)
>>> gen = read(gff, format='gff3')
>>> for im in gen:   
...     print(im[0])   # the seq id
...     print(im[1])   # the interval metadata on this seq
seq_1
3 interval features
-------------------
Interval(interval_metadata=<4603377592>, bounds=[(9, 90)], fuzzy=[(False, False)], metadata={'type': 'gene', 'ID': 'gen1', 'source': '.', 'score': '.', 'strand': '+', 'phase': 0})
Interval(interval_metadata=<4603377592>, bounds=[(9, 30)], fuzzy=[(False, False)], metadata={'strand': '+', 'type': 'exon', 'Parent': 'gen1', 'source': '.', 'score': '.'})
Interval(interval_metadata=<4603377592>, bounds=[(49, 90)], fuzzy=[(False, False)], metadata={'strand': '+', 'type': 'exon', 'Parent': 'gen1', 'source': '.', 'score': '.'})
seq_2
1 interval feature
------------------
Interval(interval_metadata=<4603378712>, bounds=[(79, 96)], fuzzy=[(False, False)], metadata={'strand': '-', 'type': 'gene', 'ID': 'gen2', 'source': '.', 'score': '.'})

For the GFF3 file with sequences, we can read it into Sequence or DNA:

>>> gff = io.StringIO(gff_str)
>>> seq = read(gff, format='gff3', into=Sequence, seq_num=1)
>>> seq
Sequence
--------------------------------------------------------------------
Metadata:
    'description': ''
    'id': 'seq_1'
Interval metadata:
    3 interval features
Stats:
    length: 100
--------------------------------------------------------------------
0  ATGCATGCAT GCATGCATGC ATGCATGCAT GCATGCATGC ATGCATGCAT GCATGCATGC
60 ATGCATGCAT GCATGCATGC ATGCATGCAT GCATGCATGC
>>> gff = io.StringIO(gff_str)
>>> seq = read(gff, format='gff3', into=DNA, seq_num=2)
>>> seq
DNA
--------------------------------------------------------------------
Metadata:
    'description': ''
    'id': 'seq_2'
Interval metadata:
    1 interval feature
Stats:
    length: 120
    has gaps: False
    has degenerates: False
    has definites: True
    GC-content: 50.00%
--------------------------------------------------------------------
0  ATGCATGCAT GCATGCATGC ATGCATGCAT GCATGCATGC ATGCATGCAT GCATGCATGC
60 ATGCATGCAT GCATGCATGC ATGCATGCAT GCATGCATGC ATGCATGCAT GCATGCATGC

References#