Input and Output (skbio.io
)#
This module provides input/output (I/O) functionality for scikit-bio.
Supported file formats#
scikit-bio provides parsers for the following file formats. For details on what objects are supported by each format, see the associated documentation.
Simple binary dissimilarity matrix format (skbio.io.format.binary_dm) |
|
BIOM-Format (skbio.io.format.biom) |
|
BLAST+6 format (skbio.io.format.blast6) |
|
BLAST+7 format (skbio.io.format.blast7) |
|
Clustal format (skbio.io.format.clustal) |
|
EMBL format (skbio.io.format.embl) |
|
Embedding format (skbio.io.format.embed). |
|
FASTA/QUAL format (skbio.io.format.fasta) |
|
FASTQ format (skbio.io.format.fastq) |
|
GenBank format (skbio.io.format.genbank) |
|
GFF3 format (skbio.io.format.gff3) |
|
Labeled square matrix format (skbio.io.format.lsmat) |
|
Newick format (skbio.io.format.newick) |
|
Ordination results format (skbio.io.format.ordination) |
|
PHYLIP multiple sequence alignment format (skbio.io.format.phylip) |
|
QSeq format (skbio.io.format.qseq) |
|
Stockholm format (skbio.io.format.stockholm) |
|
Taxdump format (skbio.io.format.taxdump) |
|
Sample Metadata object ported over from qiime2. |
Read/write files#
Generic I/O functions
|
Write an object as certain format into a file. |
|
Read a file as certain format into an object. |
|
Detect the format of a given file and suggest kwargs for reading. |
Additional I/O utilities
I/O utilities (skbio.io.util) |
Develop custom formats#
Developer documentation on extending I/O
I/O Registry (skbio.io.registry) |
Exceptions and warnings#
General exceptions and warnings
|
Warn when the sniffer of a format cannot confirm the format. |
|
Warn when a user provided kwarg differs from a guessed kwarg. |
|
Raised when a file's format is unknown, ambiguous, or unidentifiable. |
|
Raised when a file source cannot be resolved. |
|
Raised when a file cannot be parsed. |
Format-specific exceptions and warnings
|
Raised when a |
|
Raised when a |
|
Raised when a |
|
Raised when a |
|
Raised when a |
|
Raised when a |
|
Raised when a |
|
Raised when a |
|
Raised when a |
|
Raised when an |
|
Raised when a |
|
Raised when a |
|
Raised when a |
|
Raised when a |
Tutorial#
Reading and writing files (I/O) can be a complicated task:
A file format can sometimes be read into more than one in-memory representation (i.e., object). For example, a FASTA file can be read into an
skbio.alignment.TabularMSA
orskbio.sequence.DNA
depending on what operations you’d like to perform on your data.A single object might be writeable to more than one file format. For example, an
skbio.alignment.TabularMSA
object could be written to FASTA, FASTQ, CLUSTAL, or PHYLIP formats, just to name a few.You might not know the exact file format of your file, but you want to read it into an appropriate object.
You might want to read multiple files into a single object, or write an object to multiple files.
Instead of reading a file into an object, you might want to stream the file using a generator (e.g., if the file cannot be fully loaded into memory).
To address these issues (and others), scikit-bio provides a simple, powerful interface for dealing with I/O. We accomplish this by using a single I/O registry.
What kinds of files scikit-bio can use#
To see a complete list of file-like inputs that can be used for reading,
writing, and sniffing, see the documentation for skbio.io.util.open()
.
Reading files into scikit-bio#
There are two ways to read files. The first way is to use the procedural interface:
my_obj = skbio.io.read(file, format='someformat', into=SomeSkbioClass)
The second is to use the object-oriented (OO) interface which is automatically constructed from the procedural interface:
my_obj = SomeSkbioClass.read(file, format='someformat')
For example, to read a newick
file using both interfaces you would type:
>>> from skbio import read
>>> from skbio import TreeNode
>>> from io import StringIO
>>> open_filehandle = StringIO('(a, b);')
>>> tree = read(open_filehandle, format='newick', into=TreeNode)
>>> tree
<TreeNode, name: unnamed, internal node count: 0, tips count: 2>
For the OO interface:
>>> open_filehandle = StringIO('(a, b);')
>>> tree = TreeNode.read(open_filehandle, format='newick')
>>> tree
<TreeNode, name: unnamed, internal node count: 0, tips count: 2>
In the case of skbio.io.registry.read()
if into
is not provided, then a
generator will be returned. What the generator yields will depend on what
format is being read.
When into
is provided, format may be omitted and the registry will use its
knowledge of the available formats for the requested class to infer the correct
format. This format inference is also available in the OO interface, meaning
that format
may be omitted there as well.
As an example:
>>> open_filehandle = StringIO('(a, b);')
>>> tree = TreeNode.read(open_filehandle)
>>> tree
<TreeNode, name: unnamed, internal node count: 0, tips count: 2>
We call format inference sniffing, much like the csv.Sniffer
class of Python’s standard library. The goal of a sniffer
is two-fold: to
identify if a file is a specific format, and if it is, to provide **kwargs
which can be used to better parse the file.
Note
There is a built-in sniffer
which results in a useful error message
if an empty file is provided as input and the format was omitted.
Writing files from scikit-bio#
Just as when reading files, there are two ways to write files.
Procedural Interface:
skbio.io.write(my_obj, format='someformat', into=file)
OO Interface:
my_obj.write(file, format='someformat')
In the procedural interface, format
is required. Without it, scikit-bio does
not know how you want to serialize an object. OO interfaces define a default
format
, so it may not be necessary to include it.
Streaming files with read and write#
If you are working with particularly large files, streaming them might be preferable.
Scikit-bio’s io
module offers the ability to contruct a streaming interface from
the read
and write
functions.
skbio.io.read
returns a generator, which can then be passed to skbio.io.write
to write only one chunk from the generator at a time.
seq_gen = skbio.io.read(big_file, format='someformat')
skbio.io.write(seq_gen, into=write_file, format='someformat')