Input and Output (`skbio.io`)#

This module provides input/output (I/O) functionality for scikit-bio.

In bioinformatics there are many different file formats, and in scikit-bio there are many different classes which can read and write these formats. The many-to-many nature of the relationships between scikit-bio objects and file formats inspired the creation of the scikit-bio io module, which manages these relationships transparently.

For general guidance on reading and writing files and working with scikit-bio objects, see the Tutorial section and the Reading and writing files notebook. For guidance on a specific format or scikit-bio object, see the documentation for that format or object.

See the IORegistry docs for guidance on creating custom formats and registering custom readers, writers, and sniffers.

Supported file formats#

scikit-bio provides parsers for the following file formats. For details on what objects are supported by each format, see the associated documentation.

`binary_dm`	Simple binary dissimilarity matrix format (skbio.io.format.binary_dm)
`biom`	BIOM-Format (skbio.io.format.biom)
`blast6`	BLAST+6 format (skbio.io.format.blast6)
`blast7`	BLAST+7 format (skbio.io.format.blast7)
`clustal`	Clustal format (skbio.io.format.clustal)
`embl`	EMBL format (skbio.io.format.embl)
`embed`	Embedding format (skbio.io.format.embed).
`fasta`	FASTA/QUAL format (skbio.io.format.fasta)
`fastq`	FASTQ format (skbio.io.format.fastq)
`genbank`	GenBank format (skbio.io.format.genbank)
`gff3`	GFF3 format (skbio.io.format.gff3)
`lsmat`	Labeled square matrix format (skbio.io.format.lsmat)
`newick`	Newick format (skbio.io.format.newick)
`ordination`	Ordination results format (skbio.io.format.ordination)
`phylip`	PHYLIP multiple sequence alignment format (skbio.io.format.phylip)
`phylip_dm`	PHYLIP distance matrix format (skbio.io.format.phylip_dm)
`qseq`	QSeq format (skbio.io.format.qseq)
`stockholm`	Stockholm format (skbio.io.format.stockholm)
`taxdump`	Taxdump format (skbio.io.format.taxdump)
`sample_metadata`	Sample Metadata object ported over from qiime2.

Read/write files#

Generic I/O functions

`write`	Write an object as certain format into a file.
`read`	Read a file as certain format into an object.
`sniff`	Detect the format of a given file and suggest kwargs for reading.

Additional I/O utilities

util

I/O utilities (skbio.io.util)

Develop custom formats#

Developer documentation on extending I/O

registry

I/O Registry (skbio.io.registry)

Exceptions and warnings#

General exceptions and warnings

`FormatIdentificationWarning`	Warn when the sniffer of a format cannot confirm the format.
`ArgumentOverrideWarning`	Warn when a user provided kwarg differs from a guessed kwarg.
`UnrecognizedFormatError`	Raised when a file's format is unknown, ambiguous, or unidentifiable.
`IOSourceError`	Raised when a file source cannot be resolved.
`FileFormatError`	Raised when a file cannot be parsed.

Format-specific exceptions and warnings

`BLAST7FormatError`	Raised when a `blast7` formatted file cannot be parsed.
`ClustalFormatError`	Raised when a `clustal` formatted file cannot be parsed.
`EMBLFormatError`	Raised when a `EMBL` formatted file cannot be parsed.
`FASTAFormatError`	Raised when a `fasta` formatted file cannot be parsed.
`FASTQFormatError`	Raised when a `fastq` formatted file cannot be parsed.
`GenBankFormatError`	Raised when a `genbank` formatted file cannot be parsed.
`GFF3FormatError`	Raised when a `GFF3` formatted file cannot be parsed.
`LSMatFormatError`	Raised when a `lsmat` formatted file cannot be parsed.
`NewickFormatError`	Raised when a `newick` formatted file cannot be parsed.
`OrdinationFormatError`	Raised when an `ordination` formatted file cannot be parsed.
`PhylipFormatError`	Raised when a `phylip` formatted file cannot be parsed.
`PhylipDMFormatError`	Raised when a `phylip_dm` formatted file cannot be parsed.
`QSeqFormatError`	Raised when a `qseq` formatted file cannot be parsed.
`QUALFormatError`	Raised when a `qual` formatted file cannot be parsed.
`StockholmFormatError`	Raised when a `stockholm` formatted file cannot be parsed.

Tutorial#

Reading and writing files (I/O) can be a complicated task:

A file format can sometimes be read into more than one in-memory representation (i.e., object). For example, a FASTA file can be read into a TabularMSA or DNA depending on what operations you’d like to perform on your data.
A single object might be writeable to more than one file format. For example, an TabularMSA object could be written to FASTA, FASTQ, CLUSTAL, or PHYLIP formats, just to name a few.
You might not know the exact file format of your file, but you want to read it into an appropriate object.
You might want to read multiple files into a single object, or write an object to multiple files.
Instead of reading a file into an object, you might want to stream the file using a generator (e.g., if the file cannot be fully loaded into memory).

To address these issues (and others), scikit-bio provides a simple, powerful interface for dealing with I/O. We accomplish this by using a single I/O registry defined in IORegistry.

What kinds of files scikit-bio can use#

To see a complete list of file-like inputs that can be used for reading, writing, and sniffing, see the documentation for skbio.io.util.open.

Reading files into scikit-bio#

There are two ways to read files. The first way is to use the procedural interface:

my_obj = skbio.io.read(file, format='someformat', into=SomeSkbioClass)

Here, file can be a path to a file, a file handle, or any of the other objects with read support listed in the skbio.io.util.open documentation.

The second way to read files is to use the object-oriented interface, which is automatically constructed from the procedural interface:

my_obj = SomeSkbioClass.read(file, format='someformat')

Note

A very common use case in bioinformatics is to read multi-line FASTA and FASTQ files. For examples on how to achieve this with scikit-bio, please see the FASTA documentation or the FASTQ documentation.

As an example, let’s read a newick file into a TreeNode object using both interfaces. Here we will use Python’s built-in StringIO class to mimic an open file:

>>> from skbio import read as sk_read
>>> from skbio import TreeNode
>>> from io import StringIO
>>> open_filehandle = StringIO('(a, b);')
>>> tree = sk_read(open_filehandle, format='newick', into=TreeNode)
>>> tree
<TreeNode, name: unnamed, internal node count: 0, tips count: 2>

Or, using the object-oriented interface:

>>> open_filehandle = StringIO('(a, b);')
>>> tree = TreeNode.read(open_filehandle, format='newick')
>>> tree
<TreeNode, name: unnamed, internal node count: 0, tips count: 2>

In the case of skbio.io.registry.read if into is not provided, then a generator will be returned. What the generator yields will depend on what format is being read.

When into is provided, format may be omitted and the registry will use its knowledge of the available formats for the requested class to infer (sniff) the correct format. This format inference is also available in the object-oriented interface, meaning that format may be omitted there as well.

As an example:

>>> open_filehandle = StringIO('(a, b);')
>>> tree = TreeNode.read(open_filehandle)
>>> tree
<TreeNode, name: unnamed, internal node count: 0, tips count: 2>

We call format inference sniffing, much like the csv.Sniffer class of Python’s standard library. The goal of a sniffer is two-fold: to identify if a file is a specific format, and if it is, to provide **kwargs which can be used to better parse the file.

Note

There is a built-in sniffer which results in a useful error message if an empty file is provided as input and the format was omitted. See the sniff documentation for more information.

Writing files from scikit-bio#

Just as when reading files, there are two ways to write files.

Procedural Interface:

skbio.io.write(my_obj, format='someformat', into=file)

Object-oriented Interface:

my_obj.write(file, format='someformat')

In the procedural interface, format is required. Without it, scikit-bio does not know how you want to serialize an object. Object-oriented interfaces define a default format, so it may not be necessary to include it.

For more information on writing to a specific file format, please see that format’s documentation page.

Streaming files with read and write#

If you are working with particularly large files, streaming them might be preferable. For instance, if your file is larger than your available memory, you won’t be able to read the entire file into memory at once. One way to get around this is to use streaming. Scikit-bio’s io module offers the ability to construct a streaming interface from the read and write functions.

skbio.io.read returns a generator, which can then be passed to skbio.io.write to write only one chunk from the generator at a time.

seq_gen = skbio.io.read(big_file, format='someformat')
skbio.io.write(seq_gen, into=write_file, format='someformat')

Support for stdin#

You may stream files in scikit-bio through stdin. To do this, you must set the verify parameter of the read function to False. This will bypass scikit-bio’s sniffers, which is what enables piping to function. However, the cost is that scikit-bio is no longer checking that your file formats are correct, so the user must be confident that they know the file format they are working with.

For example, if you wanted to pipe a FASTA file into a python script, your script could look like this.

import skbio
import sys
for r in skbio.read(sys.stdin, format='fasta', verify=False):
   print(r.metadata['id'])

This would then enable you to do the following.

$ cat some_file.fna | python script.py

Input and Output (skbio.io)#