The parser shootout 🔫¶

Bioinformatics data processing remains dominated by plain text formats. In this post, I contrast the performance of the popular biopython, cogent3, and scikit-bio packages for reading three sequence file formats and two genome annotation formats. Despite how simple these tasks might seem, you'll see there's a lot of variation in performance! The takeaway message is that cogent3 is nearly always faster for parsing these basic file formats, while biopython typically uses less RAM.

The goal¶

The goal of this post is to compare how well popular packages read standard file formats. I focus on how quickly and simply they can pull data off disk into Python primitives¹. For some of the packages I compare, the returned objects are more complex than simple primitives, which can reduce computational performance. This is not necessarily a drawback, because these richer objects can offer powerful interfaces for working with the data. However, in this post, I won’t discuss those capabilities but focus solely on raw performance.

Before proceeding further, I note that these comparisons were conducted entirely using uncompressed files. Of the three packages compared here, only cogent3 appears to seamlessly support compressed input files. Specifically, if a file path ends with a recognised compression suffix (e.g., .gz), cogent3 automatically selects the appropriate decompressor and processes the file on the fly², without requiring any additional effort from the user.

What handling compression looks like

You have to figure out how to handle the compression, probably based on the filename suffix. And then import the correct module. cogent3 does this for you automatically for gzip, bzip2, zip, xz, and lzma.

ThisInstead of This

import cogent3 as c3

seqs = list(c3.parse.fasta.iter_fasta_records("path/to/file.fa.gz"))

import gzip
import cogent3 as c3

with gzip.open("path/to/file.fa.gz", "rt") as f:
    seq = list(c3.parse.fasta.iter_fasta_records(f))

As downloaded genomic data files are often distributed in compressed form, the lack of equivalent support in the other two packages is a practical inconvenience. The comments below regarding code complexity do not take this issue into account.

In the figures below, smaller is better for both compute-time and RAM.

Methods and Datasets

The benchmark code lives in the c3-benchmarking repository.

I time each (task, tool) pair as a standalone process using hyperfine. The orchestrator invokes hyperfine --command-name <tool> '<shell-cmd>' once per tool, runs each command three times by default, and aggregates the mean and standard deviation of wall-clock time and peak resident memory into a per-task TSV. This means we include Python startup, imports, and all that fun stuff as part of the measurements.

Dataset summary

These datasets can be obtained via the repo linked to above.

dataset	suffix	size	description
hsap_gbk	.dat	363.7 MB	Human chromosome 1 in GenBank format
hsap_gff3	.fa, .gff3	241.4 MB, 650.7 MB	Human genome annotations in GFF3 format plus Human chromosome 1 in FASTA format
micro_gbk	.gb	11.3 MB	The E. coli K12 genome in GenBank format
ptro_fa	.fa	3.1 GB	Chimpanzee genome

Tools summary

The abbreviations shown in the table are used in all the result tables and figures below. We also include code snippets used for each tool. These are extracted verbatim from the benchmarking project and hence contained within individual functions.

biopython is the oldest of these packages, dating back to 2002. cogent3 and scikit-bio both originated from PyCogent (published in 2007). cogent3 is the closest to the original PyCogent feature set while scikit-bio appears to be primarily focussed on microbes. I note here I was a founder of the PyCogent project and am the founder of cogent3.

abbreviation	package	version	docs
bp	biopython	1.87	biopython
c3, c3gffdb, c3gbdb	cogent3	2026.6.2a0	cogent3
sb	scikit-bio	0.7.2	scikit-bio
bp	bcbio-gff	0.7.1	bcbio-gff

Comparing sequence format parsers¶

FASTA formatted sequences¶

bpc3sb

def bp(path):
    from Bio import SeqIO

    for seq in SeqIO.parse(path, "fasta"):
        pass

def c3(path):
    from cogent3.parse.fasta import iter_fasta_records

    for label, seq in iter_fasta_records(path):
        pass

def sb(path):
    from skbio.io import read

    for seq in read(path, format="fasta"):
        pass

Results table for parsing the Chimpanzee genome

Function	Result Type	mean(time) seconds	std(time) seconds	mean(RAM)	std(RAM)
bp	OK	6.42	0.02	2.1 GB	0.0e+00 GB
c3	OK	3.86	0.01	2.6 GB	7.0e-05 GB
sb	OK	25.17	0.16	2.5 GB	2.6e-05 GB

ptro

Results table for parsing a SARS-COV-2 genomes file

Function	Result Type	mean(time) seconds	std(time) seconds	mean(RAM)	std(RAM)
bp	OK	4.88	0.01	54.1 MB	0.0e+00 MB
c3	OK	3.14	0.03	68.2 MB	1.8e-02 MB
sb	OK	13.97	0.12	159.2 MB	7.2e-02 MB

sars

FASTQ formatted sequences¶

bpc3sb

def bp(path):
    from Bio import SeqIO

    for seq in SeqIO.parse(path, "fastq"):
        pass

def c3(path):
    from cogent3.core.alphabet import make_qual_converter
    from cogent3.parse.fastq import iter_fastq_records

    qual_converter = make_qual_converter("phred+33")
    for label, seq, qual in iter_fastq_records(path, qual_converter=qual_converter):
        pass

def sb(path):
    from skbio.io import read

    for seq in read(path, format="fastq", phred_offset=33):
        pass

Results table for parsing the marine fastq reads

Function	Result Type	mean(time) seconds	std(time) seconds	mean(RAM)	std(RAM)
bp	OK	1.12	0.01	54.1 MB	0.0e+00 MB
c3	OK	0.81	0.00	137.9 MB	1.2e-01 MB
sb	OK	13.76	0.18	159.1 MB	0.0e+00 MB

marine

GenBank formatted sequences¶

bpc3sb

def bp(path):
    from Bio import SeqIO

    for seq in SeqIO.parse(path, "genbank"):
        pass

def c3(path):
    from cogent3.parse.genbank import iter_genbank_records

    for label, seq, _ in iter_genbank_records(path, convert_features=False):
        pass

def sb(path):
    from skbio.io import read

    for seq in read(path, format="genbank"):
        pass

Results table for human chromosome 1

Function	Result Type	mean(time) seconds	std(time) seconds	mean(RAM)	std(RAM)
bp	OK	4.39	0.05	1.4 GB	4.3e-04 GB
c3	OK	0.95	0.02	1.7 GB	0.0e+00 GB
sb	Error	2.22	0.02	832.2 MB	0.0e+00 MB

sb failed to parse this file

human chromosome 1

Results table for micro

Function	Result Type	mean(time) seconds	std(time) seconds	mean(RAM)	std(RAM)
bp	OK	0.41	0.00	104.5 MB	0.0e+00 MB
c3	OK	0.31	0.00	150.6 MB	1.3e-01 MB
sb	OK	1.34	0.06	265.8 MB	1.1e+00 MB

micro

Summary of sequence format parsing performance¶

Regarding speed, cogent3 consistently ranked as the fastest parser for sequence formats, with biopython coming second. In terms of memory usage, biopython always used the least, typically followed by scikit-bio, with cogent3 the largest. This is due to a default value employed by scinexus data streaming³. It is also worth noting that scikit-bio was unable to parse the GenBank file for Human chromosome 1, which is why it is not included in that particular plot.

Comparing annotation format parsers¶

GFF3 formatted annotations¶

bpc3c3gffdbsb

def bp(path):
    from BCBio import GFF

    with open(path) as in_handle:
        for feature in GFF.parse(in_handle):
            pass

def c3(path):
    from cogent3.parse.gff import gff_parser

    for feature in gff_parser(path):
        pass

def c3gffdb(path):
    import cogent3

    cogent3.load_annotations(path=path)

def sb(path):
    from skbio.io import read

    for label, features in read(path, format="gff3"):
        pass

Results table for human chromosome 1

Function	Result Type	mean(time) seconds	std(time) seconds	mean(RAM)	std(RAM)
bp	OK	59.89	0.46	10.1 GB	0.0e+00 GB
c3	OK	5.40	0.02	48.5 MB	6.3e-02 MB
c3gffdb	OK	74.89	0.50	4.0 GB	1.1e-03 GB
sb	OK	47.38	0.23	1.8 GB	1.1e-03 GB

human chromosome 1

GenBank formatted annotations¶

bpc3c3gbdbsb

def bp(path):
    from Bio import SeqIO

    for record in SeqIO.parse(path, "genbank"):
        features = list(record.features)

def c3(path):
    from cogent3.parse.genbank import iter_genbank_records

    for label, seq, features in iter_genbank_records(path):
        pass

def c3gbdb(path):
    import cogent3

    cogent3.load_annotations(path=path, format_name="genbank")

def sb(path):
    from skbio.io import read

    for seq in read(path, format="genbank"):
        features = list(seq.interval_metadata.query())

Results table for human chromosome 1

Function	Result Type	mean(time) seconds	std(time) seconds	mean(RAM)	std(RAM)
bp	OK	4.53	0.01	1.4 GB	0.0e+00 GB
c3	OK	5.24	0.06	2.3 GB	0.0e+00 GB
c3gbdb	OK	7.96	0.11	2.3 GB	8.6e-04 GB
sb	Error	2.34	0.03	831.1 MB	0.0e+00 MB

sb failed to parse this file

human chromosome 1

Results table for micro

Function	Result Type	mean(time) seconds	std(time) seconds	mean(RAM)	std(RAM)
bp	OK	0.47	0.00	104.3 MB	9.9e-02 MB
c3	OK	0.49	0.02	190.0 MB	0.0e+00 MB
c3gbdb	OK	0.69	0.02	192.6 MB	2.7e-01 MB
sb	OK	1.48	0.05	266.1 MB	0.0e+00 MB

micro

Summary of annotation format parsing performance¶

It is difficult to make fair comparisons here, since only cogent3 returns Python primitives. We therefore included cogent3's rich in-memory annotation databases⁴ in the comparisons.

cogent3 raw parser was the fastest, except for the microbial genome GenBank file, where biopython was marginally faster. The GFF3⁵ parsing story was interesting, with c3gffdb the slowest, followed by biopython. biopython's used substantially more RAM than the basic parsers. biopython's relative performance was better on extracting annotation data from GenBank files, with the lowest RAM and a marginally faster speed for one case. cogent3's c3gbdb was faster and required less RAM than scikit-bio on the microbial GenBank file.

Conclusions¶

For these simple tasks, the code complexity of the different packages are comparable with one exception⁶. Computational performance, however, was not.

In general, cogent3 showed the fastest run times but also used more RAM³ than biopython which was consistently second fastest. cogent3 was also, as far as I can tell, the only tool capable of returning the Python primitives. The value of that to you, the user, will vary based on your problem. But as the GenBank parsing cases show, being able to ignore annotations if you want sequences, or sequences if you want annotations, enables much faster delivery of the content that matters. Unfortunately, scikit-bio was consistently slower and typically required more RAM than the other packages.

Interpreting comparisons of annotation-data parsing requires an important qualification. Genome annotations can be viewed as the central product of genomics because they represent the current state of knowledge about the information encoded in a genome. Support for annotation data is therefore a fundamental capability of any genomics software package.

Annotation formats often encode complex relationships that span multiple rows within a file. Differences in how packages represent and reconstruct this information are likely the main source of the substantial performance differences observed here. For example, cogent3's basic parsers are designed to retrieve content as quickly as possible, without combining information across rows. The richer cogent3 objects⁴ are constructed from these parser results, and their performance reflects the added complexity of those algorithms.

Ultimately, the most meaningful measure of a package's annotation-handling capability is how effectively its tools support downstream queries. That will be the focus of a future post.

By Python primitive, I mean basic Python types such as strings, lists, dicts, or tuples. ↩
This feature is provided by scinexus. ↩
At least part of the reason cogent3 is using more RAM is due to a default 5MB chunk size employed by scinexus streaming parsers. Reducing that chunk size reduces peak RAM at the cost of a small increase in compute time. ↩↩
I included the c3gbdb and c3gffdb examples as they are the richest objects cogent3 delivers from parsing annotations. We will explore the objects returned by the sequence annotation parsing in a future post. ↩↩
This GFF3 file covered the entire human genome and thus included all human genes, transcripts and exons. ↩
cogent3 required two imports and an explicit creation of a converter for transforming raw fastq quality scores into numpy arrays. Strictly speaking, that's not required but it does deliver read quality as a comparable type to the other tools. ↩

The parser shootout 🔫¶

The goal¶

Comparing sequence format parsers¶

FASTA formatted sequences¶

FASTQ formatted sequences¶

GenBank formatted sequences¶

Summary of sequence format parsing performance¶

Comparing annotation format parsers¶

GFF3 formatted annotations¶

GenBank formatted annotations¶

Summary of annotation format parsing performance¶

Conclusions¶

Comments