Skip to content

The parser shootout πŸ”«

Bioinformatics data processing remains dominated by plain text formats. In this post, I contrast the performance of the popular biopython, cogent3, and scikit-bio packages for reading three sequence file formats and two genome annotation formats. Despite how simple these tasks might seem, you'll see there's a lot of variation in performance! The takeaway message is that cogent3 is nearly always faster for parsing these basic file formats, while biopython typically uses less RAM.

The goal

The goal of this post is to compare how well popular packages read standard file formats. I focus on how quickly and simply they can pull data off disk into Python primitives1. For some of the packages I compare, the returned objects are more complex than simple primitives, which can reduce computational performance. This is not necessarily a drawback, because these richer objects can offer powerful interfaces for working with the data. However, in this post, I won’t discuss those capabilities but focus solely on raw performance.

Before proceeding further, I note that these comparisons were conducted entirely using uncompressed files. Of the three packages compared here, only cogent3 appears to seamlessly support compressed input files. Specifically, if a file path ends with a recognised compression suffix (e.g., .gz), cogent3 automatically selects the appropriate decompressor and processes the file on the fly2, without requiring any additional effort from the user.

What handling compression looks like

You have to figure out how to handle the compression, probably based on the filename suffix. And then import the correct module. cogent3 does this for you automatically for gzip, bzip2, zip, xz, and lzma.

import cogent3 as c3

seqs = list(c3.parse.fasta.iter_fasta_records("path/to/file.fa.gz"))
import gzip
import cogent3 as c3

with gzip.open("path/to/file.fa.gz", "rt") as f:
    seq = list(c3.parse.fasta.iter_fasta_records(f))

As downloaded genomic data files are often distributed in compressed form, the lack of equivalent support in the other two packages is a practical inconvenience. The comments below regarding code complexity do not take this issue into account.

In the figures below, smaller is better for both compute-time and RAM.

Methods and Datasets

The benchmark code lives in the c3-benchmarking repository.

I time each (task, tool) pair as a standalone process using hyperfine. The orchestrator invokes hyperfine --command-name <tool> '<shell-cmd>' once per tool, runs each command three times by default, and aggregates the mean and standard deviation of wall-clock time and peak resident memory into a per-task TSV. This means we include Python startup, imports, and all that fun stuff as part of the measurements.

Dataset summary

These datasets can be obtained via the repo linked to above.

dataset suffix size description
hsap_gbk .dat 363.7 MB Human chromosome 1 in GenBank format
hsap_gff3 .gff3 650.7 MB Human genome annotations in GFF3 format
micro_gbk .gb 11.3 MB The E. coli K12 genome in GenBank format
ptro_fa .fa 3.1 GB Chimpanzee genome
Tools summary

The abbreviations shown in the table are used in all the result tables and figures below. We also include code snippets used for each tool. These are extracted verbatim from the benchmarking project and hence contained within individual functions.

biopython is the oldest of these packages, dating back to 2002. cogent3 and scikit-bio both originated from PyCogent (published in 2007). cogent3 is the closest to the original PyCogent feature set while scikit-bio appears to be primarily focussed on microbes. I note here I was a founder of the PyCogent project and am the founder of cogent3.

abbreviation package version docs
bp biopython 1.87 biopython
c3, c3gffdb, c3gbdb cogent3 2026.5.25a0 cogent3
sb scikit-bio 0.7.2 scikit-bio
bp bcbio-gff 0.7.1 bcbio-gff

Comparing sequence format parsers

FASTA formatted sequences

def bp(path):
    from Bio import SeqIO

    for seq in SeqIO.parse(path, "fasta"):
        pass
def c3(path):
    from cogent3.parse.fasta import iter_fasta_records

    for label, seq in iter_fasta_records(path):
        pass
def sb(path):
    from skbio.io import read

    for seq in read(path, format="fasta"):
        pass
Results table for parsing the Chimpanzee genome
Function Result Type mean(time) seconds std(time) seconds mean(RAM) std(RAM)
bp OK 6.60 0.08 2.1 GB 0.0e+00 GB
c3 OK 4.17 0.05 8.3 GB 0.0e+00 GB
sb OK 28.71 0.13 2.6 GB 8.8e-06 GB

ptro

Results table for parsing a SARS-COV-2 genomes file
Function Result Type mean(time) seconds std(time) seconds mean(RAM) std(RAM)
bp OK 5.29 0.02 55.5 MB 9.0e-03 MB
c3 OK 3.63 0.05 5.0 GB 0.0e+00 GB
sb OK 16.12 0.04 164.8 MB 1.4e-01 MB

sars

FASTQ formatted sequences

def bp(path):
    from Bio import SeqIO

    for seq in SeqIO.parse(path, "fastq"):
        pass
def c3(path):
    from cogent3.core.alphabet import make_qual_converter
    from cogent3.parse.fastq import iter_fastq_records

    qual_converter = make_qual_converter("phred+33")
    for label, seq, qual in iter_fastq_records(path, qual_converter=qual_converter):
        pass
def sb(path):
    from skbio.io import read

    for seq in read(path, format="fastq", phred_offset=33):
        pass
Results table for parsing the marine fastq reads
Function Result Type mean(time) seconds std(time) seconds mean(RAM) std(RAM)
bp OK 1.29 0.01 55.6 MB 0.0e+00 MB
c3 OK 0.93 0.01 94.6 MB 4.5e-02 MB
sb OK 14.50 0.15 165.0 MB 0.0e+00 MB

marine

GenBank formatted sequences

def bp(path):
    from Bio import SeqIO

    for seq in SeqIO.parse(path, "genbank"):
        pass
def c3(path):
    from cogent3.parse.genbank import iter_genbank_records

    for label, seq, _ in iter_genbank_records(path, convert_features=False):
        pass
def sb(path):
    from skbio.io import read

    for seq in read(path, format="genbank"):
        pass
Results table for human chromosome 1
Function Result Type mean(time) seconds std(time) seconds mean(RAM) std(RAM)
bp OK 4.34 0.00 1.4 GB 0.0e+00 GB
c3 OK 1.02 0.00 1.7 GB 0.0e+00 GB
sb Error 2.54 0.01 837.0 MB 6.9e-01 MB

sb failed to parse this file

human chromosome 1

Results table for micro
Function Result Type mean(time) seconds std(time) seconds mean(RAM) std(RAM)
bp OK 0.48 0.01 105.0 MB 1.9e-01 MB
c3 OK 0.37 0.01 148.4 MB 0.0e+00 MB
sb OK 1.54 0.02 269.8 MB 1.8e-01 MB

micro

Summary of sequence format parsing performance

Regarding speed, cogent3 consistently ranked as the fastest parser for sequence formats, with biopython coming second. In terms of memory usage, biopython always used the least, typically followed by scikit-bio, with cogent3 the largest. This is due to a default value employed by scinexus data streaming3. It is also worth noting that scikit-bio was unable to parse the GenBank file for Human chromosome 1, which is why it is not included in that particular plot.

Comparing annotation format parsers

GFF3 formatted annotations

def bp(path):
    from BCBio import GFF

    with open(path) as in_handle:
        for feature in GFF.parse(in_handle):
            pass
def c3(path):
    from cogent3.parse.gff import gff_parser

    for feature in gff_parser(path):
        pass
def c3gffdb(path):
    import cogent3

    cogent3.load_annotations(path=path)
def sb(path):
    from skbio.io import read

    for label, features in read(path, format="gff3"):
        pass
Results table for human chromosome 1
Function Result Type mean(time) seconds std(time) seconds mean(RAM) std(RAM)
bp OK 48.07 0.13 10.1 GB 6.8e-04 GB
c3 OK 5.78 0.07 50.1 MB 0.0e+00 MB
c3gffdb OK 73.62 0.52 4.0 GB 0.0e+00 GB
sb OK 36.15 0.09 3.5 GB 0.0e+00 GB

human chromosome 1

GenBank formatted annotations

def bp(path):
    from Bio import SeqIO

    for record in SeqIO.parse(path, "genbank"):
        features = list(record.features)
def c3(path):
    from cogent3.parse.genbank import iter_genbank_records

    for label, seq, features in iter_genbank_records(path):
        pass
def c3gbdb(path):
    import cogent3

    cogent3.load_annotations(path=path, format_name="genbank")
def sb(path):
    from skbio.io import read

    for seq in read(path, format="genbank"):
        features = list(seq.interval_metadata.query())
Results table for human chromosome 1
Function Result Type mean(time) seconds std(time) seconds mean(RAM) std(RAM)
bp OK 4.41 0.01 1.4 GB 5.0e-04 GB
c3 OK 4.11 0.02 2.2 GB 0.0e+00 GB
c3gbdb Error 0.37 0.01 95.2 MB 0.0e+00 MB
sb Error 2.64 0.02 837.3 MB 1.1e+00 MB

sb failed to parse this file

human chromosome 1

Results table for micro
Function Result Type mean(time) seconds std(time) seconds mean(RAM) std(RAM)
bp OK 0.49 0.00 105.3 MB 0.0e+00 MB
c3 OK 0.57 0.06 181.9 MB 0.0e+00 MB
c3gbdb Error 0.37 0.01 94.9 MB 0.0e+00 MB
sb OK 1.57 0.02 271.7 MB 0.0e+00 MB

micro

Summary of annotation format parsing performance

It is difficult to make fair comparisons here, since only cogent3 returns Python primitives. We therefore included cogent3's rich in-memory annotation databases4 in the comparisons.

cogent3 raw parser was the fastest, except for the microbial genome GenBank file, where biopython was marginally faster. The GFF35 parsing story was interesting, with c3gffdb the slowest, followed by biopython. biopython's used substantially more RAM than the basic parsers. biopython's relative performance was better on extracting annotation data from GenBank files, with the lowest RAM and a marginally faster speed for one case. cogent3's c3gbdb was faster and required less RAM than scikit-bio on the microbial GenBank file.

Conclusions

For these simple tasks, the code complexity of the different packages are comparable with one exception6. Computational performance, however, was not.

In general, cogent3 showed the fastest run times but also used more RAM3 than biopython which was consistently second fastest. cogent3 was also, as far as I can tell, the only tool capable of returning the Python primitives. The value of that to you, the user, will vary based on your problem. But as the GenBank parsing cases show, being able to ignore annotations if you want sequences, or sequences if you want annotations, enables much faster delivery of the content that matters. Unfortunately, scikit-bio was consistently slower and typically required more RAM than the other packages.

Interpreting comparisons of annotation-data parsing requires an important qualification. Genome annotations can be viewed as the central product of genomics because they represent the current state of knowledge about the information encoded in a genome. Support for annotation data is therefore a fundamental capability of any genomics software package.

Annotation formats often encode complex relationships that span multiple rows within a file. Differences in how packages represent and reconstruct this information are likely the main source of the substantial performance differences observed here. For example, cogent3's basic parsers are designed to retrieve content as quickly as possible, without combining information across rows. The richer cogent3 objects4 are constructed from these parser results, and their performance reflects the added complexity of those algorithms.

Ultimately, the most meaningful measure of a package's annotation-handling capability is how effectively its tools support downstream queries. That will be the focus of a future post.


  1. By Python primitive, I mean basic Python types such as strings, lists, dicts, or tuples. 

  2. This feature is provided by scinexus

  3. At least part of the reason cogent3 is using more RAM is due to a default 5MB chunk size employed by scinexus streaming parsers. Reducing that chunk size reduces peak RAM at the cost of a small increase in compute time. 

  4. I included the c3gbdb and c3gffdb examples as they are the richest objects cogent3 delivers from parsing annotations. We will explore the objects returned by the sequence annotation parsing in a future post. 

  5. This GFF3 file covered the entire human genome and thus included all human genes, transcripts and exons. 

  6. cogent3 required two imports and an explicit creation of a converter for transforming raw fastq quality scores into numpy arrays. Strictly speaking, that's not required but it does deliver read quality as a comparable type to the other tools. 

Comments