The parser shootout π«¶
Bioinformatics data processing remains dominated by plain text formats. In this post, I contrast the performance of the popular biopython, cogent3, and scikit-bio packages for reading three sequence file formats and two genome annotation formats. Despite how simple these tasks might seem, you'll see there's a lot of variation in performance! The takeaway message is that cogent3 is nearly always faster for parsing these basic file formats, while biopython typically uses less RAM.
The goal¶
The goal of this post is to compare how well popular packages read standard file formats. I focus on how quickly and simply they can pull data off disk into Python primitives1. For some of the packages I compare, the returned objects are more complex than simple primitives, which can reduce computational performance. This is not necessarily a drawback, because these richer objects can offer powerful interfaces for working with the data. However, in this post, I wonβt discuss those capabilities but focus solely on raw performance.
Before proceeding further, I note that these comparisons were conducted entirely using uncompressed files. Of the three packages compared here, only cogent3 appears to seamlessly support compressed input files. Specifically, if a file path ends with a recognised compression suffix (e.g., .gz), cogent3 automatically selects the appropriate decompressor and processes the file on the fly2, without requiring any additional effort from the user.
What handling compression looks like
You have to figure out how to handle the compression, probably based on the filename suffix. And then import the correct module. cogent3 does this for you automatically for gzip, bzip2, zip, xz, and lzma.
As downloaded genomic data files are often distributed in compressed form, the lack of equivalent support in the other two packages is a practical inconvenience. The comments below regarding code complexity do not take this issue into account.
In the figures below, smaller is better for both compute-time and RAM.
Methods and Datasets
The benchmark code lives in the c3-benchmarking repository.
I time each (task, tool) pair as a standalone process using hyperfine. The orchestrator invokes hyperfine --command-name <tool> '<shell-cmd>' once per tool, runs each command three times by default, and aggregates the mean and standard deviation of wall-clock time and peak resident memory into a per-task TSV. This means we include Python startup, imports, and all that fun stuff as part of the measurements.
Dataset summary
These datasets can be obtained via the repo linked to above.
| dataset | suffix | size | description |
|---|---|---|---|
| hsap_gbk | .dat | 363.7 MB | Human chromosome 1 in GenBank format |
| hsap_gff3 | .gff3 | 650.7 MB | Human genome annotations in GFF3 format |
| micro_gbk | .gb | 11.3 MB | The E. coli K12 genome in GenBank format |
| ptro_fa | .fa | 3.1 GB | Chimpanzee genome |
Tools summary
The abbreviations shown in the table are used in all the result tables and figures below. We also include code snippets used for each tool. These are extracted verbatim from the benchmarking project and hence contained within individual functions.
biopython is the oldest of these packages, dating back to 2002. cogent3 and scikit-bio both originated from PyCogent (published in 2007). cogent3 is the closest to the original PyCogent feature set while scikit-bio appears to be primarily focussed on microbes. I note here I was a founder of the PyCogent project and am the founder of cogent3.
| abbreviation | package | version | docs |
|---|---|---|---|
| bp | biopython | 1.87 | biopython |
| c3, c3gffdb, c3gbdb | cogent3 | 2026.5.25a0 | cogent3 |
| sb | scikit-bio | 0.7.2 | scikit-bio |
| bp | bcbio-gff | 0.7.1 | bcbio-gff |
Comparing sequence format parsers¶
FASTA formatted sequences¶
Results table for parsing the Chimpanzee genome
| Function | Result Type | mean(time) seconds | std(time) seconds | mean(RAM) | std(RAM) |
|---|---|---|---|---|---|
| bp | OK | 6.60 | 0.08 | 2.1 GB | 0.0e+00 GB |
| c3 | OK | 4.17 | 0.05 | 8.3 GB | 0.0e+00 GB |
| sb | OK | 28.71 | 0.13 | 2.6 GB | 8.8e-06 GB |
Results table for parsing a SARS-COV-2 genomes file
| Function | Result Type | mean(time) seconds | std(time) seconds | mean(RAM) | std(RAM) |
|---|---|---|---|---|---|
| bp | OK | 5.29 | 0.02 | 55.5 MB | 9.0e-03 MB |
| c3 | OK | 3.63 | 0.05 | 5.0 GB | 0.0e+00 GB |
| sb | OK | 16.12 | 0.04 | 164.8 MB | 1.4e-01 MB |
FASTQ formatted sequences¶
Results table for parsing the marine fastq reads
| Function | Result Type | mean(time) seconds | std(time) seconds | mean(RAM) | std(RAM) |
|---|---|---|---|---|---|
| bp | OK | 1.29 | 0.01 | 55.6 MB | 0.0e+00 MB |
| c3 | OK | 0.93 | 0.01 | 94.6 MB | 4.5e-02 MB |
| sb | OK | 14.50 | 0.15 | 165.0 MB | 0.0e+00 MB |
GenBank formatted sequences¶
Results table for human chromosome 1
| Function | Result Type | mean(time) seconds | std(time) seconds | mean(RAM) | std(RAM) |
|---|---|---|---|---|---|
| bp | OK | 4.34 | 0.00 | 1.4 GB | 0.0e+00 GB |
| c3 | OK | 1.02 | 0.00 | 1.7 GB | 0.0e+00 GB |
| sb | Error | 2.54 | 0.01 | 837.0 MB | 6.9e-01 MB |
sb failed to parse this file
Results table for micro
| Function | Result Type | mean(time) seconds | std(time) seconds | mean(RAM) | std(RAM) |
|---|---|---|---|---|---|
| bp | OK | 0.48 | 0.01 | 105.0 MB | 1.9e-01 MB |
| c3 | OK | 0.37 | 0.01 | 148.4 MB | 0.0e+00 MB |
| sb | OK | 1.54 | 0.02 | 269.8 MB | 1.8e-01 MB |
Summary of sequence format parsing performance¶
Regarding speed, cogent3 consistently ranked as the fastest parser for sequence formats, with biopython coming second. In terms of memory usage, biopython always used the least, typically followed by scikit-bio, with cogent3 the largest. This is due to a default value employed by scinexus data streaming3. It is also worth noting that scikit-bio was unable to parse the GenBank file for Human chromosome 1, which is why it is not included in that particular plot.
Comparing annotation format parsers¶
GFF3 formatted annotations¶
Results table for human chromosome 1
| Function | Result Type | mean(time) seconds | std(time) seconds | mean(RAM) | std(RAM) |
|---|---|---|---|---|---|
| bp | OK | 48.07 | 0.13 | 10.1 GB | 6.8e-04 GB |
| c3 | OK | 5.78 | 0.07 | 50.1 MB | 0.0e+00 MB |
| c3gffdb | OK | 73.62 | 0.52 | 4.0 GB | 0.0e+00 GB |
| sb | OK | 36.15 | 0.09 | 3.5 GB | 0.0e+00 GB |
GenBank formatted annotations¶
Results table for human chromosome 1
| Function | Result Type | mean(time) seconds | std(time) seconds | mean(RAM) | std(RAM) |
|---|---|---|---|---|---|
| bp | OK | 4.41 | 0.01 | 1.4 GB | 5.0e-04 GB |
| c3 | OK | 4.11 | 0.02 | 2.2 GB | 0.0e+00 GB |
| c3gbdb | Error | 0.37 | 0.01 | 95.2 MB | 0.0e+00 MB |
| sb | Error | 2.64 | 0.02 | 837.3 MB | 1.1e+00 MB |
sb failed to parse this file
Results table for micro
| Function | Result Type | mean(time) seconds | std(time) seconds | mean(RAM) | std(RAM) |
|---|---|---|---|---|---|
| bp | OK | 0.49 | 0.00 | 105.3 MB | 0.0e+00 MB |
| c3 | OK | 0.57 | 0.06 | 181.9 MB | 0.0e+00 MB |
| c3gbdb | Error | 0.37 | 0.01 | 94.9 MB | 0.0e+00 MB |
| sb | OK | 1.57 | 0.02 | 271.7 MB | 0.0e+00 MB |
Summary of annotation format parsing performance¶
It is difficult to make fair comparisons here, since only cogent3 returns Python primitives. We therefore included cogent3's rich in-memory annotation databases4 in the comparisons.
cogent3 raw parser was the fastest, except for the microbial genome GenBank file, where biopython was marginally faster. The GFF35 parsing story was interesting, with c3gffdb the slowest, followed by biopython. biopython's used substantially more RAM than the basic parsers. biopython's relative performance was better on extracting annotation data from GenBank files, with the lowest RAM and a marginally faster speed for one case. cogent3's c3gbdb was faster and required less RAM than scikit-bio on the microbial GenBank file.
Conclusions¶
For these simple tasks, the code complexity of the different packages are comparable with one exception6. Computational performance, however, was not.
In general, cogent3 showed the fastest run times but also used more RAM3 than biopython which was consistently second fastest. cogent3 was also, as far as I can tell, the only tool capable of returning the Python primitives. The value of that to you, the user, will vary based on your problem. But as the GenBank parsing cases show, being able to ignore annotations if you want sequences, or sequences if you want annotations, enables much faster delivery of the content that matters. Unfortunately, scikit-bio was consistently slower and typically required more RAM than the other packages.
Interpreting comparisons of annotation-data parsing requires an important qualification. Genome annotations can be viewed as the central product of genomics because they represent the current state of knowledge about the information encoded in a genome. Support for annotation data is therefore a fundamental capability of any genomics software package.
Annotation formats often encode complex relationships that span multiple rows within a file. Differences in how packages represent and reconstruct this information are likely the main source of the substantial performance differences observed here. For example, cogent3's basic parsers are designed to retrieve content as quickly as possible, without combining information across rows. The richer cogent3 objects4 are constructed from these parser results, and their performance reflects the added complexity of those algorithms.
Ultimately, the most meaningful measure of a package's annotation-handling capability is how effectively its tools support downstream queries. That will be the focus of a future post.
-
By Python primitive, I mean basic Python types such as strings, lists, dicts, or tuples. ↩
-
At least part of the reason cogent3 is using more RAM is due to a default 5MB chunk size employed by scinexus streaming parsers. Reducing that chunk size reduces peak RAM at the cost of a small increase in compute time. ↩↩
-
I included the
c3gbdbandc3gffdbexamples as they are the richest objects cogent3 delivers from parsing annotations. We will explore the objects returned by the sequence annotation parsing in a future post. ↩↩ -
This GFF3 file covered the entire human genome and thus included all human genes, transcripts and exons. ↩
-
cogent3 required two imports and an explicit creation of a converter for transforming raw fastq quality scores into
numpyarrays. Strictly speaking, that's not required but it does deliver readqualityas a comparable type to the other tools. ↩