The genome annotation handling shootout π«¶
Genome annotation data are a fundamental reflection of the state of our understanding of a genome. Any software package that claims to provide generic genomic data handling must also be great for handling genome annotations (aka genome features). Right? In this post, I put this assertion to the test for biopython, cogent3, and scikit-bio. In a nutshell, only cogent3 acquits itself with some distinction. On large datasets, cogent3 can be orders of magnitude faster and use orders of magnitude less memory than the others. It also requires much less code π€―. But there is room for improvement across all packages.
The goal¶
Genome annotations can take many forms. All of them represent the genomic coordinates of some meta-data description of a genome, such as the locations of genes, genetic variation, and so on. In a previous post, we compared the performance of the popular biopython, cogent3, and scikit-bio packages in parsing standard formats, including those for genome annotations. In that post we noticed substantial differences in raw parsing speed with cogent3 often being a lot faster. We qualified the performance difference by pointing out how much more complex annotation data is and that we needed to reserve judgement regarding biopython and scikit-bio until we had evaluated their merits on handling annotation data.
In this post, I do this by contrasting how these packages support working with genome annotations, including the complexity of relating annotations to their associated genomic sequence. I evaluate the ability to:
- Transform standard flat file formats into a serialised data format suitable for reuse across Python sessions
- Search for annotations matching specified conditions
- Integrate annotation searching with selecting and transforming the corresponding genomic sequence
Methods and Datasets
The benchmark code lives in the c3-benchmarking repository.
I time each (task, tool) pair as a standalone process using hyperfine. The orchestrator invokes hyperfine --command-name <tool> '<shell-cmd>' once per tool, runs each command three times by default, and aggregates the mean and standard deviation of wall-clock time and peak resident memory into a per-task TSV. This means we include Python startup, imports, and all that fun stuff as part of the measurements.
Dataset summary
These datasets can be obtained via the repo linked to above.
| dataset | suffix | size | description |
|---|---|---|---|
| hsap_gbk | .dat | 363.7 MB | Human chromosome 1 in GenBank format |
| hsap_gff3 | .fa, .gff3 | 241.4 MB, 650.7 MB | Human genome annotations in GFF3 format plus Human chromosome 1 in FASTA format |
| micro_gbk | .gb | 11.3 MB | The E. coli K12 genome in GenBank format |
| ptro_fa | .fa | 3.1 GB | Chimpanzee genome |
Tools summary
The abbreviations shown in the table are used in all the result tables and figures below. We also include code snippets used for each tool. These are extracted verbatim from the benchmarking project and hence contained within individual functions.
biopython is the oldest of these packages, dating back to 2002. cogent3 and scikit-bio both originated from PyCogent (published in 2007). cogent3 is the closest to the original PyCogent feature set while scikit-bio appears to be primarily focussed on microbes.
I was a founder of the PyCogent project and am the founder of cogent3.
| abbreviation | package | version | docs |
|---|---|---|---|
| bp | biopython | 1.87 | biopython |
| c3gffdb, c3gbdb, c3 | cogent3 | 2026.6.2a0 | cogent3 |
| sb | scikit-bio | 0.7.2 | scikit-bio |
Serialisation of annotation data¶
Aim
Because serialized data is usually more efficient to reload and use, we evaluated serialising the results of parsing standard flat-file annotation formats. As far as we can tell, neither biopython nor scikit-bio defines a serialization format for annotation data. We therefore used Pythonβs built-in pickle module, noting that pickling is typically fragile1.
scikit-bio does not support serialisation
Attempting to pickle objects generated by parsing the annotation data failed for scikit-bio. The error suggests that some objects produced during annotation parsing are instances of a Cython class for interval data that lacks the necessary methods to support pickle.
Human genome annotations derived from a GFF3 file¶
Results table for serialising the Human genome annotations from GFF3
| Function | Result Type | mean(time) seconds | std(time) seconds | mean(RAM) | std(RAM) |
|---|---|---|---|---|---|
| bp | OK | 93.05 | 1.72 | 18.2 GB | 0.0e+00 GB |
| c3gffdb | OK | 73.33 | 0.40 | 2.0 GB | 0.0e+00 GB |
| sb | Error | 50.94 | 0.17 | 6.7 GB | 0.0e+00 GB |
Annotations in GenBank format¶
Human chromosome 1¶
Results table for serialising the Human chromosome 1 annotations
| Function | Result Type | mean(time) seconds | std(time) seconds | mean(RAM) | std(RAM) |
|---|---|---|---|---|---|
| bp | OK | 5.94 | 0.07 | 1.4 GB | 0.0e+00 GB |
| c3gbdb | OK | 7.26 | 0.02 | 2.3 GB | 0.0e+00 GB |
| sb | Error | 2.63 | 0.08 | 837.2 MB | 6.8e-01 MB |
E. coli genome¶
Results table for serialising the E. coli genome
| Function | Result Type | mean(time) seconds | std(time) seconds | mean(RAM) | std(RAM) |
|---|---|---|---|---|---|
| bp | OK | 0.52 | 0.00 | 115.0 MB | 0.0e+00 MB |
| c3gbdb | OK | 0.72 | 0.01 | 191.0 MB | 3.6e-02 MB |
| sb | Error | 1.61 | 0.03 | 270.5 MB | 1.4e-01 MB |
Summary of serialisation results¶
Neither scikit-bio or biopython have an inbuilt serialization capacity for annotation data. In the case of biopython the objects generated by parsing files containing annotations can be pickled, that is, serialised using the standard Python format. cogent3 utilizes an SQLite database for representing annotation data2 and so relies on that built-in module for serializing to disk. So the results show Pythons built-in serialisation is somewhat faster and more memory efficient than cogent3's SQLite based approach. For the whole human genome, cogent3's parser speed offsets the difference so it edges out biopython for this benchmark.
In terms of code complexity, cogent3 is simpler. A single additional argument to the load_annotations() function is required in order for annotation data to be simultaneously written to disk. This is in contrast with the extra work required to serialise biopython objects to disk in pickle format.
Annotation querying¶
Querying 100 CDS¶
Aim
Evaluate how hard it is to limit the number of returned features. This is a useful feature for prototyping analyses.
Human chromosome 1¶
Results table for querying 100 CDS from Human chromosome 1
| Function | Result Type | mean(time) seconds | std(time) seconds | mean(RAM) | std(RAM) |
|---|---|---|---|---|---|
| bp | OK | 1.86 | 0.01 | 875.4 MB | 0.0e+00 MB |
| c3gbdb | OK | 0.38 | 0.00 | 95.1 MB | 0.0e+00 MB |
| sb | Error | 2.55 | 0.01 | 833.0 MB | 0.0e+00 MB |
E. coli¶
Results table for querying 100 CDS from E. coli
| Function | Result Type | mean(time) seconds | std(time) seconds | mean(RAM) | std(RAM) |
|---|---|---|---|---|---|
| bp | OK | 0.18 | 0.00 | 80.9 MB | 0.0e+00 MB |
| c3gbdb | OK | 0.38 | 0.00 | 94.5 MB | 3.6e-02 MB |
| sb | OK | 1.63 | 0.07 | 266.7 MB | 0.0e+00 MB |
Human chromosome 16 protein coding genes¶
Aim
Evaluate how hard it is to select features using a composite condition: genes are on chromosome 16 AND are protein coding.
Results table for querying the Human genome
| Function | Result Type | mean(time) seconds | std(time) seconds | mean(RAM) | std(RAM) |
|---|---|---|---|---|---|
| bp | OK | 32.28 | 0.25 | 11.7 GB | 0.0e+00 GB |
| c3gffdb | OK | 0.40 | 0.01 | 97.5 MB | 1.5e-01 MB |
| sb | OK | 53.62 | 0.86 | 2.0 GB | 0.0e+00 GB |
One E. coli gene¶
Aim
Evaluate how hard it is to select a single gene.
Results table for querying one gene from E. coli
| Function | Result Type | mean(time) seconds | std(time) seconds | mean(RAM) | std(RAM) |
|---|---|---|---|---|---|
| bp | OK | 0.25 | 0.01 | 81.6 MB | 0.0e+00 MB |
| c3gbdb | OK | 0.39 | 0.01 | 97.3 MB | 3.6e-02 MB |
| sb | OK | 1.61 | 0.02 | 271.9 MB | 1.6e-01 MB |
Summary of querying results¶
cogent3 exhibits a substantial performance advantage (both time and RAM) for collections with tens of thousands of features. In some cases, cogent3 uses about 1% of the RAM and is 100 times faster. For the E. coli genome, biopython outperforms cogent3 in both speed and RAM, but not by much. At both scales, scikit-bio requires substantially more resources (time and RAM). The speed advantage of biopython for the E. coli case reflects, in part, the efficiency of deserialisation using pickle.
Both biopython and scikit-bio require looping and conditionals in order to identify the necessary matches to the conditions set out by these queries. In contrast, cogent3 presents a simple interface enabling expression of the conditions with a single statement.
Comparison of sequence annotation integration¶
Aim
Evaluate how hard it is to integrate both sequence and annotation data. We query for a single, multi-exon gene on the minus strand. We need to extract its exons, concatenate, reverse complement and translate using the standard genetic code.
Results table from integrated querying to sequence translation
| Function | Result Type | mean(time) seconds | std(time) seconds | mean(RAM) | std(RAM) |
|---|---|---|---|---|---|
| bp | OK | 33.41 | 0.04 | 12.2 GB | 0.0e+00 GB |
| c3 | OK | 1.52 | 0.03 | 1.3 GB | 0.0e+00 GB |
| sb | OK | 10.81 | 0.07 | 3.5 GB | 4.4e-05 GB |
In addition to being much faster and requiring less RAM, cogent3 is able to perform this task with just four Python statements. In contrast, both biopython and scikit-bio require approximately 30 lines of code.
Conclusions¶
Of the three packages, only cogent3 presents a simplified interface for querying annotations. It also exhibits the tightest integration with sequence objects. These factors manifest as the considerably reduced code complexity of cogent3 solutions to the problems we posed.
cogent3's simplified interface is also paired with the highest performance characteristics in terms of low execution time and low RAM usage. A notable exception is performance when querying the bacterial genome, where biopython was approximately 0.1β0.2 seconds faster. While we could argue that this difference is trivial, I think it indicates there is unnecessary overhead in cogent3βs loading of databases followed by querying. I further suspect cogent3 has unnecessary overhead associated with Feature instances and related operations. Both of these issues warrant more detailed investigation.
It is clear that the biopython and scikit-bio developers also have their work cut out for them. Their tasks range from the inclusion of GFF3 parsing capabilities within biopython to specifying more efficient data structures to improve scalability. For scikit-bio, implementing support for serialisation would be valuable. And the robustness of the scikit-bio GenBank parser is not great. Adding to that, there are some real gotcha's. For example, why does gene = feature.metadata.get("gene", "") return '"ytjEr"' instead of 'ytjEr'? That's unexpected and why we needed to do the removal of quote characters from the string.
The functionality that we have explored in this post is by no means comprehensive. Annotation data is employed in a multitude of ways by life sciences researchers. That said, it is also the case that we have not fully explored the cogent3 feature set related to annotations. Check out the project documentation for a more complete listing of cogent3βs annotation handling. I will just note here that these include capabilities such as: using a feature instance to select its parent feature (think of a transcript selecting its gene), the ability to produce interactive plots of features, and the ability to generate a single annotation database from a directory of annotation files in one statement.
-
Pickling is fragile between runtimes for a large number of reasons and is therefore not recommended. ↩
-
Because of its plugin architecture,
cogent3can support third-party plugins for annotation data handling. As such, these can use their own serialisation format. ↩