Skip to content

The genome annotation handling shootout πŸ”«

Genome annotation data are a fundamental reflection of the state of our understanding of a genome. Any software package that claims to provide generic genomic data handling must also be great for handling genome annotations (aka genome features). Right? In this post, I put this assertion to the test for biopython, cogent3, and scikit-bio. In a nutshell, only cogent3 acquits itself with some distinction. On large datasets, cogent3 can be orders of magnitude faster and use orders of magnitude less memory than the others. It also requires much less code 🀯. But there is room for improvement across all packages.

The goal

Genome annotations can take many forms. All of them represent the genomic coordinates of some meta-data description of a genome, such as the locations of genes, genetic variation, and so on. In a previous post, we compared the performance of the popular biopython, cogent3, and scikit-bio packages in parsing standard formats, including those for genome annotations. In that post we noticed substantial differences in raw parsing speed with cogent3 often being a lot faster. We qualified the performance difference by pointing out how much more complex annotation data is and that we needed to reserve judgement regarding biopython and scikit-bio until we had evaluated their merits on handling annotation data.

In this post, I do this by contrasting how these packages support working with genome annotations, including the complexity of relating annotations to their associated genomic sequence. I evaluate the ability to:

  1. Transform standard flat file formats into a serialised data format suitable for reuse across Python sessions
  2. Search for annotations matching specified conditions
  3. Integrate annotation searching with selecting and transforming the corresponding genomic sequence
Methods and Datasets

The benchmark code lives in the c3-benchmarking repository.

I time each (task, tool) pair as a standalone process using hyperfine. The orchestrator invokes hyperfine --command-name <tool> '<shell-cmd>' once per tool, runs each command three times by default, and aggregates the mean and standard deviation of wall-clock time and peak resident memory into a per-task TSV. This means we include Python startup, imports, and all that fun stuff as part of the measurements.

Dataset summary

These datasets can be obtained via the repo linked to above.

dataset suffix size description
hsap_gbk .dat 363.7 MB Human chromosome 1 in GenBank format
hsap_gff3 .fa, .gff3 241.4 MB, 650.7 MB Human genome annotations in GFF3 format plus Human chromosome 1 in FASTA format
micro_gbk .gb 11.3 MB The E. coli K12 genome in GenBank format
ptro_fa .fa 3.1 GB Chimpanzee genome
Tools summary

The abbreviations shown in the table are used in all the result tables and figures below. We also include code snippets used for each tool. These are extracted verbatim from the benchmarking project and hence contained within individual functions.

biopython is the oldest of these packages, dating back to 2002. cogent3 and scikit-bio both originated from PyCogent (published in 2007). cogent3 is the closest to the original PyCogent feature set while scikit-bio appears to be primarily focussed on microbes.

I was a founder of the PyCogent project and am the founder of cogent3.

abbreviation package version docs
bp biopython 1.87 biopython
c3gffdb, c3gbdb, c3 cogent3 2026.6.2a0 cogent3
sb scikit-bio 0.7.2 scikit-bio

Serialisation of annotation data

Aim

Because serialized data is usually more efficient to reload and use, we evaluated serialising the results of parsing standard flat-file annotation formats. As far as we can tell, neither biopython nor scikit-bio defines a serialization format for annotation data. We therefore used Python’s built-in pickle module, noting that pickling is typically fragile1.

scikit-bio does not support serialisation

Attempting to pickle objects generated by parsing the annotation data failed for scikit-bio. The error suggests that some objects produced during annotation parsing are instances of a Cython class for interval data that lacks the necessary methods to support pickle.

Human genome annotations derived from a GFF3 file

1
2
3
4
5
6
7
8
9
def bp(path: pathlib.Path, out: pathlib.Path) -> None:
    import pickle

    from BCBio import GFF

    with open(path) as in_handle:
        records = list(GFF.parse(in_handle))
    with out.open("wb") as f:
        pickle.dump(records, f)
1
2
3
4
def c3gffdb(path: pathlib.Path, out: pathlib.Path) -> None:
    import cogent3

    cogent3.load_annotations(path=path, write_path=out)
1
2
3
4
5
6
7
8
def sb(path: pathlib.Path, out: pathlib.Path) -> None:
    import pickle

    from skbio.io import read

    records = list(read(path, format="gff3"))
    with out.open("wb") as f:
        pickle.dump(records, f)
Results table for serialising the Human genome annotations from GFF3
Function Result Type mean(time) seconds std(time) seconds mean(RAM) std(RAM)
bp OK 93.05 1.72 18.2 GB 0.0e+00 GB
c3gffdb OK 73.33 0.40 2.0 GB 0.0e+00 GB
sb Error 50.94 0.17 6.7 GB 0.0e+00 GB

hsap-ser

Annotations in GenBank format

1
2
3
4
5
6
7
8
def bp(path: pathlib.Path, out: pathlib.Path) -> None:
    import pickle

    from Bio import SeqIO

    records = list(SeqIO.parse(path, "genbank"))
    with out.open("wb") as f:
        pickle.dump(records, f)
1
2
3
4
def c3gbdb(path: pathlib.Path, out: pathlib.Path) -> None:
    import cogent3

    cogent3.load_annotations(path=path, write_path=out, format_name="genbank")
1
2
3
4
5
6
7
8
def sb(path: pathlib.Path, out: pathlib.Path) -> None:
    import pickle

    from skbio.io import read

    records = list(read(path, format="genbank"))
    with out.open("wb") as f:
        pickle.dump(records, f)

Human chromosome 1

Results table for serialising the Human chromosome 1 annotations
Function Result Type mean(time) seconds std(time) seconds mean(RAM) std(RAM)
bp OK 5.94 0.07 1.4 GB 0.0e+00 GB
c3gbdb OK 7.26 0.02 2.3 GB 0.0e+00 GB
sb Error 2.63 0.08 837.2 MB 6.8e-01 MB

hsap-ser

E. coli genome

Results table for serialising the E. coli genome
Function Result Type mean(time) seconds std(time) seconds mean(RAM) std(RAM)
bp OK 0.52 0.00 115.0 MB 0.0e+00 MB
c3gbdb OK 0.72 0.01 191.0 MB 3.6e-02 MB
sb Error 1.61 0.03 270.5 MB 1.4e-01 MB

ecoli-ser

Summary of serialisation results

Neither scikit-bio or biopython have an inbuilt serialization capacity for annotation data. In the case of biopython the objects generated by parsing files containing annotations can be pickled, that is, serialised using the standard Python format. cogent3 utilizes an SQLite database for representing annotation data2 and so relies on that built-in module for serializing to disk. So the results show Pythons built-in serialisation is somewhat faster and more memory efficient than cogent3's SQLite based approach. For the whole human genome, cogent3's parser speed offsets the difference so it edges out biopython for this benchmark.

In terms of code complexity, cogent3 is simpler. A single additional argument to the load_annotations() function is required in order for annotation data to be simultaneously written to disk. This is in contrast with the extra work required to serialise biopython objects to disk in pickle format.

Annotation querying

Querying 100 CDS

Aim

Evaluate how hard it is to limit the number of returned features. This is a useful feature for prototyping analyses.

def bp(ann_path: pathlib.Path, limit: int = 100) -> list:
    import pickle

    with ann_path.open("rb") as f:
        records = pickle.load(f)

    hits = []
    for record in records:
        for feature in record.features:
            if feature.type == "CDS":
                hits.append(feature)
                if len(hits) >= limit:
                    return hits
    return hits
1
2
3
4
5
def c3gbdb(ann_path: pathlib.Path, limit: int = 100) -> list:
    import cogent3

    db = cogent3.load_annotations(path=ann_path)
    return list(db.get_features_matching(biotype="CDS", limit=limit))
def sb(ann_path: pathlib.Path, limit: int = 100) -> list:
    from skbio.io import read

    hits = []
    for seq in read(ann_path, format="genbank"):
        for feature in seq.interval_metadata.query(metadata={"type": "CDS"}):
            hits.append(feature)
            if len(hits) >= limit:
                return hits
    return hits

Human chromosome 1

Results table for querying 100 CDS from Human chromosome 1
Function Result Type mean(time) seconds std(time) seconds mean(RAM) std(RAM)
bp OK 1.86 0.01 875.4 MB 0.0e+00 MB
c3gbdb OK 0.38 0.00 95.1 MB 0.0e+00 MB
sb Error 2.55 0.01 833.0 MB 0.0e+00 MB

hsap-quer

E. coli

Results table for querying 100 CDS from E. coli
Function Result Type mean(time) seconds std(time) seconds mean(RAM) std(RAM)
bp OK 0.18 0.00 80.9 MB 0.0e+00 MB
c3gbdb OK 0.38 0.00 94.5 MB 3.6e-02 MB
sb OK 1.63 0.07 266.7 MB 0.0e+00 MB

hsap-quer

Human chromosome 16 protein coding genes

Aim

Evaluate how hard it is to select features using a composite condition: genes are on chromosome 16 AND are protein coding.

def bp(ann_path: pathlib.Path) -> list:
    import pickle

    with ann_path.open("rb") as f:
        records = pickle.load(f)

    hits = []
    for record in records:
        if record.id != "16":
            continue
        for feature in record.features:
            if (
                feature.type == "gene"
                and feature.qualifiers.get("biotype", [""])[0] == "protein_coding"
            ):
                hits.append(feature)
    return hits
1
2
3
4
5
6
7
8
9
def c3gffdb(ann_path: pathlib.Path) -> list:
    import cogent3

    db = cogent3.load_annotations(path=ann_path)
    return list(
        db.get_features_matching(
            seqid="16", biotype="gene", attributes="protein_coding"
        )
    )
1
2
3
4
5
6
7
8
9
def sb(ann_path: pathlib.Path) -> list:
    from skbio.io import read

    hits = []
    for label, im in read(ann_path, format="gff3"):
        if label != "16":
            continue
        hits.extend(im.query(metadata={"type": "gene", "biotype": "protein_coding"}))
    return hits
Results table for querying the Human genome
Function Result Type mean(time) seconds std(time) seconds mean(RAM) std(RAM)
bp OK 32.28 0.25 11.7 GB 0.0e+00 GB
c3gffdb OK 0.40 0.01 97.5 MB 1.5e-01 MB
sb OK 53.62 0.86 2.0 GB 0.0e+00 GB

hsap-quer

One E. coli gene

Aim

Evaluate how hard it is to select a single gene.

def bp(ann_path: pathlib.Path, name: str) -> object | None:
    import pickle

    with ann_path.open("rb") as f:
        records = pickle.load(f)

    for record in records:
        for feature in record.features:
            if feature.type == "CDS" and name in feature.qualifiers.get("gene", []):
                return feature
    return None
1
2
3
4
5
def c3gbdb(ann_path: pathlib.Path, name: str) -> object | None:
    import cogent3

    db = cogent3.load_annotations(path=ann_path)
    return next(iter(db.get_features_matching(biotype="CDS", name=name)))
def sb(ann_path: pathlib.Path, name: str) -> object | None:
    from skbio.io import read

    for seq in read(ann_path, format="genbank"):
        for feature in seq.interval_metadata.query(metadata={"type": "CDS"}):
            gene = feature.metadata.get("gene", "")
            gene = gene.replace('"', "")
            if gene == name:
                return feature
    return None
Results table for querying one gene from E. coli
Function Result Type mean(time) seconds std(time) seconds mean(RAM) std(RAM)
bp OK 0.25 0.01 81.6 MB 0.0e+00 MB
c3gbdb OK 0.39 0.01 97.3 MB 3.6e-02 MB
sb OK 1.61 0.02 271.9 MB 1.6e-01 MB

ecoli-one

Summary of querying results

cogent3 exhibits a substantial performance advantage (both time and RAM) for collections with tens of thousands of features. In some cases, cogent3 uses about 1% of the RAM and is 100 times faster. For the E. coli genome, biopython outperforms cogent3 in both speed and RAM, but not by much. At both scales, scikit-bio requires substantially more resources (time and RAM). The speed advantage of biopython for the E. coli case reflects, in part, the efficiency of deserialisation using pickle.

Both biopython and scikit-bio require looping and conditionals in order to identify the necessary matches to the conditions set out by these queries. In contrast, cogent3 presents a simple interface enabling expression of the conditions with a single statement.

Comparison of sequence annotation integration

Aim

Evaluate how hard it is to integrate both sequence and annotation data. We query for a single, multi-exon gene on the minus strand. We need to extract its exons, concatenate, reverse complement and translate using the standard genetic code.

def bp(fasta_path: pathlib.Path, ann_path: pathlib.Path, name: str) -> object | None:
    import pickle

    from Bio import SeqIO

    fa_seqs = {r.id: r.seq for r in SeqIO.parse(fasta_path, "fasta")}
    with open(ann_path, "rb") as handle:
        records = pickle.load(handle)

    for record in records:
        matches = []
        stack = list(record.features)
        while stack:
            feature = stack.pop()
            if feature.id == name or feature.qualifiers.get("ID", [""])[0] == name:
                matches.append(feature)

            stack.extend(getattr(feature, "sub_features", None) or [])

        if not matches:
            continue

        matches.sort(key=lambda f: f.location.start)
        parent = fa_seqs[record.id]
        joined = parent[matches[0].location.start : matches[0].location.end]

        for f in matches[1:]:
            joined = joined + parent[f.location.start : f.location.end]

        if matches[0].location.strand == -1:
            joined = joined.reverse_complement()

        return joined.translate()

    return None
def c3(fasta_path: pathlib.Path, ann_path: pathlib.Path, name: str) -> object | None:
    import cogent3

    seq = cogent3.load_seq(
        fasta_path,
        moltype="dna",
        annotation_path=ann_path,
        label_to_name=lambda x: x.split()[0],
    )
    feature = next(iter(seq.get_features(name=name)))
    return feature.get_slice().get_translation(include_stop=True, trim_stop=False)
def sb(fasta_path: pathlib.Path, ann_path: pathlib.Path, name: str) -> object | None:
    from skbio import DNA
    from skbio.io import read

    fa_seqs = {
        s.metadata["id"]: s for s in read(fasta_path, format="fasta", constructor=DNA)
    }
    for seqid, im in read(ann_path, format="gff3"):
        if seqid not in fa_seqs:
            continue

        matches = [
            feature
            for feature in im.query(metadata={"type": "CDS"})
            if feature.metadata.get("ID") == name
        ]
        if not matches:
            continue

        parent = fa_seqs[seqid]
        bounds = sorted(
            ((start, stop) for f in matches for start, stop in f.bounds),
            key=lambda b: b[0],
        )
        pieces = [parent[start:stop] for start, stop in bounds]
        joined = DNA.concat(pieces)
        if matches[0].metadata.get("strand") == "-":
            joined = joined.reverse_complement()

        return joined.translate()

    return None
Results table from integrated querying to sequence translation
Function Result Type mean(time) seconds std(time) seconds mean(RAM) std(RAM)
bp OK 33.41 0.04 12.2 GB 0.0e+00 GB
c3 OK 1.52 0.03 1.3 GB 0.0e+00 GB
sb OK 10.81 0.07 3.5 GB 4.4e-05 GB

human-integ-one

In addition to being much faster and requiring less RAM, cogent3 is able to perform this task with just four Python statements. In contrast, both biopython and scikit-bio require approximately 30 lines of code.

Conclusions

Of the three packages, only cogent3 presents a simplified interface for querying annotations. It also exhibits the tightest integration with sequence objects. These factors manifest as the considerably reduced code complexity of cogent3 solutions to the problems we posed.

cogent3's simplified interface is also paired with the highest performance characteristics in terms of low execution time and low RAM usage. A notable exception is performance when querying the bacterial genome, where biopython was approximately 0.1–0.2 seconds faster. While we could argue that this difference is trivial, I think it indicates there is unnecessary overhead in cogent3’s loading of databases followed by querying. I further suspect cogent3 has unnecessary overhead associated with Feature instances and related operations. Both of these issues warrant more detailed investigation.

It is clear that the biopython and scikit-bio developers also have their work cut out for them. Their tasks range from the inclusion of GFF3 parsing capabilities within biopython to specifying more efficient data structures to improve scalability. For scikit-bio, implementing support for serialisation would be valuable. And the robustness of the scikit-bio GenBank parser is not great. Adding to that, there are some real gotcha's. For example, why does gene = feature.metadata.get("gene", "") return '"ytjEr"' instead of 'ytjEr'? That's unexpected and why we needed to do the removal of quote characters from the string.

The functionality that we have explored in this post is by no means comprehensive. Annotation data is employed in a multitude of ways by life sciences researchers. That said, it is also the case that we have not fully explored the cogent3 feature set related to annotations. Check out the project documentation for a more complete listing of cogent3’s annotation handling. I will just note here that these include capabilities such as: using a feature instance to select its parent feature (think of a transcript selecting its gene), the ability to produce interactive plots of features, and the ability to generate a single annotation database from a directory of annotation files in one statement.


  1. Pickling is fragile between runtimes for a large number of reasons and is therefore not recommended. 

  2. Because of its plugin architecture, cogent3 can support third-party plugins for annotation data handling. As such, these can use their own serialisation format. 

Comments