Skip to content

bioinformatics

The genome annotation handling shootout 🔫

Genome annotation data are a fundamental reflection of the state of our understanding of a genome. Any software package that claims to provide generic genomic data handling must also be great for handling genome annotations (aka genome features). Right? In this post, I put this assertion to the test for biopython, cogent3, and scikit-bio. In a nutshell, only cogent3 acquits itself with some distinction. On large datasets, cogent3 can be orders of magnitude faster and use orders of magnitude less memory than the others. It also requires much less code 🤯. But there is room for improvement across all packages.

The parser shootout 🔫

Bioinformatics data processing remains dominated by plain text formats. In this post, I contrast the performance of the popular biopython, cogent3, and scikit-bio packages for reading three sequence file formats and two genome annotation formats. Despite how simple these tasks might seem, you'll see there's a lot of variation in performance! The takeaway message is that cogent3 is nearly always faster for parsing these basic file formats, while biopython typically uses less RAM.