Skip to content

Thinking Science with Statistics and Computers

Abstract

I’m a scientist whose career has focused on developing statistical and computational methods for genetics and genomics. Here, I share advice and lessons learned from conducting research and from teaching others how to do it. I also provide reports on experiments related to my open-source software.

The genome annotation handling shootout šŸ”«

Genome annotation data are a fundamental reflection of the state of our understanding of a genome. Any software package that claims to provide generic genomic data handling must also be great for handling genome annotations (aka genome features). Right? In this post, I put this assertion to the test for biopython, cogent3, and scikit-bio. In a nutshell, only cogent3 acquits itself with some distinction. On large datasets, cogent3 can be orders of magnitude faster and use orders of magnitude less memory than the others. It also requires much less code 🤯. But there is room for improvement across all packages.

The parser shootout šŸ”«

Bioinformatics data processing remains dominated by plain text formats. In this post, I contrast the performance of the popular biopython, cogent3, and scikit-bio packages for reading three sequence file formats and two genome annotation formats. Despite how simple these tasks might seem, you'll see there's a lot of variation in performance! The takeaway message is that cogent3 is nearly always faster for parsing these basic file formats, while biopython typically uses less RAM.