Setting up your computer for computational research

How much impact can a computer setup possibly have? Some choices can make your daily life as a computational scientist a lot happier. Conversely, other choices can make you curse yourself. If it sounds like I’m talking from experience, you’re correct – 20+ years worth.

Be systematic

There are many reasons why adopting a systematic approach to organising your research activities on your computer is important. If you adopt a naming scheme and always organise your project files in the same way, you increase the clarity and predictability of your work. Putting all research work under a single directory facilitates portability, meaning it can easily be mirrored on different computers and code working on one system works on another. Predictability and portability improve the reproducibility of your work since you will not have to write custom code to handle execution on different systems.

Designing your work with reproducibility in mind benefits you and your colleagues and contributes to your long-term reputation. It also reduces the chance you’ll go 🤪.

Take account of the resources you will use

Computers

Consider the following scenario:

  • laptop, running macOS or Windows: most development and prototyping will be done here

  • lab server, running Ubuntu 20: data sampling against databases is done here

  • supercomputer, running “Rocky Linux release 8.5”: large scale analysis is done here

In this scenario, data sampling code and data are synchronised between the laptop and the lab server. Large scale analyses are first prototyped on the laptop, and a selected data subset, code, and computational environment are synchronised with the supercomputer. Results are brought back to the laptop.

How can you simplify keeping the different computational environments in sync?

Adopt a directed relationship (click the tabs below to see illustrations for our scenario). For example, designate your laptop as the canonical machine for any source code. That means only editing code on that machine. Communicate those edits to the other computers in one direction only (use git 1 for this). If you don’t adhere to this, your code will quickly devolve to an inconsistent state, making you 😢.

digraph code_flow { node [shape="rectangle"]; GitHub [color="darkgray" style=filled]; Laptop [color="skyblue" style=filled]; "Lab Server" [fontcolor="white" color="blue" style=filled]; Supercomputer [fontcolor="white" color="darkblue" style=filled]; Laptop -> GitHub; Laptop -> "Lab Server"; Laptop -> Supercomputer; GitHub -> "Lab Server" [style=dotted]; GitHub -> "Supercomputer" [style=dotted]; }

Arrows direction indicates direction of flow. If you authorise all computers (via ssh keys) with GitHub, you enable moving all code between computers via git 2.

digraph data_flow { node [shape="rectangle"]; External [color="darkgray" style=filled]; Laptop [color="skyblue" style=filled]; "Lab Server" [fontcolor="white" color="blue" style=filled]; Supercomputer [fontcolor="white" color="darkblue" style=filled]; External -> Laptop; "Lab Server" -> Laptop; Laptop -> Supercomputer; }

Arrows direction indicates direction of flow. The Lab Server is where data sampling is taking place (remember, we’re assuming it hosts databases), and the laptop specifies what is moved to the Supercomputer.

digraph results_flow { node [shape="rectangle"]; Laptop [color="skyblue" style=filled]; Supercomputer [fontcolor="white" color="darkblue" style=filled]; Laptop -> Laptop Supercomputer -> Laptop; }

Arrows direction indicates direction of flow. Results are either generated on the Laptop or the Supercomputer.

Note

Homogeneity in operating systems is key to the portability of your work. While macOS and Linux have substantial differences, they are both POSIX operating systems. So the effort to develop code on a macOS laptop and get it running on a Linux machine is, in my experience, quite small.

Windows, on the other hand, is more of a challenge. Because most machines used in the computational sciences are running a POSIX OS (i.e. Linux), I strongly advise you to do your development work using WSL on a Windows laptop.

Major software tools

Package manager

If your laptop is running macOS, you already have a unix based system 😎. However, you will almost certainly need to install a unix-style package manager. I recommend Homebrew. This can be used to install non-python tools, such as the openmpi library 3.

If your laptop is running Windows, install the latest WSL (Windows subsystem for Linux, which installs Ubuntu). You will then get the apt package manager.

Terminal app

The Terminal application is your gateway to the command line on all the computers you will access. The terminal is just an interface to your shell environment. At present, zsh is the default shell on macOS, while bash is the default on Linux distributions. Configuration files are located in your home directory and named .zshrc and .bashrc respectively. You will be editing them.

As this same application is employed to interact with all the computers you will use, it’s worth looking into making your shell environments on different computers as similar as possible, albeit you need to make sure it’s clear which computer you’re on. The latter can be achieved by customising the terminal prompt.

If you use zsh, consider installing ohmyzsh. Some other command line tools I find super useful are:

  • autojump, jump to commonly used directories

  • starship, customised prompts

  • mcfly, control+r magic for getting back to past commands

  • fd, super fast find

  • ripgrep, a better, easier to use, grep

See the modern unix tools page.

Text editor

No matter what OS your laptop is running, I encourage you to install VS Code . The reasons are simple, it provides an excellent experience for editing files on remote machines along with a fully-featured terminal experience. Be sure and research useful extensions. On my system, I have:

Install the code command-line tool for invoking VS Code from the terminal. Open the command palette (macOS command+shift+p) and type “install code” at the prompt. This will show a single listing with “Shell Command: Install ‘code’ in PATH”. Click on that and follow any prompts.

Configure ssh

ssh stands for secure shell which provides a mechanism for accessing remote computers either for interactive terminal sessions or for copying files to / from. On your laptop, you should create a private / public key pair.

$ ssh-keygen -t rsa -b 4096 -C "your_email@example.com"

Follow the prompts, and set a good password. The public copy of this key (which you can find under ~/.ssh/id_rsa.pub) will be copied to your other computer accounts, enabling simplified steps to authorise your access.

But wait, you’re not done with ssh yet! Using your newly configured VS Code, enter the following command in the terminal

$ code ~/.ssh/config

This will open an empty file. You can include a shortcut in this file for every remote machine you need to access. Here’s an example

Host qik
UseKeychain yes
HostName super.annoying.domain.com
User ini777

Tip

qik is an “alias” I defined. Pro-tip, make your aliases easy to type!

Save the file. Instead of logging into super.annoying.domain.com as

$ ssh ini777@super.annoying.domain.com

You can do

$ ssh qik

🎉

Login into each computer and repeat the ssh keygen step there (this will facilitate code sharing, see below). Copy your public ssh key into your clip board on each computer.

$ cat ~/.ssh/id_rsa.pub | pbcopy

Add the result to your authorized_keys on each of your remote computers by logging into each remote machine and doing the following

$ ssh qik
$ nano ~/.ssh/authorized_keys # or your favourite editor

paste the key on a new line 5 and exit nano 6.

Using git and GitHub for version control

The version control tool git should already be installed on your computer. To use git you need to configure it.

$ git config --global user.name "Firstname Lastname"
$ git config --global user.email "username@myEmail.com"

These will be used by git to sign any commits you make. I recommend you do this on all the computers you will be using.

If you don’t already have an account on GitHub, create one. At this point, you should copy the public ssh keys you created on each machine and add them to your GitHub account. Follow the instructions at GitHub.

Tip

When you add a key, give it the computer’s name. Doing this means it’s easy to delete a key if you lose access to that computer (e.g. you buy a new laptop).

Reproducible computational environments

There is no single answer to this challenge that applies to all cases. Some will argue that conda provides the most general solution to this problem. My own experience is that if your computations include a supercomputer, you may find conda troublesome. Supercomputers are often administered via a granting system whereby some quantity of resources is allocated. Those resources include CPU hours and storage. If you exceed your allocation, you can no longer use the computer.

conda does not work well in the supercomputer context. Shared facilities may penalise user accounts with many files 7 due to the significant overhead they can impose on performance of the file system. I have witnessed this effect with naive conda installs. In addition, supercomputer facilities often provide custom builds of core tools. For instance, higher performance builds of Python than what you will obtain from conda-forge.

If conda seems to be the only way to solve your case, make sure you only install the minimal dependency set. You can specify that set using a conda environment yaml file, remembering to “pin” 8 your versions.

If you are lucky enough to have a Python-only project 9, then use the built-in capability to create virtual environments. These can be made portable by creating a requirements.txt file, which you share between your different accounts. If this is the approach you take, be sure and pin your dependency versions.

Tip

You can reconstruct your computing environment by just the yaml or requirements file. As plaint text file formats, they should be version controlled too.

Structuring your projects

Tip

Put them all into a single directory, call it repos 10.

Having a single directory makes moving your research projects between computers easier. I advise you also to include repositories for any dependency that is being actively developed in this directory. This way, you preserve the entire compute state.

Tip

Since you will be versioning everything, the first action you take to start a new project is create a repository on GitHub. Then clone it into your ~/repos directory.

Typically, I have two repositories if I’m engaged in research to develop a software tool. The first is for the tool to be distributed to the target audience. The second is for the analyses to be undertaken to establish the tool is worth using. Below I give sample structures for a “software project” and a “research project”.

Directory structure for a software methods project 11

.
└── software_project/
    ├── README stating project purpose and basic instructions
    ├── LICENSE pick one that facilitates user adoption
    ├── project config files
    ├── docs/
    │   ├── data/
    │   │   └── small sample data files
    │   └── documentation covering API and examples
    ├── src/
    │   └── lib_name/
    │     └── source code files
    └── tests/
        ├── data/
        │   └── small sample data files
        └── test files

Software development projects have input data necessary for your test suite and documentation, which should be tracked in version control. They should be minimal, sufficient for testing and / or demonstration purposes.

Directory structure for a research project

.
└── research_project/
    ├── README describing usage
    ├── data/
    │   ├── processed/
    │   └── raw/
    ├── results/
    │   ├── figures/
    │   └── tables/
    ├── src
    │   ├── analysis scripts
    │   ├── data sampling scripts
    │   └── notebook files
    └── tests/
        ├── data/
        └── test files

Research projects have input data that may be local to your institute or external, e.g. resources such as Ensembl, GenBank, or Zenodo. Wherever your data comes from, store it under the data/ directory with a name that reflects its origin.

For a research project, these data files can be massive! As such, do not to add the data files to your research project’s git repository. An alternate way to version those files is by uploading them to Zenodo (for instance) and adding a script that does the download. Users seeking to replicate your work then run that script to reconstitute the state of your project directory.

Note

“notebook files” refers to Jupyter notebook files. Putting these in version control can be problematic. There are multiple reasons for this, e.g. embedded images can make these files very large. This has led to tools like nbstripout. My advice is only to include notebooks if they’re small.

Footnotes

1

git is the major tool used for “version control” of text files.

2

Yo do not actually have to use GitHub for this. But if GitHub is how you will share your work with other 🧑‍🔬, you may as well.

3

This is necessary for prototyping your code runs in parallel using MPI library (Message Passing Interface). MPI is the most likely protocol for parallel computation supported on the supercomputer.

4

On windows, install the Remote WSL extension.

5

The public key must be on a single line.

6

It is up to you be sure you know how to use the nano editor. When in doubt, google.

7

measured via inodes

8

Pinning here means to state a specific version number of the tool.

9

“Lucky” in the sense that there is less complexity in the project, simplifying your development process and reducing the storage footprint of your project.

10

repos because it is short for repositories, and every project will be version controlled … right?

11

This is for a Python language based project. Adopt a structure that is considered best pratice for the language you’re working in.