The three technologies bioinformaticians need to be using right now
I’m old. Sometimes I feel really old. I’m old enough to remember when you had to install things manually, and that’s if you were allowed to; worse would be that you depend on an IT department to install software for you. I am old enough to remember when changing servers meant reinstalling everything again from scratch. I am old enough to have been frightened by dependency hell, where you do not know all the dependencies of the software you need to install, and whether they would conflict with other software you also need to install. It’s not too long ago that my “pipelines” were simple Perl of bash scripts, with for() and while() loops in them, whose purpose it was to run things in the correct order and maybe do some basic checking to ensure things ran.
My bone-chilling fear is that many of you, maybe even most of you, still do this. My fear is that if you are a pet bioinformatician, you don’t have anyone around to tell you that it needn’t be like this anymore. How do I sleep? Sometimes very badly.
Look, this isn’t a tutorial, so here it is. These are the three technologies that, as a bioinformatician, you need to be engaging with right now. Not next week, or month. Now. They are not new technologies. Now is not the time to put this off, it’s time to realise that if you don’t use them, you are already out of date, and the longer you leave it, the worse things will become.
Here they are:
1. Software environments and/or containers
Software environments can be thought of as a folder where you install all the software you need to run a particular analysis. “But I already have one of those!” I hear you cry. Ah, but there’s more. Software environments don’t require root access, they handle all software dependencies, they are exportable, shareable and re-useable, they can be used with multiple different flavours of Linux/Unix, and in the vast majority of cases they make installing and maintaining software simple and pain-free. You can have multiple environments (e.g. one for each pipeline you run), and you can also version them. The one I am most familair with is conda (https://docs.conda.io/en/latest/) which has an associated BioConda project (https://bioconda.github.io/) with many bioinformatics tools set up and ready to go.
Containers are similar in that they are a place you install all of the software you need to run a pipeline. Containers are a form of light-weight VM, and the idea is you describe once how to install software into a container, and then you use that container multiple times i.e each time you run your pipeline. No changes are made to your local machine, the software simply downloads the container and runs your pipeline inside of it. Again you can set up and use multiple containers, and version them. The most common software tools for containers are Docker and Singularity, and there is a BioContainers project (https://biocontainers.pro/#/)
Installing software is now so simple in the vast majority of cases, using either or both of the above. Please use them.
2. Workflow management systems
It’s a law of the universe that every bioinformatician will say they have a “pipeline”, but often I find this is actually a bash, perl or python script. They were great for their time, but it is time to move on.
Workflow management systems describe true pipelines. Each has their own, often simple, language; the language describes jobs, and each job will have input and output. The input of one job can be the output of another job, and the jobs will be executed in the correct order to make sure all of the inputs are present before running. They handle wild cards and scale to tens of thousands of input files. They run on your local computer, they integrate with common HPC submission systems, they can submit jobs to the cloud, and they often also integrate with software environments and containers. Yes, so each job in your workflow can be run within its own environment or container. They track which jobs succeeded and which failed, they can re-submit those that failed, they can restart where the pipeline finished last time, they clean up when things fail, and you can add more input files at a later date and they will only re-run the jobs they need to. There are hundreds of these, but those I am most familiar with are Snakemake (https://snakemake.readthedocs.io/en/stable/) and NextFlow (https://www.nextflow.io/)
Honestly, adopting (1) and (2) will change your life. Do it.
The third is optional, and also not new, but it is now so much easier to access. No matter how big your University cluster, nothing will match the size of the compute clouds of Amazon, Microsoft and Google. These things are vast, they run Netflix and Dropbox, and once you start using the cloud, you realise there are essentially no computational limits to your research.
What’s amazing about cloud is that the vast majority of it is sat around doing nothing most of the time, and companies sell off these “on demand” resources at a massive discount. The downside is that if there is a surge of Netflix then they may kill your jobs, but the upside is you only pay 10-20% of the advertised cost. And if you use a workflow management system, it can pick up where you left off anyway, so….
Snakemake can submit jobs to Google Cloud (via Kubernetes) and NextFlow can submit to Amazon. Broad adopted Google Cloud for GATK workflows and got the cost down to $5 per sample. Remember that the next time you pay your University HPC bill. There are now essentially no computational limits to your research – when you realise that, it frees your mind and you begin to think much bigger than you did before.
And before you say “I don’t have a credit card”, I am here to tell you that both Amazon and Google will send monthly invoices. It’s all good.
There we have it. None of this is new and apologies if I sound patronising; but if you are a bioinformatician and you’re not using any of the above, please, I beg of you, take a few days out to learn about them all, and start using them. It will change your life, for the better. If you are a student or a post-doc looking for a job, these are the technologies you will need to talk about in your interview. The future is here – in fact it arrived a few years ago – the time to change is now.