Reproducible science with Conda and Snakemake

Doing scientific computing is hard. Delivering results with fast, performant code is often the easy part. You know your tools and how to get results. Delivering your workflow to your target audience is where it gets tough. What happens if your clients want to re-run things themselves, on their own hardware? How do they configure your pipeline for a new problem set? What happens if they’ve never even used the command line before, much less understand what a server is? This post is more or less a “lessons learned” on my approach to solving these types of workflow deployment problems. It’s by no means a perfect solution, but hopefully this will be useful to other groups struggling with the same issues.

This is a tall order - you need to provide:

Your analysis results.
The pipeline itself and all supporting code.
A foolproof method of deploying the software and the execution environment your pipeline requires.
The training and documentation required to run things start to finish. This is harder than it sounds - you can’t force your target audience to care enough to learn UNIX or the basics of programming (they’ve got other stuff to do, remember!).

You might say the last 3 are unnecessary (they’ve got the results, right?), but this is the most important part! Once your clients can run the pipeline themselves, your job is done and you can move onto your next project!

For readers looking for the quick summary:

Jupyter and R notebooks work really well for displaying results (nothing new here…).
Snakemake works well for managing and scaling pipeline execution.
When deploying pipeline software, Git + Conda environments work well initially, but do not age well. There’s not really any good solutions in this space right now unfortunately (Docker containers won’t be able to pass muster for security-conscious organizations).
Ideally, documentation gets done in the Git repository README.md/wiki, but hands-on training and follow ups are still a must. Your workflow needs to be as simple as possible to reproduce and run.

Delivering results⌗

This is probably the easiest part (chances are you’ve done this before!). You need to deliver the actual result data files along with supporting plots and explanations. Personally, I find the best solution to this approach is a report that interleaves summary statistics/plots with explanations as they are generated by the pipeline. The easiest way to do this is using Jupyter notebooks or an R Markdown report. (I won’t provide a full walkthrough on how to use tools here, check out their respective documentation pages.)

Either tool is great (I personally prefer R markdown notebooks), but it’s important that you use these as a tool to document your workflow (where possible) and how each plot was generated. No one wants a folder full of plots with no explanation (aside from labeled axes). For each step, write what you are doing and why you are doing it along with what each statistic means. As for your data, provide a description of what each output file consists of and gzip it all up.

All of that said, Jupyter/R notebooks aren’t all that great for heavy-duty data-crunching. So what do you make into a notebook and what can you leave as plain old scripts? Again, notebooks are there to explain your results: QC scripts, summary statistics, analysis conclusions, etc.. anything that will be read by someone else should go into a notebook if possible. Everything else can stay a script.

Creating a reproducible analysis pipeline⌗

Your analysis needs to run itself, automatically, without human input. It’s not reproducible unless it can be run completely independently of your involvement. Your client should also be able to swap out the dataset for a new one, and your pipeline should update itself and handle the change in data appropriately. Ideally, this should all execute in parallel and take advantage of all available hardware.

There’s a lot of different tools for this, but the one I’ve (relatively happily) settled on is Snakemake. Snakemake works exactly like GNU Make, where rules define how output files are created from input files. The main difference between the two is that Snakemake workflows are all in Python and supports a lot of stuff that GNU Make doesn’t (running on different OS’es, submitting jobs to a cluster, etc.).

An example Snakemake rule to produce a FastQC report from an input FASTQ file might look like this. Notice how there’s only 3 ingredients to a rule: an input, output, and a shell command.

rule fastqc:
    input: '{sample}.fastq'
    output: '{sample}_fastqc.html'
    shell: 'fastqc {input}'

There are a few big advantages of Snakemake vs. other tools I tried:

Snakemake was probably the easiest pipelining software to get the hang of. You can more or less learn it in an afternoon.
It’s pure Python - anything Python can do, Snakemake can do as well.
Scaling up a pipeline is effortless. A serial workflow is identical to a parallel one - no changes needed. To submit a job to a cluster, all you have to do is provide a cluster submission command and it will do its thing and submit jobs for you.
Workflows are really fast to write.
It’s really easy to produce a workflow diagram that shows exactly how a pipeline gets executed. This is really great for explaining to a professor or doctor how an analysis works.

And some disadvantages of Snakemake I’ve run into:

It’s not daemonized. There’s no real easy way to have a long-running Snakemake workflow running in the background besides just nohup-ing it, which can be a little inconvienient.
No dynamic job execution (if some file fails a quality check, do this after, etc.).
Personal experience has shown that it can’t be installed on Windows without a C++ compiler, which makes it a little harder to install on Windows users’ computers. Still, this is better than no Windows support at all though (like most other tools.)

All in all, after using Snakemake for several years, I think it’s a great tool for bioinformatics and data science use cases where analysis is done in a standard start-to-finish manner. Anything involving continuous job execution is probably not a good fit, such as rerunning an analysis with new data every hour or something like that. I have no serious regrets after using Snakemake and it’s a pretty great tool if you want to deliver outputs reproducibly and have other people understand the workflow (even non-technical types).

Deploying your pipeline with Conda⌗

This is where things always get icky. You’ve got a great software environment and it runs the pipeline happily, but you want to get your client up and running too. After all, it isn’t “reproducible science” if they can’t re-run things and verify your results. Usually the hardest part of this is just installing all of the software on your clients’ computers.

Are you really responsible for installing software on clients’ computers? Honestly, yes. Even if you provide them with access to a system with all of the software installed, at some point they will pick up a collaborator who needs to install the software, or maybe migrate systems. You’re going to get an email asking how to install the software at the end of the day. So what’s the best way of ensuring that this happens?

There are three ways of getting a set of software packages installed and running on a new system. I’ll go through these each in order:

Install every bit of your pipeline and all dependencies manually (oh god no.).
Use a containerization tool like Docker.
Use a reproducible software environment, like Anaconda.

Installing things by hand⌗

Don’t do it. If it takes you two or three hours, it will take your non-technically inclined colleagues two or three weeks (and you’ll get a lot of “please help me” emails).

Using Docker/Singularity containers⌗

Docker containers seem like an ideal way of implementing a new workflow. You can install all of your dependencies in a Docker container, and then have your clients run the analysis using that container. I think Docker containers are awesome, and use them for integration testing or any automated tests requiring testing against a web service or other special compononents.

Though Docker continers aren’t that fun to build, they make it really easy to repoduce a defined environment, which makes them perfect for workflow deployment. So what’s the catch?

To make a long story short, letting untrusted/semi-untrusted users run Docker is a massive security hole. Any Docker container can root its host machine, and by that same token any user able to launch Docker has the equivalent of root access. If your pipeline needs additional resources like those on an HPC cluster or other shared system, chances are that your workflow will not be allowed to run. To use Docker containers in production, you need root access to the system you are running on. This is a major security consideration, and is unlikely to pass muster for most research groups unless they own the infrastructure they run on.

Singularity is a nice alternative to Docker and solves most of it’s security issues. In fact, it has a “rootless” run mode that lets it run entirely as a user. The only two “gotchas” here are that Singularity still requires root priviliges to install, and there are still some security issues being ironed out.

So to sum things up, Docker is great if you (and your clients) own the infrastructure and have been entrusted with sudo priviliges. If not, Singularity is the way to go (though security issues still seem to crop up with it fairly frequently).

Conda environments⌗

There is of course, a third option: instead of requiring lots of special security priviliges or install things manually, why not just use Conda, Anaconda’s package manager. For those unfamiliar with it, Anaconda is a all-included Python distribution. Though it used to ship just Python packages, Anaconda now ships more or less every piece of scientific software. Of particular interest to bioinformaticians is it’s Bioconda channel, which ships more or less all bioinformatics software packages.

Conda works more or less like a Python virtualenv, though instead of using pip install, you use conda install to install everything. To make a very long story short, I haven’t really found anything that isn’t conda installable yet. Once all is said and done, you can export your conda environment to YML with conda env export > environment-name.yml. To reproduce the environment, another user would run conda env create -f environment-name.yml and then source activate environment-name to load it. All in all, this reduces your entire software pipeline to a single YML file. Just add this to a Git repository, stuff it on GitHub/Bitbucket/Gitlab and you’re done. To reproduce the pipeline execution environment, it’s just three lines:

git clone https://github.com/username/project-name.git
conda env create -f project-name.yml
source activate project-name

So what’s the catch? This seems a little too easy. I’ll say that this method of pipeline deployment worked really well intially. It did not age gracefully, however. After about a year of usage, some of my users began to report issues where certain dependency versions could not be found. As it turns out, Conda envs pin every version for every package and dependency. Anaconda apparently stops shipping packages after a certain period of time, which means that new environment installs will be broken after a certain period of time. After using conda environments as my go-to solution for a lot of projects, the average time to first breakage (where you need to supply a new conda-env.yml file to users) is about a year.

I haven’t found a good way around this issue, aside from providing the general conda installation instructions on how to re-create the environment ("conda install this list of packages…"). This was really disappointing, because conda environments seemed like a rather promising method of long-term software installs. Just add the environment.yml file to git and call it a day, right? Unfortunately this only works for the first year or so, after which all bets are off.

All in all, my work so far leads me to believe that Conda environments are the go-to solution for short-term work. Despite the issues with Conda environment longevity, it’s so easy to use and install software that I think using them for your workflows is worth it. For long-term projects (years or more) you should invest in some form of containerization solution, along with all the security implications that go with it. The next time I do a serious data science/bioinformatics project, I’m probably going to do a long term sit down with Conda and see if I can find a solution to the environment age problem, because I’d really like to use that for all my work, all the time.

Documenting your workflow⌗

This has been a long blogpost, so I’ll keep this short. In order for users to be able to re-run you workflows, they need the instructions in order to be able to do so. In terms of raw documentation, this pipeline (Snakemake + Conda) generally boils down to only a few lines from installation to execution:

# install Miniconda from https://conda.io/miniconda.html
git clone https://github.com/your-pipeline.git
conda env create -f your-pipeline.yml
source activate your-pipeline.yml
# example execution for 24 cpus, actual snakemake execution command will likely differ
snakemake -j 24

This is really easy to shove in a README.md on Github/Bitbucket/wherever. That said, I’ve found that most users will want an in-person training session where you walk them through the pipeline step-by-step (“drop your files here”, let’s run through the following commands, etc.). There’s not really any way around this - you wouldn’t be performing the data analysis for them if they could do it themselves. snakemake --dag | dot -Tsvg > dag.svg is an incredibly useful command to produce a workflow diagram to show your end user/data consumer how results are generated. If you are the only user and all that matters is your end results, generally just the above installation instructions and a list of dependencies is sufficient documentation for the future.

I don’t have any magic tricks here, but the above workflow generally simplifies and automates workflow deployment and execution enough to make it doable for the average end-user to run. All in all, the weakest point of this workflow is that Anaconda environments don’t age well - if that ever gets fixed, I’d have few regrets. Hopefully this was an informative read for those of you considering similar workflows.