highly-efficient & lightweight mutation signature matrix aggregation
- Users starred: 19
- Users forked: 3
- Users watching: 19
- Updated at: 2020-05-05 06:13:03
O be swift—
we have always known you wanted us.
(from 'The Helmsman' by Hilda Doolittle [1886 - 1961])
Helmsman is a utility for rapidly and efficiently generating mutation spectra matrices from massive next-generation sequencing datasets, for use in a wide range of mutation signature analysis tools. See Alexandrov et al., Cell Reports, 2013 for a detailed explanation of how mutation signature analysis methods work, and why they are useful in studying cancer genomes.
Currently, the majority of mutation signature analysis methods are implemented as R packages. To generate the mutation spectra matrix, these packages must read the entire dataset (containing information about every individual mutation), into memory. This is generally not a problem for datasets containing only a few thousand SNVs or a few dozen samples, but for much larger datasets, it is extremely easy to exceed the physical memory capacity of your machine. Depending on the dimensions of the data, even very small files can take a very long time to process in these R packages--sometimes over 30 minutes for a 3Mb file!
Other mutation signature analysis tools do not even provide these convenience functions, and leave it to the user to coerce their data into program-specific formats. Not only is this inconvenient, but it impedes users' ability to port their data between tools and take advantage of the unique features provided by different packages.
Helmsman aims to alleviate these performance barriers and lack of standardization. Helmsman was initially written to evaluate patterns of variation in massive whole-genome datasets, containing tens or hundreds of millions of SNVs observed in tens of thousands of individuals. Generating the mutation spectra matrix for such data is virtually impossible with R-based implementations, but Helmsman provides a very fast and scalable solution for analyzing these datasets.
If you plan to use the output of Helmsman in other mutation signature analysis packages, Helmsman can automatically generate a small R script with all the code necessary to read the mutation spectra matrix and format it for compatibility with existing tools, using functions from the musigtools package.
Helmsman includes several other convenient features, including:
- ability to pool samples together when generating the mutation spectra matrix
- aggregate data from multiple VCF files
- bare-bones (and really fast!) non-negative matrix factorization, to extract signatures without even relying on external packages
- able to run in parallel
Using Conda (recommended)
The easiest way to start using Helmsman is to create a Conda environment, based on the dependencies specified in the
git clone https://github.com/carjed/helmsman.git cd helmsman conda env create -n helmsman -f env.yml source activate helmsman
If you do not have Conda on your system, the prerequisites for Helmsman can also be installed with
pip inside of a python3 virtual environment (assuming you have
git clone https://github.com/carjed/helmsman.git cd helmsman virtualenv -p python3 helmsman_env source helmsman_env/bin/activate pip install -r pip_reqs.txt
It is also possible to forgo the virtual environment setup and use
pip to install the necessary dependencies in your global
site-packages directory, but this is not recommended as doing so may cause dependency conflicts between Helmsman and other programs/packages.
For more flexible deployment options, Helmsman is available as a Docker container. The following command will pull and run the preconfigured image from the Docker Hub:
docker run -d --name helmsman \ -v /path/to/local/data:/data \ # map directory containing input data -p 8888:8888 \ # expose jupyter notebook on port 8888 start-notebook.sh --NotebookApp.token='' \ # start with token disabled carjed/helmsman
You may also clone this repository and build the dockerfile locally, using the following commands:
git clone https://github.com/carjed/helmsman.git cd helmsman docker build -t latest --force-rm . docker run -d --name helmsman \ -p 8888:8888 \ start-notebook.sh --NotebookApp.token='' \ helmsman
Suppose we have a Variant Call format (VCF) file named
input.vcf, containing the genotypes of N individuals at each somatic mutation identified. You will also need the corresponding reference genome. With the following command, With the following command, Helmsman will parse the VCF file and write a file under
/path/to/output/ containing the Nx96 mutation spectra matrix:
python helmsman.py --input /path/to/input.vcf --fastafile /path/to/reference_genome.fasta --projectdir /path/to/output/
If you use Helmsman in your research, please cite our paper published in BMC Genomics:
Carlson J, Li JZ, Zöllner S. Helmsman: fast and efficient mutation signature analysis for massive sequencing datasets. BMC Genomics. 2018;19:845.