## Run MineSV demo at Biowulf ### Introduction This tutorial is to guide users to install and run *MineSV* using the Slurm workload management at [Biowulf](https://hpc.nih.gov/docs/userguide.html). We are going to run three SV calls via *MineSV*: svaba, manta and gridss on one set of normal/tumor paired next-generation sequencing data. ### Installation #### Start to make the folder for this tutorial
export DEMO_DIR=/data/yourusername/MineSV_Demo_at_Biowulf
```bash mkdir -p $DEMO_DIR; cd $_ ``` #### Clone *MineSV* from github ```bash git clone https://github.com/NCI-CGR/MineSV ``` #### Add annotation data set to MineSV Due to the file size limit in the github, we deposit the annotation component of MineSV to Zenodo DOI. And we need to add this component to the MineSV clone before running this snakeamke pipeline. ```bash wget -O MineSV_annotation.tgz https://zenodo.org/record/5176148/files/MineSV_annotation.tgz?download=1 tar xvfz MineSV_annotation.tgz -C MineSV tree MineSV/annotation MineSV/annotation ├── genes │ ├── b37 │ │ └── RefSeq.bed │ ├── hg19 │ │ └── RefSeq.bed │ └── hg38 │ └── RefSeq.bed ├── genomic_context │ ├── b37 │ │ ├── RepeatMasker.bed │ │ ├── SegDups.bed │ │ └── Telo_Centro.bed │ ├── hg19 │ │ ├── RepeatMasker.bed │ │ ├── SegDups.bed │ │ └── Telo_Centro.bed │ └── hg38 │ ├── RepeatMasker.bed │ ├── SegDups.bed │ └── Telo_Centro.bed ├── human_b37.genome ├── human_hg19.genome ├── human_hg38.genome └── public_datasets ├── b37 │ ├── 1KG.bed │ ├── ClinGen.bed │ ├── ClinVar.bed │ └── DGV.bed ├── hg19 │ ├── 1KG.bed │ ├── ClinGen.bed │ ├── ClinVar.bed │ └── DGV.bed └── hg38 ├── 1KG.bed ├── ClinGen.bed ├── ClinVar.bed └── DGV.bed 12 directories, 27 files ``` #### Download the input sequence data set DOI ```bash wget -O MineSV_Demo_SeqInput.zip https://zenodo.org/record/5217570/files/MineSV_Demo_SeqInput.zip?download=1 unzip MineSV_Demo_SeqInput.zip ``` #### Check the current status ```bash tree -L 2 . ├── bam │ ├── normal.bam │ ├── normal.bam.bai │ ├── tumor.bam │ └── tumor.bam.bai ├── MineSV │ ├── annotation │ ├── config.yaml │ ├── docs │ ├── LICENSE │ ├── modules │ ├── README.md │ ├── run_pipeline_at_biowulf.sh │ ├── run_pipeline_at_ccad.sh │ ├── scripts │ ├── Snakefile_SV_scaffold │ ├── SV_wrapper.sh │ └── TEMPLATE_singularity_recipe ├── MineSV_annotation.tgz ├── MineSV_Demo_SeqInput.zip └── refGenomes └── Homo_sapiens_chr22_assembly19.fasta 7 directories, 15 files ``` ------------------------------ ### Configure the *MineSV* run #### Create the directory working_dir and prepare the input file ```bash mkdir workding_dir; cd $_ cat < input.txt sample1 tumor.bam normal.bam EOF ``` #### Create the file config.yaml User may make a copy of the file config.yaml under the folder ../MineSV to the folder worling_dir, and make some necessary changes. Briefly, we select the analysis mode as 'TN' for the somatic SV call from the Tumor-Normal paired samples. And we use three SV callers in this particular demo: svaba, manta and gridss. The revised config.yaml for this tutorial is as below: ```yaml #### Basic analysis parameters #### analysisMode: 'TN' # TN, TO, germline, de_novo callers: # available callers: svaba, breakdancer, delly, manta - svaba # - breakdancer # - delly - manta - gridss runMode: callAndAnnotate: yes callOnly: no annotateOnly: no refGenome: '../refGenomes/Homo_sapiens_chr22_assembly19.fasta' genomeBuild: 'b37' # for MineSV annotation #### Annotation parameters #### annotateFile: # only needed if runMode: annotateOnly annotationParams: interchromPadding: 50 insertionPadding: 500 crossCallerOverlap: 0.7 genomicContextOverlap: 0.7 publicDataOverlap: 0.7 #### Samples to analyze #### inFile: 'input.txt' #### Directories #### inDir: '../bam' execDir: '../MineSV' outDir: 'output' logDir: 'output/logs' tempDir: 'tmp' #### Cluster parameters #### clusterMode: " sbatch -p norm -e logs/slurm-%j.err -o logs/slurm-%j.out --cpus-per-task={threads} -t 200:00:00 --mem=40g " latency: 600 maxNumJobs: 40 maxThreads: - 4 # - 1 # - 1 - 4 - 8 ``` #### Copy the scripts SV_wrapper.sh and run_pipeline.sh SV_wrapper.sh is a wrapper shell script to help users to launch the MineSV pipeline, and run_pipeline.sh is another helper script to submit SV_wrapper.sh as a slurm job itself. ```bash cd $DEMO_DIR/workding_dir cp ../MineSV/SV_wrapper.sh . cp ../MineSV/run_pipeline_at_biowulf.sh run_pipeline.sh ``` ------------------------------ ### Run the MineSV snakemake pipeline #### Start an interactive session at Biowulf (~ 2 minutes) This process may take several minutes and you will get a new shell prompt something similar to "(base) [zhuw10@cn0853 workding_dir]$ ", where cnXXXX indicates the computer node that has just been allocated to you (more details about sinteractive is available at [here](https://hpc.nih.gov/docs/userguide.html#int)). We use the interactive session for the subsequent operations. ```bash sinteractive --time=2160 --mem=8g --cpus-per-task=4 ``` #### Generate the rulegraph plot of MineSV ```bash cd $DEMO_DIR/workding_dir module load git/2.30.1 python/3.7 perl/5.24.3 snakemake/5.24.1 singularity/3.7.4 samtools/1.11 mkdir -p tmp/dummy output/logs logs conf=config.yaml snakemake -s ../MineSV/Snakefile_SV_scaffold --rulegraph | dot -Tpng > dag.png display dag.png # Try it if you have set up X11 and ssh properly ``` ![](../_static/dag.png) #### Build indices for the new reference genome Homo_sapiens_chr22_assembly19.fasta (optional, ~15 minutes) The building of reference genome indices is part of the MineSV pipeline and will be processed by default if any of the indices is not available. We take this step is to demonstrate that how the running details is wrapped by the scripts ***SV_wrapper.sh*** and ***run_pipeline.sh***. ```bash conf=config.yaml snakemake --cores 10 -p --latency-wait 10 \ -s ../MineSV/Snakefile_SV_scaffold --use-singularity \ --singularity-args "-B tmp:/scratch,../refGenomes:/ref,../MineSV/:/exec" \ --cluster " sbatch -p norm -e logs/slurm-%j.err -o logs/slurm-%j.out --cpus-per-task={threads} -t 200:00:00 --mem=8g " \ -U bwa_index_ref fai_index_ref dict_index_ref ``` #### Run MineSV (>30 minutes) The launch of the MineSV pipeline has been simiplified by the use of the script run_pipeline.sh : ```bash cd $DEMO_DIR/workding_dir . run_pipeline.sh ``` Users may get the standard output something like the following: ```console -] Unloading git 2.30.1 ... [+] Loading git 2.30.1 ... [-] Unloading python 3.7 ... [+] Loading python 3.7 ... [-] Unloading perl 5.24.3 on cn0853 [+] Loading perl 5.24.3 on cn0853 [-] Unloading snakemake 5.24.1 [+] Loading snakemake 5.24.1 [-] Unloading singularity 3.7.4 on cn0853 [+] Loading singularity 3.7.4 on cn0853 [-] Unloading samtools 1.11 ... [+] Loading samtools 1.11 ... Command run: sbatch -t 200:00:00 --export=ALL --mem=32g -p norm -o /data/zhuw10/repeat/MineSV_Demo_at_Biowulf/workding_dir/SV_wrapper.sh.o%j --wrap='/data/zhuw10/repeat/MineSV_Demo_at_Biowulf/workding_dir/SV_wrapper.sh /data/zhuw10/repeat/MineSV_Demo_at_Biowulf/workding_dir/config.yaml' 21390543 ``` ------------------------------ ### Inspect MineSV output files We have already packed the MineSV output from this tutorial run into a zip file, together with the full MineSV pipeline, the sequencing input data and the corresponding reference genome. The zipped file is available at DOI.
export DEMO_FINAL=/data/yourusername/MineSV_Demo_Final
```bash mkdir -p $DEMO_FINAL; cd $_ ### It may take about 2 hours to download the dataset from Zenodo wget https://zenodo.org/record/5213041/files/MineSV_Demo_at_Biowulf.zip?download=1 -O MineSV_Demo_at_Biowulf.zip unzip MineSV_Demo_at_Biowulf.zip ``` #### The original output files from the SV callers The original outputs from the SV callers are available in the folder ***\_TN*** under the working directory: ```bash tree -L 2 MineSV_Demo_at_Biowulf/working_dir/*_TN/ MineSV_Demo_at_Biowulf/working_dir/gridss_TN/ ├── gridss.tmp.sample1.vcf.gz.tbi ├── sample1.assembly.bam ├── sample1_full.vcf.gz ├── sample1_full.vcf.gz.tbi ├── sample1_somatic.vcf.gz ├── sample1_somatic.vcf.gz.tbi ├── sample1.vcf.gz └── sample1_workingdir ├── gridss.full.20210812_093540.cn0882.47808.log ├── libsswjni.so ├── normal.bam.gridss.working ├── sample1.assembly.bam.gridss.working ├── sample1.vcf.gz.gridss.working └── tumor.bam.gridss.working MineSV_Demo_at_Biowulf/working_dir/manta_TN/ ├── filtered_bams │ ├── sample1_N.bam │ ├── sample1_N.bam.bai │ ├── sample1_T.bam │ └── sample1_T.bam.bai └── sample1 ├── results ├── runWorkflow.py ├── runWorkflow.py.config.pickle ├── workflow.error.log.txt ├── workflow.exitcode.txt └── workflow.warning.log.txt MineSV_Demo_at_Biowulf/working_dir/svaba_TN/ └── calls ├── sample1.alignments.txt.gz ├── sample1.bps.txt.gz ├── sample1.contigs.bam ├── sample1.discordant.txt.gz ├── sample1.log ├── sample1.svaba.germline.indel.vcf ├── sample1.svaba.germline.sv.vcf ├── sample1.svaba.somatic.indel.vcf ├── sample1.svaba.somatic.sv.vcf ├── sample1.svaba.unfiltered.germline.indel.vcf ├── sample1.svaba.unfiltered.germline.sv.vcf ├── sample1.svaba.unfiltered.somatic.indel.vcf └── sample1.svaba.unfiltered.somatic.sv.vcf 9 directories, 31 files ``` #### SV comparison and annotateion results from MineSV Running in callAndAnnotate or callOnly modes will create the directory output/compare_and_annotate. This directory contains the files intrachromosomal_SVs_\, which contain the superset of SVs from all callers. NOTE: this is under development, and inter-chromosomal SVs are being integrated currently. ```bash tree -L 1 MineSV_Demo_at_Biowulf/working_dir/output/compare_and_annotate/ MineSV_Demo_at_Biowulf/working_dir/output/compare_and_annotate/ ├── gridss ├── interchromosomal_SVs_sample1 ├── intrachromosomal_SVs_sample1 ├── manta └── svaba 3 directories, 2 files ```