Running the Pipeline

There are three phases to running the CGR GwasQcPipeline: configuration, pre-flight checks, and running snakemake or submitting to a cluster.

Creating Configuration

Before running the CGR GwasQcPipeline we need to create the necessary configuration file (config.yml). We provide a command line utility (cgr config) to help generate this file. To run this utility you need to provide a sample sheet (or CGR LIMs manifest). Please see Sample Sheet File to see the sample sheet requirements. For more details on the available configuration options see The config.yml

cgr config

Create the CGR GwasQcPipline’s configuration file (config.yml).

CGEMs/CCAD Users

For CGR users on CGEMs/CCAD you will probably want to run:

$ cgr config --cgems -s <path to lims manifest file>

This will create the default production run folder structure in GSA_Lab_QC/<project>/builds/QC_v#_######. This will also populate config.yml with CGEMs/CCAD file locations and naming patterns. If you do not want to create the production folder structure then you can use the --cgems-dev option instead of --cgems.

Other Users

Non-CGR users and CGR users on other systems will probably want to run:

$ cgr config \
      --sample-sheet <path to lims manifest file or sample sheet> \
      --project-name <my_project_name> \
      [--slurm-partition <partition_name>]

This will generate the config.yml file in the current working directory, with placeholders for reference files and user files. Slurm users can include the --slurm-partition option to specify the name of the queue to which your jobs will be submitted.

Attention

Always review and update config.yml before each pipeline run. Then run cgr pre-flight to ensure proper configuration.

Warning

The sample sheet must exist and be readable. An error will be raised if it is not.

cgr config [OPTIONS]

Options

-s, --sample-sheet <sample_sheet>

Required Path to a sample sheet or CGR LIMs manifest file.

--bpm-file <bpm_file>

Path to the Illumina BPM file used to generate the data.

--genome-build <genome_build>

The name of the human genome build to use.

Default:

GenomeBuild.hg37

Options:

hg37 | hg38

--project-name <project_name>

The project name to use for this QC run.

--slurm-partition <slurm_partition>

Name of the Slurm partition to which jobs will be submitted. This is required when running cgr submit --slurm.

-u, --include-unused-settings

Include unused settings in the config. To keep the config file tidy, we do not output non-required settings that are set as None. With this option, we will output all settings, which maybe especially useful for new users.

Default:

False

--cgems

Create folder structure for a production run on CGEMs/CCAD. Then use standard paths for CGEMs/CCAD.

Default:

False

--cgems-dev

Use CGEMs/CCAD standard paths but create the config in the current working directory. This is particularly useful for testing.

Default:

False

-y

Don’t prompt during CGEMs/CCAD folder creation. Note: this option is only used with --cgems.

Default:

False

Pre-Flight File Checks

A fundamental design strategy of this project is to fail fast if there is a problem. A common source of problems are input reference/user files. We provide the cgr pre-flight command, which tries to validate all input files to make sure that they are (1) present, (2) readable, and (3) complete. This command also creates cgr_sample_sheet.csv which is required by the workflow. See below for more details about what all cgr pre-flight does.

Here is example output where checks pass:

Sample Sheet OK (sample-sheet-name.csv)
BPM OK (bpm-file-name.bpm)
VCF OK (vcf-file-name.vcf.gz)
VCF.TBI OK (tbi-file-name.vcf.gz.tbi)
Processing GTC files
  [#################################] 100%
7,231 GTC Files OK.

Here is example output with two missing GTC files:

# Missing a few GTC files
$ cgr pre-flight
Sample Sheet OK (sample-sheet-name.csv)
BPM OK (bpm-file-name.bpm)
VCF OK (vcf-file-name.vcf.gz)
VCF.TBI OK (tbi-file-name.vcf.gz.tbi)
Processing GTC files
  [#################################] 100%
There was a problem with these GTC Files:
  FileNotFound:
    - missing-gtc-file1.gtc
    - missing-gtc-file2.gtc

Attention

If the config.yml, sample sheet, or reference files have any issues you must fix them before continuing. If you are missing a few sample’s IDAT or GTC files you may decide to continue; the workflow will automatically exclude these samples.

cgr pre-flight

Check all input files to make sure they are readable and complete.

Included Checks. cgr pre-fight first checks the config.yml file makes sure all config settings are valid. It then reads the sample sheet (or LIMS manifest file) and checks that all columns defined in the config are present. These include workflow_params.subject_id_column, workflow_params.expected_sex_column, and workflow_params.case_control_column. Next it checks all reference files (BPM, VCF, TBI) and makes sure that they exist, are readable, and complete. Finally, it will search for all IDAT and GTC files if user_files.idat_pattern and user_files.gtc_pattern are defined. Again, it makes sure all files exist, are readable, and are complete.

File Updates. This step also updates the config.yml file with the num_samples (from the sample sheet) and the num_snps (from the BPM file).

Creates ``cgr_sample_sheet.csv``. Finally, this step creates a normalized version of the sample sheet (or LIMs manifest). This include all of the columns in the sample sheet as well as the following added columns:

  • Group_By_Subject_ID: a column with the subject ID to use for subject level qc.

  • expected_sex: copies the expected sex column from the config.

  • case_control: copies the case control column from the config.

  • is_internal_control: a flag indicating if the sample is a CGR internal control (i.e., sVALID-001).

  • is_user_exclusion: a flag indicating if the sample was marked to be excluded in the config.

  • is_missing_idats: a flag indicating if the sample was missing an IDAT file.

  • is_missing_gtc: a flag indicating if the sample was missing its GTC file.

  • is_sample_exclusion: a flag indicating if the sample had missing IDAT or GTC files.

  • num_samples_per_subject: The number of samples per subject.

  • replicate_ids: A concatenated list of Sample_IDs from reach subject.

  • cluster_group: Group names used when running on a cluster in the form of cgroup#.

You will almost always run:

$ cgr pre-flight --threads 4
cgr pre-flight [OPTIONS]

Options

--config-file <config_file>

Path to the configuration file.

Default:

config.yml

--no-reference-files-check

Skip checks of reference files. Not suggested.

Default:

False

--no-user-files-check

Skip checks of user files. Not suggested.

Default:

False

--no-update-config

Do not update the config file. Not suggested.

Default:

False

-j, --threads <threads>

Number of theads to use when checking user files.

Default:

4

--cluster-group-size <cluster_group_size>

The number of samples to group to gether when running in cluster mode.

Default:

1000

Running the Workflow Locally

We use snakemake to orchestrate the CGR GwasQcPipeline. We provide cgr snakemake to simplify running snakemake directly.

cgr snakemake

A light weight wrapper around snakemake.

The purpose of this wrapper is to allow interacting with snakemake directly, while hiding some of the implementation details. The key feature of this wrapper is to tell snakemake where the workflow’s snakefile exists. We do this by adding the -s option to your snakemake command:

snakemake -s <path to workflow install location> OPTIONS TARGETS

In addition, we also add some snakemake options if you did not specify them. This is mostly for convenience. Specifically, we require the use of conda, so we always will add --use-conda. Second, in recent versions of snakemake they no longer default to --cores 1 and will throw an error if you did not specify a value. We find this annoying, so we will always add --cores 1 unless you provide a value.

So for example, to run the full workflow locally you will typically run:

cgr snakemake --cores 8 -k

This will be translated into:

snakemake -s <path to workflow install location> --cores 8 -k --use-conda

Instead of running the entire workflow, you can also run specific sub-workflow by running:

cgr snakemake --cores 8 -k --subworkflow sample_qc
Args:
subworkflow (str):

Specify which sub-workflow to run [default: None]. Must be one of [entry_points | contamination | sample_qc | subject_qc | delivery].

other-kwargs:

All snakemake arguments will be passed to snakemake. See their documentation for possible options.

References:
cgr snakemake [OPTIONS]

Running the Workflow on a Cluster

We provide the cgr submit command to easily submit to a cluster. We take advantage of snakemake’s cluster profile system to run on different cluster environments. For CGR users, we include cluster profiles for CGEMS/CCAD, ccad2, and Biowulf. For external users will need to create your own snakemake cluster profile.

cgr submit

Submit the CGR GwasQcPipeline to a cluster for execution.

The cgr submit command will create a submission script and submit it to the cluster. We will create an optimized submission script for users of CGEMs/CCAD, CCAD2 and Biowulf. For other systems you will need to provide a snakemake cluster profile to tell snakemake how use your system. If you are submitting to a Slurm cluster, you can use the generic Slurm cluster profile we provide by including the –slurm option in your submit command.

Users running on CGEMs/CCAD will typically run:

cgr submit --cgems

Users running on Biowulf will typically run:

cgr submit --biowulf

Users running on CCAD2 will typically run:

cgr submit --ccad2

Users running on other Slurm systems will typically run:

cgr submit --slurm

Users running with a custom cluster profile will typically run:

cgr submit \
    --profile <path to custom cluster profile> \
    --queue <name of the queue to submit main job> \
    --submission-cmd <tool used for submit such as sbatch or qsub>

Note

Sometimes it may be useful to edit the submission script before submitting it to the cluster. In that case you can add the –dry-run option and edit the generated file in .snakemake/GwasQcPipeline_submission.sh. You would then submit this script directly to your cluster (i.e., qsub .snakemake/GwasQcPipeline_submission.sh).

cgr submit [OPTIONS]

Options

--cgems, --no-cgems

Run the workflow using the CGEMs/CCAD cluster profile.

Default:

False

--biowulf, --no-biowulf

Run the workflow using the cli Biowulf cluster profile.

Default:

False

--ccad2, --no-ccad2

Run the workflow using the cli ccad2 cluster profile.

Default:

False

--slurm, --no-slurm

Run the workflow using the generic slurm cluster profile. This option requires that you specify the slurm_partition in the config.yml file.

Default:

False

--cluster-profile <cluster_profile>

Path to a custom cluster profile. See https://github.com/snakemake-profiles/doc.

--subworkflow <subworkflow>

Run a sub-workflow instead of the full workflow.

--time-hr <time_hr>

The walltime limit (in hours) for the main snakemake process. Ignored if using –cgems.

Default:

120

--queue <queue>

Name of the queue for the main process to use. This option is required if you are using –cluster-profile. CGEMs/CCAD, CCAD2, and Biowulf users may want to use this to override which queue to use for the main process.

--submission-cmd <submission_cmd>

Name of the command to use for submission (i.e., qsub, sbatch). This is required if you are using –cluster-profile. Ignored if using –cgems or –biowulf or –ccad2.

--dry-run, --no-dry-run

Create the submission script but do not send to the schedular. The generate submission script is in .snakemake/GwasQcPipeline_submission.sh. This file can be edited and submitted directly to the cluster.

Default:

False

--notemp, --no-notemp

Do not delete temporary files. This option runs the workflow using snakemake –notemp.

Default:

False

--local-mem-mb <local_mem_mb>

The amount of memory to use for main snakemake process and local rules [default 8GB]. There are a number of rules that are run along side of the main snakemake process instead of submitting a new job. If you have a very large project >50k samples, you may want to increase the amount of memory used for this job. Ignored if using –cgems.

Default:

8000

--local-tasks <local_tasks>

The number of threads for the main snakemake process and local rules.There are a number of rules that are run along side of the main snakemake process instead of submitting a new job. If you have a very large project >50k samples, you may want to increase the number of CPUs used for this job. Ignored if using –cgems.

Default:

4

--max-threads <max_threads>

The maximum number of threads a rule can request. If pipeline is executed in cluster_mode, this will scale down the threads to max-threads.

Default:

8

Note

External users on a SLURM or SGE cluster, may just want to modify one of our profiles.

Attention

Biowulf users. You may need to adjust --time-hr, --local_mem_mb, and --local_tasks if your main job is getting killed by the cluster because of resource limits.

The submission script will create a log file ./gwas_qc_log.$JOB_ID that will have the status of your cluster submission. Logs for each submitted job can be found in ./logs/.