Working with Source Code

The CGR GwasQcPipeline source code is located in src/cgr_gwas_qc. This includes the GwasQcPipeline workflow, the cgr command line utility, and various helper functions.

.
├── src
│  └── cgr_gwas_qc
│     ├── __init__.py
│     ├── __main__.py          # Points to the ``cgr`` entry point     ├── cli/                 # All code for the ``cgr`` command line interface     ├── cluster_profiles/    # Cluster profiles for CGEMs and Biowulf     ├── conda.py             # Helpers to activate conda environments in a script     ├── config.py            # Configuration manager helper code ``cfg``     ├── exceptions.py        # Custom exceptions for nicer error handling     ├── models/              # Data models for ``config.yml``     ├── parsers/             # Various tools for parsing 3rd party files     ├── paths.py             # Helpers for working with paths     ├── reporting/           # Code for writing the QC report ``.docx``     ├── scripts/             # Legacy compare script     ├── testing/             # Helpers for testing     ├── typing.py            # Custom PathLike data type     ├── validators/          # Code to validate different file formats     ├── version.py           # Gets the workflow version     ├── workflow/            # The GwasQcPipeline workflow     └── yaml.py              # Helpers for working with YAML files

I will briefly discussion a few of the more important or complicated bits.

Configuration Management

src
└── cgr_gwas_qc
   ├── ...
   ├── config.py                   # The ``cgr_gwas_qc.config.ConfigMgr``
   ├── models/                     # Pydantic data models defining ``config.yml``
     ├── __init__.py
     └── config/
        ├── __init__.py           # Top Level ``config.yml``
        ├── reference_files.py    # Reference Files namespace
        ├── software_params.py    # Software Parameters namespace
        ├── user_files.py         # User Files namespace
        └── workflow_params.py    # Workflow Parameters
   ├── parsers/
     ├── sample_sheet.py          # LIMs Manifest and sample sheet parsers
   ├── ...

Configuration management is the most complicated aspect of the entire project. I wanted to create an easy to use system for all things configuration. The core of this system is cgr_gwas_qc.config.ConfigMgr. You will see this referred to as cfg throughout the workflow, tests, and some scripts. cfg provides easy access to the values stored in the config.yml` as a dictionary at ``cfg.config, the cgr_sample_sheet.csv as a data frame at cfg.ss, and to a number of helpers.

Essentially, I am trying to solve two problems with cfg.

1. Resolving file paths. snakemake makes many assumptions about the locations of inputs/outputs, snakefiles, conda environments, and scripts. Traditionally this works as expected when you have your workflow in the current working directory. However, GwasQcPipeline lives in the install location so we need to adjust adjust the paths for snakemake. cgf has several helpers to do just that.

class cgr_gwas_qc.config.ConfigMgr(root: pathlib.Path, user_config: pathlib.Path, sample_sheet_file: pathlib.Path)[source]

Manage all things config.

Example

>>> from cgr_gwas_qc import load_config
>>> cfg = load_config()  # must have ``config.yml`` and ``cgr_sample_sheet.csv`` in current directory.
CONDA_DIR: pathlib.Path = PosixPath('/home/runner/work/GwasQcPipeline/GwasQcPipeline/src/cgr_gwas_qc/workflow/conda')

The absolute path to the workflow/conda.

MODULE_DIR: pathlib.Path = PosixPath('/home/runner/work/GwasQcPipeline/GwasQcPipeline/src/cgr_gwas_qc/workflow/modules')

The absolute path to the workflow/modules.

SCRIPTS_DIR: pathlib.Path = PosixPath('/home/runner/work/GwasQcPipeline/GwasQcPipeline/src/cgr_gwas_qc/workflow/scripts')

The absolute path to the workflow/scripts.

SNAKEFILE: pathlib.Path = PosixPath('/home/runner/work/GwasQcPipeline/GwasQcPipeline/src/cgr_gwas_qc/workflow/Snakefile')

The absolute path to the workflow/Snakefile.

SRC_DIR: pathlib.Path = PosixPath('/home/runner/work/GwasQcPipeline/GwasQcPipeline/src/cgr_gwas_qc')

The absolute path to cgr_gwas_qc source code.

SUBWORKFLOW_DIR: pathlib.Path = PosixPath('/home/runner/work/GwasQcPipeline/GwasQcPipeline/src/cgr_gwas_qc/workflow/sub_workflows')

The absolute path to the workflow/sub_workflows.

TEMPLATE_DIR: pathlib.Path = PosixPath('/home/runner/work/GwasQcPipeline/GwasQcPipeline/src/cgr_gwas_qc/reporting/templates')

The absolute path to the reporting/templates.

WORKFLOW_DIR: pathlib.Path = PosixPath('/home/runner/work/GwasQcPipeline/GwasQcPipeline/src/cgr_gwas_qc/workflow')

The absolute path to the workflow source code.

classmethod conda(filename: str) → str[source]

Return path to a conda env file.

Given a conda env file_name, prepends the full path to that file.

property docx_template

Return the path to the docx template.

classmethod scripts(filename: str) → str[source]

Return the path to an internal script.

Given a script file_name, prepends the full path to that script.

classmethod subworkflow(workflow: str) → str[source]

Return the path to a sub-workflow.

Given a sub-workflow name, give the full path to the snakefile.

2. Using config/sample sheet values. A common pattern in snakemake workflows is to fill in wildcards of a file name pattern. We often want to use values stored in cgr_sample_sheet.csv to do this. The cgf object gives us access to both the config.yml and cgr_sample_sheet.csv. And it provides an enhanced version of snakemake’s expand function to minimize the amount of code you have to type.

class cgr_gwas_qc.config.ConfigMgr(root: pathlib.Path, user_config: pathlib.Path, sample_sheet_file: pathlib.Path)[source]

Manage all things config.

Example

>>> from cgr_gwas_qc import load_config
>>> cfg = load_config()  # must have ``config.yml`` and ``cgr_sample_sheet.csv`` in current directory.
property cluster_groups

List of the cluster group names from cgr_sample_sheet.csv.

Example

>>> cfg = load_config()
>>> cfg.cluster_groups
["cgroup1", "cgroup2"]
property config

The user’s config settings from config.yml.

Example

>>> from cgr_gwas_qc import load_config
>>> cfg = load_config()
>>> cfg.config.software_params.maf_for_ibd
0.2
expand(file_pattern: str, combination: Callable = <class 'zip'>, query: Optional[str] = None) → List[str][source]

Use sample sheet columns to fill in file pattern

Parameters:
  • file_pattern – A snakemake compatible file name pattern.

  • combination – The combinatorial method to use (see snakemake.rules.expand). Defaults to zip.

  • query – A pandas.DataFrame.query to use to filter before expansion. Defaults to None.

Returns:

A list of expanded file names.

Example

>>> cfg = load_config()  # assuming ``cgr_sample_sheet.csv`` contains 3 samples SP00{1,2,3}
>>> cfg.expand("{Sample_ID}.txt")
["SP001.txt", "SP002.txt", "SP003.txt"]
>>> cfg.expand("{Sample_ID}.txt", query="Sample_ID == 'SP002'")
[SP002.txt"]
property ss

The user’s sample sheet data.

Example

>>> cfg = load_config()
>>> cfg.ss.head(2)
Sample_ID   Subject_ID    ...
    SP001        SB001    ...
    SP002        SB002    ...

As discussed in The config.yml, the other part of the configuration system are the data models that describe config.yml. These models are located in cgr_gwas_qc.models.config. Here I am using pydantic to describe the config.yml. Essentially, that means I create a very basic class to define each namespace. I was not sure if I was going to have other data models, so that is why these live in the models/config subfolder. I chose the models terminology because that is commonly used in the web application space.

To add a new config option, you need to do the following:

  1. Decide which name space you want the value to reside.

  2. Add that value to the correct class in models/config.

  3. You can give the option a default value in the pydantic class (suggested for software_params and workflow_params), or you can add a default value to cli/config.py (suggested for reference_files and user_files).

  4. You may need to adjust tests if you require a value but do not provide a default.

GwasQcPipeline Workflow

src
└── cgr_gwas_qc
   ├── workflow
     ├── __init__.py
     ├── conda/                       # Definition of conda environments used by the workflow
       ├── eigensoft.yml
       ├── graf.yml
       ├── illuminaio.yml
       ├── king.yml
       ├── pandoc.yml
       ├── plink2.yml
       └── verifyidintensity.yml
     ├── modules/                     # Rules for different software packages
       ├── eigensoft.smk
       ├── graf.smk
       ├── plink.smk
       └── thousand_genomes.smk
     ├── scripts/                     # Custom python scripts used by the workflow
       ├── __init__.py
       └── ...
     ├── Snakefile                    # The main workflow, this is what is called when a sub-workflow is not specified
     └── sub_workflows/               # The sub-workflows
        ├── contamination.smk
        ├── delivery.smk
        ├── entry_points.smk
        ├── sample_qc.smk
        └── subject_qc.smk

The snakemake workflow lives in cgr_gwas_qc.workflow. When you run cgr snakemake or cgr submit without specifying a subworkflow you run workflow/Sankefile. If you use the --subworkflow <name> then you will run one of workflow/subworkflow

Besides sub-workflow, I also have a number of software modules located in workflow/modules. My goal here was to group all of the rules for each 3rd party software to make maintenance easier. I also tried to follow the key programming principle, only do something once!

Note

There are a few exceptions to all 3rd party tools are in a module.

  • I had to embed a PLINK command in workflow/scripts/plink_merge.py.

  • I had to embed verifyidintensity in workflow/scripts/verifyidintensity.py

  • I never created a King module, so all that code is in workflow/sub_workflows/sample_qc.smk

Almost all custom code are in cgr_gwas_qc.workflow.scripts. These scripts are designed to be called directly from snakemake using the scripts: directive. Some of them can also be run directly from the command line. I tried to use descriptive names, but in general it is probably easier to go to the corresponding sub-workflow and see which script is referenced.

QC Report Generation

src
└── cgr_gwas_qc
   ├── reporting
     ├── __init__.py
     ├── constants.py
     ├── qc_exclusions.py
     ├── sample_qc.py
     ├── subject_qc.py
     ├── templates
       ├── __init__.py
       ├── cgr_reference.docx
       ├── qc_report.md.j2
       └── summary_stats.txt.j2
     └── templating.py

One major change from the legacy workflow is that we moved the QC report to a markdown file that is converted to docx at runtime. The text of the QC report is in cgr_gwas_qc/reporting/templates/qc_report.md.j2. This file is in the jinja2 format. jinja2 is a template framework. It is mostly used to dynamically build HTML pages, but here we use it to build our markdown report. It allows for value substitutions, flow control (if/else), and loops. To build the QC report we create a giant dictionary with all of the values and then pass this “payload” to jinja2.

The report is pretty complicated and has a lot of logic to generate all of the necessary counts. The workflow script workflow/scripts/qc_report.py uses helpers in the reporting folder to try to keep things clean. There are three major sections: Sample QC, Subject QC, QC Exclusions (a.k.a Table 4). The values in these three sections are calculated in reporting/sample_qc.py, reporting/subject_qc.py, and reporting/qc_exclusions.py respectively.

Command Line Interface (CLI)

src
└── cgr_gwas_qc
   ├── __main__.py
   ├── cli
     ├── __init__.py
     ├── cgems_copy.py
     ├── config.py
     ├── pre_flight.py
     ├── snakemake.py
     ├── submit.py
     └── version.py

The cgr command line interface lives in cgr_gwas_qc/cli. Each of the sub-commands are in their own python file. I am using the python CLI framework called typer. It is an easy way to build up nice command line tools. To add a new command you would create a new python script in cgr_gwas_qc/cli and register the command with cgr by adding it to cli/__init__.py

NIH Cluster Profiles

src
└── cgr_gwas_qc
   ├── cluster_profiles
     ├── __init__.py
     ├── biowulf
       ├── __init__.py
       ├── config.yaml
       ├── jobscript.sh
       ├── status.py
       └── submit.py
     ├── cgems
       ├── __init__.py
       ├── cgems_jobscript.sh
       ├── cgems_status.py
       ├── cgems_submit.py
       └── config.yaml
     ├── cluster.yaml
     └── snakemake.sh

snakemake provides a plugin architecture to tell it how to interact with your cluster environment (their docs). This consists of 3 major parts: a job-script template, a submission script, and a status check script. For each rule that is submitted to the cluster, snakemake creates a job. The submission script is given this job, it can then pull out job’s metadata and build a submission script using the job-script template. snakemake will run the status check script repeatedly for each submitted job. The status script then tells snakemake if the job is “running”, “failed”, or “success”.

I created two cluster profiles, one for Biowulf and one for CGEMs/CCAD. Code that is shared by both profiles lives in cluster_profiles/__init__.py

  • job-script template: cluster_profiles/biowulf/jobscript.sh and cluster_profiles/cgems/cgems_jobscript.sh

  • submit-script template: cluster_profiles/biowulf/submit.py and cluster_profiles/cgems/cgems_submit.py

  • status-script template: cluster_profiles/biowulf/status.py and cluster_profiles/cgems/cgems_status.py

When you run cgr submit --cgems or cgr submit --biowulf we create a submission script for the main snakemake task that is saved in .snakemake/GwasQcPipeline_submission.sh To create this script we use the jinja2 template cluster_profiles/snakemake.sh.

Parsers and Validators

src
└── cgr_gwas_qc
   ├── parsers
     ├── __init__.py
     ├── bim.py
     ├── bpm.py
     ├── common.py
     ├── eigensoft.py
     ├── graf.py
     ├── illumina
       ├── __init__.py
       ├── adpc.py
       └── IlluminaBeadArrayFiles.py
     ├── king.py
     ├── plink.py
     ├── sample_sheet.py
     ├── vcf.py
     └── verifyidintensity.py
   ├── validators
     ├── __init__.py
     ├── bgzip.py
     ├── bpm.py
     ├── gtc.py
     ├── idat.py
     └── sample_sheet.py

This workflow uses a lot of different file types. Many of them are binary while others are plain text in a various formats. I created a series of parsers and validators for many of the different file types. I centralized them here again to only do something once!

The file cgr_gwas_qc/parsers/illumina/IlluminaBeadArrayFiles.py was downloaded from Illumina’s github page. All of the other parsers/validators I wrote. See API Reference for more details.