Subject QC Sub-workflow

Workflow File:

https://github.com/NCI-CGR/GwasQcPipeline/blob/default/src/cgr_gwas_qc/workflow/sub_workflows/subject_qc.smk

Config Options: see The config.yml for more details

  • workflow_params.minimum_pop_subject

  • workflow_params.control_hwp_threshold

  • software_params.ibd_pi_hat_min

  • software_params.ibd_pi_hat_max

  • software_params.dup_concordance_cutoff

  • software_params.pi_hat_threshold

  • software_params.maf_for_ibd

  • software_params.maf_for_hwe

  • software_params.ld_prune_r2

  • software_params.autosomal_het_threshold

Major Outputs:

  • subject_level/subject_qc.csv

  • subject_level/samples.bed

  • subject_level/samples.bim

  • subject_level/samples.fam

  • subject_level/subjects.bed

  • subject_level/subjects.bim

  • subject_level/subjects.fam

  • subject_level/concordance.csv

  • subject_level/population_qc.csv

../_images/subject_qc.svg

Fig. 5 The subject-qc sub-workflow. This runs additional QC checks at the subject and population level.

Selecting Subject Representative

Major Outputs:

  • subject_level/subject_qc.csv

  • subject_level/samples.bed contains samples that are subject representatives

  • subject_level/samples.bim contains samples that are subject representatives

  • subject_level/samples.fam contains samples that are subject representatives

  • subject_level/subjects.bed renames samples to subject IDs

  • subject_level/subjects.bim renames samples to subject IDs

  • subject_level/subjects.fam renames samples to subject IDs

During Sample QC Sub-workflow we build the sample_level/sample_qc.csv. In this table we have a flag is_subject_representative to indicate if a sample was selected to represent the subject. The subject_level/subject_qc.csv is simply the sample_level/sample_qc.csv but only with subject representatives. We then pull out these samples from sample_level/call_rate_2/samples.{bed,bim,fam} and rename them to subject IDs.

Population Level Analysis

Config Options: see The config.yml for more details

  • workflow_params.minimum_pop_subject

Major Outputs:

  • subject_level/<population>/subjects.bed

  • subject_level/<population>/subjects.bim

  • subject_level/<population>/subjects.fam

Here we split subjects into ancestral populations based on GRAF calls made during the Sample QC Sub-workflow.

Autosomal Heterozygosity

Config Options: see The config.yml for more details

  • software_params.autosomal_het_threshold

Major Outputs:

  • subject_level/<population>/subjects.het

  • subject_level/autosomal_heterozygosity_plots/<population>.png

Here we calculate Autosomal Heterozygosity separately for each population and generate the plots used in the QC report.

Subject Relatedness

Config Options: see The config.yml for more details

  • software_params.ibd_pi_hat_min

  • software_params.ibd_pi_hat_max

  • software_params.dup_concordance_cutoff

  • software_params.pi_hat_threshold

  • software_params.maf_for_ibd

  • software_params.ld_prune_r2

Major Outputs:

  • subject_level/<population>/relatives.csv

  • subject_level/<population>/related_subjects_to_remove.csv

Here we estimated related (IBD) again, but separately for each population. We then prune subjects so that no two subjects have a PI_HAT >= pi_hat_threshold.

Population Structure (PCA)

Config Options: see The config.yml for more details

  • software_params.maf_for_ibd

  • software_params.ld_prune_r2

Major Outputs:

  • subject_level/<population>/subjects_maf{maf_for_ibd}_ld{ld_prune_r2}.eigenvec

  • subject_level/pca_plots/<population>.png

Here we use EIGENSOFT fast pca to identify population structure. We then plot a panel of pair-plots for the first size principal components.

Hardy Weinberg

Config Options: see The config.yml for more details

  • workflow_params.control_hwp_threshold

  • software_params.maf_for_hwe

Major Outputs:

  • subject_level/<population>/controls_unrelated_maf{maf_for_hwe}_snps_autosomes.hwe

Here we calculate Hardy Weinberg Equilibrium (HWE) and produce plots for all populations (from graf-pop) that have >50 controls. Populations with <50 controls but > 50 cases + control are also run.

MAF threshold for computing HWE is the minimum(software_params.maf_for_hwe, sqrt(5/n)) where n is the number of controls or cases + controls (if number of controls < 50).

Subject QC Summary Tables

Major Outputs:

  • subject_level/population_qc.csv

Finally, all results are aggregated into the poulation level QC table.

Internal Population QC Report

Script: agg_population_qc_tables.py

Aggregate per population summary tables into a single table. Then add extra metadata from the sample qc table.

Output: population_level/population_qc.csv

name

dtype

description

population

string

The population name

Subject_ID

string

The subject identifier used by the workflow

Sample_ID

string

The sample identifier used by the workflow.

case_control

CASE_CONTROL_DTYPE

Phenotype status [Case | Control | QC | Unknown].

QC_Family_ID

string

An arbitrary ID assigned to each related set of subjects.

relatives

string

A list of related Subject_IDs concatenated together with a ‘|’.

PC1

float

Principal component 1

PC2

float

Principal component 2

PC3

float

Principal component 3

PC4

float

Principal component 4

PC5

float

Principal component 5

PC6

float

Principal component 6

PC7

float

Principal component 7

PC8

float

Principal component 8

PC9

float

Principal component 9

PC10

float

Principal component 10

O_HOM

int

Observed number of homozygotes

E_HOM

int

Expected number of homozygotes

N_NM

int

Number of non-missing autosomal genotypes

F

float

Method-of-moments F coefficient estimate

References