Subject QC Sub-workflow

Workflow File:

https://github.com/NCI-CGR/GwasQcPipeline/blob/default/src/cgr_gwas_qc/workflow/sub_workflows/subject_qc.smk

Config Options: see The config.yml for more details

  • workflow_params.minimum_pop_subject

  • workflow_params.control_hwp_threshold

  • software_params.ibd_pi_hat_min

  • software_params.ibd_pi_hat_max

  • software_params.dup_concordance_cutoff

  • software_params.pi_hat_threshold

  • software_params.maf_for_ibd

  • software_params.maf_for_hwe

  • software_params.ld_prune_r2

  • software_params.autosomal_het_threshold

Major Outputs:

  • subject_level/subject_qc.csv

  • subject_level/samples.bed

  • subject_level/samples.bim

  • subject_level/samples.fam

  • subject_level/subjects.bed

  • subject_level/subjects.bim

  • subject_level/subjects.fam

  • subject_level/concordance.csv

  • subject_level/population_qc.csv

../_images/subject_qc.svg

Fig. 5 The subject-qc sub-workflow. This runs additional QC checks at the subject and population level.

Selecting Subject Representative

Major Outputs:

  • subject_level/subject_qc.csv

  • subject_level/samples.bed contains samples that are subject representatives

  • subject_level/samples.bim contains samples that are subject representatives

  • subject_level/samples.fam contains samples that are subject representatives

  • subject_level/subjects.bed renames samples to subject IDs

  • subject_level/subjects.bim renames samples to subject IDs

  • subject_level/subjects.fam renames samples to subject IDs

During Sample QC Sub-workflow we build the sample_level/sample_qc.csv. In this table we have a flag is_subject_representative to indicate if a sample was selected to represent the subject. The subject_level?subject_qc.csv is simply the sample_level/sample_qc.csv but only with subject representatives. We then pull out these samples from sample_level/call_rate_2/samples.{bed,bim,fam} and rename them to subject IDs.

Population Level Analysis

Config Options: see The config.yml for more details

  • workflow_params.minimum_pop_subject

Major Outputs:

  • subject_level/<population>/subjects.bed

  • subject_level/<population>/subjects.bim

  • subject_level/<population>/subjects.fam

Here we split subjects into ancestral populations based on GRAF calls made during the Sample QC Sub-workflow.

Autosomal Heterozygosity

Config Options: see The config.yml for more details

  • software_params.autosomal_het_threshold

Major Outputs:

  • subject_level/<population>/subjects.het

  • subject_level/autosomal_heterozygosity_plots/<population>.png

Here we calculate Autosomal Heterozygosity separately for each population and generate the plots used in the QC report.

Subject Relatedness

Config Options: see The config.yml for more details

  • software_params.ibd_pi_hat_min

  • software_params.ibd_pi_hat_max

  • software_params.dup_concordance_cutoff

  • software_params.pi_hat_threshold

  • software_params.maf_for_ibd

  • software_params.ld_prune_r2

Major Outputs:

  • subject_level/<population>/relatives.csv

  • subject_level/<population>/related_subjects_to_remove.csv

Here we estimated related (IBD) again, but separately for each population. We then prune subjects so that no two subjects have a PI_HAT >= pi_hat_threshold.

Population Structure (PCA)

Config Options: see The config.yml for more details

  • software_params.maf_for_ibd

  • software_params.ld_prune_r2

Major Outputs:

  • subject_level/<population>/subjects_maf{maf_for_ibd}_ld{ld_prune_r2}.eigenvec

  • subject_level/pca_plots/<population>.png

Here we use EIGENSOFT fast pca to identify population structure. We then plot a panel of pair-plots for the first size principal components.

Hardy Weinberg

Config Options: see The config.yml for more details

  • workflow_params.control_hwp_threshold

  • software_params.maf_for_hwe

Major Outputs:

  • subject_level/<population>/controls_unrelated_maf{maf_for_hwe}_snps_autosomes.hwe

Here we pull out only control subjects. We then calculate Hardy Weinberg Equilibrium. We use only controls, but cases may have SNPs that are out of HWE.

Subject QC Summary Tables

Major Outputs:

  • subject_level/population_qc.csv

Finally, all results are aggregated into the poulation level QC table.

Internal Population QC Report

Script: agg_population_qc_tables.py

Aggregate per population summary tables into a single table. Then add extra metadata from the sample qc table.

Output: population_level/population_qc.csv

name

dtype

description

population

string

The population name

Subject_ID

string

The subject identifier used by the workflow

Sample_ID

string

The sample identifier used by the workflow.

case_control

CASE_CONTROL_DTYPE

Phenotype status [Case | Control | QC | Unknown].

QC_Family_ID

string

An arbitrary ID assigned to each related set of subjects.

relatives

string

A list of related Subject_IDs concatenated together with a ‘|’.

PC1

float

Principal component 1

PC2

float

Principal component 2

PC3

float

Principal component 3

PC4

float

Principal component 4

PC5

float

Principal component 5

PC6

float

Principal component 6

PC7

float

Principal component 7

PC8

float

Principal component 8

PC9

float

Principal component 9

PC10

float

Principal component 10

O_HOM

int

Observed number of homozygotes

E_HOM

int

Expected number of homozygotes

N_NM

int

Number of non-missing autosomal genotypes

F

float

Method-of-moments F coefficient estimate

References