Subject QC Sub-workflow¶
- Workflow File:
Config Options: see The config.yml for more details
workflow_params.minimum_pop_subject
workflow_params.control_hwp_threshold
software_params.ibd_pi_hat_min
software_params.ibd_pi_hat_max
software_params.dup_concordance_cutoff
software_params.pi_hat_threshold
software_params.maf_for_ibd
software_params.maf_for_hwe
software_params.ld_prune_r2
software_params.autosomal_het_threshold
Major Outputs:
subject_level/subject_qc.csv
subject_level/samples.bed
subject_level/samples.bim
subject_level/samples.fam
subject_level/subjects.bed
subject_level/subjects.bim
subject_level/subjects.fam
subject_level/concordance.csv
subject_level/population_qc.csv
Selecting Subject Representative¶
Major Outputs:
subject_level/subject_qc.csv
subject_level/samples.bed
contains samples that are subject representatives
subject_level/samples.bim
contains samples that are subject representatives
subject_level/samples.fam
contains samples that are subject representatives
subject_level/subjects.bed
renames samples to subject IDs
subject_level/subjects.bim
renames samples to subject IDs
subject_level/subjects.fam
renames samples to subject IDs
During Sample QC Sub-workflow we build the sample_level/sample_qc.csv
.
In this table we have a flag is_subject_representative
to indicate if a sample was selected to represent the subject.
The subject_level/subject_qc.csv
is simply the sample_level/sample_qc.csv
but only with subject representatives.
We then pull out these samples from sample_level/call_rate_2/samples.{bed,bim,fam}
and rename them to subject IDs.
Population Level Analysis¶
Config Options: see The config.yml for more details
workflow_params.minimum_pop_subject
Major Outputs:
subject_level/<population>/subjects.bed
subject_level/<population>/subjects.bim
subject_level/<population>/subjects.fam
Here we split subjects into ancestral populations based on GRAF calls made during the Sample QC Sub-workflow.
Autosomal Heterozygosity¶
Config Options: see The config.yml for more details
software_params.autosomal_het_threshold
Major Outputs:
subject_level/<population>/subjects.het
subject_level/autosomal_heterozygosity_plots/<population>.png
Here we calculate Autosomal Heterozygosity separately for each population and generate the plots used in the QC report.
Population Structure (PCA)¶
Config Options: see The config.yml for more details
software_params.maf_for_ibd
software_params.ld_prune_r2
Major Outputs:
subject_level/<population>/subjects_maf{maf_for_ibd}_ld{ld_prune_r2}.eigenvec
subject_level/pca_plots/<population>.png
Here we use EIGENSOFT fast pca to identify population structure. We then plot a panel of pair-plots for the first size principal components.
Hardy Weinberg¶
Config Options: see The config.yml for more details
workflow_params.control_hwp_threshold
software_params.maf_for_hwe
Major Outputs:
subject_level/<population>/controls_unrelated_maf{maf_for_hwe}_snps_autosomes.hwe
Here we calculate Hardy Weinberg Equilibrium (HWE) and produce plots for all populations (from graf-pop) that have >50 controls. Populations with <50 controls but > 50 cases + control are also run.
MAF threshold for computing HWE is the minimum(software_params.maf_for_hwe, sqrt(5/n)) where n is the number of controls or cases + controls (if number of controls < 50).
Subject QC Summary Tables¶
Major Outputs:
subject_level/population_qc.csv
Finally, all results are aggregated into the poulation level QC table.
Internal Population QC Report¶
Script: agg_population_qc_tables.py
Aggregate per population summary tables into a single table. Then add extra metadata from the sample qc table.
Output: population_level/population_qc.csv
name
dtype
description
population
string
The population name
Subject_ID
string
The subject identifier used by the workflow
Sample_ID
string
The sample identifier used by the workflow.
case_control
CASE_CONTROL_DTYPE
Phenotype status [
Case
|Control
|QC
|Unknown
].QC_Family_ID
string
An arbitrary ID assigned to each related set of subjects.
relatives
string
A list of related Subject_IDs concatenated together with a ‘|’.
PC1
float
Principal component 1
PC2
float
Principal component 2
PC3
float
Principal component 3
PC4
float
Principal component 4
PC5
float
Principal component 5
PC6
float
Principal component 6
PC7
float
Principal component 7
PC8
float
Principal component 8
PC9
float
Principal component 9
PC10
float
Principal component 10
O_HOM
int
Observed number of homozygotes
E_HOM
int
Expected number of homozygotes
N_NM
int
Number of non-missing autosomal genotypes
F
float
Method-of-moments F coefficient estimate
References
cgr_gwas_qc.workflow.scripts.related_subjects