Sample QC Sub-workflow

Workflow File:

https://github.com/NCI-CGR/GwasQcPipeline/blob/default/src/cgr_gwas_qc/workflow/sub_workflows/sample_qc.smk

Config Options: see The config.yml for more details

  • reference_files.thousand_genome_vcf

  • reference_files.thousand_genome_tbi

  • software_params.sample_call_rate_1

  • software_params.snp_call_rate_1

  • software_params.sample_call_rate_2

  • software_params.snp_call_rate_2

  • software_params.intensity_threshold

  • software_params.contam_threshold

  • software_params.ibd_pi_hat_min

  • software_params.ibd_pi_hat_max

  • software_params.maf_for_ibd

  • software_params.ld_prune_r2

  • software_params.dup_concordance_cutoff

  • software_params.pi_hat_threshold

  • workflow_params.remove_contam

  • workflow_params.remove_rep_discordant

Major Outputs:

  • sample_level/sample_qc.csv

  • sample_level/snp_qc.csv

  • sample_level/concordance/KnownReplicates.csv

  • sample_level/concordance/InternalQcKnown.csv

  • sample_level/concordance/StudySampleKnown.csv

  • sample_level/concordance/UnknownReplicates.csv

  • sample_level/summary_stats.txt

  • sample_level/call_rate.png

  • sample_level/chrx_inbreeding.png

  • sample_level/ancestry.png

../_images/sample_qc.png

Fig. 4 The sample QC sub-workflow. This runs all major QC checks at the sample level.

Call Rate Filters

Config Options: see The config.yml for more details

  • software_params.sample_call_rate_1

  • software_params.snp_call_rate_1

  • software_params.sample_call_rate_2

  • software_params.snp_call_rate_2

Major Outputs:

  • sample_level/call_rate_1/samples.{bed,bim,fam} data set after level 1 sample/snp call rate filters.

  • sample_level/call_rate_2/samples.{bed,bim,fam} data set after level 2 sample/snp call rate filters.

First we run a series of call rate (a.k.a. completion) filters. Here we remove samples then SNPs which have high missingness in across samples.

Contamination Summary

Config Options: see The config.yml for more details

  • software_params.intensity_threshold

  • software_params.contam_threshold

Major Outputs:

  • sample_level/contamination/summary.csv aggregate summary table of contamination results.

If IDAT/GTC files are present and we ran Contamination Sub-workflow then we will create a summary table. For samples that have a low call rate and a median idat intensity < the intensity_threshold we will set %Mix to NaN. We then will flag a sample is_contaminated if the %Mix is >= contam_threshold.

Replicate Concordance

Config Options: see The config.yml for more details

  • software_params.ibd_pi_hat_min

  • software_params.ibd_pi_hat_max

  • software_params.maf_for_ibd

  • software_params.ld_prune_r2

  • software_params.dup_concordance_cutoff

  • software_params.pi_hat_threshold

  • workflow_params.remove_contam

  • workflow_params.remove_rep_discordant

Major Outputs:

  • sample_level/concordance/summary.csv

It is common to include multiple replicates for a subset of samples as a QC metric. Here we check these replicates and make sure that they are highly concordant. We report 3 methods for checking relatedness/concordance (PLINK IBD, GRAF, and KING). We mostly rely on IBD results but report the others in our summary tables. For all three methods, we do all pairwise comparisons to determine which samples are related/replicated. We flag two samples as concordant (is_ge_concordance) if concordance (proportion IBD2) > dup_concordance_cutoff. We flag two samples as related (is_ge_pi_hat) if PLINK’s PI_HAT is >= pi_hat_threshold.

Ancestry

Major Outputs:

  • sample_level/ancestry/graf_ancestry.txt

Finally, we estimate ancestry using GRAF. Here we categorize each sample into 1 of 9 possible groups (see GRAF for more details). We also report the percent mixture of the 3 major continental population (African, East Asian, European).

Sample QC Summary

Major Outputs:

  • sample_level/sample_qc.csv

  • sample_level/snp_qc.csv

  • sample_level/summary_stats.txt

  • sample_level/call_rate.png

  • sample_level/chrx_inbreeding.png

  • sample_level/ancestry.png

Finally, we aggregate all of the QC results. We generate two summary tables described below:

Internal Sample QC Report

Script: workflow/scripts/sample_qc_table.py

This script takes various outputs from the GwasQcPipeline and aggregates/summarizes results into an internal QC report. Here we use a functional approach to build up the QC report. Each file has it’s own function to parse/summarize its content. This makes it easier to add new components or change the behavior. Search for TO-ADD comments for where you would have to make modifications to add new components.

Output: sample_level/sample_qc.csv

name

dtype

description

Sample_ID

string

The sample identifier used by the workflow.

Group_By_Subject_ID

string

The subject identifier used by the workflow

num_samples_per_subject

UInt32

The total number of Sample_ID per Group_By_Subject_ID.

analytic_exclusion

boolean

True if the sample failed QC criteria.

num_analytic_exclusion

boolean

The number of QC criteria the sample failed.

analytic_exclusion_reason

string

A list of QC criteria the sample failed.

is_subject_representative

boolean

True if the Sample_ID is used to represent the subject.

subject_dropped_from_study

boolean

True if there are no samples passing QC for this subject.

case_control

CASE_CONTROL_DTYPE

Phenotype status [Case | Control | QC | Unknown].

is_internal_control

boolean

True if the sample is an internal control.

is_sample_exclusion

boolean

True if the sample is excluded in the config or missing IDAT/GTC files.

is_user_exclusion

boolean

True if the sample is excluded in the config

is_missing_idats

boolean

True if the IDAT files were missing during pre-flight.

is_missing_gtc

boolean

True if the IDAT files were missing during pre-flight.

Call_Rate_Initial

float

The Initial sample call rate before any filters.

Call_Rate_1

float

The sample call rate after the first filter.

Call_Rate_2

float

The sample call rate after the second filter.

is_cr1_filtered

boolean

True if a sample was removed by the first call rate filter.

is_cr2_filtered

boolean

True if a sample was removed by the second call rate filter.

is_call_rate_filtered

boolean

True if a sample was removed by either call rate filter.

IdatIntensity

float

The median IDAT intensity.

Contamination_Rate

float

The SNPweights contamination rate (%Mix)

is_contaminated

boolean

True if a sample exceed the contamination rate threshold.

replicate_ids

string

Concatenated Sample_IDs that are replicates.

is_discordant_replicate

boolean

True if the replicates had low concordance.

expected_sex

SEX_DTYPE

The expected sex from the provided sample sheet.

predicted_sex

SEX_DTYPE

The predicted sex based on X chromosome heterozygosity.

X_inbreeding_coefficient

float

The X chromosome F coefficient.

is_sex_discordant

boolean

True if expected and predicted sex did not match.

AFR

float

Proportion African ancestry.

EUR

float

Proportion European ancestry.

ASN

float

Proportion East Asian ancestry.

Ancestry

category

Assigned ancestral population (GRAF).

identifiler_needed

boolean

True if the lab needs to run Identifiler.

identifiler_reason

string

The QC problem that needs checked with Identifiler.

Internal SNP QC Report

Script: workflow/scripts/snp_qc_table.py

This script takes various outputs from the GwasQcPipeline and aggregates/summarizes results into an internal SNP level QC report.

Output: sample_level/snp_qc.csv

name

dtype

description

array_snp_id

string

The SNP ID as found in the BPM file.

thousand_genomes_snp_id

string

The rsID found in the 1000 Genomes VCF.

chromosome

string

The chromosome that the SNP is on.

snp_cr1

float

The SNP call rate results after the first filter.

snp_cr1_removed

boolean

True if the SNP was removed by the first call rate filter.

snp_cr2

float

The SNP call rate results after the second filter.

snp_cr2_removed

boolean

True if the SNP was removed by the second call rate filter.