.. _sample-qc:

Sample QC Sub-workflow
======================

**Workflow File**:
   https://github.com/NCI-CGR/GwasQcPipeline/blob/default/src/cgr_gwas_qc/workflow/sub_workflows/sample_qc.smk

**Config Options**: see :ref:`config-yaml` for more details

   - ``reference_files.thousand_genome_vcf``
   - ``reference_files.thousand_genome_tbi``
   - ``software_params.sample_call_rate_1``
   - ``software_params.snp_call_rate_1``
   - ``software_params.sample_call_rate_2``
   - ``software_params.snp_call_rate_2``
   - ``software_params.intensity_threshold``
   - ``software_params.contam_threshold``
   - ``software_params.ibd_pi_hat_min``
   - ``software_params.ibd_pi_hat_max``
   - ``software_params.maf_for_ibd``
   - ``software_params.ld_prune_r2``
   - ``software_params.dup_concordance_cutoff``
   - ``software_params.pi_hat_threshold``
   - ``workflow_params.remove_contam``
   - ``workflow_params.remove_rep_discordant``

**Major Outputs**:

   - ``sample_level/sample_qc.csv``
   - ``sample_level/snp_qc.csv``
   - ``sample_level/concordance/KnownReplicates.csv``
   - ``sample_level/concordance/InternalQcKnown.csv``
   - ``sample_level/concordance/StudySampleKnown.csv``
   - ``sample_level/concordance/UnknownReplicates.csv``
   - ``sample_level/summary_stats.txt``
   - ``sample_level/call_rate.png``
   - ``sample_level/chrx_inbreeding.png``
   - ``sample_level/ancestry.png``

.. figure:: ../static/sample_qc.*
   :name: fig-sample-qc-workflow

   The sample QC sub-workflow.
   This runs all major QC checks at the sample level.


Call Rate Filters
-----------------

**Config Options**: see :ref:`config-yaml` for more details

   - ``software_params.sample_call_rate_1``
   - ``software_params.snp_call_rate_1``
   - ``software_params.sample_call_rate_2``
   - ``software_params.snp_call_rate_2``

**Major Outputs**:

   - ``sample_level/call_rate_1/samples.{bed,bim,fam}`` data set after level 1 sample/snp call rate filters.
   - ``sample_level/call_rate_2/samples.{bed,bim,fam}`` data set after level 2 sample/snp call rate filters.

First we run a series of call rate (a.k.a. completion) filters.
Here we remove samples then SNPs which have high missingness in across samples.


Contamination Summary
---------------------

**Config Options**: see :ref:`config-yaml` for more details

   - ``software_params.intensity_threshold``
   - ``software_params.contam_threshold``

**Major Outputs**:

   - ``sample_level/contamination/summary.csv`` aggregate summary table of contamination results.

If IDAT/GTC files are present and we ran :ref:`contamination` then we will create a summary table.
For samples that have a low call rate and a median idat intensity < the intensity_threshold we will set %Mix to NaN.
We then will flag a sample ``is_contaminated`` if the %Mix is >= contam_threshold.

Replicate Concordance
---------------------

**Config Options**: see :ref:`config-yaml` for more details

   - ``software_params.ibd_pi_hat_min``
   - ``software_params.ibd_pi_hat_max``
   - ``software_params.maf_for_ibd``
   - ``software_params.ld_prune_r2``
   - ``software_params.dup_concordance_cutoff``
   - ``software_params.pi_hat_threshold``
   - ``workflow_params.remove_contam``
   - ``workflow_params.remove_rep_discordant``
   - ``workflow_params.concordance_tools``

**Major Outputs**:

   - ``sample_level/concordance/summary.csv``

It is common to include multiple replicates for a subset of samples as a QC metric.
Here we check these replicates and make sure that they are highly concordant.
The pipeline can check for relatedness/concordance using three tools : PLINK IBD, GRAF, and KING.
We only rely on and report IBD results in our summary tables.
To save compute resources on large datasets, the outputs for GRAF and KING relatedness checks can be disabled
by setting ``workflow_params.concordance_tools.graf`` and ``workflow_params.concordance_tools.king`` to `False`.
If PLINK IBD is not needed either, ``workflow_params.concordance_tools.plink`` can also be set to `False`.
In this case, PLINK IBD will be skipped. Since PLINK IBD is the main tool used for replicate checks,
columns related to replicate checks will be empty in the sample_qc table including `Expected Replicate Discordance` and `Unexpected Replicate`.
The workflow will successfully complete the sample_qc with all other QC metrics but not proceed to subject QC.
In the summary tables, we flag two samples as concordant (``is_ge_concordance``) if concordance (proportion IBD2) > dup_concordance_cutoff.
We flag two samples as related (``is_ge_pi_hat``) if PLINK's PI_HAT is >= pi_hat_threshold.

Ancestry
--------

**Major Outputs**:

   - ``sample_level/ancestry/graf_ancestry.txt``

Finally, we estimate ancestry using GRAF_.
Here we categorize each sample into 1 of 9 possible groups (see GRAF_ for more details).
We also report the percent mixture of the 3 major continental population (African, East Asian, European).

Sample QC Summary
-----------------

**Major Outputs**:

   - ``sample_level/sample_qc.csv``
   - ``sample_level/snp_qc.csv``
   - ``sample_level/summary_stats.txt``
   - ``sample_level/call_rate.png``
   - ``sample_level/chrx_inbreeding.png``
   - ``sample_level/ancestry.png``

Finally, we aggregate all of the QC results.
We generate two summary tables described below:

.. _sample-qc-table:

.. automodule:: cgr_gwas_qc.workflow.scripts.sample_qc_table

.. _snp-qc-table:

.. automodule:: cgr_gwas_qc.workflow.scripts.snp_qc_table