Configuration Files¶
Sample Sheet File¶
A sample sheet is a CSV formatted file where each row represents a sample and columns contain various metadata.
This file must contain a column named Sample_ID
which has a unique ID for each row.
You also need a column describing (1) the subject ID representing samples from an individual, (2) the expected sex of individual, and (3) the case control status of the individual.
See the Workflow Parameters section below for details about what values are allowed for subject ID, expected sex, and case control.
Note
Example of the basic required structure of a sample sheet.
You can have any other metadata as additional columns.
The only column name that is required is Sample_ID
, the other columns can be named anything as long as they are referenced correctly in Workflow Parameters.
Sample_ID,Subject_ID,Expected_Sex,Case_Control
Samp0001,Sub0001,M,Case
Samp0002,Sub0001,M,Case
Samp0003,Sub0002,F,Control
Samp0004,Sub0003,F,QC
CGR LIMs Manifest File¶
The CGR LIMs manifest file is our internal way to distribute sample information.
If you use a Sample Sheet File then you can ignore this section.
The manifest file is an INI-like file with three section (header, manifests, data).
During cgr pre-flight
we will pull out the data section.
This file should already have all of the required columns.
See the Workflow Parameters section below for details about what values are allowed for subject ID, expected sex, and case control.
The config.yml
¶
In Running the Pipeline, we created the config.yml
using Creating Configuration.
This file is central to running the CGR GwasQcPipeline.
Here we will describe the main configuration options.
The config file is broken down into different sections called “namespaces”.
I will walk through each section below and end this page with a full example.
Note
Required options without defaults are in bold.
Top Level Config¶
pipeline_version: v1.0.0
slurm_partition: defq
project_name: SR0001-001_1_0000000
sample_sheet: /path/to/manifest/file/SR0001-001_1_AnalysisManifest_0000000.csv
genome_build: hg37
snp_array: GSAMD-24v1-0
num_samples: 6336
num_snps: 700078
reference_files: ... # Reference file namesapace
user_files: ... # User file namespace
software_params: ... # Software parameter namespace
workflow_params: ... # Workflow parameter namespace
Sample_IDs_to_remove:
|
||
type |
object |
|
properties |
||
|
Pipeline Version |
|
The version of the GwasQcPipeline to use. If you want to use a different version you may just edit this value to match. However, it is suggested that you re-run the entire pipeline in case there are differences between version. |
||
type |
string |
|
default |
v1.5.1 |
|
|
Slurm Partition |
|
Name of the Slurm partition to which jobs will be submitted when using the –slurm-generic submit option. |
||
type |
string |
|
|
Project Name |
|
The title of the project to use during report generation. |
||
type |
string |
|
|
Sample Sheet |
|
The path to the sample sheet (or LIMs manifest). This is the file referenced during |
||
type |
string |
|
format |
path |
|
|
The human genome build. This field is not actually used by the workflow. It is only used during |
|
type |
string; [hg37|hg38] |
|
default |
hg37 |
|
|
Snp Array |
|
Which SNP array was used. Only used for reporting. |
||
type |
string |
|
|
Num Samples |
|
Number of samples, automatically calculated from the sample sheet during |
||
type |
integer |
|
|
Num Snps |
|
Number of markers, automatically calculated from the |
||
type |
integer |
|
|
Reference Files |
|
Reference file namespace. Reference files include the Illumina provided BPM and the 1000 Genomes VCF. |
||
type |
namespace |
|
|
User Files |
|
User file namespace. User files include the user provided genotype data in IDAT/GTC, PED/MAP, or BED/BIM/FAM formats. |
||
type |
namespace |
|
|
Software Params |
|
Software parameter namespace. This includes all parameters passed to 3rd party and internal software and scripts. |
||
type |
namespace |
|
|
Workflow Params |
|
Workflow parameter namespace. This includes all parameters used to control workflow behavior. |
||
type |
namespace |
|
|
Sample Ids To Remove |
|
A list of Sample_IDs to exclude from QC. This is the easiest way for a user to exclude specific samples from the GwasQcPipeline. These samples will be flagged as |
||
type |
list of strings |
The config.yml
file contains all workflow configurations.
The top level section is automatically populated by cgr config
and cgr pre-flight
.
You should not need to edit this section unless you want to
(1) update the version of the workflow you are using, (2) change the project name, or (3) switch human references.
Reference Files¶
A list of reference files used by the pipeline. reference_files:
illumina_manifest_file: /path/to/bpm/file/GSAMD-24v1-0_20011747_A1.bpm
thousand_genome_vcf: /path/to/thousand/genome/ALL.wgs.phase3_shapeit2_mvncall_integrated_v5.20130502.sites.vcf.gz
thousand_genome_tbi: /path/to/thousand/genome/ALL.wgs.phase3_shapeit2_mvncall_integrated_v5.20130502.sites.vcf.gz.tbi
|
||
type |
object |
|
properties |
||
|
Illumina Manifest File |
|
Path to the Illumina provided BPM file. |
||
type |
string |
|
format |
path |
|
|
Illumina Cluster File |
|
Path to the array cluster EGT file. |
||
type |
string |
|
format |
path |
|
|
Thousand Genome Vcf |
|
Path to the 1000 Genomes VCF file. |
||
type |
string |
|
format |
path |
|
|
Thousand Genome Tbi |
|
Path to the corresponding index for the 1000 Genomes VCF file. |
||
type |
string |
|
format |
path |
If you are on CGEMS/CCAD
then the paths of the reference files are correctly populated by cgr config --cgems
or cgr config --cgems-dev
.
If you are on another system then you most provide the correct paths and versions of these files.
The illumina_manifest_file
is provided by Illumina.
The thousand_genome_vcf
and thousand_genome_tbi
files can be downloaded from the 1000 Genome’s website.
The illumina_cluster_file
is the EGT file used to generate GTCs; this file is not required and is only referenced in the QC report if present.
Warning
The genome build needs to match illumina_manifest_file
, thousand_genome_*
, and genome_build
above.
Currently we only support hg37
and hg38
.
Though in reality as long as illumina_manifest_file
and thousand_genome_*
are from the same build then everything should work.
User Files¶
A list of user provided files or naming patterns. user_files:
output_pattern: '{prefix}/{file_type}.{ext}'
idat_pattern:
red: /expample/pattern/wildcards/are/columns/in/sample_sheet_file/{Project}/{Sample_ID}_Red.idat
green: /expample/pattern/wildcards/are/columns/in/sample_sheet_file/{Project}/{Sample_ID}_Grn.idat
gtc_pattern: /expample/pattern/wildcards/are/columns/in/sample_sheet_file/{Project}/{Sample_ID}.gtc
or user_files:
output_pattern: '{prefix}/{file_type}.{ext}'
ped: /path/to/samples.ped
map: /path/to/samples.map
or user_files:
output_pattern: '{prefix}/{file_type}.{ext}'
bed: /path/to/samples.bed
bim: /path/to/samples.bim
fam: /path/to/samples.fam
Note The IDAT/GTC patterns, PED/MAP paths, and BED/BIM/FAM paths are mutually exclusive. You should only provide one set of patterns/paths. |
||
type |
object |
|
properties |
||
|
Output Pattern |
|
File naming pattern for deliverable files. In general, you should not need to edit this. However, if you do decide to change this pattern you must keep the |
||
type |
string |
|
default |
{prefix}/{file_type}.{ext} |
|
|
Idat Pattern |
|
File naming pattern for IDAT files. There are two IDAT files for each samples (red and green). You need to provide a file naming pattern for each IDAT file. Wildcards are indicated by an |
||
type |
namespace |
|
|
Gtc Pattern |
|
File name pattern for GTC file. Wildcards are indicated by an |
||
type |
string |
|
|
Ped |
|
The full path to an aggregated PED file if the sample level GTC files are not available. |
||
type |
string |
|
format |
path |
|
|
Map |
|
The full path to an aggregated MAP file if the sample level GTC files are not available. |
||
type |
string |
|
format |
path |
|
|
Bed |
|
The full path to an aggregated BED file if the sample level GTC files are not available. |
||
type |
string |
|
format |
path |
|
|
Bim |
|
The full path to an aggregated BIM file if the sample level GTC files are not available. |
||
type |
string |
|
format |
path |
|
|
Fam |
|
The full path to an aggregated FAM file if the sample level GTC files are not available. |
||
type |
string |
|
format |
path |
The user_files
section has a number of different mutually exclusive configurations depending on the starting file types.
By default we assume you will be starting with IDAT and GTC files.
Though we also accept aggregated PED/MAP and aggregated BED/BIM/FAM files.
In the example, {Project}
and {Sample_ID}
will be filled by values from Project
and Sample_ID
columns in cgr_sample_sheet.csv
.
If the gtc_pattern
is given, then this will trigger the GTC entry point.
We will convert each sample’s GTC to a PED/MAP and then aggregate all samples and convert to a single BED/BIM/FAM at sample_level/samples.{bed,bim,fam}
.
If the PED/MAP files are given, then this will trigger the PED/MAP entry point.
Which will convert these files to a single BED/BIM/FAM at sample_level/samples.{bed,bim,fam}
.
If the BED/BIM/FAM files are given, then this will trigger the BED/BIM/FAM entry point.
Which will create a symbolic link from your BED/BIM/FAM to sample_level/samples.{bed,bim,fam}
.
Software Parameters¶
Software parameters used by various tools in the workflow. software_params:
strand: top
sample_call_rate_1: 0.8
snp_call_rate_1: 0.8
sample_call_rate_2: 0.95
snp_call_rate_2: 0.95
intensity_threshold: 6000
contam_threshold: 0.1
contam_population: AF
ld_prune_r2: 0.1
maf_for_ibd: 0.2
maf_for_hwe: 0.05
ibd_pi_hat_min: 0.12
ibd_pi_hat_max: 1.0
dup_concordance_cutoff: 0.95
pi_hat_threshold: 0.2
autosomal_het_threshold: 0.1
|
||
type |
object |
|
properties |
||
|
Strand |
|
Which Illumina strand to use for genotypes when converting GTC to plink {TOP, FWD, PLUS}. |
||
type |
string |
|
default |
top |
|
|
Sample Call Rate 1 |
|
Sample call rate filter 1 threshold. |
||
type |
number |
|
maximum |
1 |
|
exclusiveMinimum |
0 |
|
default |
0.8 |
|
|
Snp Call Rate 1 |
|
SNP call rate filter 1 threshold. |
||
type |
number |
|
maximum |
1 |
|
exclusiveMinimum |
0 |
|
default |
0.8 |
|
|
Sample Call Rate 2 |
|
Sample call rate filter 2 threshold. |
||
type |
number |
|
maximum |
1 |
|
exclusiveMinimum |
0 |
|
default |
0.95 |
|
|
Snp Call Rate 2 |
|
SNP call rate filter 2 threshold. |
||
type |
number |
|
maximum |
1 |
|
exclusiveMinimum |
0 |
|
default |
0.95 |
|
|
Intensity Threshold |
|
Median IDAT intensity threshold used to filter samples during estimating contamination check. |
||
type |
integer |
|
exclusiveMinimum |
0 |
|
default |
6000 |
|
|
Contam Threshold |
|
%Mix cutoff to consider a sample as contaminated. |
||
type |
number |
|
maximum |
1 |
|
exclusiveMinimum |
0 |
|
default |
0.1 |
|
|
Contam Population |
|
While population from the 1000 Genomes project to use for B-allele frequencies during contamination testing .Can be one of {AF, EAS_AF, AMR_AF, AFR_AF, EUR_AF, SAS_AF}. |
||
type |
string |
|
default |
AF |
|
|
Ld Prune R2 |
|
The r-squared threshold for LD pruning of SNPS for use with IBD and replicate concordance. |
||
type |
number |
|
maximum |
1 |
|
exclusiveMinimum |
0 |
|
default |
0.1 |
|
|
Maf For Ibd |
|
The minor allele frequency threshold of SNPS for use with IBD and replicate concordance. |
||
type |
number |
|
maximum |
1 |
|
exclusiveMinimum |
0 |
|
default |
0.2 |
|
|
Maf For Hwe |
|
The minor allele frequency threshold of SNPS for use with population level HWE estimates. |
||
type |
number |
|
maximum |
1 |
|
exclusiveMinimum |
0 |
|
default |
0.05 |
|
|
Ibd Pi Hat Min |
|
The minimum IBD pi hat value to save in the results table. |
||
type |
number |
|
exclusiveMaximum |
1 |
|
minimum |
0 |
|
default |
0.12 |
|
|
Ibd Pi Hat Max |
|
The maximum IBD pi hat value to save in the results table. |
||
type |
number |
|
maximum |
1 |
|
exclusiveMinimum |
0 |
|
default |
1.0 |
|
|
Dup Concordance Cutoff |
|
The concordance threshold to consider two samples as replicates. |
||
type |
number |
|
maximum |
1 |
|
exclusiveMinimum |
0 |
|
default |
0.95 |
|
|
Pi Hat Threshold |
|
The pi hat threshold to consider two samples as related. The default of 0.2 reports 1st and 2nd degree relatives. |
||
type |
number |
|
maximum |
1 |
|
exclusiveMinimum |
0 |
|
default |
0.2 |
|
|
Autosomal Het Threshold |
|
The autosomal heterozygosity F coefficient threshold which to flag subject for removal. |
||
type |
number |
|
maximum |
1 |
|
exclusiveMinimum |
0 |
|
default |
0.1 |
Workflow Parameters¶
This set of parameters control what parts and how the workflow is run. workflow_params:
subject_id_column: Group_By
expected_sex_column: Expected_Sex
sex_chr_included: true
case_control_column: Case/Control_Status
remove_contam: true
remove_rep_discordant: true
minimum_pop_subjects: 50
control_hwp_threshold: 50
lims_upload: true
lims_output_dir: /DCEG/CGF/Laboratory/LIMS/drop-box-prod/gwas_primaryqc/
case_control_gwas: false
time_start:
|
||
type |
object |
|
properties |
||
|
Subject Id Column |
|
The column in your sample sheet that contains the subject ID. We expect that there may be multiple samples (rows) that have the sample subject ID. If there are multiple columns that you want use for subject ID, you can create a special column named |
||
type |
string |
|
default |
Group_By |
|
|
Expected Sex Column |
|
The name of the column in the sample sheet which identifies expected sex of samples. Allowed values in this columns are [ |
||
type |
string |
|
default |
Expected_Sex |
|
|
Sex Chr Included |
|
True if the sex chromosome is included in the microarray and a sex concordance check can be performed. |
||
type |
boolean |
|
default |
True |
|
|
Case Control Column |
|
The name of the colun in the sample sheet which identifies Case/Control status. Allowed values in this column are [ |
||
type |
string |
|
default |
Case/Control_Status |
|
|
Remove Contam |
|
True if you want to remove contaminated samples before running the subject level QC. |
||
type |
boolean |
|
default |
True |
|
|
Remove Rep Discordant |
|
True if you want to remove discordant replicates before running the subject level QC. |
||
type |
boolean |
|
default |
True |
|
|
Minimum Pop Subjects |
|
The minimum number of samples needed to use a population for population level QC (PCA, Autosomal Heterozygosity). |
||
type |
integer |
|
exclusiveMinimum |
0 |
|
default |
50 |
|
|
Control Hwp Threshold |
|
The minimum number of control samples needed to use a population for population level QC (HWE). |
||
type |
integer |
|
exclusiveMinimum |
0 |
|
default |
50 |
|
|
Lims Upload |
|
For |
||
type |
boolean |
|
default |
True |
|
|
Lims Output Dir |
|
type |
string |
|
default |
/DCEG/CGF/Laboratory/LIMS/drop-box-prod/gwas_primaryqc/ |
|
|
Case Control Gwas |
|
A plink logistic regression gwas will be performed with case_control phenotype. |
||
type |
boolean |
|
default |
False |
|
|
Time Start |
|
Date and time at which the workflow starts. This creates a unique id for the run. |
||
type |
string |
|
default |
20240604154246 |
Sample IDs to Remove¶
This is an optional section where you can list Sample_ID
that you do not want to include in the QC run.
These samples will be indicated as is_user_exclusion = True
in the sample level QC table.
Sample_IDs_to_remove:
- Sample0001
Full Example¶
pipeline_version: v1.0.0
project_name: SR0001-001_1_0000000
sample_sheet: /path/to/manifest/file/SR0001-001_1_AnalysisManifest_0000000.csv
genome_build: hg37
snp_array: GSAMD-24v1-0
num_samples: 6336
num_snps: 700078
reference_files:
illumina_manifest_file: /path/to/bpm/file/GSAMD-24v1-0_20011747_A1.bpm
thousand_genome_vcf: /path/to/thousand/genome/ALL.wgs.phase3_shapeit2_mvncall_integrated_v5.20130502.sites.vcf.gz
thousand_genome_tbi: /path/to/thousand/genome/ALL.wgs.phase3_shapeit2_mvncall_integrated_v5.20130502.sites.vcf.gz.tbi
user_files:
output_pattern: '{prefix}/{file_type}.{ext}'
idat_pattern:
red: /expample/pattern/wildcards/are/columns/in/sample_sheet_file/{Project}/{Sample_ID}_Red.idat
green: /expample/pattern/wildcards/are/columns/in/sample_sheet_file/{Project}/{Sample_ID}_Grn.idat
gtc_pattern: /expample/pattern/wildcards/are/columns/in/sample_sheet_file/{Project}/{Sample_ID}.gtc
software_params:
sample_call_rate_1: 0.8
snp_call_rate_1: 0.8
sample_call_rate_2: 0.95
snp_call_rate_2: 0.95
ld_prune_r2: 0.1
maf_for_ibd: 0.2
maf_for_hwe: 0.05
ibd_pi_hat_min: 0.12
ibd_pi_hat_max: 1.0
dup_concordance_cutoff: 0.95
intensity_threshold: 6000
contam_threshold: 0.1
contam_population: AF
pi_hat_threshold: 0.2
autosomal_het_threshold: 0.1
strand: top
workflow_params:
subject_id_column: Group_By
expected_sex_column: Expected_Sex
case_control_column: Case/Control_Status
remove_contam: true
remove_rep_discordant: true
minimum_pop_subjects: 50
control_hwp_threshold: 50
lims_upload: true
lims_output_dir: /example/location/to/place/lims/upload/file
time_start: '20240227130627'
Sample_IDs_to_remove:
- Sample0001