Configuration Files¶
Sample Sheet File¶
A sample sheet is a CSV formatted file where each row represents a sample and columns contain various metadata.
This file must contain a column named Sample_ID
which has a unique ID for each row.
You also need a column describing (1) the subject ID representing samples from an individual, (2) the expected sex of individual, and (3) the case control status of the individual.
See the Workflow Parameters section below for details about what values are allowed for subject ID, expected sex, and case control.
Note
Example of the basic required structure of a sample sheet.
You can have any other metadata as additional columns.
The only column name that is required is Sample_ID
, the other columns can be named anything as long as they are referenced correctly in Workflow Parameters.
Sample_ID,Subject_ID,Expected_Sex,Case_Control
Samp0001,Sub0001,M,Case
Samp0002,Sub0001,M,Case
Samp0003,Sub0002,F,Control
Samp0004,Sub0003,F,QC
CGR LIMs Manifest File¶
The CGR LIMs manifest file is our internal way to distribute sample information.
If you use a Sample Sheet File then you can ignore this section.
The manifest file is an INI-like file with three section (header, manifests, data).
During cgr pre-flight
we will pull out the data section.
This file should already have all of the required columns.
See the Workflow Parameters section below for details about what values are allowed for subject ID, expected sex, and case control.
The config.yml
¶
In Running the Pipeline, we created the config.yml
using Creating Configuration.
This file is central to running the CGR GwasQcPipeline.
Here we will describe the main configuration options.
The config file is broken down into different sections called “namespaces”.
I will walk through each section below and end this page with a full example.
Note
Required options without defaults are in bold.
Top Level Config¶
pipeline_version: v1.0.0
slurm_partition: defq
project_name: SR0001-001_1_0000000
sample_sheet: /path/to/manifest/file/SR0001-001_1_AnalysisManifest_0000000.csv
genome_build: hg37
snp_array: GSAMD-24v1-0
num_samples: 6336
num_snps: 700078
reference_files: ... # Reference file namesapace
user_files: ... # User file namespace
software_params: ... # Software parameter namespace
workflow_params: ... # Workflow parameter namespace
Sample_IDs_to_remove:
|
||
type |
object |
|
properties |
||
|
Pipeline Version |
|
The version of the GwasQcPipeline to use. If you want to use a different version you may just edit this value to match. However, it is suggested that you re-run the entire pipeline in case there are differences between version. |
||
type |
string |
|
default |
v1.8.0-rc3 |
|
|
Slurm Partition |
|
Name of the Slurm partition to which jobs will be submitted when using the –slurm-generic submit option. |
||
type |
string |
|
|
Project Name |
|
The title of the project to use during report generation. |
||
type |
string |
|
|
Sample Sheet |
|
The path to the sample sheet (or LIMs manifest). This is the file referenced during |
||
type |
string |
|
format |
path |
|
|
The human genome build. This field is not actually used by the workflow. It is only used during |
|
type |
string; [hg37|hg38] |
|
default |
hg37 |
|
|
Snp Array |
|
Which SNP array was used. Only used for reporting. |
||
type |
string |
|
|
Num Samples |
|
Number of samples, automatically calculated from the sample sheet during |
||
type |
integer |
|
|
Num Snps |
|
Number of markers, automatically calculated from the |
||
type |
integer |
|
|
Reference Files |
|
Reference file namespace. Reference files include the Illumina provided BPM and the 1000 Genomes VCF. |
||
type |
namespace |
|
|
User Files |
|
User file namespace. User files include the user provided genotype data in BCF, IDAT/GTC, PED/MAP or BED/BIM/FAM formats. |
||
type |
namespace |
|
|
Software Params |
|
Software parameter namespace. This includes all parameters passed to 3rd party and internal software and scripts. |
||
type |
namespace |
|
|
Workflow Params |
|
Workflow parameter namespace. This includes all parameters used to control workflow behavior. |
||
type |
namespace |
|
|
Sample Ids To Remove |
|
A list of Sample_IDs to exclude from QC. This is the easiest way for a user to exclude specific samples from the GwasQcPipeline. These samples will be flagged as |
||
type |
list of strings |
The config.yml
file contains all workflow configurations.
The top level section is automatically populated by cgr config
and cgr pre-flight
.
You should not need to edit this section unless you want to
(1) update the version of the workflow you are using, (2) change the project name, or (3) switch human references.
Reference Files¶
A list of reference files used by the pipeline. reference_files:
illumina_manifest_file: /path/to/bpm/file/GSAMD-24v1-0_20011747_A1.bpm
illumina_csv_bpm: /path/to/csv/file/GSAMD-24v1-0_20011747_A1.csv
thousand_genome_vcf: /path/to/thousand/genome/ALL.wgs.phase3_shapeit2_mvncall_integrated_v5.20130502.sites.vcf.gz
thousand_genome_tbi: /path/to/thousand/genome/ALL.wgs.phase3_shapeit2_mvncall_integrated_v5.20130502.sites.vcf.gz.tbi
reference_fasta: /path/to/reference/fasta/GCA_000001405.15_GRCh38_full_analysis_set.fna.bgz
|
||
type |
object |
|
properties |
||
|
Illumina Manifest File |
|
Path to the Illumina provided BPM file. |
||
type |
string |
|
format |
path |
|
|
Illumina Cluster File |
|
Path to the array cluster EGT file. |
||
type |
string |
|
format |
path |
|
|
Thousand Genome Vcf |
|
Path to the 1000 Genomes VCF file. |
||
type |
string |
|
format |
path |
|
|
Thousand Genome Tbi |
|
Path to the corresponding index for the 1000 Genomes VCF file. |
||
type |
string |
|
format |
path |
|
|
Reference Fasta |
|
Path to Reference fasta file to be used to convert gtc to bcf. This could be compressed with bgzip but not gzip. |
||
type |
string |
|
format |
path |
|
|
Illumina Csv Bpm |
|
Path to CSV bead pool manifest provided by Illumina to be used for gtc to bcf conversion. If csv_bpm is not provided, insertions/deletions will be skipped in gtc-to-bcf conversion. |
||
type |
string |
|
format |
path |
If you are on CGEMS/CCAD
then the paths of the reference files are correctly populated by cgr config --cgems
or cgr config --cgems-dev
.
If you are on another system then you most provide the correct paths and versions of these files.
The illumina_manifest_file
is provided by Illumina.
The thousand_genome_vcf
and thousand_genome_tbi
files can be downloaded from the 1000 Genome’s website.
The illumina_cluster_file
is the EGT file used to generate GTCs; this file is not required and is only referenced in the QC report if present.
The reference_fasta
is used convert GTCs to VCF and needed if GTCs are provided.
- 1000 Genomes reference files download links
- reference fasta download links
After downloading the reference fasta, it needs to be converted from gz format to bgz format. This can be done with:
zcat GCA_000001405.15_GRCh38_full_analysis_set.fna.gz | bgzip -c > GCA_000001405.15_GRCh38_full_analysis_set.fna.bgz
Warning
The genome build needs to match illumina_manifest_file
, thousand_genome_*
, and genome_build
above.
Currently we only support hg37
and hg38
.
Though in reality as long as illumina_manifest_file
and thousand_genome_*
are from the same build then everything should work.
User Files¶
A list of user provided files or naming patterns. user_files:
output_pattern: '{prefix}/{file_type}.{ext}'
idat_pattern:
red: /expample/pattern/wildcards/are/columns/in/sample_sheet_file/{Project}/{Sample_ID}_Red.idat
green: /expample/pattern/wildcards/are/columns/in/sample_sheet_file/{Project}/{Sample_ID}_Grn.idat
gtc_pattern: /expample/pattern/wildcards/are/columns/in/sample_sheet_file/{Project}/{Sample_ID}.gtc
or user_files:
output_pattern: '{prefix}/{file_type}.{ext}'
ped: /path/to/samples.ped
map: /path/to/samples.map
or user_files:
output_pattern: '{prefix}/{file_type}.{ext}'
bed: /path/to/samples.bed
bim: /path/to/samples.bim
fam: /path/to/samples.fam
or user_files:
output_pattern: '{prefix}/{file_type}.{ext}'
bcf: /path/to/samples.bcf
Note The BCF file, IDAT/GTC patterns, PED/MAP paths, BED/BIM/FAM paths are all mutually exclusive. You should only provide one set of patterns/paths. |
||
type |
object |
|
properties |
||
|
Output Pattern |
|
File naming pattern for deliverable files. In general, you should not need to edit this. However, if you do decide to change this pattern you must keep the |
||
type |
string |
|
default |
{prefix}/{file_type}.{ext} |
|
|
Idat Pattern |
|
File naming pattern for IDAT files. There are two IDAT files for each samples (red and green). You need to provide a file naming pattern for each IDAT file. Wildcards are indicated by an |
||
type |
namespace |
|
|
Gtc Pattern |
|
File name pattern for GTC file. Wildcards are indicated by an |
||
type |
string |
|
|
Ped |
|
The full path to an aggregated PED file if the sample level GTC files are not available. |
||
type |
string |
|
format |
path |
|
|
Map |
|
The full path to an aggregated MAP file if the sample level GTC files are not available. |
||
type |
string |
|
format |
path |
|
|
Bed |
|
The full path to an aggregated BED file if the sample level GTC files are not available. |
||
type |
string |
|
format |
path |
|
|
Bim |
|
The full path to an aggregated BIM file if the sample level GTC files are not available. |
||
type |
string |
|
format |
path |
|
|
Fam |
|
The full path to an aggregated FAM file if the sample level GTC files are not available. |
||
type |
string |
|
format |
path |
|
|
Bcf |
|
The full path to an aggregated BCF/VCF file perferably encoding the GenCall scores. |
||
type |
string |
|
format |
path |
The user_files
section has a number of different mutually exclusive configurations depending on the starting file types.
By default we assume you will be starting with IDAT and GTC files.
Though we also accept aggregated PED/MAP and aggregated BED/BIM/FAM files.
In the example, {Project}
and {Sample_ID}
will be filled by values from Project
and Sample_ID
columns in cgr_sample_sheet.csv
.
If the gtc_pattern
is given, then this will trigger the GTC entry point. There are two methods to convert gtc files to sample_level/samples.{bed,bim,fam}
.
If workflow_params.convert_gtc2bcf=false
(default), we will convert each sample’s GTC to a PED/MAP and then aggregate all samples and convert to a single BED/BIM/FAM at sample_level/samples.{bed,bim,fam}
.
If workflow_params.convert_gtc2bcf=true
, we will convert GTCs to an aggregated BCF and convert to BED/BIM/FAM at sample_level/samples.{bed,bim,fam}
.
If the PED/MAP files are given, then this will trigger the PED/MAP entry point.
Which will convert these files to a single BED/BIM/FAM at sample_level/samples.{bed,bim,fam}
.
If the BED/BIM/FAM files are given, then this will trigger the BED/BIM/FAM entry point.
Which will create a symbolic link from your BED/BIM/FAM to sample_level/samples.{bed,bim,fam}
.
If a BCF is given, then this will trigger the BCF entry point.
Which will convert BCF to BED/BIM/FAM at sample_level/samples.{bed,bim,fam}
.
Software Parameters¶
Software parameters used by various tools in the workflow. software_params:
strand: top
sample_call_rate_1: 0.8
snp_call_rate_1: 0.8
sample_call_rate_2: 0.95
snp_call_rate_2: 0.95
intensity_threshold: 6000
contam_threshold: 0.1
contam_population: AF
ld_prune_r2: 0.1
maf_for_ibd: 0.2
maf_for_hwe: 0.05
ibd_pi_hat_min: 0.12
ibd_pi_hat_max: 1.0
dup_concordance_cutoff: 0.95
pi_hat_threshold: 0.2
autosomal_het_threshold: 0.1
|
||
type |
object |
|
properties |
||
|
Strand |
|
Which Illumina strand to use for genotypes when converting GTC to plink {TOP, FWD, PLUS}. |
||
type |
string |
|
default |
top |
|
|
Sample Call Rate 1 |
|
Sample call rate filter 1 threshold. |
||
type |
number |
|
maximum |
1 |
|
exclusiveMinimum |
0 |
|
default |
0.8 |
|
|
Snp Call Rate 1 |
|
SNP call rate filter 1 threshold. |
||
type |
number |
|
maximum |
1 |
|
exclusiveMinimum |
0 |
|
default |
0.8 |
|
|
Sample Call Rate 2 |
|
Sample call rate filter 2 threshold. |
||
type |
number |
|
maximum |
1 |
|
exclusiveMinimum |
0 |
|
default |
0.95 |
|
|
Snp Call Rate 2 |
|
SNP call rate filter 2 threshold. |
||
type |
number |
|
maximum |
1 |
|
exclusiveMinimum |
0 |
|
default |
0.95 |
|
|
Intensity Threshold |
|
Median IDAT intensity threshold used to filter samples during estimating contamination check. |
||
type |
integer |
|
exclusiveMinimum |
0 |
|
default |
6000 |
|
|
Contam Threshold |
|
%Mix cutoff to consider a sample as contaminated. |
||
type |
number |
|
maximum |
1 |
|
exclusiveMinimum |
0 |
|
default |
0.1 |
|
|
Contam Population |
|
Which population from the 1000 Genomes project to use for B-allele frequencies during contamination testing. Can be one of {AF, EAS_AF, AMR_AF, AFR_AF, EUR_AF, SAS_AF}. |
||
type |
string |
|
default |
AF |
|
|
Ld Prune R2 |
|
The r-squared threshold for LD pruning of SNPS for use with IBD and replicate concordance. |
||
type |
number |
|
maximum |
1 |
|
exclusiveMinimum |
0 |
|
default |
0.1 |
|
|
Maf For Ibd |
|
The minor allele frequency threshold of SNPS for use with IBD and replicate concordance. |
||
type |
number |
|
maximum |
1 |
|
exclusiveMinimum |
0 |
|
default |
0.2 |
|
|
Maf For Hwe |
|
The minor allele frequency threshold of SNPS for use with population level HWE estimates. |
||
type |
number |
|
maximum |
1 |
|
exclusiveMinimum |
0 |
|
default |
0.05 |
|
|
Ibd Pi Hat Min |
|
The minimum IBD pi hat value to save in the results table. |
||
type |
number |
|
exclusiveMaximum |
1 |
|
minimum |
0 |
|
default |
0.12 |
|
|
Ibd Pi Hat Max |
|
The maximum IBD pi hat value to save in the results table. |
||
type |
number |
|
maximum |
1 |
|
exclusiveMinimum |
0 |
|
default |
1.0 |
|
|
Dup Concordance Cutoff |
|
The concordance threshold to consider two samples as replicates. |
||
type |
number |
|
maximum |
1 |
|
exclusiveMinimum |
0 |
|
default |
0.95 |
|
|
Pi Hat Threshold |
|
The pi hat threshold to consider two samples as related. The default of 0.2 reports 1st and 2nd degree relatives. |
||
type |
number |
|
maximum |
1 |
|
exclusiveMinimum |
0 |
|
default |
0.2 |
|
|
Autosomal Het Threshold |
|
The autosomal heterozygosity F coefficient threshold which to flag subject for removal. |
||
type |
number |
|
maximum |
1 |
|
exclusiveMinimum |
0 |
|
default |
0.1 |
Workflow Parameters¶
This set of parameters control what parts and how the workflow is run. workflow_params:
subject_id_column: Group_By
expected_sex_column: Expected_Sex
sex_chr_included: true
case_control_column: Case/Control_Status
remove_contam: true
remove_rep_discordant: true
minimum_pop_subjects: 50
control_hwp_threshold: 50
lims_upload: true
lims_output_dir: /DCEG/CGF/Laboratory/LIMS/drop-box-prod/gwas_primaryqc/
case_control_gwas: false
max_time_hr:
max_mem_mb:
time_start:
convert_gtc2bcf: false
additional_params_for_gtc2bcf: --use-gtc-sample-names
|
||
type |
object |
|
properties |
||
|
Subject Id Column |
|
The column in your sample sheet that contains the subject ID. We expect that there may be multiple samples (rows) that have the sample subject ID. If there are multiple columns that you want use for subject ID, you can create a special column named |
||
type |
string |
|
default |
Group_By |
|
|
Expected Sex Column |
|
The name of the column in the sample sheet which identifies expected sex of samples. Allowed values in this columns are [ |
||
type |
string |
|
default |
Expected_Sex |
|
|
Sex Chr Included |
|
True if the sex chromosome is included in the microarray and a sex concordance check can be performed. |
||
type |
boolean |
|
default |
True |
|
|
Case Control Column |
|
The name of the colun in the sample sheet which identifies Case/Control status. Allowed values in this column are [ |
||
type |
string |
|
default |
Case/Control_Status |
|
|
Remove Contam |
|
True if you want to remove contaminated samples before running the subject level QC. |
||
type |
boolean |
|
default |
True |
|
|
Remove Rep Discordant |
|
True if you want to remove discordant replicates before running the subject level QC. |
||
type |
boolean |
|
default |
True |
|
|
Minimum Pop Subjects |
|
The minimum number of samples needed to use a population for population level QC (PCA, Autosomal Heterozygosity). |
||
type |
integer |
|
exclusiveMinimum |
0 |
|
default |
50 |
|
|
Control Hwp Threshold |
|
The minimum number of control samples needed to use a population for population level QC (HWE). |
||
type |
integer |
|
exclusiveMinimum |
0 |
|
default |
50 |
|
|
Lims Upload |
|
For |
||
type |
boolean |
|
default |
True |
|
|
Lims Output Dir |
|
type |
string |
|
default |
/DCEG/CGF/Laboratory/LIMS/drop-box-prod/gwas_primaryqc/ |
|
|
Case Control Gwas |
|
A plink logistic regression gwas will be performed with case_control phenotype. |
||
type |
boolean |
|
default |
False |
|
|
Max Time Hr |
|
The maximum amount of time that can be requested, in hours. |
||
type |
integer |
|
|
Max Mem Mb |
|
The maximum amount of memory that can be requests, in megabytes. |
||
type |
integer |
|
|
Time Start |
|
Date and time at which the workflow starts. This creates a unique id for the run. |
||
type |
string |
|
default |
20241121155030 |
|
|
Convert Gtc2Bcf |
|
If input is GTC, this switches between gtc2vcf (True) and gtc2ped (False - default) for conversion to BED. |
||
type |
boolean |
|
default |
False |
|
|
Additional Params For Gtc2Bcf |
|
Additional/optional parameters not hardcoded to be used or skipped in gtc2bcf for specific analysis. |
||
type |
string |
|
default |
–use-gtc-sample-names |
Sample IDs to Remove¶
This is an optional section where you can list Sample_ID
that you do not want to include in the QC run.
These samples will be indicated as is_user_exclusion = True
in the sample level QC table.
Sample_IDs_to_remove:
- Sample0001
Full Example¶
pipeline_version: v1.0.0
project_name: SR0001-001_1_0000000
sample_sheet: /path/to/manifest/file/SR0001-001_1_AnalysisManifest_0000000.csv
genome_build: hg37
snp_array: GSAMD-24v1-0
num_samples: 6336
num_snps: 700078
reference_files:
illumina_manifest_file: /path/to/bpm/file/GSAMD-24v1-0_20011747_A1.bpm
thousand_genome_vcf: /path/to/thousand/genome/ALL.wgs.phase3_shapeit2_mvncall_integrated_v5.20130502.sites.vcf.gz
thousand_genome_tbi: /path/to/thousand/genome/ALL.wgs.phase3_shapeit2_mvncall_integrated_v5.20130502.sites.vcf.gz.tbi
reference_fasta: /path/to/reference/GwasQcPipeline-test-data/GCA_000001405.15_GRCh38_full_analysis_set.fna.bgz
user_files:
output_pattern: '{prefix}/{file_type}.{ext}'
idat_pattern:
red: /expample/pattern/wildcards/are/columns/in/sample_sheet_file/{Project}/{Sample_ID}_Red.idat
green: /expample/pattern/wildcards/are/columns/in/sample_sheet_file/{Project}/{Sample_ID}_Grn.idat
gtc_pattern: /expample/pattern/wildcards/are/columns/in/sample_sheet_file/{Project}/{Sample_ID}.gtc
software_params:
sample_call_rate_1: 0.8
snp_call_rate_1: 0.8
sample_call_rate_2: 0.95
snp_call_rate_2: 0.95
ld_prune_r2: 0.1
maf_for_ibd: 0.2
maf_for_hwe: 0.05
ibd_pi_hat_min: 0.12
ibd_pi_hat_max: 1.0
dup_concordance_cutoff: 0.95
intensity_threshold: 6000
contam_threshold: 0.1
contam_population: AF
pi_hat_threshold: 0.2
autosomal_het_threshold: 0.1
strand: top
workflow_params:
subject_id_column: Group_By
expected_sex_column: Expected_Sex
case_control_column: Case/Control_Status
remove_contam: true
remove_rep_discordant: true
minimum_pop_subjects: 50
control_hwp_threshold: 50
lims_upload: true
lims_output_dir: /example/location/to/place/lims/upload/file
time_start: '20240227130627'
convert_gtc2bcf: false
Sample_IDs_to_remove:
- Sample0001