Configuration Files¶

Sample Sheet File¶

A sample sheet is a CSV formatted file where each row represents a sample and columns contain various metadata. This file must contain a column named Sample_ID which has a unique ID for each row. You also need a column describing (1) the subject ID representing samples from an individual, (2) the expected sex of individual, and (3) the case control status of the individual. See the Workflow Parameters section below for details about what values are allowed for subject ID, expected sex, and case control.

Note

Example of the basic required structure of a sample sheet. You can have any other metadata as additional columns. The only column name that is required is Sample_ID, the other columns can be named anything as long as they are referenced correctly in Workflow Parameters.

Sample_ID,Subject_ID,Expected_Sex,Case_Control
Samp0001,Sub0001,M,Case
Samp0002,Sub0001,M,Case
Samp0003,Sub0002,F,Control
Samp0004,Sub0003,F,QC

CGR LIMs Manifest File¶

The CGR LIMs manifest file is our internal way to distribute sample information. If you use a Sample Sheet File then you can ignore this section. The manifest file is an INI-like file with three section (header, manifests, data). During cgr pre-flight we will pull out the data section. This file should already have all of the required columns. See the Workflow Parameters section below for details about what values are allowed for subject ID, expected sex, and case control.

The `config.yml`¶

In Running the Pipeline, we created the config.yml using Creating Configuration. This file is central to running the CGR GwasQcPipeline. Here we will describe the main configuration options. The config file is broken down into different sections called “namespaces”. I will walk through each section below and end this page with a full example.

Note

Required options without defaults are in bold.

Top Level Config¶

`config.yml` data model. pipeline_version: v1.0.0 slurm_partition: defq project_name: SR0001-001_1_0000000 sample_sheet: /path/to/manifest/file/SR0001-001_1_AnalysisManifest_0000000.csv genome_build: hg37 snp_array: GSAMD-24v1-0 num_samples: 6336 num_snps: 700078 reference_files: ... # Reference file namesapace user_files: ... # User file namespace software_params: ... # Software parameter namespace workflow_params: ... # Workflow parameter namespace Sample_IDs_to_remove:
type	object
properties
pipeline_version	Pipeline Version
	The version of the GwasQcPipeline to use. If you want to use a different version you may just edit this value to match. However, it is suggested that you re-run the entire pipeline in case there are differences between version.
	type	string
	default	v1.9.0-rc1
slurm_partition	Slurm Partition
	Name of the Slurm partition to which jobs will be submitted when using the –slurm-generic submit option.
	type	string
project_name	Project Name
	The title of the project to use during report generation.
	type	string
sample_sheet	Sample Sheet
	The path to the sample sheet (or LIMs manifest). This is the file referenced during `cgr config` and used to generate `cgr_sample_sheet.csv`.
	type	string
	format	path
genome_build	The human genome build. This field is not actually used by the workflow. It is only used during `cgr config` to select the VCF file when running on CGEMs/CCAD.
	type	string; [hg37\|hg38]
	default	hg37
snp_array	Snp Array
	Which SNP array was used. Only used for reporting.
	type	string
num_samples	Num Samples
	Number of samples, automatically calculated from the sample sheet during `cgr config`.
	type	integer
num_snps	Num Snps
	Number of markers, automatically calculated from the `reference_files.illumina_manifest_file`. We will attempt to calculate this during `cgr config` and `cgr pre-flight`.
	type	integer
reference_files	Reference Files
	Reference file namespace. Reference files include the Illumina provided BPM and the 1000 Genomes VCF.
	type	namespace
user_files	User Files
	User file namespace. User files include the user provided genotype data in BCF, IDAT/GTC, PED/MAP or BED/BIM/FAM formats.
	type	namespace
software_params	Software Params
	Software parameter namespace. This includes all parameters passed to 3rd party and internal software and scripts.
	type	namespace
workflow_params	Workflow Params
	Workflow parameter namespace. This includes all parameters used to control workflow behavior.
	type	namespace
Sample_IDs_to_remove	Sample Ids To Remove
	A list of Sample_IDs to exclude from QC. This is the easiest way for a user to exclude specific samples from the GwasQcPipeline. These samples will be flagged as `is_user_exclusion` and will be present in the report, but they will have not have any results from this analysis.
	type	list of strings

The config.yml file contains all workflow configurations. The top level section is automatically populated by cgr config and cgr pre-flight. You should not need to edit this section unless you want to (1) update the version of the workflow you are using, (2) change the project name, or (3) switch human references.

Reference Files¶

A list of reference files used by the pipeline. reference_files: illumina_manifest_file: /path/to/bpm/file/GSAMD-24v1-0_20011747_A1.bpm illumina_csv_bpm: /path/to/csv/file/GSAMD-24v1-0_20011747_A1.csv illumina_cluster_file: /path/to/egt-cluster/file/ thousand_genome_vcf: /path/to/thousand/genome/ALL.wgs.phase3_shapeit2_mvncall_integrated_v5.20130502.sites.vcf.gz thousand_genome_tbi: /path/to/thousand/genome/ALL.wgs.phase3_shapeit2_mvncall_integrated_v5.20130502.sites.vcf.gz.tbi reference_fasta: /path/to/reference/fasta/GCA_000001405.15_GRCh38_full_analysis_set.fna.bgz
type	object
properties
illumina_manifest_file	Illumina Manifest File
	Path to the Illumina provided BPM file.
	type	string
	format	path
illumina_cluster_file	Illumina Cluster File
	Path to the array cluster EGT file.
	type	string
	format	path
thousand_genome_vcf	Thousand Genome Vcf
	Path to the 1000 Genomes VCF file.
	type	string
	format	path
thousand_genome_tbi	Thousand Genome Tbi
	Path to the corresponding index for the 1000 Genomes VCF file.
	type	string
	format	path
reference_fasta	Reference Fasta
	Path to Reference fasta file to be used to convert gtc to bcf. This could be compressed with bgzip but not gzip.
	type	string
	format	path
illumina_csv_bpm	Illumina Csv Bpm
	Path to CSV bead pool manifest provided by Illumina to be used for gtc to bcf conversion. If csv_bpm is not provided, insertions/deletions will be skipped in gtc-to-bcf conversion.
	type	string
	format	path

If you are on CGEMS/CCAD then the paths of the reference files are correctly populated by cgr config --cgems or cgr config --cgems-dev. If you are on another system then you most provide the correct paths and versions of these files. The illumina_manifest_file is provided by Illumina. The thousand_genome_vcf and thousand_genome_tbi files can be downloaded from the 1000 Genome’s website. The illumina_cluster_file is the EGT file used to generate GTCs; this file is required if an IDAT entry_point is used. The reference_fasta is used convert GTCs to VCF and needed if GTCs are provided.

1000 Genomes reference files download links

hg37: VCF, TBI
hg38: VCF, TBI

reference fasta download links

hg37: fasta
hg38: fasta

After downloading the reference fasta, it needs to be converted from gz format to bgz format. This can be done with:

zcat GCA_000001405.15_GRCh38_full_analysis_set.fna.gz | bgzip -c > GCA_000001405.15_GRCh38_full_analysis_set.fna.bgz

Warning

The genome build needs to match illumina_manifest_file, thousand_genome_*, and genome_build above. Currently we only support hg37 and hg38. Though in reality as long as illumina_manifest_file and thousand_genome_* are from the same build then everything should work.

User Files¶

A list of user provided files or naming patterns. user_files: output_pattern: '{prefix}/{file_type}.{ext}' idat_pattern: red: /expample/pattern/wildcards/are/columns/in/sample_sheet_file/{Project}/{Sample_ID}_Red.idat green: /expample/pattern/wildcards/are/columns/in/sample_sheet_file/{Project}/{Sample_ID}_Grn.idat gtc_pattern: /expample/pattern/wildcards/are/columns/in/sample_sheet_file/{Project}/{Sample_ID}.gtc or user_files: output_pattern: '{prefix}/{file_type}.{ext}' ped: /path/to/samples.ped map: /path/to/samples.map or user_files: output_pattern: '{prefix}/{file_type}.{ext}' bed: /path/to/samples.bed bim: /path/to/samples.bim fam: /path/to/samples.fam or user_files: output_pattern: '{prefix}/{file_type}.{ext}' bcf: /path/to/samples.bcf Note The BCF file, IDAT/GTC patterns, PED/MAP paths, BED/BIM/FAM paths are all mutually exclusive. You should only provide one set of patterns/paths.
type	object
properties
output_pattern	Output Pattern
	File naming pattern for deliverable files. In general, you should not need to edit this. However, if you do decide to change this pattern you must keep the `{prefix}`, `{file_type}`, and `{ext}` patterns. These are filled in by the workflow, see the delivery sub-workflow for more details.
	type	string
	default	{prefix}/{file_type}.{ext}
idat_pattern	Idat Pattern
	File naming pattern for IDAT files. There are two IDAT files for each samples (red and green). You need to provide a file naming pattern for each IDAT file. Wildcards are indicated by an `{}`. Wildcards have to match column names in the sample sheet exactly.
	type	namespace
gtc_pattern	Gtc Pattern
	File name pattern for GTC file. Wildcards are indicated by an `{}`. Wildcards have to match column names in the sample sheet exactly.
	type	string
ped	Ped
	The full path to an aggregated PED file if the sample level GTC files are not available.
	type	string
	format	path
map	Map
	The full path to an aggregated MAP file if the sample level GTC files are not available.
	type	string
	format	path
bed	Bed
	The full path to an aggregated BED file if the sample level GTC files are not available.
	type	string
	format	path
bim	Bim
	The full path to an aggregated BIM file if the sample level GTC files are not available.
	type	string
	format	path
fam	Fam
	The full path to an aggregated FAM file if the sample level GTC files are not available.
	type	string
	format	path
bcf	Bcf
	The full path to an aggregated BCF/VCF file perferably encoding the GenCall scores.
	type	string
	format	path

The user_files section has a number of different mutually exclusive configurations depending on the starting file types. By default we assume you will be starting with IDAT and GTC files. Though we also accept aggregated PED/MAP and aggregated BED/BIM/FAM files. In the example, {Project} and {Sample_ID} will be filled by values from Project and Sample_ID columns in cgr_sample_sheet.csv.

If the idat_pattern is given and workflow_params.convert_idat2gtc=true, then this will trigger IDAT entry point. It will convert idats to gtcs and resume per gtc entry_point aftwards. To convert idats to gtcs, we use Illumina’s dragen array software which should be available either as a module or path to binary must be provided using workflow_params.dragena_location. If you are on CCAD2, we already have a module installed and dragena_location is not neeeded. You might also need to provide cluster egt file using reference_files.illumina_cluster_file.

If the gtc_pattern is given, then this will trigger the GTC entry point. There are two methods to convert gtc files to sample_level/samples.{bed,bim,fam}. If workflow_params.convert_gtc2bcf=false (default), we will convert each sample’s GTC to a PED/MAP and then aggregate all samples and convert to a single BED/BIM/FAM at sample_level/samples.{bed,bim,fam}. If workflow_params.convert_gtc2bcf=true, we will convert GTCs to an aggregated BCF and convert to BED/BIM/FAM at sample_level/samples.{bed,bim,fam}.

If the PED/MAP files are given, then this will trigger the PED/MAP entry point. Which will convert these files to a single BED/BIM/FAM at sample_level/samples.{bed,bim,fam}.

If the BED/BIM/FAM files are given, then this will trigger the BED/BIM/FAM entry point. Which will create a symbolic link from your BED/BIM/FAM to sample_level/samples.{bed,bim,fam}.

If a BCF is given, then this will trigger the BCF entry point. Which will convert BCF to BED/BIM/FAM at sample_level/samples.{bed,bim,fam}.

Software Parameters¶

Software parameters used by various tools in the workflow. software_params: strand: top sample_call_rate_1: 0.8 snp_call_rate_1: 0.8 sample_call_rate_2: 0.95 snp_call_rate_2: 0.95 intensity_threshold: 6000 contam_threshold: 0.1 contam_population: AF ld_prune_r2: 0.1 maf_for_ibd: 0.2 maf_for_hwe: 0.05 ibd_pi_hat_min: 0.12 ibd_pi_hat_max: 1.0 dup_concordance_cutoff: 0.95 pi_hat_threshold: 0.2 autosomal_het_threshold: 0.1
type	object
properties
strand	Strand
	Which Illumina strand to use for genotypes when converting GTC to plink {TOP, FWD, PLUS}. `workflow/scripts/gtc2plink.py`
	type	string
	default	top
sample_call_rate_1	Sample Call Rate 1
	Sample call rate filter 1 threshold. `plink --mind (1 - sample_call_rate_1`; reference
	type	number
	maximum	1
	exclusiveMinimum	0
	default	0.8
snp_call_rate_1	Snp Call Rate 1
	SNP call rate filter 1 threshold. `plink --geno (1 - sample_call_rate_1`; reference
	type	number
	maximum	1
	exclusiveMinimum	0
	default	0.8
sample_call_rate_2	Sample Call Rate 2
	Sample call rate filter 2 threshold. `plink --mind (1 - sample_call_rate_1`; reference
	type	number
	maximum	1
	exclusiveMinimum	0
	default	0.95
snp_call_rate_2	Snp Call Rate 2
	SNP call rate filter 2 threshold. `plink --geno (1 - sample_call_rate_1`; reference
	type	number
	maximum	1
	exclusiveMinimum	0
	default	0.95
intensity_threshold	Intensity Threshold
	Median IDAT intensity threshold used to filter samples during estimating contamination check. `workflow/scripts/agg_contamination.py`
	type	integer
	exclusiveMinimum	0
	default	6000
contam_threshold	Contam Threshold
	%Mix cutoff to consider a sample as contaminated. `workflow/scripts/agg_contamination.py`
	type	number
	maximum	1
	exclusiveMinimum	0
	default	0.1
contam_population	Contam Population
	Which population from the 1000 Genomes project to use for B-allele frequencies during contamination testing. Can be one of {AF, EAS_AF, AMR_AF, AFR_AF, EUR_AF, SAS_AF}. `workflow/scripts/bpm2abf.py`
	type	string
	default	AF
ld_prune_r2	Ld Prune R2
	The r-squared threshold for LD pruning of SNPS for use with IBD and replicate concordance. `plink --indep-pairwise 50 5 ld_prune_r2`; reference
	type	number
	maximum	1
	exclusiveMinimum	0
	default	0.1
maf_for_ibd	Maf For Ibd
	The minor allele frequency threshold of SNPS for use with IBD and replicate concordance. `plink --maf maf_for_ibd`; reference
	type	number
	maximum	1
	exclusiveMinimum	0
	default	0.2
maf_for_hwe	Maf For Hwe
	The minor allele frequency threshold of SNPS for use with population level HWE estimates. `plink --maf maf_for_hwe`; reference
	type	number
	maximum	1
	exclusiveMinimum	0
	default	0.05
ibd_pi_hat_min	Ibd Pi Hat Min
	The minimum IBD pi hat value to save in the results table. `plink --genome full --min ibd_pi_hat_min`; reference
	type	number
	exclusiveMaximum	1
	minimum	0
	default	0.12
ibd_pi_hat_max	Ibd Pi Hat Max
	The maximum IBD pi hat value to save in the results table. `plink --genome full --max ibd_pi_hat_max`; reference
	type	number
	maximum	1
	exclusiveMinimum	0
	default	1.0
dup_concordance_cutoff	Dup Concordance Cutoff
	The concordance threshold to consider two samples as replicates. `workflow/scripts/concordance_table.py`
	type	number
	maximum	1
	exclusiveMinimum	0
	default	0.95
pi_hat_threshold	Pi Hat Threshold
	The pi hat threshold to consider two samples as related. The default of 0.2 reports 1st and 2nd degree relatives. `workflow/scripts/concordance_table.py`
	type	number
	maximum	1
	exclusiveMinimum	0
	default	0.2
autosomal_het_threshold	Autosomal Het Threshold
	The autosomal heterozygosity F coefficient threshold which to flag subject for removal. `workflow/scripts/plot_autosomal_heterozygosity.py`
	type	number
	maximum	1
	exclusiveMinimum	0
	default	0.1

Workflow Parameters¶

This set of parameters controls which parts of the workflow are run and how they are executed workflow_params: subject_id_column: Group_By expected_sex_column: Expected_Sex sex_chr_included: true case_control_column: Case/Control_Status remove_contam: true remove_rep_discordant: true minimum_pop_subjects: 50 control_hwp_threshold: 50 lims_upload: true lims_output_dir: /DCEG/CGF/Laboratory/LIMS/drop-box-prod/gwas_primaryqc/ case_control_gwas: false max_time_hr: max_mem_mb: time_start: convert_gtc2bcf: false additional_params_for_gtc2bcf: --use-gtc-sample-names convert_idat2gtc: false dragena_location: concordance_tools: graf: true king: true plink: true
type	object
properties
subject_id_column	Subject Id Column
	The column in your sample sheet that contains the subject ID. We expect that there may be multiple samples (rows) that have the sample subject ID. If there are multiple columns that you want use for subject ID, you can create a special column named `Group_By`. This would be a column that contains the column name to use as subject ID for that given samples (row).
	type	string
	default	Group_By
expected_sex_column	Expected Sex Column
	The name of the column in the sample sheet which identifies expected sex of samples. Allowed values in this columns are [`F` \| `M` \| `U`]
	type	string
	default	Expected_Sex
sex_chr_included	Sex Chr Included
	True if the sex chromosome is included in the microarray and a sex concordance check can be performed.
	type	boolean
	default	True
ancestry_snps_included	Ancestry Snps Included
	True if the ancestry informative SNPs are included in the microarray and a GRAF ancestry check can be performed.
	type	boolean
	default	True
case_control_column	Case Control Column
	The name of the colun in the sample sheet which identifies Case/Control status. Allowed values in this column are [`Control` \| `Case` \| `QC` \| `Unknown`].
	type	string
	default	Case/Control_Status
remove_contam	Remove Contam
	True if you want to remove contaminated samples before running the subject level QC.
	type	boolean
	default	True
remove_rep_discordant	Remove Rep Discordant
	True if you want to remove discordant replicates before running the subject level QC.
	type	boolean
	default	True
minimum_pop_subjects	Minimum Pop Subjects
	The minimum number of samples needed to use a population for population level QC (PCA, Autosomal Heterozygosity).
	type	integer
	exclusiveMinimum	0
	default	50
control_hwp_threshold	Control Hwp Threshold
	The minimum number of control samples needed to use a population for population level QC (HWE).
	type	integer
	exclusiveMinimum	0
	default	50
lims_upload	Lims Upload
	For `CGEMS/CCAD` use only, will place a copy of the LimsUpload file in the root directory.
	type	boolean
	default	True
lims_output_dir	Lims Output Dir
	type	string
	default	/DCEG/CGF/Laboratory/LIMS/drop-box-prod/gwas_primaryqc/
case_control_gwas	Case Control Gwas
	A plink logistic regression gwas will be performed with case_control phenotype.
	type	boolean
	default	False
max_time_hr	Max Time Hr
	The maximum amount of time that can be requested, in hours.
	type	integer
max_mem_mb	Max Mem Mb
	The maximum amount of memory that can be requests, in megabytes.
	type	integer
time_start	Time Start
	Date and time at which the workflow starts. This creates a unique id for the run.
	type	string
	default	20250416195829
convert_gtc2bcf	Convert Gtc2Bcf
	If input is GTC, this switches between gtc2vcf (True) and gtc2ped (False - default) for conversion to BED.
	type	boolean
	default	False
additional_params_for_gtc2bcf	Additional Params For Gtc2Bcf
	Additional/optional parameters not hardcoded to be used or skipped in gtc2bcf for specific analysis.
	type	string
	default	–use-gtc-sample-names
convert_idat2gtc	Convert Idat2Gtc
	If idat_pattern is provided and convert_idat2gtc is True, idat2gtc will be triggered in entry_points.
	type	boolean
	default	False
dragena_location	Dragena Location
	Path to dragena binary. If dragena is not available as a module on HPC and IDAT entry_point is used, dragena_location will be used to convert idat2gtc.
	type	string
concordance_tools	Concordance Tools
	The sample_concordance_summary only uses Plink.If the outputs of graf and king relationship checks are needed, this option can be configuredPlease note even if graf and king relatedness checks are requested and executed, these would be for reference purpose only and not considered sample_concordance_summary.
	default	OrderedDict([(‘graf’, True), (‘king’, True), (‘plink’, True)])
	allOf		#/definitions/ConcordanceTools
definitions
ConcordanceTools	ConcordanceTools
	type	object
	properties
	graf	Graf
		Are graf relatedness results needed? Even if True, the results won’t be used in sample_concordance.
		type	boolean
		default	True
	king	King
		Are king relateness results needed? Even if True, the results won’t be used in sample_concordance.
		type	boolean
		default	True
	plink	Plink
		Are plink ibd relateness results needed? It is the primary tool used in sample_qc report. If false, no replicate/relatedness check would be considered in sample_qc
		type	boolean
		default	True

Sample IDs to Remove¶

This is an optional section where you can list Sample_ID that you do not want to include in the QC run. These samples will be indicated as is_user_exclusion = True in the sample level QC table.

Sample_IDs_to_remove:
   - Sample0001

Full Example¶

pipeline_version: v1.0.0
project_name: SR0001-001_1_0000000
sample_sheet: /path/to/manifest/file/SR0001-001_1_AnalysisManifest_0000000.csv
genome_build: hg37
snp_array: GSAMD-24v1-0
num_samples: 6336
num_snps: 700078
reference_files:
   illumina_manifest_file: /path/to/bpm/file/GSAMD-24v1-0_20011747_A1.bpm
   thousand_genome_vcf: /path/to/thousand/genome/ALL.wgs.phase3_shapeit2_mvncall_integrated_v5.20130502.sites.vcf.gz
   thousand_genome_tbi: /path/to/thousand/genome/ALL.wgs.phase3_shapeit2_mvncall_integrated_v5.20130502.sites.vcf.gz.tbi
   reference_fasta: /path/to/reference/GwasQcPipeline-test-data/GCA_000001405.15_GRCh38_full_analysis_set.fna.bgz

user_files:
   output_pattern: '{prefix}/{file_type}.{ext}'
   idat_pattern:
      red: /expample/pattern/wildcards/are/columns/in/sample_sheet_file/{Project}/{Sample_ID}_Red.idat
      green: /expample/pattern/wildcards/are/columns/in/sample_sheet_file/{Project}/{Sample_ID}_Grn.idat
   gtc_pattern: /expample/pattern/wildcards/are/columns/in/sample_sheet_file/{Project}/{Sample_ID}.gtc
software_params:
   sample_call_rate_1: 0.8
   snp_call_rate_1: 0.8
   sample_call_rate_2: 0.95
   snp_call_rate_2: 0.95
   ld_prune_r2: 0.1
   maf_for_ibd: 0.2
   maf_for_hwe: 0.05
   ibd_pi_hat_min: 0.12
   ibd_pi_hat_max: 1.0
   dup_concordance_cutoff: 0.95
   intensity_threshold: 6000
   contam_threshold: 0.1
   contam_population: AF
   pi_hat_threshold: 0.2
   autosomal_het_threshold: 0.1
   strand: top
workflow_params:
   subject_id_column: Group_By
   expected_sex_column: Expected_Sex
   case_control_column: Case/Control_Status
   remove_contam: true
   remove_rep_discordant: true
   minimum_pop_subjects: 50
   control_hwp_threshold: 50
   lims_upload: true
   lims_output_dir: /example/location/to/place/lims/upload/file
   time_start: '20240227130627'
   convert_gtc2bcf: false
Sample_IDs_to_remove:
   - Sample0001