Configuration Files

Sample Sheet File

A sample sheet is a CSV formatted file where each row represents a sample and columns contain various metadata. This file must contain a column named Sample_ID which has a unique ID for each row. You also need a column describing (1) the subject ID representing samples from an individual, (2) the expected sex of individual, and (3) the case control status of the individual. See the Workflow Parameters section below for details about what values are allowed for subject ID, expected sex, and case control.

Note

Example of the basic required structure of a sample sheet. You can have any other metadata as additional columns. The only column name that is required is Sample_ID, the other columns can be named anything as long as they are referenced correctly in Workflow Parameters.

Sample_ID,Subject_ID,Expected_Sex,Case_Control
Samp0001,Sub0001,M,Case
Samp0002,Sub0001,M,Case
Samp0003,Sub0002,F,Control
Samp0004,Sub0003,F,QC

CGR LIMs Manifest File

The CGR LIMs manifest file is our internal way to distribute sample information. If you use a Sample Sheet File then you can ignore this section. The manifest file is an INI-like file with three section (header, manifests, data). During cgr pre-flight we will pull out the data section. This file should already have all of the required columns. See the Workflow Parameters section below for details about what values are allowed for subject ID, expected sex, and case control.

The config.yml

In Running the Pipeline, we created the config.yml using Creating Configuration. This file is central to running the CGR GwasQcPipeline. Here we will describe the main configuration options. The config file is broken down into different sections called “namespaces”. I will walk through each section below and end this page with a full example.

Note

Required options without defaults are in bold.

Top Level Config

config.yml data model.

pipeline_version: v1.0.0
slurm_partition: defq
project_name: SR0001-001_1_0000000
sample_sheet: /path/to/manifest/file/SR0001-001_1_AnalysisManifest_0000000.csv
genome_build: hg37
snp_array: GSAMD-24v1-0
num_samples: 6336
num_snps: 700078
reference_files: ...  # Reference file namesapace
user_files: ...  # User file namespace
software_params: ...  # Software parameter namespace
workflow_params: ...  # Workflow parameter namespace
Sample_IDs_to_remove:

type

object

properties

  • pipeline_version

Pipeline Version

The version of the GwasQcPipeline to use. If you want to use a different version you may just edit this value to match. However, it is suggested that you re-run the entire pipeline in case there are differences between version.

type

string

default

v1.8.0-rc3

  • slurm_partition

Slurm Partition

Name of the Slurm partition to which jobs will be submitted when using the –slurm-generic submit option.

type

string

  • project_name

Project Name

The title of the project to use during report generation.

type

string

  • sample_sheet

Sample Sheet

The path to the sample sheet (or LIMs manifest). This is the file referenced during cgr config and used to generate cgr_sample_sheet.csv.

type

string

format

path

  • genome_build

The human genome build. This field is not actually used by the workflow. It is only used during cgr config to select the VCF file when running on CGEMs/CCAD.

type

string; [hg37|hg38]

default

hg37

  • snp_array

Snp Array

Which SNP array was used. Only used for reporting.

type

string

  • num_samples

Num Samples

Number of samples, automatically calculated from the sample sheet during cgr config.

type

integer

  • num_snps

Num Snps

Number of markers, automatically calculated from the reference_files.illumina_manifest_file. We will attempt to calculate this during cgr config and cgr pre-flight.

type

integer

  • reference_files

Reference Files

Reference file namespace. Reference files include the Illumina provided BPM and the 1000 Genomes VCF.

type

namespace

  • user_files

User Files

User file namespace. User files include the user provided genotype data in BCF, IDAT/GTC, PED/MAP or BED/BIM/FAM formats.

type

namespace

  • software_params

Software Params

Software parameter namespace. This includes all parameters passed to 3rd party and internal software and scripts.

type

namespace

  • workflow_params

Workflow Params

Workflow parameter namespace. This includes all parameters used to control workflow behavior.

type

namespace

  • Sample_IDs_to_remove

Sample Ids To Remove

A list of Sample_IDs to exclude from QC. This is the easiest way for a user to exclude specific samples from the GwasQcPipeline. These samples will be flagged as is_user_exclusion and will be present in the report, but they will have not have any results from this analysis.

type

list of strings

The config.yml file contains all workflow configurations. The top level section is automatically populated by cgr config and cgr pre-flight. You should not need to edit this section unless you want to (1) update the version of the workflow you are using, (2) change the project name, or (3) switch human references.

Reference Files

A list of reference files used by the pipeline.

reference_files:
    illumina_manifest_file: /path/to/bpm/file/GSAMD-24v1-0_20011747_A1.bpm
    illumina_csv_bpm: /path/to/csv/file/GSAMD-24v1-0_20011747_A1.csv
    thousand_genome_vcf: /path/to/thousand/genome/ALL.wgs.phase3_shapeit2_mvncall_integrated_v5.20130502.sites.vcf.gz
    thousand_genome_tbi: /path/to/thousand/genome/ALL.wgs.phase3_shapeit2_mvncall_integrated_v5.20130502.sites.vcf.gz.tbi
    reference_fasta: /path/to/reference/fasta/GCA_000001405.15_GRCh38_full_analysis_set.fna.bgz

type

object

properties

  • illumina_manifest_file

Illumina Manifest File

Path to the Illumina provided BPM file.

type

string

format

path

  • illumina_cluster_file

Illumina Cluster File

Path to the array cluster EGT file.

type

string

format

path

  • thousand_genome_vcf

Thousand Genome Vcf

Path to the 1000 Genomes VCF file.

type

string

format

path

  • thousand_genome_tbi

Thousand Genome Tbi

Path to the corresponding index for the 1000 Genomes VCF file.

type

string

format

path

  • reference_fasta

Reference Fasta

Path to Reference fasta file to be used to convert gtc to bcf. This could be compressed with bgzip but not gzip.

type

string

format

path

  • illumina_csv_bpm

Illumina Csv Bpm

Path to CSV bead pool manifest provided by Illumina to be used for gtc to bcf conversion. If csv_bpm is not provided, insertions/deletions will be skipped in gtc-to-bcf conversion.

type

string

format

path

If you are on CGEMS/CCAD then the paths of the reference files are correctly populated by cgr config --cgems or cgr config --cgems-dev. If you are on another system then you most provide the correct paths and versions of these files. The illumina_manifest_file is provided by Illumina. The thousand_genome_vcf and thousand_genome_tbi files can be downloaded from the 1000 Genome’s website. The illumina_cluster_file is the EGT file used to generate GTCs; this file is not required and is only referenced in the QC report if present. The reference_fasta is used convert GTCs to VCF and needed if GTCs are provided.

1000 Genomes reference files download links
reference fasta download links

After downloading the reference fasta, it needs to be converted from gz format to bgz format. This can be done with:

zcat GCA_000001405.15_GRCh38_full_analysis_set.fna.gz | bgzip -c > GCA_000001405.15_GRCh38_full_analysis_set.fna.bgz

Warning

The genome build needs to match illumina_manifest_file, thousand_genome_*, and genome_build above. Currently we only support hg37 and hg38. Though in reality as long as illumina_manifest_file and thousand_genome_* are from the same build then everything should work.

User Files

A list of user provided files or naming patterns.

user_files:
    output_pattern: '{prefix}/{file_type}.{ext}'
    idat_pattern:
        red: /expample/pattern/wildcards/are/columns/in/sample_sheet_file/{Project}/{Sample_ID}_Red.idat
        green: /expample/pattern/wildcards/are/columns/in/sample_sheet_file/{Project}/{Sample_ID}_Grn.idat
    gtc_pattern: /expample/pattern/wildcards/are/columns/in/sample_sheet_file/{Project}/{Sample_ID}.gtc

or

user_files:
    output_pattern: '{prefix}/{file_type}.{ext}'
    ped: /path/to/samples.ped
    map: /path/to/samples.map

or

user_files:
    output_pattern: '{prefix}/{file_type}.{ext}'
    bed: /path/to/samples.bed
    bim: /path/to/samples.bim
    fam: /path/to/samples.fam

or

user_files:
    output_pattern: '{prefix}/{file_type}.{ext}'
    bcf: /path/to/samples.bcf

Note

The BCF file, IDAT/GTC patterns, PED/MAP paths, BED/BIM/FAM paths are all mutually exclusive. You should only provide one set of patterns/paths.

type

object

properties

  • output_pattern

Output Pattern

File naming pattern for deliverable files. In general, you should not need to edit this. However, if you do decide to change this pattern you must keep the {prefix}, {file_type}, and {ext} patterns. These are filled in by the workflow, see the delivery sub-workflow for more details.

type

string

default

{prefix}/{file_type}.{ext}

  • idat_pattern

Idat Pattern

File naming pattern for IDAT files. There are two IDAT files for each samples (red and green). You need to provide a file naming pattern for each IDAT file. Wildcards are indicated by an {}. Wildcards have to match column names in the sample sheet exactly.

type

namespace

  • gtc_pattern

Gtc Pattern

File name pattern for GTC file. Wildcards are indicated by an {}. Wildcards have to match column names in the sample sheet exactly.

type

string

  • ped

Ped

The full path to an aggregated PED file if the sample level GTC files are not available.

type

string

format

path

  • map

Map

The full path to an aggregated MAP file if the sample level GTC files are not available.

type

string

format

path

  • bed

Bed

The full path to an aggregated BED file if the sample level GTC files are not available.

type

string

format

path

  • bim

Bim

The full path to an aggregated BIM file if the sample level GTC files are not available.

type

string

format

path

  • fam

Fam

The full path to an aggregated FAM file if the sample level GTC files are not available.

type

string

format

path

  • bcf

Bcf

The full path to an aggregated BCF/VCF file perferably encoding the GenCall scores.

type

string

format

path

The user_files section has a number of different mutually exclusive configurations depending on the starting file types. By default we assume you will be starting with IDAT and GTC files. Though we also accept aggregated PED/MAP and aggregated BED/BIM/FAM files. In the example, {Project} and {Sample_ID} will be filled by values from Project and Sample_ID columns in cgr_sample_sheet.csv.

If the gtc_pattern is given, then this will trigger the GTC entry point. There are two methods to convert gtc files to sample_level/samples.{bed,bim,fam}. If workflow_params.convert_gtc2bcf=false (default), we will convert each sample’s GTC to a PED/MAP and then aggregate all samples and convert to a single BED/BIM/FAM at sample_level/samples.{bed,bim,fam}. If workflow_params.convert_gtc2bcf=true, we will convert GTCs to an aggregated BCF and convert to BED/BIM/FAM at sample_level/samples.{bed,bim,fam}.

If the PED/MAP files are given, then this will trigger the PED/MAP entry point. Which will convert these files to a single BED/BIM/FAM at sample_level/samples.{bed,bim,fam}.

If the BED/BIM/FAM files are given, then this will trigger the BED/BIM/FAM entry point. Which will create a symbolic link from your BED/BIM/FAM to sample_level/samples.{bed,bim,fam}.

If a BCF is given, then this will trigger the BCF entry point. Which will convert BCF to BED/BIM/FAM at sample_level/samples.{bed,bim,fam}.

Software Parameters

Software parameters used by various tools in the workflow.

software_params:
    strand: top

    sample_call_rate_1: 0.8
    snp_call_rate_1: 0.8
    sample_call_rate_2: 0.95
    snp_call_rate_2: 0.95

    intensity_threshold: 6000
    contam_threshold: 0.1
    contam_population: AF

    ld_prune_r2: 0.1
    maf_for_ibd: 0.2
    maf_for_hwe: 0.05

    ibd_pi_hat_min: 0.12
    ibd_pi_hat_max: 1.0
    dup_concordance_cutoff: 0.95
    pi_hat_threshold: 0.2

    autosomal_het_threshold: 0.1

type

object

properties

  • strand

Strand

Which Illumina strand to use for genotypes when converting GTC to plink {TOP, FWD, PLUS}. workflow/scripts/gtc2plink.py

type

string

default

top

  • sample_call_rate_1

Sample Call Rate 1

Sample call rate filter 1 threshold. plink --mind (1 - sample_call_rate_1; reference

type

number

maximum

1

exclusiveMinimum

0

default

0.8

  • snp_call_rate_1

Snp Call Rate 1

SNP call rate filter 1 threshold. plink --geno (1 - sample_call_rate_1; reference

type

number

maximum

1

exclusiveMinimum

0

default

0.8

  • sample_call_rate_2

Sample Call Rate 2

Sample call rate filter 2 threshold. plink --mind (1 - sample_call_rate_1; reference

type

number

maximum

1

exclusiveMinimum

0

default

0.95

  • snp_call_rate_2

Snp Call Rate 2

SNP call rate filter 2 threshold. plink --geno (1 - sample_call_rate_1; reference

type

number

maximum

1

exclusiveMinimum

0

default

0.95

  • intensity_threshold

Intensity Threshold

Median IDAT intensity threshold used to filter samples during estimating contamination check. workflow/scripts/agg_contamination.py

type

integer

exclusiveMinimum

0

default

6000

  • contam_threshold

Contam Threshold

%Mix cutoff to consider a sample as contaminated. workflow/scripts/agg_contamination.py

type

number

maximum

1

exclusiveMinimum

0

default

0.1

  • contam_population

Contam Population

Which population from the 1000 Genomes project to use for B-allele frequencies during contamination testing. Can be one of {AF, EAS_AF, AMR_AF, AFR_AF, EUR_AF, SAS_AF}. workflow/scripts/bpm2abf.py

type

string

default

AF

  • ld_prune_r2

Ld Prune R2

The r-squared threshold for LD pruning of SNPS for use with IBD and replicate concordance. plink --indep-pairwise 50 5 ld_prune_r2; reference

type

number

maximum

1

exclusiveMinimum

0

default

0.1

  • maf_for_ibd

Maf For Ibd

The minor allele frequency threshold of SNPS for use with IBD and replicate concordance. plink --maf maf_for_ibd; reference

type

number

maximum

1

exclusiveMinimum

0

default

0.2

  • maf_for_hwe

Maf For Hwe

The minor allele frequency threshold of SNPS for use with population level HWE estimates. plink --maf maf_for_hwe; reference

type

number

maximum

1

exclusiveMinimum

0

default

0.05

  • ibd_pi_hat_min

Ibd Pi Hat Min

The minimum IBD pi hat value to save in the results table. plink --genome full --min ibd_pi_hat_min; reference

type

number

exclusiveMaximum

1

minimum

0

default

0.12

  • ibd_pi_hat_max

Ibd Pi Hat Max

The maximum IBD pi hat value to save in the results table. plink --genome full --max ibd_pi_hat_max; reference

type

number

maximum

1

exclusiveMinimum

0

default

1.0

  • dup_concordance_cutoff

Dup Concordance Cutoff

The concordance threshold to consider two samples as replicates. workflow/scripts/concordance_table.py

type

number

maximum

1

exclusiveMinimum

0

default

0.95

  • pi_hat_threshold

Pi Hat Threshold

The pi hat threshold to consider two samples as related. The default of 0.2 reports 1st and 2nd degree relatives. workflow/scripts/concordance_table.py

type

number

maximum

1

exclusiveMinimum

0

default

0.2

  • autosomal_het_threshold

Autosomal Het Threshold

The autosomal heterozygosity F coefficient threshold which to flag subject for removal. workflow/scripts/plot_autosomal_heterozygosity.py

type

number

maximum

1

exclusiveMinimum

0

default

0.1

Workflow Parameters

This set of parameters control what parts and how the workflow is run.

workflow_params:
    subject_id_column: Group_By
    expected_sex_column: Expected_Sex
    sex_chr_included: true
    case_control_column: Case/Control_Status
    remove_contam: true
    remove_rep_discordant: true
    minimum_pop_subjects: 50
    control_hwp_threshold: 50
    lims_upload: true
    lims_output_dir: /DCEG/CGF/Laboratory/LIMS/drop-box-prod/gwas_primaryqc/
    case_control_gwas: false
    max_time_hr:
    max_mem_mb:
    time_start:
    convert_gtc2bcf: false
    additional_params_for_gtc2bcf: --use-gtc-sample-names

type

object

properties

  • subject_id_column

Subject Id Column

The column in your sample sheet that contains the subject ID. We expect that there may be multiple samples (rows) that have the sample subject ID. If there are multiple columns that you want use for subject ID, you can create a special column named Group_By. This would be a column that contains the column name to use as subject ID for that given samples (row).

type

string

default

Group_By

  • expected_sex_column

Expected Sex Column

The name of the column in the sample sheet which identifies expected sex of samples. Allowed values in this columns are [F | M | U]

type

string

default

Expected_Sex

  • sex_chr_included

Sex Chr Included

True if the sex chromosome is included in the microarray and a sex concordance check can be performed.

type

boolean

default

True

  • case_control_column

Case Control Column

The name of the colun in the sample sheet which identifies Case/Control status. Allowed values in this column are [Control | Case | QC | Unknown].

type

string

default

Case/Control_Status

  • remove_contam

Remove Contam

True if you want to remove contaminated samples before running the subject level QC.

type

boolean

default

True

  • remove_rep_discordant

Remove Rep Discordant

True if you want to remove discordant replicates before running the subject level QC.

type

boolean

default

True

  • minimum_pop_subjects

Minimum Pop Subjects

The minimum number of samples needed to use a population for population level QC (PCA, Autosomal Heterozygosity).

type

integer

exclusiveMinimum

0

default

50

  • control_hwp_threshold

Control Hwp Threshold

The minimum number of control samples needed to use a population for population level QC (HWE).

type

integer

exclusiveMinimum

0

default

50

  • lims_upload

Lims Upload

For CGEMS/CCAD use only, will place a copy of the LimsUpload file in the root directory.

type

boolean

default

True

  • lims_output_dir

Lims Output Dir

type

string

default

/DCEG/CGF/Laboratory/LIMS/drop-box-prod/gwas_primaryqc/

  • case_control_gwas

Case Control Gwas

A plink logistic regression gwas will be performed with case_control phenotype.

type

boolean

default

False

  • max_time_hr

Max Time Hr

The maximum amount of time that can be requested, in hours.

type

integer

  • max_mem_mb

Max Mem Mb

The maximum amount of memory that can be requests, in megabytes.

type

integer

  • time_start

Time Start

Date and time at which the workflow starts. This creates a unique id for the run.

type

string

default

20241121155030

  • convert_gtc2bcf

Convert Gtc2Bcf

If input is GTC, this switches between gtc2vcf (True) and gtc2ped (False - default) for conversion to BED.

type

boolean

default

False

  • additional_params_for_gtc2bcf

Additional Params For Gtc2Bcf

Additional/optional parameters not hardcoded to be used or skipped in gtc2bcf for specific analysis.

type

string

default

–use-gtc-sample-names

Sample IDs to Remove

This is an optional section where you can list Sample_ID that you do not want to include in the QC run. These samples will be indicated as is_user_exclusion = True in the sample level QC table.

Sample_IDs_to_remove:
   - Sample0001

Full Example

pipeline_version: v1.0.0
project_name: SR0001-001_1_0000000
sample_sheet: /path/to/manifest/file/SR0001-001_1_AnalysisManifest_0000000.csv
genome_build: hg37
snp_array: GSAMD-24v1-0
num_samples: 6336
num_snps: 700078
reference_files:
   illumina_manifest_file: /path/to/bpm/file/GSAMD-24v1-0_20011747_A1.bpm
   thousand_genome_vcf: /path/to/thousand/genome/ALL.wgs.phase3_shapeit2_mvncall_integrated_v5.20130502.sites.vcf.gz
   thousand_genome_tbi: /path/to/thousand/genome/ALL.wgs.phase3_shapeit2_mvncall_integrated_v5.20130502.sites.vcf.gz.tbi
   reference_fasta: /path/to/reference/GwasQcPipeline-test-data/GCA_000001405.15_GRCh38_full_analysis_set.fna.bgz

user_files:
   output_pattern: '{prefix}/{file_type}.{ext}'
   idat_pattern:
      red: /expample/pattern/wildcards/are/columns/in/sample_sheet_file/{Project}/{Sample_ID}_Red.idat
      green: /expample/pattern/wildcards/are/columns/in/sample_sheet_file/{Project}/{Sample_ID}_Grn.idat
   gtc_pattern: /expample/pattern/wildcards/are/columns/in/sample_sheet_file/{Project}/{Sample_ID}.gtc
software_params:
   sample_call_rate_1: 0.8
   snp_call_rate_1: 0.8
   sample_call_rate_2: 0.95
   snp_call_rate_2: 0.95
   ld_prune_r2: 0.1
   maf_for_ibd: 0.2
   maf_for_hwe: 0.05
   ibd_pi_hat_min: 0.12
   ibd_pi_hat_max: 1.0
   dup_concordance_cutoff: 0.95
   intensity_threshold: 6000
   contam_threshold: 0.1
   contam_population: AF
   pi_hat_threshold: 0.2
   autosomal_het_threshold: 0.1
   strand: top
workflow_params:
   subject_id_column: Group_By
   expected_sex_column: Expected_Sex
   case_control_column: Case/Control_Status
   remove_contam: true
   remove_rep_discordant: true
   minimum_pop_subjects: 50
   control_hwp_threshold: 50
   lims_upload: true
   lims_output_dir: /example/location/to/place/lims/upload/file
   time_start: '20240227130627'
   convert_gtc2bcf: false
Sample_IDs_to_remove:
   - Sample0001