File Type Reference

Here I briefly describe the large number of filetypes used throughout the GWAS QC Pipeline.

adpc.bin

An Illumina file format that can be output by GenomeStudio. I cannot find a spec file or any details about this format beyond a small reader function used by Picard.

Fields

A allele intensity:

B allele intensity:

A normalized allele intensity:

B normalized allele intensity:

cluster confidence score:

called genotype:

0 (AA), 1 (AB), 2 (BB), and 3 (NN)

abf.txt

This is a simple text file with the B allele frequencies for each snp.

Fields

SNP_ID:

BAF:

(a.k.a ABF)

bpm (Illumina Manifest File)

The Illumina manifest file is a binary file that describes the specific array. It has information about each probe set and their corresponding SNPs. This version GSAMD-24v1-0_20011747_A1.bpm is linked to in the config. There is also a csv version that can be downloaded from Illumina.

Fields

IlmnID:

Name:

IlmnStrand:

BOT|TOP|PLUS|MINUs

SNP:

[A/B] where A and B can be [ACTGID]

AddressA_ID:

The location on the array for A

AlleleA_ProbeSeq:

The probe sequence for A

AddressB_ID:

The location on the array for B

AlleleB_ProbeSeq:

The probe sequence for B

GenomeBuild:

Reference genome build number (i.e. 37)

Chr:

Chromosome

MapInfo:

Ploidy:

Ploidy of the species (i.e. diploid)

Species:

The species (i.e. Homo sapiens)

Source:

The source for the SNP (e.g. 1000genomes, PAGE, ClinVar_ACMG).

SourceVersion:

Version of the source.

SourceStrand:

BOT|TOP|PLUS|MINUs

SourceSeq:

TopGenomicSeq:

BeadSetID:

Exp_Clusters:

Intensity_Only:

RefStrand:

Reference strand (\+|\-)

https://support.illumina.com/bulletins/2017/06/how-to-interpret-dna-strand-and-allele-information-for-infinium-.html

contam.out (verifyIDintensity)

Warning

This output has no official documentation.

Sample level contamination score.

Fields

ID:

sample_ID

%MIX:

Percent mixture with another sample (i.e. Alpha from Jun et al. 2012)

LLN:

The minimized log-likelihood. not really sure what this is

LLN0:

The log-likelihood at alpha = 0 (i.e. no contamination).

csv (Illumina Sample Sheet)

Warning

This output has no official documentation.

I am assuming this is the sample sheet used to run GenomeStudio.

Fields

Sample_ID:

SC\d+_PB\d+_[A-H][0-1][1-9]

SentrixBarcode_A:

\d+

SentrixPosition_A:

R\d{2}C\d{2}

Sample_Plate:

WG\d+-DNA

Sample_Well:

[A-H][0-1][1-9]

Sample_Group:

.*

Identifier_Sex:

(M|F|U)

Sample_Name:

\w{2} \d{4} \d{4}

Replicate:

.*

Parent1:

SC\d+_PB\d+_[A-H][0-1][1-9]

Parent2:

SC\d+_PB\d+_[A-H][0-1][1-9]

SR_Subject_ID:

SI\d+

LIMSSample_ID:

SC\d+

Sample_Type:

[\w\s]+

LIMS_Specimen_ID:

Project:

GP\d+-IN\d{2}

Project-Sample ID:

PS-(?P<year>\d{4})(?P<month>\d{2})(?P<day>\d{2})-\d{6}

Array:

GSAMD-24v1-0

LIMS_Individual_ID:

I-\d+

PI_Subject_ID:

\w-\d+-\d

PI_Study_ID:

Age:

Expected_Sex:

(M|F)

Ancestry_S1:

Ancestry_S2:

Ancestry_S3:

POPGROUP:

Case/Control_Status:

(Control|Case)

eigenstratgeno (eigensoft)

The genotype data for each individual at each SNP. Each column represent an individual sample (same order as ind file). Each row represents a SNP (same order as snp file). Where values are 0-2 or 9.

Encoding

0:

Zero copies of the reference allele

1:

One copies of the reference allele

2:

Two copies of the reference allele

9:

Missing data

gtc (Illumina Genotype Calls)

The Illumina Infinium genotype (GTC) file format. This format is output from Illumina’s genotype calling software (either Autocall or Autoconvert). This is a complicated data format containing large amounts of metadata describing the run, data describing each probe on the array, as well as sample level genotype calls for probe.

Each sample has a GTC file associated with it.

Format Spec

idat (Illumina)

A binary format of intensities. This file includes various types of metadata (i.e., array information, software versions, the type of BeadChip). Data fields are made up of 4 values: ID of each probe on the array, mean intensity, intensity standard deviation, and the number of beads with each probe.

ind (eigensoft)

The individual sample description file. Each row represents a sample.

Fields

Sample ID:

The sample identifier

gender:

The gender of the sample {M, F, U}

phenotype:

The sample phenotype or population information.

snp (eigensoft)

A file describing each SNP (one SNP per row).

Fields

SNP ID:

Chromosome:

The chromosome number (X: 23, Y: 24, mtDNA: 90, XY: 91)

Genetic Position:

Given in morgans or 0.0 if unknown.

Physical Position:

Given in bases.

Optional Fields

Reference Allele:

Variant Allele:

For monomorphic SNPs can be encoded as X (unknown).

snpwt.* (SNPweights)

Header Rows

first:

shrinkage for the predicted PCs (1 and 2)

second:

ancestral populations

third:

number of ancestral samples for each population

fourth:

average PCs for each ancestral population

fifth:

parameter for linear transformation of PCs to % ancestry

Data Fields (remaining rows)

SNP rs number:

reference allele:

variant allele:

reference allele frequency:

SNP weight for PC1:

SNP weight for PC2:

snpweights (SNPweights)

Per sample SNP weights based on an external reference panel.

Fields

Sample ID:

Population Label:

Number of SNPs:

The number of SNPs used for inference of SNP weight.

Predicted PC1:

Predicted PC2:

Percent YRI Ancestry:

Percent CEU Ancestry:

Percent ASI (CHB + CHD) Ancestry:

vcf/bcf (Variant call format)

Variant call format (VCF) is a standard tab delimited text file to represent variants. The binary variant call format (BCF) refers to the binary compressed version of the vcf file. The BCF file typically offers storage and compute efficiency with BCFtools. The VCF file begins with a header where each line is commented with ## and describes the VCF version, reference genome contigs and each TAG in the INFO/FILTER/FORMAT fields. The header is followed by data section where each variant is described by a row. The data section contains following standard fields:

Fields

CHROM:

Chromosome contig name from the reference genome assembly

POS:

The choromosomal position of the variant.

ID:

The identifier for the variant. Typically dbSNP rsid.

REF:

The reference allele.

ALT:

The alternate allele.

QUAL:

Phred Scale quality of the vairant.

FILTER:

Any soft filters to tag the variant.

FORMAT:

The format of the sample field.

SAMPLE1 … SAMPLEn:

The genotypes and any genotype quality scores for each sample. The genotypes for each sample is represented by a separate column.

https://samtools.github.io/hts-specs/VCFv4.2.pdf