File Type Reference

Here I briefly describe the large number of filetypes used throughout the GWAS QC Pipeline.

adpc.bin

An Illumina file format that can be output by GenomeStudio. I cannot find a spec file or any details about this format beyond a small reader function used by Picard.

Fields

A allele intensity:

B allele intensity:

A normalized allele intensity:

B normalized allele intensity:

cluster confidence score:

called genotype:

0 (AA), 1 (AB), 2 (BB), and 3 (NN)

abf.txt

This is a simple text file with the B allele frequencies for each snp.

Fields

SNP_ID:

BAF:

(a.k.a ABF)

bpm (Illumina Manifest File)

The Illumina manifest file is a binary file that describes the specific array. It has information about each probe set and their corresponding SNPs. This version GSAMD-24v1-0_20011747_A1.bpm is linked to in the config. There is also a csv version that can be downloaded from Illumina.

Fields

IlmnID:

Name:

IlmnStrand:

BOT|TOP|PLUS|MINUs

SNP:

[A/B] where A and B can be [ACTGID]

AddressA_ID:

The location on the array for A

AlleleA_ProbeSeq:

The probe sequence for A

AddressB_ID:

The location on the array for B

AlleleB_ProbeSeq:

The probe sequence for B

GenomeBuild:

Reference genome build number (i.e. 37)

Chr:

Chromosome

MapInfo:

Ploidy:

Ploidy of the species (i.e. diploid)

Species:

The species (i.e. Homo sapiens)

Source:

The source for the SNP (e.g. 1000genomes, PAGE, ClinVar_ACMG).

SourceVersion:

Version of the source.

SourceStrand:

BOT|TOP|PLUS|MINUs

SourceSeq:

TopGenomicSeq:

BeadSetID:

Exp_Clusters:

Intensity_Only:

RefStrand:

Reference strand (\+|\-)

https://support.illumina.com/bulletins/2017/06/how-to-interpret-dna-strand-and-allele-information-for-infinium-.html

contam.out (verifyIDintensity)

Warning

This output has no official documentation.

Sample level contamination score.

Fields

ID:

sample_ID

%MIX:

Percent mixture with another sample (i.e. Alpha from Jun et al. 2012)

LLN:

The minimized log-likelihood. not really sure what this is

LLN0:

The log-likelihood at alpha = 0 (i.e. no contamination).

csv (Illumina Sample Sheet)

Warning

This output has no official documentation.

I am assuming this is the sample sheet used to run GenomeStudio.

Fields

Sample_ID:

SC\d+_PB\d+_[A-H][0-1][1-9]

SentrixBarcode_A:

\d+

SentrixPosition_A:

R\d{2}C\d{2}

Sample_Plate:

WG\d+-DNA

Sample_Well:

[A-H][0-1][1-9]

Sample_Group:

.*

Identifier_Sex:

(M|F|U)

Sample_Name:

\w{2} \d{4} \d{4}

Replicate:

.*

Parent1:

SC\d+_PB\d+_[A-H][0-1][1-9]

Parent2:

SC\d+_PB\d+_[A-H][0-1][1-9]

SR_Subject_ID:

SI\d+

LIMSSample_ID:

SC\d+

Sample_Type:

[\w\s]+

LIMS_Specimen_ID:

Project:

GP\d+-IN\d{2}

Project-Sample ID:

PS-(?P<year>\d{4})(?P<month>\d{2})(?P<day>\d{2})-\d{6}

Array:

GSAMD-24v1-0

LIMS_Individual_ID:

I-\d+

PI_Subject_ID:

\w-\d+-\d

PI_Study_ID:

Age:

Expected_Sex:

(M|F)

Ancestry_S1:

Ancestry_S2:

Ancestry_S3:

POPGROUP:

Case/Control_Status:

(Control|Case)

eigenstratgeno (eigensoft)

The genotype data for each individual at each SNP. Each column represent an individual sample (same order as ind file). Each row represents a SNP (same order as snp file). Where values are 0-2 or 9.

Encoding

0:

Zero copies of the reference allele

1:

One copies of the reference allele

2:

Two copies of the reference allele

9:

Missing data

gtc (Illumina Genotype Calls)

The Illumina Infinium genotype (GTC) file format. This format is output from Illumina’s genotype calling software (either Autocall or Autoconvert). This is a complicated data format containing large amounts of metadata describing the run, data describing each probe on the array, as well as sample level genotype calls for probe.

Each sample has a GTC file associated with it.

Format Spec

idat (Illumina)

A binary format of intensities. This file includes various types of metadata (i.e., array information, software versions, the type of BeadChip). Data fields are made up of 4 values: ID of each probe on the array, mean intensity, intensity standard deviation, and the number of beads with each probe.

ind (eigensoft)

The individual sample description file. Each row represents a sample.

Fields

Sample ID:

The sample identifier

gender:

The gender of the sample {M, F, U}

phenotype:

The sample phenotype or population information.

snp (eigensoft)

A file describing each SNP (one SNP per row).

Fields

SNP ID:

Chromosome:

The chromosome number (X: 23, Y: 24, mtDNA: 90, XY: 91)

Genetic Position:

Given in morgans or 0.0 if unknown.

Physical Position:

Given in bases.

Optional Fields

Reference Allele:

Variant Allele:

For monomorphic SNPs can be encoded as X (unknown).

snpwt.* (SNPweights)

Header Rows

first:

shrinkage for the predicted PCs (1 and 2)

second:

ancestral populations

third:

number of ancestral samples for each population

fourth:

average PCs for each ancestral population

fifth:

parameter for linear transformation of PCs to % ancestry

Data Fields (remaining rows)

SNP rs number:

reference allele:

variant allele:

reference allele frequency:

SNP weight for PC1:

SNP weight for PC2:

snpweights (SNPweights)

Per sample SNP weights based on an external reference panel.

Fields

Sample ID:

Population Label:

Number of SNPs:

The number of SNPs used for inference of SNP weight.

Predicted PC1:

Predicted PC2:

Percent YRI Ancestry:

Percent CEU Ancestry:

Percent ASI (CHB + CHD) Ancestry: