Parsers¶
Eigensoft Parsers¶
-
class
cgr_gwas_qc.parsers.eigensoft.
Eigenvec
(filename: pathlib.Path)[source]¶ Eigensoft eigenvec file parser.
-
filename
¶ Path to the eigenvec file from
smartpca
.- Type:
Path
-
components
¶ A (n x 11) table of principal components
name
dtype
description
ID (index)
object
Sample or Subject ID
PC1
object
The first principal componenet (PC1).
PC2
object
The second principal componenet.
…
…
The third to ninth principal components.
PC10
object
The tenth principal componenet.
- Type:
pd.DataFrame
-
values
¶ A vector of eigenvalues for PC1 to PC10.
- Type:
np.ndarray
References
-
GRAF Parsers¶
Genetic Relationship and Fingerprinting¶
GRAF is a package that allows estmation or relatedness and ancestry.
KING Parsers¶
KING Relationship Inferences¶
KING is a toolset to robustly identify different types of relatedness. Unlike other methods, you should only remove markers that fail QC (i.e., don’t mess with MAF or LD filters).
Kinship Table¶
name |
dtype |
description |
|
---|---|---|---|
ID1 |
string |
Individual ID for the first individual of the pair |
|
ID2 |
string |
Individual ID for the second individual of the pair |
|
N_SNP |
UInt32 |
The number of SNPs that do not have missing genotypes in either of the individual |
|
HetHet |
float |
Proportion of SNPs with double heterozygotes (e.g. |
AG and AG) |
IBS0 |
float |
Proportion of SNPs with zero IBS (identical-by-state) (e.g. |
AA and GG) |
HetConc |
float |
Heterozygous concordance |
|
HomIBS0 |
float |
Homozygous IBS0 |
|
Kinship |
float |
Estimated kinship coefficient from the SNP data |
|
IBD1Seg |
float |
Total length of IBD1 segments divided by total length of all segments |
|
IBD2Seg |
float |
Total length of IBD2 segments divided by total length of all segments |
|
PropIBD |
float |
Proportion of genome with shared IBD (e.g. |
IBD2Seg + IBD1Seg/2) |
relationship |
string |
The assigned relationship based on Kinship |
King Relationships¶
Categories were assigned based on Kinship ranges provided in the KING manual.
name |
description |
---|---|
ID |
duplicate or MZ twin |
D1 |
1st degree relative |
D2 |
2nd degree relative |
D3 |
3rd degree relative |
UN |
unrelated |
References
Manichaikul A, Mychaleckyj JC, Rich SS, Daly K, Sale M, Chen WM (2010) Robust relationship inference in genome-wide association studies. Bioinformatics 26(22):2867-2873
Reads the table generated by
king --related
.Reads the king file and re-assigns ID1/ID2 by sorting IDs alphanumerically.
- Returns:
pd.DataFrame
ID1
ID2
N_SNP
HetHet
IBS0
HetConc
HomIBS0
Kinship
IBD1Seg
IBD2Seg
PropIBD
relationship {ID, PO, FS, D2, D3}
References
Plink Parsers¶
-
cgr_gwas_qc.parsers.plink.
read_genome
(filename: Union[str, os.PathLike[str], pathlib.Path]) → pandas.core.frame.DataFrame[source]¶ Parse PLINK’s genome file format.
Each row of the genome file is a pairwise combinations of samples/subjects. I am unsure of how plink assigns IID order. Here I sort IDs alphanumerically which will make searching for pairs easier because you can assume the order.
>>> ID1, ID2 = sort([IID1, IID2])
- Returns:
A (n x 5) table with the following columns
name
dtype
description
ID1
object
First Sample or Subject ID (alphanumerically)
ID2
object
Second Sample or Subject ID (alphanumerically)
RT
object
Relationship type inferred from .fam/.ped file {FS: Full Sib
HS
Half Sib
PO: Parent-Offspring
OT; Other}
EZ
object
IBD sharing expected value
based on just .fam/.ped relationship
Z0
float
P(IBD=0)
Z1
float
P(IBD=1)
Z2
float
P(IBD=2)
PI_HAT
float
Proportion IBD
i.e. P(IBD=2) + 0.5*P(IBD=1)
PHE
int
Pairwise phenotypic code (1
0
-1 = case-case
case-ctrl
and ctrl-ctrl pairs
respectively)
DST
float
IBS distance
i.e. (IBS2 + 0.5*IBS1) / (IBS0 + IBS1 + IBS2)
PPC
float
IBS binomial test
RATIO
float
HETHET: IBS0 SNP ratio (expected value 2)
IBS0
int
Number of IBS 0 nonmissing variants
IBS1
int
Number of IBS 1 nonmissing variants
IBS2
int
Number of IBS 2 nonmissing variants
HOMHOM
float
Number of IBS 0 SNP pairs used in PPC test
HETHET
float
Number of IBS 2 het/het SNP pairs used in PPC test
- Return type:
pd.DataFrame
References
-
cgr_gwas_qc.parsers.plink.
read_het
(filename: Union[str, os.PathLike[str], pathlib.Path]) → pandas.core.frame.DataFrame[source]¶ Parse PLINK’s het file format.
- Returns:
A (n x 5) table with the following columns
name
dtype
description
ID (index)
object
Sample or Subject ID
O_HOM
int
Observed number of homozygotes
E_HOM
int
Expected number of homozygotes
N_NM
int
Number of non-missing autosomal genotypes
F
float
Method-of-moments F coefficient estimate
- Return type:
pd.DataFrame
References
-
cgr_gwas_qc.parsers.plink.
read_hwe
(filename: Union[str, os.PathLike[str], pathlib.Path]) → pandas.core.frame.DataFrame[source]¶ Parse PLINK’s hwe file format.
- Returns:
A (# SNP x 9) table with the following columns
name
dtype
description
SNP (index)
string
SNP ID
CHR
string
Chromosome code
TEST
string
Type of test; one of {ALL
AFF
UNAFF
ALL(QT)
ALL(NP)}
A1
string
Allele 1 (usually minor)
A2
string
Allele 2 (usually major)
GENO
string
’/’- separated genotype counts (A1 hom
het
A2 hom)
O_HET
float
Observed heterozygote frequency
E_HET
float
Expected heterozygote frequency
P
float
Hardy-Weinberg equilibrium exact test p-value
- Return type:
pd.DataFrame
References
-
cgr_gwas_qc.parsers.plink.
read_imiss
(filename: Union[str, os.PathLike[str], pathlib.Path]) → pandas.core.frame.DataFrame[source]¶ Parse PLINK’s imiss file format.
This reports sample level missing data.
- Returns:
A (n x 5) table with the following columns
name
dtype
description
ID (index)
object
Sample or Subject ID
MISS_PHENO
str
[Y/N] if the phenotype is missing
N_MISS
int
The number of missing genotype calls not including obligatory missing or heterozygous haploids.
N_GENO
int
Number of potentially valid calls
F_MISS
float
Missing call rate
- Return type:
pd.DataFrame
References
-
cgr_gwas_qc.parsers.plink.
read_lmiss
(filename: Union[str, os.PathLike[str], pathlib.Path]) → pandas.core.frame.DataFrame[source]¶ Parse PLINK’s lmiss file format.
This reports snp level missing data.
- Returns:
A (n x 5) table with the following columns
name
dtype
description
CHR
str
Chromosome code
SNP
str
Variant identifier
N_MISS
int
The number of missing genotype calls not including obligatory missing or heterozygous haploids.
N_GENO
int
Number of potentially valid calls
F_MISS
float
Missing call rate
- Return type:
pd.DataFrame
References
-
cgr_gwas_qc.parsers.plink.
read_sexcheck
(filename: Union[str, os.PathLike[str], pathlib.Path]) → pandas.core.frame.DataFrame[source]¶ Parse PLINK’s sexcheck file format.
- Returns:
A (n x 5) table with the following columns
name
dtype
description
ID (index)
object
Sample or Subject ID
PEDSEX
int
Sex code in input file
SNPSEX
int
Imputed sex code (1 = male; 2 = female; 0 = unknown)
STATUS
str
OK if PEDSEX and SNPSEX match and are nonzero PROBLEM otherwise
F
int
Inbreeding coefficient considering only X chromosome.
- Return type:
pd.DataFrame
References
Parser for the PLINK BIM format.
-
class
cgr_gwas_qc.parsers.bim.
BimFile
(filename, mode='r')[source]¶ Provides an iterable interface to BIM files.
Illumina Parsers¶
Parser for the Illumina BPM format.
-
class
cgr_gwas_qc.parsers.bpm.
BpmFile
(filename)[source]¶ Provides an iterable interface to BPM files.
-
class
cgr_gwas_qc.parsers.bpm.
BpmRecord
(id: str, chrom: str, pos: int, allele_1: str, allele_2: str, snp: Union[str, NoneType] = None, ref_strand: Union[str, NoneType] = None, source_strand: Union[str, NoneType] = None)[source]¶
-
cgr_gwas_qc.parsers.bpm.
open
(filename)[source]¶ Note this has to be used as a context manager.
To open and close while not using a with block you must to the BpmFile class directly.
-
class
cgr_gwas_qc.parsers.illumina.adpc.
AdpcBase
[source]¶ Base class for Illumina’s adpc.bin files.
Based on information found on Picard’s website:
-
class
cgr_gwas_qc.parsers.illumina.IlluminaBeadArrayFiles.
BeadPoolManifest
(filename)[source]¶ Class for parsing binary (BPM) manifest file. .. attribute:: names
Names of loci from manifest
- type:
list of strings
-
snps
¶ SNP values of loci from manifest
- Type:
list of strings
-
chroms
¶ Chromosome values for loci
- Type:
list of string
-
map_infos =
Map info values for loci
- Type:
list of ints
-
addresses
¶ AddressA IDs of loci from manifest
- Type:
list of ints
-
normalization_lookups
¶ Normalization lookups from manifest. This indexes into list of normalization transforms read from GTC file
- Type:
list of ints
-
ref_strands
¶ Reference strand annotation for loci (see RefStrand class)
- Type:
list of ints
-
source_strands
¶ Source strand annotations for loci (see SourceStrand class)
- Type:
list of ints
-
num_loci
¶ Number of loci in manifest
- Type:
int
-
manifest_name
¶ Name of manifest
- Type:
string
-
control_config
¶ Control description from manifest
- Type:
string
-
class
cgr_gwas_qc.parsers.illumina.IlluminaBeadArrayFiles.
GenotypeCalls
(filename, ignore_version=False, check_write_complete=True)[source]¶ Class to parse gtc files as produced by Illumina AutoConvert and AutoCall software.
-
supported_versions
¶ Supported file versions as a list of integers
-
get_autocall_date
()[source]¶ Returns: The imaging date of scanning as a string For example
2/17/2015 1:47 PM
-
get_autocall_version
()[source]¶ Returns: The version of AutoCall used for genotyping as a string For example
1.6.2.2
-
get_base_calls
() → Generator[str, None, None][source]¶ Yields: The genotype basecalls as a string. The characters are A, C, G, T, or - for a no-call/null. The calls are relative to the top strand.
-
get_base_calls_forward_strand
(snps, forward_strand_annotations) → Generator[str, None, None][source]¶ Get base calls on the forward strand.
- Parameters:
snps (list<string>) – A list of string representing the snp on the design strand for the loci (e.g. [A/C])
forward_strand_annotations (list<int>) – A list of strand annotations for the loci (e.g., SourceStrand.Forward)
- Yields:
The genotype basecalls on the report strand as a string. The characters are A, C, G, T, or - for a no-call/null.
-
get_base_calls_generic
(snps, strand_annotations, report_strand, unknown_annotation) → Generator[str, None, None][source]¶ Get base calls on arbitrary strand :param snps: A list of string representing the snp on the design strand for the loci (e.g. [A/C]) :type snps: list<string> :param strand_annotations: A list of strand annotations for the loci :type strand_annotations: list<int> :param report_strand: The strand to use for reporting (must match encoding of strand_annotations) :type report_strand: int :param unknown_annotation: The encoding used in strand annotations for an unknown strand :type unknown_annotation: int
- Yields:
The genotype basecalls on the report strand as a string. The characters are A, C, G, T, or - for a no-call/null.
-
get_base_calls_plus_strand
(snps, ref_strand_annotations) → Generator[str, None, None][source]¶ Get base calls on plus strand of genomic reference. If you only see no-calls returned from this method, please verify that the reference strand annotations passed as argument are not unknown (RefStrand.Unknown)
- Parameters:
snps (list<string>) – A list of string representing the snp on the design strand for the loci (e.g. [A/C])
ref_strand_annotations (list<int>) – A list of strand annotations for the loci (e.g., RefStrand.Plus)
- Yields:
The genotype basecalls on the report strand as a string. The characters are A, C, G, T, or - for a no-call/null.
-
get_control_x_intensities
()[source]¶ Returns: The x intensities of control bead types as a list of integers
-
get_control_y_intensities
()[source]¶ Returns: The y intensities of control bead types as a list of integers
-
get_imaging_date
()[source]¶ Returns: The imaging date of scanning as a string For example
Monday, December 01, 2014 4:51:47 PM
-
get_normalization_transforms
()[source]¶ Returns: The normalization transforms used during genotyping (as a lit of NormalizationTransforms)
-
get_normalized_intensities
(normalization_lookups)[source]¶ Calculate and return the normalized intensities :param normalization_lookups: Map from each SNP to a normalization transform.
This list can be obtained from the BeadPoolManifest object.
- Returns:
The normalized intensities for the sample as a list of (x,y) float tuples
-
get_percentiles_x
()[source]¶ Returns: An array of length three representing 5th, 50th and 95th percentiles for x intensity
-
get_percentiles_y
()[source]¶ Returns: An array of length three representing 5th, 50th and 95th percentiles for y intensity
-
get_raw_x_intensities
()[source]¶ Returns: The raw x intensities of assay bead types as a list of integers
-
-
class
cgr_gwas_qc.parsers.illumina.IlluminaBeadArrayFiles.
LocusEntry
(handle)[source]¶ Helper class representing a locus entry within a bead pool manifest. Current only support version 6,7, and 8. .. attribute:: ilmn_id
IlmnID (probe identifier) of locus
- type:
string
-
name
¶ Name (variant identifier) of locus
- Type:
string
-
snp
¶ SNP value for locus (e.g., [A/C])
- Type:
string
-
chrom
¶ Chromosome for the locus (e.g., XY)
- Type:
string
-
map_info
¶ Mapping location of locus
- Type:
int
-
assay_type
¶ Identifies type of assay (0 - Infinium II , 1 - Infinium I (A/T), 2 - Infinium I (G/C)
- Type:
int
-
address_a
¶ AddressA ID of locus
- Type:
int
-
address_b
¶ AddressB ID of locus (0 if none)
- Type:
int
-
ref_strand
¶ See RefStrand class
- Type:
int
-
source_strand
¶ See SourceStrand class
- Type:
int
-
cgr_gwas_qc.parsers.illumina.IlluminaBeadArrayFiles.
complement
(nucleotide)[source]¶ Complement a single nucleotide. Complements of D(eletion) and I(nsertion) are D and I, respectively. :param nucleotide: Nucleotide, must be A, C, T, G, D, or I :type nucleotide: string
- Returns:
Complemented nucleotide
- Return type:
str
-
cgr_gwas_qc.parsers.illumina.IlluminaBeadArrayFiles.
read_byte
(handle)[source]¶ Helper function to parse byte from file handle :param handle: File handle :type handle: file handle
- Returns:
byte value read from handle
-
cgr_gwas_qc.parsers.illumina.IlluminaBeadArrayFiles.
read_char
(handle)[source]¶ Helper function to parse character from file handle :param handle: File handle :type handle: file handle
- Returns:
char value read from handle
-
cgr_gwas_qc.parsers.illumina.IlluminaBeadArrayFiles.
read_float
(handle)[source]¶ Helper function to parse float from file handle :param handle: File handle :type handle: file handle
- Returns:
numpy.float32 value read from handle
-
cgr_gwas_qc.parsers.illumina.IlluminaBeadArrayFiles.
read_int
(handle)[source]¶ Helper function to parse int from file handle :param handle: File handle :type handle: file handle
- Returns:
numpy.int32 value read from handle
-
cgr_gwas_qc.parsers.illumina.IlluminaBeadArrayFiles.
read_scanner_data
(handle)[source]¶ Helper function to parse ScannerData object from file handle. :param handle: File handle :type handle: file handle
- Returns:
ScannerData value read from handle
-
cgr_gwas_qc.parsers.illumina.IlluminaBeadArrayFiles.
read_string
(handle)[source]¶ Helper function to parse string from file handle. See https://msdn.microsoft.com/en-us/library/yzxa6408(v=vs.100).aspx for additional details on string format. :param handle: File handle :type handle: file handle
- Returns:
string value read from handle