Parsers¶

Eigensoft Parsers¶

class cgr_gwas_qc.parsers.eigensoft.Eigenvec(filename: pathlib.Path)[source]¶

Eigensoft eigenvec file parser.

filename¶

Path to the eigenvec file from smartpca.

Type:: Path

components¶

A (n x 11) table of principal components

name	dtype	description
ID (index)	object	Sample or Subject ID
PC1	object	The first principal componenet (PC1).
PC2	object	The second principal componenet.
…	…	The third to ninth principal components.
PC10	object	The tenth principal componenet.

Type:: pd.DataFrame

values¶

A vector of eigenvalues for PC1 to PC10.

Type:: np.ndarray

References

GRAF Parsers¶

Genetic Relationship and Fingerprinting¶

GRAF is a package that allows estmation or relatedness and ancestry.

Relatedness¶

name	dtype	description
ID1	string
ID2	string
HG_match	int	number of SNPs with matched genotypes when only homozygous SNPs are counted
HG_miss	int	number of SNPs with mismatched genotypes when only homozygous SNPs are counted
HGMR	float	Homozygous Genotype Mismatch Rate (%)
AG_match	int	number of SNPs with matched genotypes when all SNPs are counted
AG_miss	int	number of SNPs with mismatched genotypes when all SNPs are counted
AGMR	float	All Genotype Mismatch Rate (%)
relationship	string	relationship determined by sample genotypes.
p_value	float	probability that the genetic relationship is NOT the predicted type

Relationship Values¶

Categories are assigned by GRAF.

name	description
ID	duplicate or MZ twin
PO	parent-offspring
FS	full sibling
D2	2nd degree relative
D3	3rd degree relative
UN	unrelated

References

https://github.com/ncbi/graf
Jin Y, Schäffer AA, Sherry ST, and Feolo M (2017). Quickly identifying identical and closely related subjects in large databases using genotype data. PLoS One. 12(6):e0179106.

cgr_gwas_qc.parsers.graf.read_relatedness(filename: Union[str, os.PathLike[str], pathlib.Path]) → pandas.core.frame.DataFrame[source]¶

Reads the table generated by graf --out

Returns:

pd.DataFrame

ID1
ID2
HG_match
HG_miss
HGMR
AG_match
AG_miss
AGMR
relationship {ID, PO, FS, D2, D3, UN}
p_value

References

https://github.com/ncbi/graf#output-files

KING Parsers¶

KING Relationship Inferences¶

KING is a toolset to robustly identify different types of relatedness. Unlike other methods, you should only remove markers that fail QC (i.e., don’t mess with MAF or LD filters).

Kinship Table¶

name	dtype	description
ID1	string	Individual ID for the first individual of the pair
ID2	string	Individual ID for the second individual of the pair
N_SNP	UInt32	The number of SNPs that do not have missing genotypes in either of the individual
HetHet	float	Proportion of SNPs with double heterozygotes (e.g.	AG and AG)
IBS0	float	Proportion of SNPs with zero IBS (identical-by-state) (e.g.	AA and GG)
HetConc	float	Heterozygous concordance
HomIBS0	float	Homozygous IBS0
Kinship	float	Estimated kinship coefficient from the SNP data
IBD1Seg	float	Total length of IBD1 segments divided by total length of all segments
IBD2Seg	float	Total length of IBD2 segments divided by total length of all segments
PropIBD	float	Proportion of genome with shared IBD (e.g.	IBD2Seg + IBD1Seg/2)
relationship	string	The assigned relationship based on Kinship

King Relationships¶

Categories were assigned based on Kinship ranges provided in the KING manual.

name	description
ID	duplicate or MZ twin
D1	1st degree relative
D2	2nd degree relative
D3	3rd degree relative
UN	unrelated

References

http://people.virginia.edu/~wc9c/KING/manual.html
Manichaikul A, Mychaleckyj JC, Rich SS, Daly K, Sale M, Chen WM (2010) Robust relationship inference in genome-wide association studies. Bioinformatics 26(22):2867-2873

cgr_gwas_qc.parsers.king.read_related(filename: Union[str, os.PathLike[str], pathlib.Path], **kwargs) → pandas.core.frame.DataFrame[source]¶

Reads the table generated by king --related.

Reads the king file and re-assigns ID1/ID2 by sorting IDs alphanumerically.

Returns:

pd.DataFrame

ID1
ID2
N_SNP
HetHet
IBS0
HetConc
HomIBS0
Kinship
IBD1Seg
IBD2Seg
PropIBD
relationship {ID, PO, FS, D2, D3}

References

http://people.virginia.edu/~wc9c/KING/manual.html#WITHIN

Plink Parsers¶

cgr_gwas_qc.parsers.plink.read_genome(filename: Union[str, os.PathLike[str], pathlib.Path], required_cols=None, chunksize=100000) → pandas.core.frame.DataFrame[source]¶

Parse PLINK’s genome file format.

Each row of the genome file is a pairwise combinations of samples/subjects.

Returns:

A (n x 17) table with the following columns

name	dtype	description
IID1	string	First Sample or Subject ID (alphanumerically)
IID2	string	Second Sample or Subject ID (alphanumerically)
RT	category	Relationship type inferred from .fam/.ped file {FS: Full Sib	HS	Half Sib	PO: Parent-Offspring	OT; Other}
EZ	object	IBD sharing expected value	based on just .fam/.ped relationship
Z0	float	P(IBD=0)
Z1	float	P(IBD=1)
Z2	float	P(IBD=2)
PI_HAT	float	Proportion IBD	i.e. P(IBD=2) + 0.5*P(IBD=1)
PHE	category	Pairwise phenotypic code (1	0	-1 = case-case	case-ctrl	and ctrl-ctrl pairs	respectively)
DST	float	IBS distance	i.e. (IBS2 + 0.5*IBS1) / (IBS0 + IBS1 + IBS2)
PPC	float	IBS binomial test
RATIO	float	HETHET: IBS0 SNP ratio (expected value 2)
IBS0	int	Number of IBS 0 nonmissing variants
IBS1	int	Number of IBS 1 nonmissing variants
IBS2	int	Number of IBS 2 nonmissing variants
HOMHOM	int	Number of IBS 0 SNP pairs used in PPC test
HETHET	int	Number of IBS 2 het/het SNP pairs used in PPC test

Return type:

pd.DataFrame

References

cgr_gwas_qc.parsers.plink.read_het(filename: Union[str, os.PathLike[str], pathlib.Path]) → pandas.core.frame.DataFrame[source]¶

Parse PLINK’s het file format.

Returns:

A (n x 5) table with the following columns

name	dtype	description
ID (index)	object	Sample or Subject ID
O_HOM	int	Observed number of homozygotes
E_HOM	int	Expected number of homozygotes
N_NM	int	Number of non-missing autosomal genotypes
F	float	Method-of-moments F coefficient estimate

Return type:

pd.DataFrame

References

cgr_gwas_qc.parsers.plink.read_hwe(filename: Union[str, os.PathLike[str], pathlib.Path]) → pandas.core.frame.DataFrame[source]¶

Parse PLINK’s hwe file format.

Returns:

A (# SNP x 9) table with the following columns

name	dtype	description
SNP (index)	string	SNP ID
CHR	string	Chromosome code
TEST	string	Type of test; one of {ALL	AFF	UNAFF	ALL(QT)	ALL(NP)}
A1	string	Allele 1 (usually minor)
A2	string	Allele 2 (usually major)
GENO	string	’/’- separated genotype counts (A1 hom	het	A2 hom)
O_HET	float	Observed heterozygote frequency
E_HET	float	Expected heterozygote frequency
P	float	Hardy-Weinberg equilibrium exact test p-value

Return type:

pd.DataFrame

References

cgr_gwas_qc.parsers.plink.read_imiss(filename: Union[str, os.PathLike[str], pathlib.Path]) → pandas.core.frame.DataFrame[source]¶

Parse PLINK’s imiss file format.

This reports sample level missing data.

Returns:

A (n x 5) table with the following columns

name	dtype	description
ID (index)	object	Sample or Subject ID
MISS_PHENO	str	[Y/N] if the phenotype is missing
N_MISS	int	The number of missing genotype calls not including obligatory missing or heterozygous haploids.
N_GENO	int	Number of potentially valid calls
F_MISS	float	Missing call rate

Return type:

pd.DataFrame

References

cgr_gwas_qc.parsers.plink.read_lmiss(filename: Union[str, os.PathLike[str], pathlib.Path]) → pandas.core.frame.DataFrame[source]¶

Parse PLINK’s lmiss file format.

This reports snp level missing data.

Returns:

A (n x 5) table with the following columns

name	dtype	description
CHR	str	Chromosome code
SNP	str	Variant identifier
N_MISS	int	The number of missing genotype calls not including obligatory missing or heterozygous haploids.
N_GENO	int	Number of potentially valid calls
F_MISS	float	Missing call rate

Return type:

pd.DataFrame

References

cgr_gwas_qc.parsers.plink.read_sexcheck(filename: Union[str, os.PathLike[str], pathlib.Path]) → pandas.core.frame.DataFrame[source]¶

Parse PLINK’s sexcheck file format.

Returns:

A (n x 5) table with the following columns

name	dtype	description
ID (index)	object	Sample or Subject ID
PEDSEX	int	Sex code in input file
SNPSEX	int	Imputed sex code (1 = male; 2 = female; 0 = unknown)
STATUS	str	OK if PEDSEX and SNPSEX match and are nonzero PROBLEM otherwise
F	int	Inbreeding coefficient considering only X chromosome.

Return type:

pd.DataFrame

References

Parser for the PLINK BIM format.

class cgr_gwas_qc.parsers.bim.BimFile(filename, mode='r')[source]¶: Provides an iterable interface to BIM files.

class cgr_gwas_qc.parsers.bim.BimRecord(id: str, chrom: str, pos: int, allele_1: str, allele_2: str, encoded_chrom: Union[str, NoneType] = None, morgans: Union[int, NoneType] = None)[source]¶

get_record_problems() → List[str][source]¶

Checks the record for common problems.

A convenience method to check the record for a set of common problems and return a list of those problems. Potential problems: [“not_major_chrom”, “bad_position”, “ambiguous_allele”, “indel].

cgr_gwas_qc.parsers.bim.open(filename, mode: str = 'r')[source]¶

Note this has to be used as a context manager.

To open and close while not using a with block you must to the BimFile class directly.

Illumina Parsers¶

Parser for the Illumina BPM format.

class cgr_gwas_qc.parsers.bpm.BpmFile(filename)[source]¶

Provides an iterable interface to BPM files.

write()[source]¶

This version of write adds self.endchar when writing.

If the subclass sets self.endchar this method will add that character when writing record. Useful for automatically adding a newline character. This can be directly overridden by setting enchar to None upon calling.

class cgr_gwas_qc.parsers.bpm.BpmRecord(id: str, chrom: str, pos: int, allele_1: str, allele_2: str, snp: Union[str, NoneType] = None, ref_strand: Union[str, NoneType] = None, source_strand: Union[str, NoneType] = None)[source]¶

cgr_gwas_qc.parsers.bpm.open(filename)[source]¶

Note this has to be used as a context manager.

To open and close while not using a with block you must to the BpmFile class directly.

class cgr_gwas_qc.parsers.illumina.adpc.AdpcBase[source]¶

Base class for Illumina’s adpc.bin files.

Based on information found on Picard’s website:

https://javadoc.io/static/org.broadinstitute/gatk/4.1.4.1/picard/arrays/illumina/IlluminaAdpcFileWriter.html

class cgr_gwas_qc.parsers.illumina.adpc.AdpcReader(file_name: Union[str, pathlib.Path])[source]¶

class cgr_gwas_qc.parsers.illumina.adpc.AdpcWriter(file_name: Union[str, pathlib.Path])[source]¶

class cgr_gwas_qc.parsers.illumina.IlluminaBeadArrayFiles.BeadPoolManifest(filename)[source]¶

Class for parsing binary (BPM) manifest file. .. attribute:: names

Names of loci from manifest

type:

list of strings

snps¶

SNP values of loci from manifest

Type:: list of strings

chroms¶

Chromosome values for loci

Type:: list of string

map_infos =

Map info values for loci

Type:: list of ints

addresses¶

AddressA IDs of loci from manifest

Type:: list of ints

normalization_lookups¶

Normalization lookups from manifest. This indexes into list of normalization transforms read from GTC file

Type:: list of ints

ref_strands¶

Reference strand annotation for loci (see RefStrand class)

Type:: list of ints

source_strands¶

Source strand annotations for loci (see SourceStrand class)

Type:: list of ints

num_loci¶

Number of loci in manifest

Type:: int

manifest_name¶

Name of manifest

Type:: string

control_config¶

Control description from manifest

Type:: string

class cgr_gwas_qc.parsers.illumina.IlluminaBeadArrayFiles.GenotypeCalls(filename, ignore_version=False, check_write_complete=True)[source]¶

Class to parse gtc files as produced by Illumina AutoConvert and AutoCall software.

supported_versions¶: Supported file versions as a list of integers

get_autocall_date()[source]¶: Returns: The imaging date of scanning as a string For example

2/17/2015 1:47 PM

get_autocall_version()[source]¶: Returns: The version of AutoCall used for genotyping as a string For example

1.6.2.2

get_ballele_freqs()[source]¶: Returns: The B allele frequencies as a list of floats

get_base_calls() → Generator[str, None, None][source]¶: Yields: The genotype basecalls as a string. The characters are A, C, G, T, or - for a no-call/null. The calls are relative to the top strand.

get_base_calls_forward_strand(snps, forward_strand_annotations) → Generator[str, None, None][source]¶

Get base calls on the forward strand.

Parameters:

snps (list<string>) – A list of string representing the snp on the design strand for the loci (e.g. [A/C])
forward_strand_annotations (list<int>) – A list of strand annotations for the loci (e.g., SourceStrand.Forward)

Yields:

The genotype basecalls on the report strand as a string. The characters are A, C, G, T, or - for a no-call/null.

get_base_calls_generic(snps, strand_annotations, report_strand, unknown_annotation) → Generator[str, None, None][source]¶

Get base calls on arbitrary strand :param snps: A list of string representing the snp on the design strand for the loci (e.g. [A/C]) :type snps: list<string> :param strand_annotations: A list of strand annotations for the loci :type strand_annotations: list<int> :param report_strand: The strand to use for reporting (must match encoding of strand_annotations) :type report_strand: int :param unknown_annotation: The encoding used in strand annotations for an unknown strand :type unknown_annotation: int

Yields:: The genotype basecalls on the report strand as a string. The characters are A, C, G, T, or - for a no-call/null.

get_base_calls_plus_strand(snps, ref_strand_annotations) → Generator[str, None, None][source]¶

Get base calls on plus strand of genomic reference. If you only see no-calls returned from this method, please verify that the reference strand annotations passed as argument are not unknown (RefStrand.Unknown)

Parameters:

snps (list<string>) – A list of string representing the snp on the design strand for the loci (e.g. [A/C])
ref_strand_annotations (list<int>) – A list of strand annotations for the loci (e.g., RefStrand.Plus)

Yields:

The genotype basecalls on the report strand as a string. The characters are A, C, G, T, or - for a no-call/null.

get_call_rate()[source]¶: Returns: The call rate as a float

get_cluster_file()[source]¶: Returns: The name of the cluster file used for genotyping as a string

get_control_x_intensities()[source]¶: Returns: The x intensities of control bead types as a list of integers

get_control_y_intensities()[source]¶: Returns: The y intensities of control bead types as a list of integers

get_gc10()[source]¶: Returns: The GC10 (GenCall score - 10th percentile) as a float

get_gc50()[source]¶: Returns: The GC50 (GenCall score - 50th percentile) as a float

get_gender()[source]¶: Returns: The gender as a char M - Male, F - Female, U-Unknown

get_genotype_scores()[source]¶: Returns: The genotype scores as a list of floats

get_genotypes()[source]¶: Returns: A byte list (string) of genotypes. See code2genotype for mapping

get_imaging_date()[source]¶: Returns: The imaging date of scanning as a string For example

Monday, December 01, 2014 4:51:47 PM

get_logr_dev()[source]¶: Returns: The logR deviation as a float

get_logr_ratios()[source]¶: Returns: The logR ratios as a list of floats

get_normalization_transforms()[source]¶: Returns: The normalization transforms used during genotyping (as a lit of NormalizationTransforms)

get_normalized_intensities(normalization_lookups)[source]¶

Calculate and return the normalized intensities :param normalization_lookups: Map from each SNP to a normalization transform.

This list can be obtained from the BeadPoolManifest object.

Returns:: The normalized intensities for the sample as a list of (x,y) float tuples

get_num_calls()[source]¶: Returns: The number of calls as an integer

get_num_intensity_only()[source]¶: Returns: The number of intensity only SNPs

get_num_no_calls()[source]¶: Returns: The number of no calls as an integer

get_num_snps()[source]¶: Returns: The number of SNPs in the file as an integer

get_percentiles_x()[source]¶: Returns: An array of length three representing 5th, 50th and 95th percentiles for x intensity

get_percentiles_y()[source]¶: Returns: An array of length three representing 5th, 50th and 95th percentiles for y intensity

get_ploidy()[source]¶: Returns: The ploidy of the sample

get_ploidy_type()[source]¶: Returns: The ploidy type of the sample

get_raw_x_intensities()[source]¶: Returns: The raw x intensities of assay bead types as a list of integers

get_raw_y_intensities()[source]¶: Returns: The raw y intensities of assay bead types as a list of integers

get_sample_name()[source]¶: Returns: The name of the sample as a string

get_sample_plate()[source]¶: Returns: The name of the sample plate as a string

get_sample_well()[source]¶: Returns: The name of the sample well as a string

get_scanner_data()[source]¶: Returns: Information about scanner as ScannerData object

get_slide_identifier()[source]¶: Returns: The name of the sample as a string

get_snp_manifest()[source]¶: Returns: The name of the manifest used for genotyping as a string

is_write_complete()[source]¶

Check for last item written to GTC file to verify that write has successfully completed

Parameters:: None –

Returns: Whether or not write is complete (bool)

class cgr_gwas_qc.parsers.illumina.IlluminaBeadArrayFiles.LocusEntry(handle)[source]¶

Helper class representing a locus entry within a bead pool manifest. Current only support version 6,7, and 8. .. attribute:: ilmn_id

IlmnID (probe identifier) of locus

type:

string

name¶

Name (variant identifier) of locus

Type:: string

snp¶

SNP value for locus (e.g., [A/C])

Type:: string

chrom¶

Chromosome for the locus (e.g., XY)

Type:: string

map_info¶

Mapping location of locus

Type:: int

assay_type¶

Identifies type of assay (0 - Infinium II , 1 - Infinium I (A/T), 2 - Infinium I (G/C)

Type:: int

address_a¶

AddressA ID of locus

Type:: int

address_b¶

AddressB ID of locus (0 if none)

Type:: int

ref_strand¶

See RefStrand class

Type:: int

source_strand¶

See SourceStrand class

Type:: int

cgr_gwas_qc.parsers.illumina.IlluminaBeadArrayFiles.complement(nucleotide)[source]¶

Complement a single nucleotide. Complements of D(eletion) and I(nsertion) are D and I, respectively. :param nucleotide: Nucleotide, must be A, C, T, G, D, or I :type nucleotide: string

Returns:: Complemented nucleotide
Return type:: str

cgr_gwas_qc.parsers.illumina.IlluminaBeadArrayFiles.read_byte(handle)[source]¶

Helper function to parse byte from file handle :param handle: File handle :type handle: file handle

Returns:: byte value read from handle

cgr_gwas_qc.parsers.illumina.IlluminaBeadArrayFiles.read_char(handle)[source]¶

Helper function to parse character from file handle :param handle: File handle :type handle: file handle

Returns:: char value read from handle

cgr_gwas_qc.parsers.illumina.IlluminaBeadArrayFiles.read_float(handle)[source]¶

Helper function to parse float from file handle :param handle: File handle :type handle: file handle

Returns:: numpy.float32 value read from handle

cgr_gwas_qc.parsers.illumina.IlluminaBeadArrayFiles.read_int(handle)[source]¶

Helper function to parse int from file handle :param handle: File handle :type handle: file handle

Returns:: numpy.int32 value read from handle

cgr_gwas_qc.parsers.illumina.IlluminaBeadArrayFiles.read_scanner_data(handle)[source]¶

Helper function to parse ScannerData object from file handle. :param handle: File handle :type handle: file handle

Returns:: ScannerData value read from handle

cgr_gwas_qc.parsers.illumina.IlluminaBeadArrayFiles.read_string(handle)[source]¶

Helper function to parse string from file handle. See https://msdn.microsoft.com/en-us/library/yzxa6408(v=vs.100).aspx for additional details on string format. :param handle: File handle :type handle: file handle

Returns:: string value read from handle

cgr_gwas_qc.parsers.illumina.IlluminaBeadArrayFiles.read_ushort(handle)[source]¶

Helper function to parse ushort from file handle :param handle: File handle :type handle: file handle

Returns:: numpy.int16 value read from handle