Parsers

Eigensoft Parsers

class cgr_gwas_qc.parsers.eigensoft.Eigenvec(filename: pathlib.Path)[source]

Eigensoft eigenvec file parser.

filename

Path to the eigenvec file from smartpca.

Type:

Path

components

A (n x 11) table of principal components

name

dtype

description

ID (index)

object

Sample or Subject ID

PC1

object

The first principal componenet (PC1).

PC2

object

The second principal componenet.

The third to ninth principal components.

PC10

object

The tenth principal componenet.

Type:

pd.DataFrame

values

A vector of eigenvalues for PC1 to PC10.

Type:

np.ndarray

References

GRAF Parsers

Genetic Relationship and Fingerprinting

GRAF is a package that allows estmation or relatedness and ancestry.

Relatedness

name

dtype

description

ID1

string

ID2

string

HG_match

int

number of SNPs with matched genotypes when only homozygous SNPs are counted

HG_miss

int

number of SNPs with mismatched genotypes when only homozygous SNPs are counted

HGMR

float

Homozygous Genotype Mismatch Rate (%)

AG_match

int

number of SNPs with matched genotypes when all SNPs are counted

AG_miss

int

number of SNPs with mismatched genotypes when all SNPs are counted

AGMR

float

All Genotype Mismatch Rate (%)

relationship

string

relationship determined by sample genotypes.

p_value

float

probability that the genetic relationship is NOT the predicted type

Relationship Values

Categories are assigned by GRAF.

name

description

ID

duplicate or MZ twin

PO

parent-offspring

FS

full sibling

D2

2nd degree relative

D3

3rd degree relative

UN

unrelated

References

  • https://github.com/ncbi/graf

  • Jin Y, Schäffer AA, Sherry ST, and Feolo M (2017). Quickly identifying identical and closely related subjects in large databases using genotype data. PLoS One. 12(6):e0179106.

cgr_gwas_qc.parsers.graf.read_relatedness(filename: Union[str, os.PathLike[str], pathlib.Path]) → pandas.core.frame.DataFrame[source]

Reads the table generated by graf --out

Returns:

pd.DataFrame

  • ID1

  • ID2

  • HG_match

  • HG_miss

  • HGMR

  • AG_match

  • AG_miss

  • AGMR

  • relationship {ID, PO, FS, D2, D3, UN}

  • p_value

References

KING Parsers

KING Relationship Inferences

KING is a toolset to robustly identify different types of relatedness. Unlike other methods, you should only remove markers that fail QC (i.e., don’t mess with MAF or LD filters).

Kinship Table

name

dtype

description

ID1

string

Individual ID for the first individual of the pair

ID2

string

Individual ID for the second individual of the pair

N_SNP

UInt32

The number of SNPs that do not have missing genotypes in either of the individual

HetHet

float

Proportion of SNPs with double heterozygotes (e.g.

AG and AG)

IBS0

float

Proportion of SNPs with zero IBS (identical-by-state) (e.g.

AA and GG)

HetConc

float

Heterozygous concordance

HomIBS0

float

Homozygous IBS0

Kinship

float

Estimated kinship coefficient from the SNP data

IBD1Seg

float

Total length of IBD1 segments divided by total length of all segments

IBD2Seg

float

Total length of IBD2 segments divided by total length of all segments

PropIBD

float

Proportion of genome with shared IBD (e.g.

IBD2Seg + IBD1Seg/2)

relationship

string

The assigned relationship based on Kinship

King Relationships

Categories were assigned based on Kinship ranges provided in the KING manual.

name

description

ID

duplicate or MZ twin

D1

1st degree relative

D2

2nd degree relative

D3

3rd degree relative

UN

unrelated

References

Reads the table generated by king --related.

Reads the king file and re-assigns ID1/ID2 by sorting IDs alphanumerically.

Returns:

pd.DataFrame

  • ID1

  • ID2

  • N_SNP

  • HetHet

  • IBS0

  • HetConc

  • HomIBS0

  • Kinship

  • IBD1Seg

  • IBD2Seg

  • PropIBD

  • relationship {ID, PO, FS, D2, D3}

References

Illumina Parsers

Parser for the Illumina BPM format.

class cgr_gwas_qc.parsers.bpm.BpmFile(filename)[source]

Provides an iterable interface to BPM files.

write()[source]

This version of write adds self.endchar when writing.

If the subclass sets self.endchar this method will add that character when writing record. Useful for automatically adding a newline character. This can be directly overridden by setting enchar to None upon calling.

class cgr_gwas_qc.parsers.bpm.BpmRecord(id: str, chrom: str, pos: int, allele_1: str, allele_2: str, snp: Union[str, NoneType] = None, ref_strand: Union[str, NoneType] = None, source_strand: Union[str, NoneType] = None)[source]
cgr_gwas_qc.parsers.bpm.open(filename)[source]

Note this has to be used as a context manager.

To open and close while not using a with block you must to the BpmFile class directly.

class cgr_gwas_qc.parsers.illumina.adpc.AdpcBase[source]

Base class for Illumina’s adpc.bin files.

Based on information found on Picard’s website:

https://javadoc.io/static/org.broadinstitute/gatk/4.1.4.1/picard/arrays/illumina/IlluminaAdpcFileWriter.html

class cgr_gwas_qc.parsers.illumina.adpc.AdpcReader(file_name: Union[str, pathlib.Path])[source]
class cgr_gwas_qc.parsers.illumina.adpc.AdpcWriter(file_name: Union[str, pathlib.Path])[source]
class cgr_gwas_qc.parsers.illumina.IlluminaBeadArrayFiles.BeadPoolManifest(filename)[source]

Class for parsing binary (BPM) manifest file. .. attribute:: names

Names of loci from manifest

type:

list of strings

snps

SNP values of loci from manifest

Type:

list of strings

chroms

Chromosome values for loci

Type:

list of string

map_infos =

Map info values for loci

Type:

list of ints

addresses

AddressA IDs of loci from manifest

Type:

list of ints

normalization_lookups

Normalization lookups from manifest. This indexes into list of normalization transforms read from GTC file

Type:

list of ints

ref_strands

Reference strand annotation for loci (see RefStrand class)

Type:

list of ints

source_strands

Source strand annotations for loci (see SourceStrand class)

Type:

list of ints

num_loci

Number of loci in manifest

Type:

int

manifest_name

Name of manifest

Type:

string

control_config

Control description from manifest

Type:

string

class cgr_gwas_qc.parsers.illumina.IlluminaBeadArrayFiles.GenotypeCalls(filename, ignore_version=False, check_write_complete=True)[source]

Class to parse gtc files as produced by Illumina AutoConvert and AutoCall software.

supported_versions

Supported file versions as a list of integers

get_autocall_date()[source]

Returns: The imaging date of scanning as a string For example

2/17/2015 1:47 PM

get_autocall_version()[source]

Returns: The version of AutoCall used for genotyping as a string For example

1.6.2.2

get_ballele_freqs()[source]

Returns: The B allele frequencies as a list of floats

get_base_calls() → Generator[str, None, None][source]

Yields: The genotype basecalls as a string. The characters are A, C, G, T, or - for a no-call/null. The calls are relative to the top strand.

get_base_calls_forward_strand(snps, forward_strand_annotations) → Generator[str, None, None][source]

Get base calls on the forward strand.

Parameters:
  • snps (list<string>) – A list of string representing the snp on the design strand for the loci (e.g. [A/C])

  • forward_strand_annotations (list<int>) – A list of strand annotations for the loci (e.g., SourceStrand.Forward)

Yields:

The genotype basecalls on the report strand as a string. The characters are A, C, G, T, or - for a no-call/null.

get_base_calls_generic(snps, strand_annotations, report_strand, unknown_annotation) → Generator[str, None, None][source]

Get base calls on arbitrary strand :param snps: A list of string representing the snp on the design strand for the loci (e.g. [A/C]) :type snps: list<string> :param strand_annotations: A list of strand annotations for the loci :type strand_annotations: list<int> :param report_strand: The strand to use for reporting (must match encoding of strand_annotations) :type report_strand: int :param unknown_annotation: The encoding used in strand annotations for an unknown strand :type unknown_annotation: int

Yields:

The genotype basecalls on the report strand as a string. The characters are A, C, G, T, or - for a no-call/null.

get_base_calls_plus_strand(snps, ref_strand_annotations) → Generator[str, None, None][source]

Get base calls on plus strand of genomic reference. If you only see no-calls returned from this method, please verify that the reference strand annotations passed as argument are not unknown (RefStrand.Unknown)

Parameters:
  • snps (list<string>) – A list of string representing the snp on the design strand for the loci (e.g. [A/C])

  • ref_strand_annotations (list<int>) – A list of strand annotations for the loci (e.g., RefStrand.Plus)

Yields:

The genotype basecalls on the report strand as a string. The characters are A, C, G, T, or - for a no-call/null.

get_call_rate()[source]

Returns: The call rate as a float

get_cluster_file()[source]

Returns: The name of the cluster file used for genotyping as a string

get_control_x_intensities()[source]

Returns: The x intensities of control bead types as a list of integers

get_control_y_intensities()[source]

Returns: The y intensities of control bead types as a list of integers

get_gc10()[source]

Returns: The GC10 (GenCall score - 10th percentile) as a float

get_gc50()[source]

Returns: The GC50 (GenCall score - 50th percentile) as a float

get_gender()[source]

Returns: The gender as a char M - Male, F - Female, U-Unknown

get_genotype_scores()[source]

Returns: The genotype scores as a list of floats

get_genotypes()[source]

Returns: A byte list (string) of genotypes. See code2genotype for mapping

get_imaging_date()[source]

Returns: The imaging date of scanning as a string For example

Monday, December 01, 2014 4:51:47 PM

get_logr_dev()[source]

Returns: The logR deviation as a float

get_logr_ratios()[source]

Returns: The logR ratios as a list of floats

get_normalization_transforms()[source]

Returns: The normalization transforms used during genotyping (as a lit of NormalizationTransforms)

get_normalized_intensities(normalization_lookups)[source]

Calculate and return the normalized intensities :param normalization_lookups: Map from each SNP to a normalization transform.

This list can be obtained from the BeadPoolManifest object.

Returns:

The normalized intensities for the sample as a list of (x,y) float tuples

get_num_calls()[source]

Returns: The number of calls as an integer

get_num_intensity_only()[source]

Returns: The number of intensity only SNPs

get_num_no_calls()[source]

Returns: The number of no calls as an integer

get_num_snps()[source]

Returns: The number of SNPs in the file as an integer

get_percentiles_x()[source]

Returns: An array of length three representing 5th, 50th and 95th percentiles for x intensity

get_percentiles_y()[source]

Returns: An array of length three representing 5th, 50th and 95th percentiles for y intensity

get_ploidy()[source]

Returns: The ploidy of the sample

get_ploidy_type()[source]

Returns: The ploidy type of the sample

get_raw_x_intensities()[source]

Returns: The raw x intensities of assay bead types as a list of integers

get_raw_y_intensities()[source]

Returns: The raw y intensities of assay bead types as a list of integers

get_sample_name()[source]

Returns: The name of the sample as a string

get_sample_plate()[source]

Returns: The name of the sample plate as a string

get_sample_well()[source]

Returns: The name of the sample well as a string

get_scanner_data()[source]

Returns: Information about scanner as ScannerData object

get_slide_identifier()[source]

Returns: The name of the sample as a string

get_snp_manifest()[source]

Returns: The name of the manifest used for genotyping as a string

is_write_complete()[source]

Check for last item written to GTC file to verify that write has successfully completed

Parameters:

None

Returns

Whether or not write is complete (bool)

class cgr_gwas_qc.parsers.illumina.IlluminaBeadArrayFiles.LocusEntry(handle)[source]

Helper class representing a locus entry within a bead pool manifest. Current only support version 6,7, and 8. .. attribute:: ilmn_id

IlmnID (probe identifier) of locus

type:

string

name

Name (variant identifier) of locus

Type:

string

snp

SNP value for locus (e.g., [A/C])

Type:

string

chrom

Chromosome for the locus (e.g., XY)

Type:

string

map_info

Mapping location of locus

Type:

int

assay_type

Identifies type of assay (0 - Infinium II , 1 - Infinium I (A/T), 2 - Infinium I (G/C)

Type:

int

address_a

AddressA ID of locus

Type:

int

address_b

AddressB ID of locus (0 if none)

Type:

int

ref_strand

See RefStrand class

Type:

int

source_strand

See SourceStrand class

Type:

int

cgr_gwas_qc.parsers.illumina.IlluminaBeadArrayFiles.complement(nucleotide)[source]

Complement a single nucleotide. Complements of D(eletion) and I(nsertion) are D and I, respectively. :param nucleotide: Nucleotide, must be A, C, T, G, D, or I :type nucleotide: string

Returns:

Complemented nucleotide

Return type:

str

cgr_gwas_qc.parsers.illumina.IlluminaBeadArrayFiles.read_byte(handle)[source]

Helper function to parse byte from file handle :param handle: File handle :type handle: file handle

Returns:

byte value read from handle

cgr_gwas_qc.parsers.illumina.IlluminaBeadArrayFiles.read_char(handle)[source]

Helper function to parse character from file handle :param handle: File handle :type handle: file handle

Returns:

char value read from handle

cgr_gwas_qc.parsers.illumina.IlluminaBeadArrayFiles.read_float(handle)[source]

Helper function to parse float from file handle :param handle: File handle :type handle: file handle

Returns:

numpy.float32 value read from handle

cgr_gwas_qc.parsers.illumina.IlluminaBeadArrayFiles.read_int(handle)[source]

Helper function to parse int from file handle :param handle: File handle :type handle: file handle

Returns:

numpy.int32 value read from handle

cgr_gwas_qc.parsers.illumina.IlluminaBeadArrayFiles.read_scanner_data(handle)[source]

Helper function to parse ScannerData object from file handle. :param handle: File handle :type handle: file handle

Returns:

ScannerData value read from handle

cgr_gwas_qc.parsers.illumina.IlluminaBeadArrayFiles.read_string(handle)[source]

Helper function to parse string from file handle. See https://msdn.microsoft.com/en-us/library/yzxa6408(v=vs.100).aspx for additional details on string format. :param handle: File handle :type handle: file handle

Returns:

string value read from handle

cgr_gwas_qc.parsers.illumina.IlluminaBeadArrayFiles.read_ushort(handle)[source]

Helper function to parse ushort from file handle :param handle: File handle :type handle: file handle

Returns:

numpy.int16 value read from handle