Example session with the HardyWeinberg package

Jan Graffelman - Dpt. of Statistics, Universitat Politecnica de Catalunya

2024-04-05

Introduction

This document gives a description of the functionality of the HardyWeinberg package (Graffelman (2015)); it essentially reproduces and extends the “example session” described in the aformentioned article, which is accessible via the vignette of the package. References for the statistical procedures that are used can be found in the vignette and in the help files of the functions of the package.

We subsequently address:

  1. Installation

  2. A single biallelic marker

    2.1. Genotype and allele counts

    2.2. Testing for Hardy-Weinberg equilibrium

    2.2.1. Autosomal markers

    2.2.2. X-chromosomal markers

    2.3. Special topics

    2.3.1. Missing genotype data

    2.3.2. Power computation

    2.3.3. Population substructure

    2.3.4. Scenario testing under stratification by gender

  3. Sets of biallelic markers

    3.1. Genotype count patterns

    3.2. Testing HWE with many variants

    3.3. Visualising the HWE test results

    3.3.1. Ternary diagrams

    3.3.2. QQ plots

    3.4. Simulating biallelic marker data

  4. Multiallelic markers

    4.1. Triallelic variants

    4.2. Microsatellites (STRs)

1. Installation

The package is installed with the instructions

#install.packages("HardyWeinberg")
library(HardyWeinberg)

and its vignette can be consulted with vignette("HardyWeinberg").

2. A single biallelic marker

2.1. Genotype and allele counts

We use a sample of 1000 individuals genotyped for the MN blood group locus as an example. We store the genotype counts (298, 489 and 213 for MM, MN and NN respectively) in a vector x with named elements:

x <- c(MM = 298, MN = 489, NN = 213)

Most functions of the HardyWeinberg package will accept the counts in any order, as long as they are labelled with two letters. If counts are not labelled, the default order (AA,AB,BB) is assumed. The allele frequency of the M allele can be calculated using the given order of the genotypes with af(x), though we usually calculate the minor allele frequency (MAF) with

maf(x)
#>      N 
#> 0.4575

The function maf can also be used, with its option argument, to extract all sorted allele counts and their relative frequencies.

maf(x,2)
#>      N      M 
#> 0.4575 0.5425
maf(x,3)
#>    N    M 
#>  915 1085

Genotype frequencies can be conveniently visualised in a ternary diagram (in genetics also known as a de Finetti diagram) with function HWTernaryPlot:

HWTernaryPlot(x,region=0,hwcurve=FALSE,grid=TRUE,markercol="blue")

2.2. Testing for Hardy-Weinberg equilibrium

We conduct several classical tests for Hardy-Weinberg equilibrium, addressing both autosomal and X chromosomal tests.

2.2.1. Autosomal markers

We use the MN blood group locus to illustrate autosomal procedures.

Chi-square test

The classical chi-square test can be carried out using HWChisq:

HW.test <- HWChisq(x, verbose = TRUE)
#> Chi-square test with continuity correction for Hardy-Weinberg equilibrium (autosomal)
#> Chi2 =  0.1789563 DF =  1 p-value =  0.6722717 D =  -3.69375 f =  0.01488253

This shows that the chi-square statistic has value 0.1790, and that the corresponding p-value for the test is 0.6723. Using a significance threshold of five percent, we do not reject HWE for the MN locus. When verbose is set to FALSE (default) the test is silent, and HW.test is a list object containing the results of the test (chi-square statistic, the p-value of the test, half the deviation from HWE (D) for the heterozygote (\(D = \frac{1}{2} (f_{AB} - e_{AB}\))), the minor allele frequency (p), the inbreeding coefficient f and the expected counts under the equilibrium assumption). By default, HWChisq applies a continuity correction. This is not recommended for low minor allele frequencies. In order to perform a chi-square test without Yates’ continuity correction, it is necessary to set the cc parameter to zero:

HW.test <- HWChisq(x, cc = 0, verbose = TRUE)
#> Chi-square test for Hardy-Weinberg equilibrium (autosomal)
#> Chi2 =  0.2214896 DF =  1 p-value =  0.6379073 D =  -3.69375 f =  0.01488253

The test with correction gives a smaller \(\chi^2\)-statistic and a larger p-value in comparison with the ordinary \(\chi^2\) test.

LRT test

The likelihood ratio test (LRT) for HWE can be performed with HWLratio:

HW.lrtest <- HWLratio(x, verbose = TRUE)
#> Likelihood ratio test for Hardy-Weinberg equilibrium
#> G2 = 0.2214663 DF = 1 p-value = 0.637925

Note that the \(G^2\)-statistic and the p-value obtained are very close to the chi-square statistic and its p-value.

Exact test

An exact test for HWE can be performed by using routine HWExact:

HW.exacttest <- HWExact(x, verbose = TRUE, pvaluetype = "midp")
#> Haldane Exact test for Hardy-Weinberg equilibrium (autosomal)
#> using MID p-value
#> sample counts: nMM =  298 nMN =  489 nNN =  213 
#> H0: HWE (D==0), H1: D <> 0 
#> D =  -3.69375 p-value =  0.6330965

The exact test leads to the same conclusion, we do not reject HWE (mid p-value = 0.6331). Both one-sided and two-sided exact tests are possible by using the argument alternative, which can be set to "two.sided", "greater", or "less". Three different ways of computing the p-value of an exact test are implemented, and can be specified by the pvaluetype argument, which can be set to standard or selome (sum equally likely or more extreme) p-value, the dost (double one-sided tail probability) p-value, or the better powered midp p-value. By default HWExact calculates a standard exact p-value, though the mid p-value is recommended for having better statistical properties. The exact test is based on a recursive algorithm. For very large samples, R may give an error message “evaluation nested too deeply: infinite recursion”. This can usually be resolved by increasing R’s limit on the number of nested expressions with options(expressions = 10000) prior to calling HWExact. See ?HWExact for more information on this issue.

Permutation test

The permutation test for HWE can be run by using HWPerm:

set.seed(123)
#HW.permutationtest <- HWPerm(x, verbose = TRUE)
#Permutation test for Hardy-Weinberg equilibrium
#Observed statistic: 0.2214896   17000 permutations. p-value: 0.6508235 

and the number of permutations can be specified via the nperm argument. By default the chi-square statistic will be used as the test statistic, but alternative statistics may be supplied by the FUN argument.

If, for some reason, the equilibrium status of a particular marker is at stake, all frequentist tests can be run to see to what extent they do agree or disagree. Function HWAlltests performs all tests with one call and returns a table of all p-values.

#HW.results <- HWAlltests(x, verbose = TRUE, include.permutation.test = TRUE)
#                                            Statistic   p-value
#Chi-square test:                            0.2214896 0.6379073
#Chi-square test with continuity correction: 0.1789563 0.6722717
#Likelihood-ratio test:                      0.2214663 0.6379250
#Exact test with selome p-value:                    NA 0.6556635
#Exact test with dost p-value:                      NA 0.6723356
#Exact test with mid p-value:                       NA 0.6330965
#Permutation test:                           0.2214896 0.6508235

The MN data concern a large sample (\(n =\) 1000) with an intermediate allele frequency (p = 0.4575), for which all test results closely agree. For smaller samples and more extreme allele frequencies, larger differences between the tests are typically observed.

Bayesian procedure

A Bayesian procedure for HWE consists of the calculation of the posterior distribution for Lindley’s disequilibrium parameter \(\alpha\). We calculate and plot this density, representing the equilibrium situation \(\alpha = 0\) by vertical red line.

post.dens <- HWLindley(seq(-1,1,by=0.01),x)
plot(seq(-1,1,by=0.01),post.dens,type="l",xlab=expression(alpha),
     ylab=expression(pi(alpha)))
segments(0,0,0,HWLindley(0,x),lty="dotted",col="red")

Alternatively, we calculate a Bayesian 95% credible interval with

HWLindley.cri(x=x)
#> Lindley's 95% credible interval
#> HWE parameter alpha:  (-0.0959,0.1549)

The results of the Bayesian analysis neither produce evidence against the equilibrium hypothesis.

2.2.2. X-chromosomal markers

We perform HWE tests for X-chromosomal markers, using a vector of five elements containing both male and female genotype counts. Hemizygous male genotype counts should be labelled with a single letter, diploid female counts with two. We consider the X-chromosomal single nucleotide polymorphism (SNP):

SNP1 <- c(A=399,B=205,AA=230,AB=314,BB=107) 

Chi-square test

HWChisq(SNP1,cc=0,x.linked=TRUE,verbose=TRUE)
#> Chi-square test for Hardy-Weinberg equilibrium (X-chromosomal)
#> Chi2 =  7.624175 DF = 2 p-value =  0.022102 D =  NA f =  -0.0003817242

When males are excluded from the test we get:

HWChisq(SNP1[3:5],cc=0)
#> Chi-square test for Hardy-Weinberg equilibrium (autosomal)
#> Chi2 =  9.485941e-05 DF =  1 p-value =  0.9922291 D =  0.05990783 f =  -0.0003817242

Note that the test including males is significant, whereas the test excluding males is not.

Likelihood ratio test

The X-chromosomal likelihood ratio test gives:

HWLratio(SNP1,x.linked=TRUE)
#> Likelihood ratio test for Hardy-Weinberg equilibrium for an X-linked marker
#> G2 = 7.693436 DF = 2 p-value = 0.02134969

and is again very similar to the chi-square test.

Exact test

The exact test for HWE for an X-chromosomal marker can be performed by adding the x.linked=TRUE option:

HWExact(SNP1,x.linked=TRUE)
#> Graffelman-Weir exact test for Hardy-Weinberg equilibrium on the X-chromosome
#> using SELOME p-value
#> Sample probability 5.682963e-05 p-value =  0.02085798

which gives a p-value similar to the \(\chi^2\) test. When the mid p-value is used we obtain

HWExact(SNP1,x.linked=TRUE,pvaluetype="midp")
#> Graffelman-Weir exact test for Hardy-Weinberg equilibrium on the X-chromosome
#> using MID p-value
#> Sample probability 5.682963e-05 p-value =  0.02082957

These exact tests show that the joint null of Hardy-Weinberg proportions and equality of allele frequencies has to be rejected. An exact test using the females only gives again a non-significant result:

HWExact(SNP1[3:5],pvaluetype="midp")
#> Haldane Exact test for Hardy-Weinberg equilibrium (autosomal)
#> using MID p-value
#> sample counts: nAA =  230 nAB =  314 nBB =  107 
#> H0: HWE (D==0), H1: D <> 0 
#> D =  0.05990783 p-value =  0.9676001

Permutation test

The permutation test for X-linked markers gives

#HWPerm(SNP1,x.linked=TRUE)
#Permutation test for Hardy-Weinberg equilibrium of an X-linked marker
#Observed statistic: 7.624175   17000 permutations. p-value: 0.02211765 

A summary of all frequentist X-chromosomal tests is obtained by

#HWAlltests(SNP1,x.linked=TRUE,include.permutation.test=TRUE)
#                                            Statistic    p-value
#Chi-square test:                             7.624175 0.02210200
#Chi-square test with continuity correction:  7.242011 0.02675576
#Likelihood-ratio test:                       7.693436 0.02134969
#Exact test with selome p-value:                    NA 0.02085798
#Exact test with dost p-value:                      NA         NA
#Exact test with mid p-value:                       NA 0.02082957
#Permutation test:                            7.624175 0.02211765

Results of all tests are similar. Finally, we test equality of allele frequencies in males and females with an exact test:

AFtest(SNP1)
#> Fisher Exact test for equality of allele frequencies for males and females.
#> 
#> Table of allele counts:
#>     A   B
#> M 399 205
#> F 774 528
#> 
#> Sample of 1255 indivduals with 1906 alleles. p-value = 0.006268363

For this SNP, there is a significant difference in allele frequency between males and females.

Bayesian procedure

A Bayesian test for HWE for variants on the X-chromosome is implemented in the function HWPosterior. A Bayesian analysis of the same SNP is obtained by:

HWPosterior(males=SNP1[1:2],females=SNP1[3:5],x.linked=TRUE)
#> Bayesian test for Hardy-Weinberg equilibrium of X-chromosomal variants.
#> 
#>                   Posterior_Prob log10(Bayes Factor)
#> M0 (HWE):                 0.3384              0.1859
#> M1 (f!=0):                0.0138             -1.3774
#> M2 (d!=1):                0.6222              0.6939
#> M3 (f!=0 & d!=1:)         0.0256             -1.1035

and shows that a model with Hardy-Weinberg proportions for females and different allele frequencies for both sexes has the largest posterior probability, and the largest Bayes factor, which is in accordance with the previous frequentist procedures.

2.3. Special topics

2.3.1. Missing genotype data

We indicate how to test for HWE when there is missing genotype data. We use the data set Markers for that purpose.

data(Markers)
Markers[1:12,]
#>    SNP1   iG   iT SNP2 SNP3
#> 1    TT  641 1037   AA   GG
#> 2    GT 1207  957   AC   AG
#> 3    TT 1058 1686   AA   GG
#> 4    GG 1348  466   CC   AA
#> 5    GT 1176  948   AC   AG
#> 6    GG 1906  912   CC   AA
#> 7    GG 1844  705   CC   AA
#> 8    GG 2007  599   CC   AA
#> 9    GT 1369 1018   AC   AG
#> 10   GG 1936  953   CC   AA
#> 11   GG 1952  632   AC   AG
#> 12 <NA>  947  920   AC   AG

Note that this data is at the level of each individual. Dataframe Markers contains one SNP with missings (SNP1), the two allele intensities of that SNP (iG and iT) and two covariate markers (SNP2 and SNP3). Here, the covariates have no missing values. We first test SNP1 for HWE using a chi-square test and ignoring the missing genotypes:

Xt <- table(Markers[,1])
Xv <- as.vector(Xt)
names(Xv) <- names(Xt)
Xv
#> GG GT TT 
#> 46 32 20
HW.test <- HWChisq(Xv,cc=0,verbose=TRUE)
#> Chi-square test for Hardy-Weinberg equilibrium (autosomal)
#> Chi2 =  8.67309 DF =  1 p-value =  0.003229431 D =  -6.77551 f =  0.297491

This gives a significant result (p-value = 0.0032). If the data can be assumed to be missing completely at random (MCAR), then we may impute missings by randomly sampling the observed data. This can be done by supplying the method = "sample" argument, and we create 50 imputed data sets (m = 50).

set.seed(123)
Results <- HWMissing(Markers[,1], m = 50, method = "sample", verbose=TRUE)
#> Test for Hardy-Weinberg equilibrium in the presence of missing values
#> Inbreeding coefficient f =  0.2936 
#> 95 % Confidence interval ( 0.1058 , 0.4813 )
#> p-value =  0.0022 
#> Relative increase in variance of f due to missings: r =  0.3351 
#> Fraction of missing information about f: lambda =  0.2529

As could be expected, the conclusion is the same: there is significant deviation from HWE (p-value = 0.0022). It will make more sense to take advantage of variables that are correlated with SNP1, and use multiple imputation of the missings of SNP1 using a multinomial logit model. The multinomial logit model will be used when we set method = "polyreg" or leave the method argument out, since "polyreg" is the default for imputation of factor variables by means of a multinomial logit model used by R package mice. We test SNP1 (with missings) for HWE, using a multinomial logit model to impute SNP1 using information from the allele intensities iG and iT and the neighbouring markers SNP2 and SNP3.

set.seed(123)
Results <- HWMissing(Markers[, 1:5], m = 50, verbose = TRUE)
#> Warning: Number of logged events: 1
#> Test for Hardy-Weinberg equilibrium in the presence of missing values
#> Inbreeding coefficient f =  0.0608 
#> 95 % Confidence interval ( -0.1061 , 0.2278 )
#> p-value =  0.4751 
#> Relative increase in variance of f due to missings: r =  0.0596 
#> Fraction of missing information about f: lambda =  0.0564

Note the sharp drop of the inbreeding coefficient, and the missing data statistics \(\lambda\) and \(r\). The test is now not significant (p-value = 0.4751). Exact inference for HWE with missings is possible by setting the argument statistic="exact". This gives the result

set.seed(123)
Results <- HWMissing(Markers[, 1:5], m = 50, statistic = "exact", verbose = TRUE)
#> Warning: Number of logged events: 1
#> Two-sided Exact test for Hardy-Weinberg equilibrium in the presence of missing values
#>  p-value =  0.4426941

and a similar p-value is obtained.

2.3.2. Power computation

Tests for HWE have low power for small samples with a low minor allele frequency, or samples that deviate only moderately from HWE. It is therefore important to be able to compute power. The function HWPower can be used to compute the power of a test for HWE. If its disequilibrium argument theta (the effect size), is set to 4 (the default value), then the function computes the Type I error rate for the test; the disequilibrium parameter \(\theta\) is defined as \(f_{AB}^2/(f_{AA} \cdot f_{BB})\).

Function mac is used to compute the minor allele count. We compute the power for the data on the MN locus:

x <- c(MM = 298, MN = 489, NN = 213)
n <- sum(x)
nM <- mac(x) 
pw4 <- HWPower(n, nM, alpha = 0.05, test = "exact", theta = 4, 
               pvaluetype = "selome")
print(pw4)
#> [1] 0.04822774
pw8 <- HWPower(n, nM, alpha = 0.05, test = "exact", theta = 8, 
               pvaluetype = "selome")
print(pw8)
#> [1] 0.9996853

These computations show that for a large sample like this one, the Type I error rate (0.0482) is very close to the nominal rate, 0.05, and that the standard exact test has good power (0.9997) for detecting deviations as large \(\theta=8\), which is a doubling of the number of heterozygotes with respect to HWE. Type I error rate and power for the chi-square test can be calculated by setting test="chisq". With the allele frequency of this sample (0.5425), \(\theta=8\) amounts to an inbreeding coefficient of -0.1698

2.3.3. Population substructure

When a sample consists of individuals stemming from different source populations characterised by different allele frequencies, tests for HWE often end up significant, indicating in fact that the assumption of a homogeneous allele frequency is not satisfied. It is recommended to test for HWE in a stratified manner for each source population, if the provenance of the individuals is known. Asymptotic procedures to test HWE across multiple samples genotyped for the same biallelic marker have been developed and are available in function HWStrata.

Testing across strata

We use the Glyoxalasa polymorphism to illustrate a test across strata.

data("Glyoxalase")
Glyoxalase <- as.matrix(Glyoxalase)
HWStrata(Glyoxalase)
#> Olson's asymptotic test for HWE across strata for a single biallelic marker.
#> T2 =  0.003521695 p-value =  0.9526782

This test does not reject HWE for the entire sample. When stratified testing is applied using HWExactStats, a significant deviation is found for one population.

pvalues <- HWExactStats(Glyoxalase)
pvalues
#>  [1] 0.7000832517 0.3832088377 0.9001547536 0.8312315411 0.5772651947
#>  [6] 0.4994510938 0.7643629563 1.0000000000 0.7137920971 0.1807312880
#> [11] 0.5700754196 0.7563473304 0.6802971188 0.1993351995 0.0006174331
#> [16] 1.0000000000 1.0000000000

Contributions in mixed samples

Whenever a sample is known to consist of individuals of two (or more) subpopulations, the relative contributions of the subpopulations to the mixed sample can be estimated by the EM algorithm, if the genotype or allele frequencies of the source populations are known. This is implemented for a biallelic SNP with individuals from two subpopulations in function HWEM.

We estimate the contributions of two known populations to a sample of genotype frequencies that is a mix of these two populations. The genotype frequencies of the two source populations are stored in vectors g1 and g2, and the mixed sample in x.

g1 <- c(AA=0.034, AB=0.330, BB=0.636)
g2 <- c(AA=0.349, AB=0.493, BB=0.158)
x  <- c(AA=0.270, AB=0.453, BB=0.277)

Their contributions are estimated using HWEM

G <- cbind(g1,g2)
contributions <- HWEM(x,G=G)
contributions
#>    Group1    Group2 
#> 0.2511847 0.7488153

About 25 percent is estimated to stem for the first population, and 75 percent from the second. If only the allele frequencies of the source populations are known, contributions can be estimated by supplying the allele frequencies and assuming Hardy-Weinberg proportions:

p <- c(af(g1),af(g2))
contributions <- HWEM(x,p=p)
contributions
#>    Group1    Group2 
#> 0.2456682 0.7543318

2.3.4 Scenario testing under stratification by gender

Autosomal tests for HWE assume equality of allele frequencies in the sexes. When sex is taken into account, several scenarios are possible, according to whether males of females genotypes satisfy the assumptions of equal allele frequencies and HWE or not. The function HWPosterior can be used to perform Bayesian model selection using the posterior probability of each scenario. We consider an example using a SNP of the JPT sample taken from the 1000G project, using their male and female genotype counts.

data(JPTsnps)
JPTsnps[1,]
#> AA AB BB AA AB BB 
#> 46 10  0 40  8  0
Results <- HWPosterior(males=JPTsnps[1,1:3],
                       females=JPTsnps[1,4:6],
                       x.linked=FALSE,precision=0.05)
#>   M_11   M_12   M_13   M_14   M_15   M_21   M_22   M_23   M_24   M_25 
#> 0.5571 0.0463 0.0349 0.2384 0.0075 0.0620 0.0211 0.0226 0.0025 0.0077 
#> Best fitting M_11 0.5570735

The results show that for this variant, equality of allele frequencies in the sexes and HWP for both sexes (model \(M_{11}\)) is the model with the largest probability. For more accurate results, higher precision of posterior probabilities can be obtained by specifying precision=0.005, at the expense of increasing the computation time.

Alternatively, the same variant can be analysed by calculating Akaike’s information criterion (AIC) for each scenario. This is achieved by

data(JPTsnps)
AICs <- HWAIC(JPTsnps[1,1:3],JPTsnps[1,4:6])
#> Best fitting M_11 99.54001
AICs
#>      M_11      M_12      M_13      M_14      M_15      M_21      M_22      M_23 
#>  99.54001 100.81297 100.55911  99.83219 101.83219 101.51680 102.78852 102.53483 
#>      M_24      M_25 
#> 101.83219 103.80656

In this case, the AIC criterion identifies the same \(M_{11}\) model as the best fitting model.

3. Sets of biallelic markers

3.1. Genotype count patterns

We use the data set CEUchr22 to illustrate biallelic procedures with multiple SNPs. The dataset CEUchr22 contains 10,000 SNPs sampled at random from chromosome 22 of the CEU population from the 1000 genomes project.

data("CEUchr22")
CEUchr22[1:5,1:5]
#>         SNP1 SNP2 SNP3 SNP4 SNP5
#> NA06984    0    0    0    0    0
#> NA06985    0    0    0    0    0
#> NA06986    0    0    0    0    0
#> NA06989    0    0    0    0    0
#> NA06994    0    0    0    0    0

This data is in (0,1,2) format and we first construct a matrix with the three genotype counts for each SNP.

Z <- MakeCounts(CEUchr22)
Z <- Z[,1:3]
head(Z)
#>      AA AB BB
#> SNP1 99  0  0
#> SNP2 99  0  0
#> SNP3 99  0  0
#> SNP4 99  0  0
#> SNP5 99  0  0
#> SNP6 99  0  0

We use a ternary diagram with a colour ramp for the frequency of the genotype count patterns.

HWTernaryPlot(Z[,1:3],patternramp=TRUE,region=0)

This shows that SNPs that are monomorphic for the reference allele (A) make up about 82% of the database. We redo this plot filtering out SNPs with a minor allele frequency (MAF) below 5%:

pminor <- maf(Z)
HWTernaryPlot(Z[pminor>0.05,],patternramp=TRUE,region=0)

This reveals that SNPs with a zero count for the BB homozygote and a varying low count for AB are relatively more common. We study the distribution of the MAF with a histogram, excluding the monomorphics:

hist(pminor[pminor > 0],freq=FALSE,xlab="MAF",main="CEU MAF CHR 22")

This shows the typical pattern of more frequent low MAF polymorphisms. We can also use function maf to extract the minor allele count for each SNP, and represent these in a barplot:

cminor <- maf(Z[pminor > 0,],option=3)
barplot(table(cminor[,1]),cex.names = 0.75)

This shows, as expected, that SNPs with just one, two or three copies of the minor allele are most common.

3.2. Testing HWE with many variants

The aforementioned functions HWChisq, HWLratio, HWExact, HWPerm all test a single biallelic marker for HWE. If the genotype counts AA, AB, BB are collected in a three-column matrix, with each row representing a marker then large sets of markers can be tested most efficiently with the functions HWChisqStats for the chi-square test, and with HWExactStats for the exact tests. These routines return the p-values or test statistics for each marker. These functions have fewer options but are computationally better optimized. Both functions allow for X-linked markers via the x.linked argument. Exact tests that rely on exhaustive enumeration are slow in R, and HWExactStats uses by default faster C++ code generously shared by Christopher Chang. The same C++ code is used in the current version (2.0) of the PLINK software (https://www.cog-genomics.org/plink/2.0/). We apply these functions to the previously obtained genotype counts of the CEUChr22 data, using only polymorphic SNPs:

Zpoly <- Z[!is.mono(Z),]
npoly <- nrow(Zpoly)
chisq.pvalues <- HWChisqStats(Zpoly,pvalues=TRUE)
exact.pvalues <- HWExactStats(Zpoly,midp=TRUE)
bonferronithreshold <- 0.05/npoly
sum(chisq.pvalues < bonferronithreshold)
#> [1] 24
sum(exact.pvalues < bonferronithreshold)
#> [1] 9

If a (conservative) Bonferroni correction is applied, some highly significant SNPs are detected. The exact test detects fewer, as it is more conservative. A set of low MAF SNPs is significant in a chi-square test, but not in the exact test.

3.3. Visualising HWE test results

3.3.1. Ternary diagrams

Genetic association studies, genome-wide association studies in particular, use many genetic markers. In this context graphics such as ternary plots, log-ratio plots and QQ plots become particularly useful, because they can reveal whether HWE is a reasonable assumption for the whole data set. We begin to explore the CEUChr22 SNPs by making a ternary plot.

Zu <- UniqueGenotypeCounts(Z)
#> 10000 rows in X
#> 549 unique rows in X
Zu <- Zu[,1:3]
HWTernaryPlot(Zu,region=1,pch=1)

The curves around the Hardy-Weinberg parabola delimit the acceptance region for a chi-square test for HWE, with a default significance threshold of five percent. A number of SNPs is seen to have significant deviation from HWP, either for having an excess or a lack of heterozygotes. One SNP is seen to consist almost entirely of heterozygotes. The acceptance region of the exact test can be shown by setting region=7.

HWTernaryPlot(Zu[,1:3],region=7,pch=1)

The exact test acceptance region is wiggly due to the discrete nature of the exact test. This region shows fewer significant markers, notably towards the homozygote vertices, and illustrates that the exact test is more conservative than the chi-square test.

For large databases of SNPs, drawing the ternary plot can be time consuming. Usually the matrix with genotype counts contains several rows with the same counts. The ternary plot can be constructed faster by plotting only the unique rows of the count matrix. Function UniqueGenotypeCounts, illustrated above, extracts the unique rows of the count matrix and also counts their frequency.

3.3.2. QQ plots

When many statistical tests are performed, the distribution of the obtained p-values is of interest. For tests based on continuous test statistics, under the null hypothesis the distribution of the p-values is expected to be uniform. We exclude monomorphic variants using the logical function is.mono. We first use qqunif to compare the chi-square p-values of the HWE test against a uniform distribution (panel A). We next simulate markers under HWE with HWData, matching the simulated markers in sample size and allele frequency distribution to the observed data. This is achieved by setting argument p of HWData equal to the allele frequencies of the observed data, where the latter are computed with function af. The corresponding QQ-plot of the simulated p-values is shown in panel B. For the empirical data, the observed p-value distribution strongly deviates from the uniform distribution, as well as from the expected pattern under HWE.

Given that genotype data is discrete, often with low counts, the null distribution of the p-values is in fact not uniform. We therefore build a new QQ plot of exact p-values against p-values that are obtained from sampling the true null a few times (five times in the code below) using HWQqplot, doing this both for the observed data (panel C) and for data sampled from the Levene-Haldane equilibrium distribution, conditional on the observed minor allele counts (panel D). The pattern for the empirical data differs strongly form the expected p-value distribution as shown in panel D. There are many more significant markers than expected under the HWE assumption.

data("CEUchr22")

Z <- MakeCounts(CEUchr22)
Z <- Z[,1:3]
Z <- Z[!is.mono(Z),]

alfreq  <- af(Z)
alcount <- maf(Z,option=3)[,1]

chisq.pvals <- HWChisqStats(Z,pvalues=TRUE)

set.seed(123)
Z.sim.chi <- HWData(nm=nrow(Z),n=99,p=alfreq)
Z.sim.chi <- Z.sim.chi[!is.mono(Z.sim.chi),]
chisq.pvals.sim <- HWChisqStats(Z.sim.chi,pvalues=TRUE)

set.seed(123)
Z.sim.exa <- HWData(nm=nrow(Z),n=99,nA=alcount,conditional = TRUE)
Z.sim.exa <- Z.sim.exa[!is.mono(Z.sim.exa),]

opar <- par(mfrow=c(2,2),mar=c(3,3,2,0)+0.5,mgp=c(2,1,0))
par(mfg=c(1,1))
 qqunif(chisq.pvals,logplot = TRUE,
       plotline = 1,main="A: observed QQ uniform")
par(mfg=c(1,2))
 qqunif(chisq.pvals.sim,logplot = TRUE,xylim=15,
        main="B: simulated QQ uniform")
par(mfg=c(2,1))
 set.seed(123)
 HWQqplot(Z,nsim=5,logplot=TRUE,main="C: observed QQ HWE null")
par(mfg=c(2,2))
 set.seed(123)
 HWQqplot(Z.sim.exa,nsim=5,logplot=TRUE,main="D: simulated QQ HWE null")

par(opar)

3.4. Simulating biallelic marker data

The function HWData allows for the simulation of genetic markers under equilibrium and disequilibrium conditions. This enables us to create simulated data sets that match the observed data set in sample size and allele frequency distribution, as shown in the previous section. The comparison of graphics and statistics for observed and simulated datasets is helpful when assessing the extent of HWE for a large set of markers. We simulate \(m=100\) markers for \(n=100\) individuals by taking random samples from a multinomial distribution with \(\theta_{AA} = p^2, \hspace{1mm} \theta_{AB} = 2pq, \hspace{1mm}\) and \(\theta_{BB} = q^2\). This is done by routine HWData, which can generate data sets that stem from a population that is either in or out of Hardy-Weinberg equilibrium. Routine HWData can generate data that are in exact equilibrium (exactequilibrium = TRUE) or that are generated from a multinomial distribution (default). The markers generated by HWData are independent (there is no linkage disequilibrium). HWData returns a matrix of genotype counts, which are converted to genotypic compositions (i.e., the relative genotype frequencies) if argument counts is set to FALSE. Routine HWData can simulate genotype counts under several conditions. A fixed allele frequency can be specified by setting p to a scalar or vector with the desired allele frequencies and specifying conditional=TRUE. Sampling is then according to Levene-Haldane’s exact distribution. If conditional is FALSE, the given vector p of allele frequencies will be used in sampling from the multinomial distribution. If p is not specified, p will be drawn from a uniform distribution, and genotypes are drawn from a multinomial distribution with probabilities \(p^2, 2pq\) and \(q^2\) for AA, AB and BB respectively. It is also possible to generate data under disequilibrium, by specifying a vector of inbreeding coefficients f. The use of HWData is illustrated below by simulating several data sets. Each simulated dataset is plotted in a ternary diagram below in order to show the effect of the different simulation options. We subsequently simulate 100 markers under HWE with allele frequency 0.5 (X1), 100 markers under HWE with a random uniform allele frequency (X2), 100 markers under inbreeding (\(f = 0.5\)) with allele frequency 0.5 (X3), 100 markers under inbreeding (\(f=0.5\)) with a random uniform allele frequency (X4), 100 markers with fixed allele frequencies of 0.2, 0.4, 0.6 and 0.8 (25 each, X5) and 100 markers in exact equilibrium with a random uniform allele frequency (X6).

set.seed(123)
n <- 100
m <- 100
X1 <- HWData(m, n, p = rep(0.5, m))
X2 <- HWData(m, n)
X3 <- HWData(m, n, p = rep(0.5, m), f = rep(0.5, m))
X4 <- HWData(m, n, f = rep(0.5, m))
X5 <- HWData(m, n, p = rep(c(0.2, 0.4, 0.6, 0.8), 25), conditional = TRUE)
X6 <- HWData(m, n, exactequilibrium = TRUE)
opar <- par(mfrow = c(3, 2),mar = c(1, 0, 3, 0) + 0.1)
par(mfg = c(1, 1))
HWTernaryPlot(X1, main = "(a)")
par(mfg = c(1, 2))
HWTernaryPlot(X2, main = "(b)")
par(mfg = c(2, 1))
HWTernaryPlot(X3, main = "(c)")
par(mfg = c(2, 2))
HWTernaryPlot(X4, main = "(d)")
par(mfg = c(3, 1))
HWTernaryPlot(X5, main = "(e)")
par(mfg = c(3, 2))
HWTernaryPlot(X6, main = "(f)")

par(opar)

In practice, SNPs mostly have a skewed distribution of the minor allele frequency (MAF), such that low MAF variants are much more common. SNPs with such an allele frequency distribution can be simulated with HWData by setting the parameters of the beta distribution shape1 and shape2 to 1 and 10 respectively:

X <- HWData(nm=100,shape1=1,shape2=10)

4. Multiallelic markers

The HardyWeinberg package has incorporated functions for dealing with multiallelic markers, geared towards the analysis of microsatellite or Short Tandem Repeat (STR) datasets. Special functions for triallelic and multiallelic variants are illustrated below.

4.1. Triallelic variants

The ABO locus

The classical three-allelic ABO locus can be tested for equilibrium with an iterative algorithm implemented in HWABO, as shown for the sample (A=182,B=60,AB=17,OO=176) below.

x <- c(fA=182,fB=60,nAB=17,nOO=176)
al.fre <- HWABO(x)
#> Iteration history:
#>          pA         pB        pO          ll
#> 0 0.3333333 0.33333333 0.3333333 -194.706389
#> 1 0.2984674 0.11149425 0.5900383  -13.488684
#> 2 0.2709650 0.09445916 0.6345758   -9.196185
#> 3 0.2655411 0.09328308 0.6411759   -9.099231
#> 4 0.2646231 0.09318236 0.6421945   -9.096756
#> 5 0.2644732 0.09317075 0.6423560   -9.096691
#> 6 0.2644490 0.09316911 0.6423819   -9.096690
#>               fA       fB     nAB     nOO
#> Observed 182.000 60.00000 17.0000 176.000
#> Expected 178.212 55.84582 21.4351 179.507
#> X2 =  1.375706 p-value =  0.2408339

Allele frequencies, initially set to being equally frequent, converge in six iterations to their final values. A Chi-square test with one degree of freedom indicates equilibrium can not be rejected.

Triallelic variants

A general triallelic locus can be tested for equilibrium with an exact test by HWTriExact as shown below, supplying the genotype counts as a six element named vector.

x <- c(AA=20,AB=31,AC=26,BB=15,BC=12,CC=0)
#results <- HWTriExact(x)
#Tri-allelic Exact test for HWE (autosomal).
#Allele counts: A = 38 B = 73 C = 97 
#sum probabilities all outcomes 1 
#probability of the sample 0.0001122091 
#p-value =  0.03370688 

The output gives the probability of the observed sample, and the exact test p-value. For this example, the null hypothesis of equilibrium proportions is rejected at a significance level of five percent. HWTriExact uses a complete enumeration algorithm programmed in R, which can be slow, depending on the genotype counts of the particular sample. A faster analysis for triallelics is to use a network algorithm. For the data at hand, the exact test based on the network algorithm is carried out by with HWNetwork

x <- c(AA=20,AB=31,AC=26,BB=15,BC=12,CC=0)
x <- toTriangular(x)
x
#>    A  B C
#> A 20  0 0
#> B 31 15 0
#> C 26 12 0
m <- c(A=0,B=0,C=0)
results <- HWNetwork(ma=m,fe=x)
#> Network algorithm for HWE Exact test with multiple alleles
#> 3 alleles detected.
#> 0 males and  104 females
#> Allele counts:
#>          A  B  C
#> Males    0  0  0
#> Females 97 73 38
#> All     97 73 38
#> Probability of the sample: 0.0001122091 
#> p-value: 0.03370688

We note that HWNetwork allows for the X chromosomal variants, and requires the specification of male and female genotype counts (arguments ma and fe). To do an autosomal test as shown above, hemizygous male counts should be set to zero, and the female genotype counts should be set to contain the summed autosomal counts of males and females. Second, note that the fe argument is required to be a lower triangular matrix, and for this reason the counts are first reorganised in this format with toTriangular. The p-value is exactly the same as before. For markers with more alleles, a permutation test will generally be faster. We run the permutation test for the triallelic autosomal locus analysed above

set.seed(123)
x <- c(AA=20,AB=31,AC=26,BB=15,BC=12,CC=0)
x <- toTriangular(x)
#results <- HWPerm.mult(x)
#Permutation test for Hardy-Weinberg equilibrium (autosomal).
#3 alleles detected.
#Observed statistic: 0.0001122091   17000 permutations. p-value: 0.03405882 

Note that this gives a similar, but not identical p-value, in comparison with HWTriExact above. As this works with two-character named vectors this currently allows the permutation test to be used for variants with well over 52 alleles.

An X chromosomal variant is tested for HWE by supplying separate vectors for males and females as shown below:

males   <- c(A=1,B=21,C=34) 
females <- c(AA=0,AB=1,AC=0,BB=8,BC=24,CC=15)
results <- HWTriExact(females,males)
#> Tri-allelic Exact test for HWE and EAF (X-chromosomal)
#> Allele counts: na =  2 nb =  62 nc = 88 
#> Sample contains:  56 males and 48 females
#> sum probabilities all outcomes 1 
#> probability of the sample 0.005343291 
#> p-value =  0.8309187

and this can also be done with the network algorithm by

males   <- c(A=1,B=21,C=34) 
females <- toTriangular(c(AA=0,AB=1,AC=0,BB=8,BC=24,CC=15))
results <- HWNetwork(ma=males,fe=females)
#> Network algorithm for HWE Exact test with multiple alleles
#> 3 alleles detected.
#> 56 males and  48 females
#> Allele counts:
#>          C  B A
#> Males   34 21 1
#> Females 54 41 1
#> All     88 62 2
#> Probability of the sample: 0.005343291 
#> p-value: 0.8309187

4.2. Microsatellites (STRs)

For genetic markers with multiple alleles such as microsatellites (STRs), the different alleles are often separated in different columns. The example below uses autosomal microsatellite data from the US National Institute of Standards and Technology (NIST), stored in dataframe NistSTRs, containing 29 STRs for 361 individuals. The two alleles of a marker are coded in successive columns. The corresponding alleles are often coded as integers, and we use function AllelesToTriangular to obtain a lower triangular matrix with the autosomal genotype counts. This matrix is used as input for HWPerm.mult that runs the permutation test.

data("NistSTRs")
NistSTRs[1:5,1:5]
#>         CSF1PO-1 CSF1PO-2 D10S1248-1 D10S1248-2 D12S391-1
#> BC11352       11       13         13         15      17.0
#> GC03394       10       12         15         16      19.0
#> GT36864       10       12         15         17      17.3
#> GT36866       11       11         13         16      18.0
#> GT36875       11       11         13         14      18.0
n <- nrow(NistSTRs)
p <- ncol(NistSTRs)/2
n
#> [1] 361
p
#> [1] 29

A1 <- NistSTRs[,1]
A2 <- NistSTRs[,2]
GenotypeCounts <- AllelesToTriangular(A1,A2)
print(GenotypeCounts)
#>     A10 A11 A12 A13 A14 A8 A9
#> A10  17   0   0   0   0  0  0
#> A11  47  35   0   0   0  0  0
#> A12  61  78  44   0   0  0  0
#> A13  12  20  26   0   0  0  0
#> A14   1   1   4   1   0  0  0
#> A8    1   2   1   0   0  0  0
#> A9    3   5   2   0   0  0  0

set.seed(123)
#out <- HWPerm.mult(GenotypeCounts)
#Permutation test for Hardy-Weinberg equilibrium (autosomal).
#7 alleles detected.
#Observed statistic: 2.290724e-11   17000 permutations. p-value: 0.8644706 

For this seven-allelic STR, there is no evidence against HWE. Function HWStr is a wrapper function allowing to test a set of STRs coded in the two-column format by a permutation or chisquare test. All STRs in the dataframe NistSTRs can be tested with

#Results <- HWStr(NistSTRs,test="permutation")
#29 STRs detected.
#> Results
#          STR   N Nt MinorAF MajorAF     Ho     He     Hp   pval
#1    CSF1PO-1 361  7  0.0055  0.3601 0.7341 0.7194 1.4016 0.8614
#2  D10S1248-1 361  9  0.0014  0.3075 0.7645 0.7586 1.5552 0.6488
#3   D12S391-1 361 16  0.0014  0.1717 0.8975 0.8909 2.3686 0.8955
#4   D13S317-1 361  8  0.0014  0.3255 0.7618 0.7837 1.7102 0.1049
#5   D16S539-1 361  7  0.0180  0.3144 0.7645 0.7600 1.5933 0.1641
#6    D18S51-1 361 15  0.0014  0.1704 0.8560 0.8758 2.2139 0.1926
#7   D19S433-1 361 15  0.0014  0.3615 0.7673 0.7694 1.7624 0.9667
#8   D1S1656-1 361 15  0.0028  0.1496 0.9252 0.8992 2.4073 0.8699
#9    D21S11-1 361 16  0.0014  0.2825 0.8227 0.8288 1.9946 0.6156
#10 D22S1045-1 361  8  0.0055  0.3823 0.7535 0.7220 1.4823 0.9785
#11  D2S1338-1 361 12  0.0014  0.1856 0.8698 0.8814 2.2464 0.5066
#12   D2S441-1 361 11  0.0014  0.3435 0.7867 0.7693 1.6734 0.7125
#13  D3S1358-1 361  9  0.0014  0.2729 0.7562 0.7900 1.6437 0.3789
#14   D5S818-1 361  9  0.0014  0.3878 0.7064 0.6977 1.3939 0.7615
#15  D6S1043-1 361 14  0.0014  0.2964 0.7978 0.8232 2.0059 0.0220
#16   D7S820-1 361  9  0.0014  0.2562 0.8310 0.8161 1.7925 0.5045
#17  D8S1179-1 361 10  0.0014  0.3296 0.7839 0.8072 1.8386 0.8606
#18   F13A01-1 361 12  0.0014  0.3490 0.7590 0.7353 1.5471 0.8793
#19     F13B-1 361  6  0.0055  0.3892 0.7008 0.7175 1.3790 0.8276
#20   FESFPS-1 361  7  0.0014  0.4114 0.6731 0.6930 1.3085 0.8506
#21      FGA-1 361 14  0.0014  0.2050 0.8670 0.8594 2.1163 0.0528
#22      LPL-1 361  8  0.0014  0.4224 0.7175 0.6947 1.3386 0.9721
#23  Penta_C-1 361 10  0.0014  0.3947 0.7701 0.7523 1.6035 0.0248
#24  Penta_D-1 361 13  0.0014  0.2327 0.8587 0.8247 1.8925 0.1464
#25  Penta_E-1 361 19  0.0014  0.1994 0.8920 0.8910 2.4391 0.4811
#26     SE33-1 361 39  0.0014  0.0942 0.9501 0.9483 3.1472 0.2506
#27     TH01-1 361  8  0.0014  0.3449 0.7424 0.7646 1.5616 0.2379
#28     TPOX-1 361  8  0.0014  0.5249 0.6537 0.6404 1.2572 0.5545
#29      vWA-1 361 10  0.0014  0.2839 0.8061 0.8076 1.7577 0.4501

The p-values of the permutation test in the last column show that at a usual five percent significance level, HWP would be rejected for Penta_C and D6S1043 and that FGA-1 is borderline.

References

Graffelman, J. 2015. “Exploring Diallelelic Genetic Markers: The HardyWeinberg Package.” Journal of Statistical Software 64 (3): 1–23. https://doi.org/10.18637/jss.v064.i03.