Prevalence mapping for DHS indicators using the surveyPrev package

Qianyu Dong, Zehang Richard Li, Yunhan Wu, Andrea Boskovic, Jon Wakefield

2024-04-07

1 Overview

In this vignette, we describe how prevalence mapping using DHS data can be carried out. Prevalence mapping is Small Area Estimation (SAE) for proportions of some indicator of interest. The vignette is organized as follows. In Section 2 we first describe the data preparation steps to obtain and process the relevant DHS survey data in R. We then describe three classes of models

  1. Direct estimates in Section 3.
  2. Area-level Fay-Herriot model in Section 4.
  3. Cluster-level model in Section 5.

Then in Section 6 we show several tools for visualization of different estimates, and in Section 7 we discuss how to aggregate the result for a given model to produce estimates at higher admin levels.

2 Data Preparation

2.1 Prerequisites

The following pieces of data are needed to produce SAE for DHS surveys:

2.2 Install surveyPrev

The latest surveyPrev package can be installed from CRAN with

install.packages(surveyPrev)

The development version of the surveyPrev package can be installed from GitHub with

library(devtools)
install_github("richardli/surveyPrev")

After installation, the package can be loaded with

library(surveyPrev)

One of the key required packages, INLA is not available on CRAN or GitHub, and needs to be installed with the following code (Rue, Martino, and Chopin 2009).

install.packages("INLA", repos=c(getOption("repos"), 
                  INLA="https://inla.r-inla-download.org/R/stable"), dep=TRUE)
library(INLA)

The following packages are needed to run the example analysis described in this vignette.

library(geodata)
library(sf)
library(SUMMER)
library(rdhs)
library(ggplot2)
library(patchwork)
library(dplyr)
library(tidyr)
library(kableExtra)

2.3 Built-in indicators

Currently, the surveyPrev package supports 22 indicators. The list of indicators and their IDs can be found in the surveyPrevIndicators dataset. The ID column correspond to the indicator IDs in the DHS API and are more standardized. The alternative ID column provides convenient and more readable indicator names for only a subset of indicators. Both can be used to retrieve the indicator in the functions described later. The full list of indicators and their IDs are summarized in the Table below.

data(surveyPrevIndicators)
head(surveyPrevIndicators)
Table 2.1: List of built-in indicators in the surveyPrev package.
ID alternative.ID Description Topic
AN_ANEM_W_ANY womananemia Percentage of women aged 15-49 classified as having any anemia Nutrition
AN_NUTS_W_THN Underweight (Prevalence of underweight (BMI < 18.5) women of reproductive age) Nutrition
CH_DIAT_C_ORT Diarrhea treatment (Children under five with diarrhea treated with either ORS or RHF) Child Health
CH_VACC_C_BAS Children age 12-23 months with all 8 basic vaccinations Child Health
CH_VACC_C_DP1 Percentage of children 12-23 months who had received DPT 1 vaccination Child Health
CH_VACC_C_DP3 DPT3 Percentage of children 12-23 months who had received DPT 3 vaccination Child Health
CH_VACC_C_MSL Percentage of children 12-23 months who had received MCV 1 (Measles containing vaccine) Child Health
CH_VACC_C_NON Children 12-23 months with no vaccinations Child Health
CM_ECMR_C_NNR nmr Probability of dying in the first month of life in the five or ten years preceding the survey Infant and Child Mortality
CN_ANMC_C_ANY Children under five with any anemia Nutrition
CN_BRFS_C_EXB Children exclusively breastfed (Prevalence of exclusive breastfeeding of children under six months of age) Nutrition
CN_NUTS_C_HA2 stunting Stunting rate (Prevalence of stunted (HAZ < -2) children under five (0-59 months)) Nutrition
CN_NUTS_C_WH2 wasting Wasting rate (Prevalence of wasted (HAZ < -2) children under five (0-59 months)) Nutrition
FP_CUSA_W_MOD Modern contraceptive prevalence rate (Married women currently using any modern method of contraception) Family Planning
FP_NADA_W_UNT unmet_family Percentage of women with an unmet need for family planning Family Planning
HA_HIVP_B_HIV HIV prevalence HIV prevalence
ML_NETP_H_IT2 Households with at least one insecticide-treated mosquito net (ITN) for every two persons who stayed in the household the previous night Malaria
RH_ANCN_W_N4P ancvisit4+ Antenatal visits for pregnancy: 4+ visits Maternal Healthcare
RH_DELA_C_SKP Assistance during delivery from a skilled provide Maternal Healthcare
WS_SRCE_P_BAS Population using a basic water source Water, Sanitation and Hygiene
WS_TLET_H_IMP sanitation Percentage of households using an improved sanitation facility Water, Sanitation and Hygiene
WS_TLET_P_BAS Population with access to a basic sanitation service Water, Sanitation and Hygiene

In this vignette, we consider estimating the percentage of women who had a live birth in the five years before the survey who had four or more antenatal care visits. This indicator can be extracted with indicator = "ancvisit4+" or indicator = "RH_ANCN_W_N4P".

2.4 Customized indicators

Details on more common DHS indicators can be found in the Guide to DHS Statistics at https://dhsprogram.com/data/Guide-to-DHS-Statistics/. User-specified function can also be used to process raw DHS data into new indicators. Please refer to the vignette on Creating Customized Indicators for surveyPrev for more details.

2.5 DHS survey data

Processing the binary indicator from the raw DHS data consist of two steps: 1. download the relevant DHS data recode using getDHSdata(), or manually from the DHS website, 2. create a data frame with the binary indicator and relevant information using getDHSindicator().

The getDHSdata() function downloads the relevant DHS data directly from the DHS website using the rdhs package. This step requires users to

  1. register with DHS to gain data access,
  2. set up the DHS account login details in R with rdhs::set_rdhs_config().
rdhs::set_rdhs_config(email = "your_email",
                project = "your_registered_DHS_project_title")

After setting up the DHS account information, getDHSdata() can be used to download the relevant survey files directly into R. If data download using the API fails, or if it is preferred to work with downloaded data files offline, you may also manually download the file from the DHS website and read into R. The getDHSdata() function returns a message specifying which file is used (e.g., Individual Record file for the ANC visit example).

indicator <- "ancvisit4+"
year <- 2018
country <- "Zambia"
dhsData <- getDHSdata(country = country, indicator = indicator, year = year)

We then use getDHSindicator() to process the raw survey data into a data frame, where the column value is the indicator of interest. The data frame also contains information specifying survey designs in the survey package, including cluster ID, household ID, survey weight and strata information.

data <- getDHSindicator(dhsData, indicator = indicator)
head(data)
##   cluster householdID    v024  weight strata value
## 1       1           1 eastern 1892890  rural     1
## 2       1           2 eastern 1892890  rural     1
## 3       1           3 eastern 1892890  rural     1
## 4       1           4 eastern 1892890  rural     0
## 5       1           7 eastern 1892890  rural     1
## 6       1           9 eastern 1892890  rural    NA

2.6 Spatial information

The getDHSgeo()function downloads the GPS data for cluster locations, also through the rdhs package. This step loads in the GPS location as a SpatialPointsDataFrame, but it can also be performed manually by downloading the GPS file from the DHS website. The GPS location of the clusters are used to match clusters to regions at different admin levels.

geo <- getDHSgeo(country = country, year = year)

In the case of Zambia, the maps are already included in the package as example datasets (data(ZambiaAdm1) and data(ZambiaAdm2)) so we load them directly within R. For other countries, the Admin 1 and Admin 2 boundary shapefiles can be downloaded from the GADM site3 and read into R manually. Another alternative is to use the geodata and sf package to download the maps from the GADM site and load them in R directly, in the following steps:

poly.adm1 <- geodata::gadm(country="ZMB", level=1, path=tempdir())
poly.adm1 <- sf::st_as_sf(poly.adm1)
poly.adm2 <- geodata::gadm(country="ZMB", level=2, path=tempdir())
poly.adm2 <- sf::st_as_sf(poly.adm2)

The clusterInfo() function extracts Admin 1 and Admin 2 area information for each cluster. To avoid confusion in countries that have duplicated region names, we create new labels for Admin 2 regions using the variable admin2.name.full. The clusters with invalid GPS locations are saved in the wrong.points object and not used in any models.

cluster.info <- clusterInfo(geo=geo, poly.adm1=poly.adm1, poly.adm2=poly.adm2, 
                            by.adm1 = "NAME_1",by.adm2 = "NAME_2")
head(cluster.info$data)
##   cluster  LONGNUM     LATNUM                   geometry admin1.name
## 1       1 32.06631 -13.895685 POINT (32.06631 -13.89568)     Eastern
## 2       2 28.73625 -13.419209 POINT (28.73625 -13.41921)  Copperbelt
## 3       3 31.13254  -8.754055 POINT (31.13254 -8.754055)    Northern
## 4       4 28.44930 -14.419112  POINT (28.4493 -14.41911)     Central
## 5       5 29.96905 -12.677275 POINT (29.96905 -12.67727)     Central
## 6       6 32.74745  -9.319054 POINT (32.74745 -9.319054)    Muchinga
##   admin2.name   admin2.name.full
## 1      Katete     Eastern_Katete
## 2     Masaiti Copperbelt_Masaiti
## 3    Mpulungu  Northern_Mpulungu
## 4       Kabwe      Central_Kabwe
## 5    Chitambo   Central_Chitambo
## 6     Nakonde   Muchinga_Nakonde
cluster.info$wrong.points
##  [1]  19  33  65 179 236 281 293 360 361 469

The adminInfo() function computes the spatial the adjacency matrix, and if supplied with additional information on population, combines needed information for each administrative area, including the name, population, and urban population fractions of each administrative area. The admin.mat object stores the adjacency matrix for spatial models. We will ignore the population information for now and focus on only information from the DHS data, and will return to the population information in Section 7.

admin.info1 <- adminInfo(poly.adm = poly.adm1, admin = 1,by.adm = "NAME_1")
admin.info2 <- adminInfo(poly.adm = poly.adm2, admin = 2,by.adm = "NAME_2",by.adm.upper = "NAME_1")
head(admin.info2$data)
##   admin1.name   admin2.name      admin2.name.full population urban
## 1     Central      Chibombo      Central_Chibombo         NA    NA
## 2     Central      Chisamba      Central_Chisamba         NA    NA
## 3     Central      Chitambo      Central_Chitambo         NA    NA
## 4     Central  Itezhi-tezhi  Central_Itezhi-tezhi         NA    NA
## 5     Central         Kabwe         Central_Kabwe         NA    NA
## 6     Central Kapiri Mposhi Central_Kapiri Mposhi         NA    NA

3 Direct estimates

3.1 National and Admin 1 direct estimates

For \(y_j\) be the binary outcome of interest for the \(j^{\text{th}}\) individual in the survey and \(w_j\) be the design weight associated with this individual. For a given area denoted as \(i\), we have the weighted estimator:

\[ \hat p^{W}_{i}=\frac {\sum_{j \in S_i} y_j w_j}{\sum_{j \in S_i} w_j}\]

where \(S_i\) is the set of individual index within the \(i\)-th region. The directEST() function calculates direct estimates at different Admin levels using the SUMMER::smoothSurvey() function in the SUMMER package internally (Li et al. 2020). The output of the function includes the direct estimates and their variance, standard error, and confidence intervals based on the specified confidence level. The confidence intervals are computed on the logit scale, i.e., if we use \(D_i\) to denote the design-based variance of \(\hat p^{W}_i\), then the asymptotic design-based variance of \(\text{logit}(\hat p^{W}_i)\) is \[ V_i = \frac{D_i}{(\hat p^{W}_i)^2(1-\hat p^{W}_i)^2} \] and we compute the confidence interval on the probability scale to be \[(\quad \text{expit}[\text{logit}(\hat p^{W}_i) - z_{\alpha/2}V_i^{1/2}], \;\; \text{expit}[\text{logit}(\hat p^{W}_i) + z_{\alpha/2}V_i^{1/2}]\quad). \]

Currently the package defaults to a two-stage stratified cluster sampling design, with the sampling clusters (enumeration areas) being stratified by Admin 1 areas and urban/rural strata, which is the most common sampling design in the DHS.

res_ad1 <- directEST(data = data,
                   cluster.info = cluster.info,
                   admin = 1)
head(res_ad1$res.admin1)
##   admin1.name direct.est   direct.var direct.logit.est direct.logit.var
## 1     Central  0.5845443 0.0007002450        0.3414564      0.011873143
## 2  Copperbelt  0.6023371 0.0006960677        0.4152124      0.012132270
## 3     Eastern  0.6505890 0.0003224625        0.6216294      0.006240117
## 4     Luapula  0.6453174 0.0007216008        0.5985190      0.013774331
## 5      Lusaka  0.5817949 0.0004319779        0.3301459      0.007296978
## 6    Muchinga  0.6884903 0.0004137610        0.7930706      0.008995197
##   direct.logit.prec  direct.se direct.lower direct.upper         cv
## 1          84.22370 0.02646214    0.5319292    0.6352999 0.04526970
## 2          82.42480 0.02638310    0.5496679    0.6527379 0.04380122
## 3         160.25341 0.01795724    0.6146268    0.6849157 0.02760151
## 4          72.59881 0.02686263    0.5910940    0.6960479 0.04162700
## 5         137.04303 0.02078408    0.5405908    0.6218883 0.03572406
## 6         111.17044 0.02034112    0.6472976    0.7269017 0.02954452

For national level (admin = 0), the function can additionally produce urban- and rural-specific direct estimates by specifying argument strata.

res_ad0 <- directEST(data=data,
                   cluster.info= cluster.info,
                   admin=0,
                   strata="all")
res_ad0_urban <- directEST(data=data,
                   cluster.info= cluster.info,
                   admin=0,
                   strata="urban")
res_ad0_rural <- directEST(data=data,
                   cluster.info= cluster.info,
                   admin=0,
                   strata="rural")
list(all = res_ad0$res.admin0, 
      urban = res_ad0_urban$res.admin0, 
      rural = res_ad0_rural$res.admin0)
## $all
##   direct.est   direct.var direct.logit.est direct.logit.var direct.logit.prec
## 1  0.6344011 7.614786e-05        0.5511447      0.001415533          706.4475
##     direct.se direct.lower direct.upper
## 1 0.008726274    0.6171347    0.6513289
## 
## $urban
##   direct.est   direct.var direct.logit.est direct.logit.var direct.logit.prec
## 1  0.6066641 0.0002399573        0.4333115      0.004214152          237.2957
##    direct.se direct.lower direct.upper
## 1 0.01549055    0.5759275    0.6365788
## 
## $rural
##   direct.est   direct.var direct.logit.est direct.logit.var direct.logit.prec
## 1  0.6519019 0.0001103517        0.6274099      0.002142946          466.6473
##    direct.se direct.lower direct.upper
## 1 0.01050484    0.6310396    0.6721974

3.2 Admin 2 direct estimates

In general, direct estimates at Admin 2 are unstable. For countries with a large number of Admin 2 areas, some of the direct estimates can be undefined if there is no cluster in the area.

res_ad2 <- directEST(data = data,
                     cluster.info = cluster.info,
                     admin = 2,
                     aggregation = FALSE)

4 Area-level Fay-Herriot model

4.1 Admin 1 Fay-Herriot estimates

Fay-Herriot models provides smoothed estimates at the areal level using direct estimate \(\hat p^{W}_{i}\) as input. The direct estimates are modeled as a noisy observation of the true prevalence, with the variance of noise determined by the design-based variance. We consider the spatial Fay-Herriot model for the logit transformed direct estimates, which is defined as follows:

\[\text{logit}(\hat p^{W}_{i})|\lambda_{i} \sim\textrm{Normal}(\lambda_{i}, V_{i}^{HT}),\] \[\lambda_{i}= \alpha + e_i+S_i.\] Here \(\text{expit}(\lambda_{i})\) is the latent true prevalence, and \(e_i\) and \(S_i\) are unstructured and structured spatial random effects. Inference is carried out using Bayesian methods and so the model specific is completed by priors on \(\alpha\), \(e\) and \(S\), and their hyperpriors. More details of the Bayesian model setup can be found in Wakefield, Okonek, and Pedersen (2020). Area level Fay-Herriot model are viewed as the most reliable model choice, since they acknowledge the design through the sue of a weighted estimate and its associated variance. See chapters 4 to 6 of Rao and Molina (2015).

As of now the package allows only an overall intercept \(\alpha\), but future versions of the package will allow area level covariates to be included. The default prior for the intercept is \(N(0, 1000)\). The structured and non-structured random effects are implemented used Besag-York-Mollié (BYM) via BYM2 parameterization, with default PC priors such that the marginal standard deviation has a prior such that \(Prob(\sigma > 1) = 0.01\) and the proportion of variation explained by the spatial effect, \(\phi\) has a prior such that \(Prob(\phi > 0.5) = 2/3\) (Riebler et al. 2016; Simpson et al. 2017).

The fhModel() calculates spatial Fay-Herriot estimates at different administrative levels using SUMMER::smoothSurvey(). Users can specify either only i.i.d. random effects (i.e., no \(S_i\) term) or the BYM2 model by setting model = "iid" or model = "bym2". The function returns model results at the specified admin level with aggregation = FALSE. Aggregated to higher admin levels is possible with aggregation = TRUE when additional population information is provided. The aggregation steps are described in more details in Section 7.

smth_res_ad1_bym2 <- fhModel(data,
                        cluster.info = cluster.info,
                        admin.info = admin.info1,
                        admin = 1,
                        model = "bym2",
                        aggregation =FALSE)

smth_res_ad1_iid <- fhModel(data, 
                        cluster.info = cluster.info, 
                        admin.info = admin.info1,
                        admin = 1, 
                        model = "iid",
                        aggregation =FALSE)
head(smth_res_ad1_bym2$res.admin1) 
##   admin1.name      mean          var    median     lower     upper logit.mean
## 1     Central 0.6028279 0.0005262850 0.6035275 0.5559112 0.6455631  0.4181968
## 2  Copperbelt 0.6130262 0.0005303788 0.6135925 0.5663871 0.6567400  0.4611124
## 3     Eastern 0.6479354 0.0002699419 0.6480306 0.6153815 0.6799110  0.6107456
## 4     Luapula 0.6468364 0.0005282296 0.6471917 0.6007430 0.6910090  0.6066520
## 5      Lusaka 0.5941311 0.0003830601 0.5944710 0.5549919 0.6312606  0.3816880
## 6    Muchinga 0.6768713 0.0003558819 0.6769527 0.6397106 0.7137660  0.7407531
##     logit.var logit.median logit.lower logit.upper         sd         cv
## 1 0.009195237    0.4201849   0.2245841   0.2245841 0.02294090 0.03805548
## 2 0.009451857    0.4624387   0.2671258   0.2671258 0.02302995 0.03756765
## 3 0.005201276    0.6103938   0.4699907   0.4699907 0.01642991 0.02535732
## 4 0.010167730    0.6067175   0.4085620   0.4085620 0.02298325 0.03553177
## 5 0.006601776    0.3824795   0.2208608   0.2208608 0.01957192 0.03294209
## 6 0.007476746    0.7398026   0.5741084   0.5741084 0.01886483 0.02787064

Columns 2-6 represent the posterior mean, posterior variance, posterior median and 2.5%, and 97.5% posterior quantiles, respectively. The posterior variance can be viewed as an analogue of the mean squared error (MSE) summary.

4.2 Admin 2 Fay-Herriot estimates

A Fay-Herriot model at Admin 2 level can be fitted by treating areas without direct estimates as missing data, though it is usually not recommended due to data sparsity. In addition, numerical issue arise when design-based variance of direct estimate is close to zero and logit precision become too large. A potential fix for this issue is to identify regions with too small design variance(< 1e-30), and delete the clusters in these regions, before fitting the model. However, this step creates additional bias in the results if the number of clusters deleted is large.

bad_admin2 <- subset(res_ad2$res.admin2, direct.var < 1e-30)$admin2.name.full
# identify clusters in these regions
bad_clusters <- subset(cluster.info$data, admin2.name.full %in% bad_admin2)$cluster
# Run FH model without these clusters
smth_res_ad2 <- fhModel(subset(data, !cluster %in% bad_clusters),
                    cluster.info = cluster.info,
                    admin.info = admin.info2,
                    admin = 2,
                    model = "bym2",
                    aggregation = FALSE)
head(smth_res_ad2$res.admin2)
##        admin2.name.full      mean          var    median     lower     upper
## 1      Central_Chibombo 0.6535171 0.0026360727 0.6550209 0.5480457 0.7496094
## 2      Central_Chisamba 0.5579803 0.0006107532 0.5581483 0.5089874 0.6060565
## 3      Central_Chitambo 0.6217739 0.0101383288 0.6276832 0.4110805 0.8000109
## 4  Central_Itezhi-tezhi 0.6436590 0.0101398292 0.6505298 0.4284150 0.8197984
## 5         Central_Kabwe 0.5213792 0.0011020175 0.5213857 0.4560794 0.5861891
## 6 Central_Kapiri Mposhi 0.5106414 0.0015870676 0.5106247 0.4323244 0.5880890
##   logit.mean  logit.var logit.median logit.lower logit.upper         sd
## 1 0.64252506 0.05257552   0.64118322  0.19277749  0.19277749 0.05134270
## 2 0.23355308 0.01008966   0.23365025  0.03595353  0.03595353 0.02471342
## 3 0.52003748 0.19864247   0.52229030 -0.35950027 -0.35950027 0.10068927
## 4 0.61982125 0.20921818   0.62136870 -0.28832074 -0.28832074 0.10069672
## 5 0.08594926 0.01785317   0.08559503 -0.17613618 -0.17613618 0.03319665
## 6 0.04284320 0.02574167   0.04250537 -0.27237407 -0.27237407 0.03983802
##           cv
## 1 0.07856367
## 2 0.04429085
## 3 0.16193872
## 4 0.15644419
## 5 0.06367084
## 6 0.07801564

5 Cluster-level model

5.1 Unstratified model

Cluster-level models assume smoothing models for counts of events in each cluster (Wakefield, Okonek, and Pedersen 2020; Li et al. 2020). In terms of traditional SAE literature, cluster-level models are a type of unit-level model. We start with describing an unstratified model without taking into account the urban/rural stratification in the sampling design.

Let \(Y_c\) be the number of events in cluster \(c\), and \(n_c\) be the number of individuals at risk, where \(c= 1,\dots,C\). The unstratified model assumes the hierarchical structure:

\[Y_c \mid p_c,d\sim \textrm{BetaBinomial}(n_c,p_c,d),\] \[p_c=\textrm{expit}(\alpha+e_{i[s_c]}+S_{i[s_c]}),\]

where \(\alpha\) is the intercept, and \(i[s_c]\) indexes the area within which the cluster \(s_c\) resides. Similar to the area-level model, \(e_i\) and \(S_i\) are unstructured and structured spatial random effects with the same prior as before. The Beta-binomial distribution arise from a hierarchical model where the probability follows a \(\text{Beta}(a, b)\) prior. The overdispersion parameter, \(d=\frac{1}{\alpha+\beta+1}\), is between 0 and 1 and represent the the intracluster correlation between Bernoulli draws within cluster. The default prior for \(d\) is \(\text{logit}(d) \sim \text{Normal}(0,0.4)\).

The clusterModel() function fits the cluster-level model above. Unstratified model can be specified by setting stratification = FALSE, and either i.i.d. or the BYM2 model by setting model = "bym2" or model = "iid". For the the overdispersion parameter \(d\), users can change the prior mean and precision through overdisp.mean and overdisp.prec.

We fit the unstratified model in both Admin 1 and Admin 2 below. These models differ only in the admin level at which the spatial random effects enter, with the cluster-level model being the same in each case.

cl_res_ad1 <- clusterModel(data=data,
                   cluster.info=cluster.info,
                   admin.info  = admin.info1,
                   stratification = FALSE,
                   model = "bym2",
                   admin = 1,
                   aggregation =FALSE,
                   CI = 0.95)

cl_res_ad2 <- clusterModel(data=data,
                   cluster.info= cluster.info,
                   admin.info = admin.info2,
                   model = "bym2",
                   stratification =FALSE,
                   admin = 2,
                   aggregation =FALSE,
                   CI = 0.95)
head(cl_res_ad2$res.admin2)
##        admin2.name.full      mean    median         sd         var     lower
## 1      Central_Chibombo 0.6448760 0.6447917 0.04581087 0.002098636 0.5518984
## 2      Central_Chisamba 0.5963255 0.5973862 0.05510070 0.003036088 0.4839398
## 3      Central_Chitambo 0.5798794 0.5841081 0.06852435 0.004695587 0.4385194
## 4  Central_Itezhi-tezhi 0.6458915 0.6466465 0.07320677 0.005359231 0.4920623
## 5         Central_Kabwe 0.5445200 0.5434148 0.04622923 0.002137142 0.4522349
## 6 Central_Kapiri Mposhi 0.5598836 0.5622164 0.04545403 0.002066069 0.4685788
##       upper         cv admin1.name   admin2.name population urban
## 1 0.7313468 0.07103827     Central      Chibombo         NA    NA
## 2 0.6996858 0.09240039     Central      Chisamba         NA    NA
## 3 0.7063556 0.11817001     Central      Chitambo         NA    NA
## 4 0.7798355 0.11334222     Central  Itezhi-tezhi         NA    NA
## 5 0.6331086 0.08489905     Central         Kabwe         NA    NA
## 6 0.6465233 0.08118479     Central Kapiri Mposhi         NA    NA

5.2 Stratified model

Cluster-level models will produce biased estimates if the specified model does not hold for all units in the population. In the DHS, urban clusters are often oversampled. If such oversampling occurs and the outcome associated with urban/rural, then bias will result when the model does not allow for differences between urban/rural. To emphasize, bias will only arise when oversampling of urban (or rural) areas occurs, and the indicator has an association with urban/rural. One may check whether one needs to include urban/rural terms in the model, and then aggregate by urban/rural.

A fully adjusted model would include interaction terms representing the crossing of Admin 1 areas and urban/rural status. However, for reasons of parsimony we include area-level random effects and a single urban/rural fixed effect. The sampling model becomes

\[Y_c|p_c,d\sim \textrm{BetaBinomial}(n_c,p_c,d),\] \[p_c=\textrm{expit}(\alpha+\gamma\times I(s_c \in \textrm{urban})+x_c\beta+e_{i[s_c]}+S_{i[s_c]}).\]

The area-level risk is defined as \[p_i=q_i\times \text{expit}(\alpha+\gamma+e_{i[s_c]}+S_{i[s_c]})+(1-q_i)\times \text{expit}(\alpha+e_{i[s_c]}+S_{i[s_c]}),\] where \(q_i\) is the proportion of urban population in area \(i\) where “urban” is defined by the survey design sampling frame.

Stratified cluster-level model can be fit by setting stratification = TRUE. The returned results then consist of urban-specific, rural-specific, and the overall area-level prevalences. In order to facilitate aggregation for cluster-level model with urban/rural effects, we need to know the fraction of urban population by each administrative region. This computation can be performed with the getUR() function (see example for details), but it requires external information (i.e., the urban/rural definition by census), so we omit this step from this document for now and assume such information is already obtained from a difference source. he urban proportions for women of age 15 to 49 in Zambia in 2018 has been pre-computed and saved in ZambiaPopWomen$admin1_urban and ZambiaPopWomen$admin2_urban.

head(ZambiaPopWomen$admin2_urban)
##     admin1.name admin2.name       urban
## 1       Eastern     Chadiza 0.049373174
## 2      Muchinga       Chama 0.054972234
## 3       Eastern     Chasefu 0.001718483
## 4 North-Western     Chavuma 0.325830536
## 5       Luapula      Chembe 0.000000000
## 6       Central    Chibombo 0.577152577

Given the urban/rural proportions, we first add the information into admin.info2 using adminInfo().T

admin.info2 <- adminInfo(poly.adm = poly.adm2, 
                         admin = 2,
                         proportion = ZambiaPopWomen$admin2_urban,
                         by.adm="NAME_2",by.adm.upper="NAME_1")
head(admin.info2$data)
##   admin1.name   admin2.name      admin2.name.full      urban population
## 1     Central      Chibombo      Central_Chibombo 0.57715258         NA
## 2     Central      Chisamba      Central_Chisamba 0.03825222         NA
## 3     Central      Chitambo      Central_Chitambo 0.00000000         NA
## 4     Central  Itezhi-tezhi  Central_Itezhi-tezhi 0.00000000         NA
## 5     Central         Kabwe         Central_Kabwe 0.68604463         NA
## 6     Central Kapiri Mposhi Central_Kapiri Mposhi 0.13246326         NA

We can then fit the stratified cluster-level model by specifying stratification = TRUE.

cl_res_ad2_T <- clusterModel(data=data,
                   cluster.info=cluster.info,
                   admin.info = admin.info2,
                   model = "bym2",
                   stratification = TRUE,
                   admin = 2, 
                   CI = 0.95)

6 Visualizing prevalence estimates

We can use mapPlot() function from the SUMMER package to visualize estimates and standard errors.

# Arrange all estimates into a long-format data frame
out1 <- res_ad1$res.admin1[, c("admin1.name", "direct.est", "cv")]
colnames(out1)[2] <- "mean"
out1$model <- "Direct Estimates"
out2 <- smth_res_ad1_bym2$res.admin1[, c("admin1.name", "mean", "cv")]
out2$model <- "Fay-Herriot Model"
out3 <- cl_res_ad1$res.admin1[, c("admin1.name", "mean", "cv")]
out3$model <- "Unstratified Cluster-level Model"

g1 <- mapPlot(data = rbind(out1, out2, out3), geo = poly.adm1,
              by.data = "admin1.name",  by.geo = "NAME_1", is.long = TRUE,
              variable = "model", value = "mean", legend.label = "Mean")

g2 <- mapPlot(data = rbind(out1, out2, out3), geo = poly.adm1,
              by.data = "admin1.name",  by.geo = "NAME_1", is.long = TRUE,
              variable = "model", value = "cv", legend.label = "CV")
g1 / g2
Comparing Admin 1 direct estimates, Fay–Herriot model, and unstratified cluster-level posterior mean estimates

Figure 6.1: Comparing Admin 1 direct estimates, Fay–Herriot model, and unstratified cluster-level posterior mean estimates

We note that for Admin 2 estimates, surveyPrev uses a recreated region name in the form of [admin1_name]_[admin2_name] to avoid duplicated Admin 2 names. This new region identifier is assigned a column name of admin2.name.full in the output. Thus when plotting, we need to manually create the admin2.name.full in the spatial polygon object.

poly.adm2$admin2.name.full=paste0(poly.adm2$NAME_1,"_",poly.adm2$NAME_2)

# Arrange all estimates into a long-format data frame
out1<- res_ad2$res.admin2[, c("admin2.name.full", "direct.est", "cv")]
colnames(out1)[2] <- "mean"
out1$model <- "Direct Estimates"
out2 <- smth_res_ad2$res.admin2[, c("admin2.name.full", "mean", "cv")]
out2$model <- "Fay-Herriot Model"
out3 <- cl_res_ad2$res.admin2[, c("admin2.name.full", "mean", "cv")]
out3$model <- "Unstratified Cluster-level Model"
out4 <- cl_res_ad2_T$res.admin2[cl_res_ad2_T$res.admin2$type=="full", c("admin2.name.full", "mean", "cv")]
out4$model <- "Stratified Cluster-level Model"

g1 <- mapPlot(data = rbind(out1, out2, out3, out4), geo = poly.adm2,
              by.data = "admin2.name.full",  by.geo = "admin2.name.full", 
              is.long = TRUE, variable = "model", value = "mean", 
              legend.label = "Mean", ncol = 4)

g2 <- mapPlot(data = rbind(out1, out2, out3, out4), geo = poly.adm2,
              by.data = "admin2.name.full",  by.geo = "admin2.name.full", 
              is.long = TRUE, variable = "model", value = "cv", 
              legend.label = "CV", ncol = 4)
g1 / g2 
Comparing Admin 2 direct estimates, Fay–Herriot model, unstratified and stratified cluster-level posterior mean estimates

Figure 6.2: Comparing Admin 2 direct estimates, Fay–Herriot model, unstratified and stratified cluster-level posterior mean estimates

The scatterPlot() function takes results from two model outputs, and produce a scatter plot of the selected variable. For example, we can compare the direct estimates with the Fay-Herriot model (top row) and Cluster-level model (bottom row) in terms of their point estimates and standard deviations. In the two plots on the left, we see the expected shrinkage (attenuation) of the spatial Fay–Herriot posterior mean estimates as compared to the direct estimates. In the plots on the right we see the reduction in uncertainty which arises from using all of the data in a single Model.

s1 <- scatterPlot(
          res1=res_ad1$res.admin1,
          res2=smth_res_ad1_bym2$res.admin1,
          value1="direct.est",
          value2="mean",
          by.res1="admin1.name",
          by.res2="admin1.name",
          title="Fay-Herriot vs Direct estimate",
          label1="Direct Est",
          label2="Spatial Fay-Herriot")
s2 <- scatterPlot(
          res1=res_ad1$res.admin1,
          res2=smth_res_ad1_bym2$res.admin1,
          value1="direct.se",
          value2="sd",
          by.res1="admin1.name",
          by.res2="admin1.name",
          title="Fay–Herriot vs Direct SE",
          label1="Direct Est",
          label2="Spatial Fay–Herriot")
s3 <- scatterPlot(
          res1=res_ad1$res.admin1,
          res2=cl_res_ad1$res.admin1,
          value1="direct.est",
          value2="mean",
          by.res1="admin1.name",
          by.res2="admin1.name",
          title="Cluster-level model vs Direct estimate",
          label1="Direct Est",
          label2="Cluster-level model")
s4 <- scatterPlot(
          res1=res_ad1$res.admin1,
          res2=cl_res_ad1$res.admin1,
          value1="direct.se",
          value2="sd",
          by.res1="admin1.name",
          by.res2="admin1.name",
          title="Cluster-level model vs Direct SE",
          label1="Direct Est",
          label2="Cluster-level model")
(s1 + s2 ) / (s3 + s4)
Comparing  direct estimates with Fay-Herriot model (top row) and Unstratified cluster-level model (bottom row).

Figure 6.3: Comparing direct estimates with Fay-Herriot model (top row) and Unstratified cluster-level model (bottom row).

Similar smoothing patterns can be observed when comparing estimates at Admin 2 level.

s1 <- scatterPlot(
          res1=res_ad2$res.admin2,
          res2=smth_res_ad2$res.admin2,
          value1="direct.est",
          value2="mean",
          by.res1="admin2.name.full",
          by.res2="admin2.name.full",
          title="Fay-Herriot vs Direct estimate",
          label1="Direct Est",
          label2="Spatial Fay-Herriot")
s2 <- scatterPlot(
          res1=res_ad2$res.admin2,
          res2=smth_res_ad2$res.admin2,
          value1="direct.se",
          value2="sd",
          by.res1="admin2.name.full",
          by.res2="admin2.name.full",
          title="Fay–Herriot vs Direct SE",
          label1="Direct Est",
          label2="Spatial Fay–Herriot")
s3 <- scatterPlot(
        res1=res_ad2$res.admin2,
        res2=cl_res_ad2$res.admin2,
        value1="direct.est",
        value2="mean",
        by.res1="admin2.name.full",
        by.res2="admin2.name.full",
        title="Cluster-level model vs Direct estimate",
        label1="Direct Est",
        label2="Cluster-level model")
s4 <- scatterPlot(
        res1=res_ad2$res.admin2,
        res2=cl_res_ad2$res.admin2,
        value1="direct.se",
        value2="sd",
        by.res1="admin2.name.full",
        by.res2="admin2.name.full",
        title="Cluster-level model vs Direct SE",
        label1="Direct Est",
        label2="Cluster-level model")
(s1 + s2 ) / (s3 + s4)
Comparing  direct estimates with Fay-Herriot model (top row) and Unstratified cluster-level model (bottom row). Red triangles corresponding to areas where direct estimates are not available.

Figure 6.4: Comparing direct estimates with Fay-Herriot model (top row) and Unstratified cluster-level model (bottom row). Red triangles corresponding to areas where direct estimates are not available.

Finally, we compare the unstratified and stratified cluster-level model at Admin 2 level. We can observe mild differences but the estimates are mostly similar between the two models.

s1 <- scatterPlot(
          res1=cl_res_ad2$res.admin2,
          res2=cl_res_ad2_T$res.admin2[cl_res_ad2_T$res.admin2$type=="full",],
          value1="mean",
          value2="mean",
          by.res1="admin2.name.full",
          by.res2="admin2.name.full",
          title="Stratified vs Unstratified model estimate",
          label1="Unstratified model estimates",
          label2="Stratified model estimates")
s2 <- scatterPlot(
          res1=cl_res_ad2$res.admin2,
          res2=cl_res_ad2_T$res.admin2[cl_res_ad2_T$res.admin2$type=="full",],
          value1="sd",
          value2="sd",
          by.res1="admin2.name.full",
          by.res2="admin2.name.full",
          title="Stratified vs Unstratified model SD",
          label1="Unstratified model estimates",
          label2="Stratified model estimates")
(s1 + s2 )  
Comparing unstratified and stratified cluster-level model at Admin 2 level.

Figure 6.5: Comparing unstratified and stratified cluster-level model at Admin 2 level.

Another visualization function intervalPlot() can provide interval plots at different admin levels for model comparison with compare=T. Input should be a list of model outputs by surveyPrev with self defined model name, (e.g.model=list("name 1"= fit1,"name 2"= fit2)).

intervalPlot(admin = 1, compare = TRUE, model = list(
      "Direct estimate model"= res_ad1,
      "Fay-Herriot model"= smth_res_ad1_bym2,
      "Unstratified Cluster-level model"= cl_res_ad1))
Comparing different Admin 1 estimates.

Figure 6.6: Comparing different Admin 1 estimates.

For stratified cluster-level models, intervalPlot() visualizes the urban, rural, and overall Admin 2 estimates within each Admin 1 region, when setting compare = FALSE.

plots <- intervalPlot(compare = FALSE, model = list("model"=cl_res_ad2_T))
patchwork::wrap_plots(plots, ncol = 2)
Comparing Admin 2 estimates arranged by the corresponding Admin 1 regions.

Figure 6.7: Comparing Admin 2 estimates arranged by the corresponding Admin 1 regions.

7 Aggregation to higher admin levels

7.1 Computing population size from WorldPop raster

Population fractions are necessary to aggregate estimates from finer administrative regions to higher levels. The script below demonstrates the creation of a pixel-level population raster at the 100m by 100m level for women of age 15 to 49 in Zambia in 2018. The corresponding age-sex-specific population estimates can be found at https://hub.worldpop.org/geodata/summary?id=16429. The path to the downloaded .tiff files are specified in the pop_dir object. The aggPopulation() aggregates population data from the population raster into admin levels based on the input shapefiles we read in previous step. Here we aggregate pixel-level population for women of age 15 to 49 in Zambia in 2018 into both Admin 1 and Admin 2 levels.

library(raster)
pop.abbrev <- "ZMB"
year <- 2018
pop.dir <- "../data/Zambia/worldpop"

## first sum up four rasters female  15-49
f_15_name <- paste0(pop.dir, "/", pop.abbrev, '_f_15_', year, '.tiff')
f_20_name <- paste0(pop.dir, "/", pop.abbrev, '_f_20_', year, '.tiff')
f_25_name <- paste0(pop.dir, "/", pop.abbrev, '_f_25_', year, '.tiff')
f_30_name <- paste0(pop.dir, "/", pop.abbrev, '_f_30_', year, '.tiff')
f_35_name <- paste0(pop.dir, "/", pop.abbrev, '_f_35_', year, '.tiff')
f_40_name <- paste0(pop.dir, "/", pop.abbrev, '_f_40_', year, '.tiff')
f_45_name <- paste0(pop.dir, "/", pop.abbrev, '_f_45_', year, '.tiff')


pop_f_15 <- raster(f_15_name)
pop_f_20 <- raster(f_20_name)
pop_f_25 <- raster(f_25_name)
pop_f_30 <- raster(f_30_name)
pop_f_35 <- raster(f_35_name)
pop_f_40 <- raster(f_40_name)
pop_f_45 <- raster(f_45_name)

raster <- pop_f_45 + pop_f_40 + pop_f_35 + pop_f_30 + pop_f_25 + pop_f_20 + pop_f_15
agg.pop1 <- aggPopulation(
   tiff = pop_raster,
   poly.adm  = ZambiaAdm1,
   by.adm  = "NAME_1")
colnames(agg.pop1)[1] <- "admin1.name"
agg.pop2 <- aggPopulation(
   tiff = pop_raster,
   poly.adm = ZambiaAdm2,
   by.adm  = "NAME_2",
   by.adm.upper= "NAME_1")

The computed population estimates in this example is built into the package and available in the ZambiaPopWomen dataset.

data(ZambiaPopWomen)
head(ZambiaPopWomen$admin1_pop)
##   admin1.name population
## 1     Central   430358.5
## 2  Copperbelt   598032.5
## 3     Eastern   452361.1
## 4     Luapula   342075.2
## 5      Lusaka   808640.2
## 6    Muchinga   214810.4
head(ZambiaPopWomen$admin2_pop)
##        admin2.name.full population admin1.name   admin2.name
## 1      Central_Chibombo  108561.22     Central      Chibombo
## 2      Central_Chisamba   28307.61     Central      Chisamba
## 3      Central_Chitambo   20727.69     Central      Chitambo
## 4  Central_Itezhi-tezhi   22814.54     Central  Itezhi-tezhi
## 5         Central_Kabwe   56875.20     Central         Kabwe
## 6 Central_Kapiri Mposhi   55995.96     Central Kapiri Mposhi

7.2 Estimating population size by survey weight

An alternative approach to obtain population fractions is to simply use survey weights to estimate the population size within each area. This approach, however, can lead to noisy estimates at fine spatial resolutions. Nevertheless, similar to aggPopulation(), the aggSurveyWeight() function produces the estimated population size at the specified admin level. The output can be used as weights for aggregation model results to higher admin levels.

agg.survey1 <- aggSurveyWeight(data = data, cluster.info = cluster.info, admin = 1)
agg.survey2 <- aggSurveyWeight(data = data, cluster.info = cluster.info, admin = 2,
    poly.adm = poly.adm2, by.adm = "NAME_2", by.adm.upper = "NAME_1")

As mentioned in Section 2, the adminInfo() combine information for each administrative area into a single data frame. This includes the name, population, and fractions of the urban population for each area. Population and urban population fractions can be added through the agg.pop and proportion objects, respectively. Additionally, the survey weight (output of aggSurveyWeight()) may serve as an alternative to population data, and users can add it via agg.pop.

admin.info1 <- adminInfo(poly.adm = poly.adm1, 
                         admin = 1,
                         by.adm="NAME_1",
                         agg.pop  =ZambiaPopWomen$admin1_pop,
                         proportion = ZambiaPopWomen$admin1_urban )

admin.info2 <- adminInfo(poly.adm = poly.adm2, 
                         admin = 2,by.adm="NAME_2",by.adm.upper = "NAME_1",
                         agg.pop =ZambiaPopWomen$admin2_pop,
                         proportion = ZambiaPopWomen$admin2_urban)

admin.info1.survey <- adminInfo(poly.adm = poly.adm1, 
                         admin = 1,
                         by.adm="NAME_1",
                         agg.pop  =agg.survey1,
                         proportion = ZambiaPopWomen$admin1_urban )

admin.info2.survey  <- adminInfo(poly.adm = poly.adm2, 
                         admin = 2,by.adm="NAME_2",by.adm.upper = "NAME_1",
                         agg.pop =agg.survey2,
                         proportion = ZambiaPopWomen$admin2_urban)

head(admin.info2$data)
##   admin1.name   admin2.name      admin2.name.full      urban population
## 1     Central      Chibombo      Central_Chibombo 0.57715258  108561.22
## 2     Central      Chisamba      Central_Chisamba 0.03825222   28307.61
## 3     Central      Chitambo      Central_Chitambo 0.00000000   20727.69
## 4     Central  Itezhi-tezhi  Central_Itezhi-tezhi 0.00000000   22814.54
## 5     Central         Kabwe         Central_Kabwe 0.68604463   56875.20
## 6     Central Kapiri Mposhi Central_Kapiri Mposhi 0.13246326   55995.96
##   population.admin1
## 1          430358.5
## 2          430358.5
## 3          430358.5
## 4          430358.5
## 5          430358.5
## 6          430358.5

The following maps display the 15-49 female population and aggregated DHS survey weights ratio within upper admin levels.

out1 <- rbind(cbind(admin.info1$data, type = "WorldPop"), 
             cbind(admin.info1.survey$data, type = "SurveyWeight"))
out1 <- out1 %>% group_by(type) %>% mutate(pop.frac = population / sum(population)) 
g1 <- mapPlot(data = out1, geo = poly.adm1, size = 0.1, 
              by.data = "admin1.name", by.geo = "NAME_1", 
              variable = "type",  value = "pop.frac", is.long = TRUE,
              legend.label = "Population Fraction")

out2 <- rbind(cbind(admin.info2$data, type = "WorldPop"), 
             cbind(admin.info2.survey$data, type = "SurveyWeight"))
out2 <- out2 %>% group_by(type) %>% mutate(pop.frac = population / sum(population)) 
poly.adm2$admin2.name.full=paste0(poly.adm2$NAME_1,"_",poly.adm2$NAME_2)
g2 <- mapPlot(data = out2, geo = poly.adm2, size = 0.1, 
              by.data = "admin2.name.full", by.geo = "admin2.name.full", 
              variable = "type",  value = "pop.frac", is.long = TRUE,
              legend.label = "Population Fraction")
g1 / g2
survey weight ratio and population ratio at Admin 1 and Admin 2

Figure 7.1: survey weight ratio and population ratio at Admin 1 and Admin 2

The scatter plot below examines the fraction of population residing in each Admin 1 areas, computed by both WorldPop raster and survey weights. Noticed that DHS data may have missing Admin 2 areas. For example, Zambia has 112 out of 115 Admin 2 areas for this survey (red points in the scatter plots). Users have to be aware of that this is an incomplete set of weights for aggregation to from Admin 2 to Admin 1.

out2$pop.frac.within <- out2$population / out2$population.admin1
out3 <- tidyr::spread(out2[, c("admin2.name.full", "pop.frac.within", "type")], 
                      type, pop.frac.within)
out3 <- left_join(out3, admin.info2$data[, c("admin2.name.full", "population")])
ggplot(out3, aes(x = SurveyWeight, y = WorldPop, size = population)) +
    geom_point(alpha = 0.5, color = "red")+
    geom_abline(slope = 1, intercept = 0, linetype = "dashed")
Comparing two sets of Admin 2 15-49 female population fractions within Admin 1 area.

Figure 7.2: Comparing two sets of Admin 2 15-49 female population fractions within Admin 1 area.

7.3 Aggregating direct estimates

For direct estimates, we caution that we almost always prefer directly computing direct estimates at the desired level, rather than aggregating direct estimates at finer levels. Nevertheless, here we describe how the aggregated direct estimates are computed.

Let \(i\) denote the index for the lower admin level and \(k[i]\) be the index for the corresponding upper admin level, then aggregated estimates for the \(k\)-th upper admin area is

\[ \hat p^{agg}_{k}=\frac {\sum_{k[i] = k} \hat p^{W}_{i}E_i}{\sum_{k[i] = k} \text{E}_i},\] where \(E_i=\widehat{\text{pop}_{i}} \times \textbf{1}(\hat p^{W}_{i} \neq \text{NA})\), i.e., the point estimates are obtained by weighted average of non-missing direct estimates in the area, where the weights are either given by external population information, or estimated by survey weights. The standard error and confidence intervals are computed by simulation using samples from the design-based asymptotic sampling distributions of \(\text{logit}(\hat p^{W}_i)\).

The directEST() function allows aggregation by survey weight directly by specifying weight = "survey", without going through the steps discussed in Section 7.2. We compute the direct estimates at admin1 level and aggregate them to national estimates. When using survey weight, the aggregated point estimate at national level is the same as the national level direct estimate, but the uncertainty calculation are different as discussed above.

res_ad1agg <- directEST(data = data,
                   cluster.info = cluster.info,
                   admin = 1, 
                   weight = "population", 
                   admin.info = admin.info1, 
                   aggregation = TRUE)
head(res_ad1agg$res.admin1)
##   admin1.name direct.est   direct.var direct.logit.est direct.logit.var
## 1     Central  0.5845443 0.0007002450        0.3414564      0.011873143
## 2  Copperbelt  0.6023371 0.0006960677        0.4152124      0.012132270
## 3     Eastern  0.6505890 0.0003224625        0.6216294      0.006240117
## 4     Luapula  0.6453174 0.0007216008        0.5985190      0.013774331
## 5      Lusaka  0.5817949 0.0004319779        0.3301459      0.007296978
## 6    Muchinga  0.6884903 0.0004137610        0.7930706      0.008995197
##   direct.logit.prec  direct.se direct.lower direct.upper         cv
## 1          84.22370 0.02646214    0.5319292    0.6352999 0.04526970
## 2          82.42480 0.02638310    0.5496679    0.6527379 0.04380122
## 3         160.25341 0.01795724    0.6146268    0.6849157 0.02760151
## 4          72.59881 0.02686263    0.5910940    0.6960479 0.04162700
## 5         137.04303 0.02078408    0.5405908    0.6218883 0.03572406
## 6         111.17044 0.02034112    0.6472976    0.7269017 0.02954452
res_ad1agg$agg.natl
##      direct.est   direct.se   direct.var direct.lower direct.upper
## 2.5%   0.628894 0.008586162 7.372218e-05    0.6117665    0.6452747
res_ad1agg_bysurvey <- directEST(data = data,
                   cluster.info = cluster.info,
                   admin = 1, 
                   weight = "survey", 
                   admin.info = admin.info1, 
                   aggregation = TRUE)

head(res_ad1agg_bysurvey$agg.natl)
##      direct.est   direct.se   direct.var direct.lower direct.upper
## 2.5%  0.6343419 0.008295346 6.881276e-05    0.6176332    0.6495225

Similarly, if we set admin = 2, the output includes aggregated results at both the Admin1 and national level with aggregation = TRUE, computed by weighting the Admin 2 level estimates by their population (as specified by weight = "population").

res_ad2agg <- directEST(data = data,
                   cluster.info = cluster.info,
                   admin = 2,
                   admin.info = admin.info2,
                   weight = "population",
                   aggregation = TRUE)


res_ad2agg_bysurvey <- directEST(data = data,
                   cluster.info = cluster.info,
                   admin = 2, 
                   weight = "survey", 
                   admin.info = admin.info2, 
                   aggregation = TRUE)
head(res_ad2agg$res.admin2)
##        admin2.name.full direct.est   direct.var direct.logit.est
## 1      Central_Chibombo  0.6657436 3.575578e-03      0.688996052
## 2      Central_Chisamba  0.5531800 6.467530e-04      0.213527778
## 3      Central_Chitambo  0.5005565 5.721326e-02      0.002225886
## 4         Central_Kabwe  0.5129186 1.198784e-03      0.051686066
## 5 Central_Kapiri Mposhi  0.4903518 1.841140e-03     -0.038597405
## 6         Central_Luano  0.6666667 4.814825e-35      0.693147181
##   direct.logit.var direct.logit.prec    direct.se direct.lower direct.upper
## 1     7.220591e-02      1.384928e+01 5.979614e-02    0.5404939    0.7712991
## 2     1.058620e-02      9.446256e+01 2.543134e-02    0.5029671    0.6023310
## 3     9.154144e-01      1.092401e+00 2.391929e-01    0.1331933    0.8673198
## 4     1.920618e-02      5.206657e+01 3.462346e-02    0.4452360    0.5801308
## 5     2.948019e-02      3.392108e+01 4.290851e-02    0.4073046    0.5739350
## 6     9.750020e-34      1.025639e+33 6.938894e-18    0.6666667    0.6666667
##             cv   admin2.name admin1.name
## 1 8.981857e-02      Chibombo     Central
## 2 4.597299e-02      Chisamba     Central
## 3 4.778540e-01      Chitambo     Central
## 4 6.750284e-02         Kabwe     Central
## 5 8.750555e-02 Kapiri Mposhi     Central
## 6 1.040834e-17         Luano     Central
head(res_ad2agg$agg.admin1)
##   admin1.name direct.est  direct.se direct.lower direct.upper
## 1     Central  0.6072971 0.02240626    0.5614618    0.6488131
## 2  Copperbelt  0.6122046 0.02495829    0.5586943    0.6560586
## 3     Eastern  0.6611554 0.01534567    0.6290711    0.6888775
## 4     Luapula  0.6531875 0.02380226    0.6032167    0.6959296
## 5      Lusaka  0.5799671 0.01995427    0.5401101    0.6180307
## 6    Muchinga  0.6785380 0.01604594    0.6443338    0.7068600
res_ad2agg$agg.natl
##      direct.est   direct.se   direct.var direct.lower direct.upper
## 2.5%    0.63164 0.007168317 5.138477e-05    0.6158581    0.6436442

We can compare these aggregated results at the national level using intervalPlot().

intervalPlot(admin = 0, group = FALSE, compare = TRUE, 
  model = list(
  "Admin 0 direct estimate" = res_ad0,
  "Admin 2 direct estimate, aggregated by survey" =res_ad2agg_bysurvey,
  "Admin 1 direct estimate, aggregated by survey" =res_ad1agg_bysurvey,
  "Admin 2 direct estimate, aggregated by population" = res_ad2agg,
  "Admin 1 direct estimate, aggregated by population" = res_ad1agg))
Comparing different aggregated direct estimates at national level

Figure 7.3: Comparing different aggregated direct estimates at national level

7.4 Aggregating area-level Fay-Herriot estimates

Aggregation of the Fay-Herriot model can be similarly carried out as model based results. We first re-fit the Admin 1 model with the two updated admin.info objects.

smth_res_ad1_spatial <- fhModel(data=data,
                                cluster.info = cluster.info,
                                admin.info = admin.info1,
                                admin = 1,
                                model = "bym2",
                                aggregation = TRUE)
smth_res_ad1_spatial_survey <- fhModel(data=data,
                                      cluster.info = cluster.info,
                                      admin.info = admin.info1.survey,
                                      admin = 1,
                                      model = "bym2",
                                      aggregation = TRUE)

We also re-fit the Admin 2 model with the two updated admin.info objects.

# Run FH model without problematic clusters
smth_res_ad2_agg_pop <- fhModel(data=subset(data, !cluster %in% bad_clusters),
                                cluster.info = cluster.info,
                                admin.info = admin.info2,
                                admin = 2,
                                model = "bym2",
                                aggregation = TRUE)
smth_res_ad2_agg_survey <- fhModel(data=subset(data, !cluster %in% bad_clusters),
                                  cluster.info = cluster.info,
                                  admin.info = admin.info2.survey,
                                  admin = 2,
                                  model = "bym2",
                                  aggregation = TRUE)

And we can compare all four models when aggregating to national level, with the national direct estimate as an additional comparison.

intervalPlot(admin = 0, group = FALSE, compare = TRUE, model = list(
            "Admin 0 direct estimates" = res_ad0,
            "Admin 2 FH model, aggregated by survey" = smth_res_ad2_agg_survey,
            "Admin 1 FH model, aggregated by survey" = smth_res_ad1_spatial_survey,
            "Admin 2 FH model, aggregated by pop" = smth_res_ad2_agg_pop,
            "Admin 1 FH model, aggregated by pop" = smth_res_ad1_spatial))
Comparing different aggregated Fay-Herriot estimates at national level

Figure 7.4: Comparing different aggregated Fay-Herriot estimates at national level

We can also compare all four set of results at the Admin 1 level, with the Admin 1 direct estimate as an additional comparison.

intervalPlot(admin = 1, group = FALSE, compare = TRUE, model = list(
            "Admin 1 direct estimates" = res_ad1,
            "Admin 1 FH model" = smth_res_ad1_spatial_survey,
            "Admin 2 FH model, aggregated by survey" = smth_res_ad2_agg_survey,
            "Admin 2 FH model, aggregated by pop" = smth_res_ad2_agg_pop))
Comparing different aggregated Fay-Herriot estimates at national level

Figure 7.5: Comparing different aggregated Fay-Herriot estimates at national level

7.5 Aggregating cluster-level model

In the same manner, we fit the unstratified and stratified cluster-level model at both Admin 1 and Admin 2 levels. For simplicity of presentation, here we use only the population fractions derived from WorldPop. Results aggregating with survey weights produce very similar results and can be obtained by replacing admin.info argument with the corresponding object computed survey weights, i.e., admin.info1.survey and admin.info2.survey.

cl_res_ad1_pop <- clusterModel(data=data,
                               cluster.info=cluster.info,
                               admin.info = admin.info1,
                               stratification = FALSE,
                               model = "bym2",
                               admin = 1, 
                               aggregation = TRUE,
                               CI = 0.95)
cl_res_ad2_pop <- clusterModel(data=data,
                               cluster.info= cluster.info,
                               admin.info = admin.info2,
                               model = "bym2",
                               stratification = FALSE,
                               admin = 2, 
                               aggregation = TRUE,
                               CI = 0.95)
cl_res_ad1_T_pop <- clusterModel(data=data,
                                 cluster.info=cluster.info,
                                 admin.info = admin.info1,
                                 model = "bym2",
                                 stratification = TRUE,
                                 admin = 1, 
                                 aggregation = TRUE,
                                 CI = 0.95)
cl_res_ad2_T_pop <- clusterModel(data=data,
                                 cluster.info = cluster.info,
                                 admin.info = admin.info2,
                                 model = "bym2",
                                 stratification = TRUE,
                                 admin = 2, 
                                 aggregation = TRUE,
                                 CI = 0.95)

Finally we compare the different aggregated national estimates with the national direct estimates

intervalPlot(admin = 0, group = FALSE, compare = TRUE, model = list(
            "Admin 0 direct estimates" = res_ad0,
             "Admin 1 unstratified model, aggregated by pop" = cl_res_ad1_pop,
             "Admin 2 unstratified model, aggregated by pop" = cl_res_ad2_pop,
             "Admin 1 stratified model, aggregated by pop" = cl_res_ad1_T_pop,
             "Admin 2 stratified model, aggregated by pop" = cl_res_ad2_T_pop))
Comparing different aggregated cluster-level estimates at national level

Figure 7.6: Comparing different aggregated cluster-level estimates at national level

And we can also compare the different model estimates at the Admin 1 level, together with the Admin 1 direct estimates

intervalPlot(admin = 1, group = FALSE, compare = TRUE, model = list(
             "Admin 1 direct estimates" = res_ad1,
             "Admin 1 unstratified model" = cl_res_ad1_pop,
             "Admin 2 unstratified model, aggregated by pop" = cl_res_ad2_pop,
             "Admin 1 stratified model" = cl_res_ad1_T_pop,
             "Admin 2 stratified model, aggregated by pop" = cl_res_ad2_T_pop))
Comparing different cluster-level estimates at Admin 1 level

Figure 7.7: Comparing different cluster-level estimates at Admin 1 level

Finally, we put together all the different Admin 1 level results based on both Fay-Herriot and cluster-level models. When comparing many estimates, it is usually useful to highlight groups of similar models. This can be done by specifying a group object in the fitted models, and set group = TRUE in intervalPlot().

res_ad1$group <- "Direct Estimate"
smth_res_ad1_spatial_survey$group <- "Fay-Herriot"
smth_res_ad2_agg_survey$group <- "Fay-Herriot"
smth_res_ad2_agg_pop$group <- "Fay-Herriot"
cl_res_ad1_pop$group <- "Cluster-Level"
cl_res_ad2_pop$group <- "Cluster-Level"
cl_res_ad1_T_pop$group <- "Cluster-Level"
cl_res_ad2_T_pop$group <- "Cluster-Level"
intervalPlot(admin = 1, group = TRUE, compare = TRUE, model = list(
                 "Admin 1 direct estimates" = res_ad1,
                 "Admin 1 FH model" = smth_res_ad1_spatial_survey,
                 "Admin 2 FH model, aggregated by survey" = smth_res_ad2_agg_survey,
                 "Admin 2 FH model, aggregated by pop" = smth_res_ad2_agg_pop,
                 "Admin 1 unstratified model" = cl_res_ad1_pop,
                 "Admin 2 unstratified model, aggregated by pop" = cl_res_ad2_pop,
                 "Admin 1 stratified model" = cl_res_ad1_T_pop,
                 "Admin 2 stratified model, aggregated by pop" = cl_res_ad2_T_pop))
Comparing different model estimates at national level

Figure 7.8: Comparing different model estimates at national level

Acknowledgement

We thank Ben Mayala and Trevor Croft from the Demographic and Health Surveys (DHS) program for useful discussions in creating this R package.

References

Li, Zehang R, Bryan D Martin, Tracy Q Dong, Geir-Arne Fuglstad, Jessica Godwin, John Paige, Andrea Riebler, Samuel Clark, and Jon Wakefield. 2020. Space-Time Smoothing of Demographic and Health Indicators Using the R Package SUMMER. arXiv Preprint.
Rao, John NK, and Isabel Molina. 2015. Small Area Estimation. John Wiley & Sons.
Riebler, Andrea, Sigrunn H Sørbye, Daniel Simpson, and Håvard Rue. 2016. “An Intuitive Bayesian Spatial Model for Disease Mapping That Accounts for Scaling.” Statistical Methods in Medical Research 25 (4): 1145–65.
Rue, Håvard, Sara Martino, and Nicolas Chopin. 2009. “Approximate Bayesian Inference for Latent Gaussian Models by Using Integrated Nested Laplace Approximations.” Journal of the Royal Statistical Society Series B: Statistical Methodology 71 (2): 319–92.
Simpson, Daniel, Håvard Rue, Andrea Riebler, Thiago G Martins, and Sigrunn H Sørbye. 2017. “Penalising Model Component Complexity: A Principled, Practical Approach to Constructing Priors.”
Wakefield, Jonathan, Taylor Okonek, and Jon Pedersen. 2020. “Small Area Estimation for Disease Prevalence Mapping.” International Statistical Review 88 (2): 398–418.

  1. https://gadm.org/download_country.html↩︎

  2. https://hub.worldpop.org/project/categories?id=8↩︎

  3. https://gadm.org/download_country.html↩︎