Simulating datasets is a powerful and varied tool when conducting
unmarked analyses. Writing our own code to simulate a
dataset based on a given model is an excellent learning tool, and can
help us test if a given model is generating the expected results. If we
simulate a series of datasets based on a fitted model, and calculate a
statistic from each of those fits, we can generate a distribution of the
statistic - this is what the parboot function does. This
can be helpful, for example, when testing goodness of fit. Finally,
simulation can be a useful component of power analysis when a
closed-form equation for power is not available.
unmarked provides two different ways of generating
simulated datasets, depending on the stage we are at in the modeling
process.
For (1), we simply call the simulate method on our
fitted model object and new dataset(s) are generated. This is the
approach used by parboot. In this vignette we will focus on
(2), a more flexible approach to simulation, also using the
simulate method, that allows us to generate a dataset
corresponding to any unmarked model from scratch.
We will need to provide several pieces of information to
simulate in order to simulate a dataset from scratch in
unmarked.
unmarkedFrame of the appropriate type defining the
desired experimental designmodel), if the
unmarkedFrame is used for multiple model typescoefs)
controlling the simulation, which correspond to the formula(s) specified
earlier.The easiest way to demonstrate how to use simulate is to
look at an example: we’ll start with a simple one for occupancy.
unmarkedFrameSuppose we want to simulate a single-season occupancy dataset in
which site occupancy is affected by elevation. The first step is to
create an unmarkedFrame object of the appropriate type,
which defines the experimental design and includes any covariates we
want to use in the simulation. Since we want to simulate an occupancy
dataset, we’ll create an unmarkedFrameOccu.
The unmarkedFrameOccu function takes three arguments:
the observation matrix y, the site covariates
siteCovs, and the observation-level covariates
obsCovs. The dimensions of y define how many
sites and replicate samples the study includes. We’ll create a blank
y matrix (i.e., filled with NAs) of dimension
300 x 8, indicating we want our study to have 300 sites and 8 sampling
occasions. The values you put in this y matrix don’t
matter, you can put anything in there you want as they’ll be overwritten
with the simulated values later. It’s only used to define the number of
sites and occasions.
Earlier we said we want to include an elevation covariate, so we’ll simulate the covariate now and add it to a data frame. We could create several covariates here, including factors, etc.
We’re not using any observation covariates, so we can now make the
complete unmarkedFrameOccu:
## Data frame representation of unmarkedFrame object.
##    y.1 y.2 y.3 y.4 y.5 y.6 y.7 y.8        elev
## 1   NA  NA  NA  NA  NA  NA  NA  NA -0.56047565
## 2   NA  NA  NA  NA  NA  NA  NA  NA -0.23017749
## 3   NA  NA  NA  NA  NA  NA  NA  NA  1.55870831
## 4   NA  NA  NA  NA  NA  NA  NA  NA  0.07050839
## 5   NA  NA  NA  NA  NA  NA  NA  NA  0.12928774
## 6   NA  NA  NA  NA  NA  NA  NA  NA  1.71506499
## 7   NA  NA  NA  NA  NA  NA  NA  NA  0.46091621
## 8   NA  NA  NA  NA  NA  NA  NA  NA -1.26506123
## 9   NA  NA  NA  NA  NA  NA  NA  NA -0.68685285
## 10  NA  NA  NA  NA  NA  NA  NA  NA -0.44566197Since unmarkedFrameOccu is used by both the
single-season occupancy model (occu) and the Royle-Nichols
occupancy model (occuRN), we need to tell
unmarked which one to use.
Most unmarkedFrame types in unmarked are
used by only one model fitting function, so this step is often
unnecessary.
Take a look at the help file for occu. When fitting a
single-season occupancy model we need to provide, in addition to the
data, the formula argument defining the model structure.
We’ll need to provide these same argument(s) to simulate.
Many fitting functions will have multiple required arguments, such as
the mixture distribution to use, key functions, etc.
Here we specify a double right-hand-side formula as required by
occu, specifying an effect of elevation on occupancy.
The model structure, as defined by the formula above, implies a
certain set of parameter/coefficient values (intercepts, slopes) we need
to supply to simulate. These need to be supplied as a named
list, where each list element corresponds to one submodel (such as
state for occupancy and det for detection).
Each list element is a numeric vector of the required parameter values.
It can be tricky to figure out the structure of this list, so
simulate allows you to not include it at first, in which
case the function will return a template for you to fill in.
## coefs should be a named list of vectors, with the following structure
## (replace 0s with your values):
##
## $state
## (Intercept)        elev
##           0           0
##
## $det
## (Intercept)
##           0## Error: Specify coefs argument as shown aboveWe need to supply a list with two elements state and
det. The state element contains two values,
the intercept and the slope corresponding to elevation. The
det element contains only the intercept since we have no
covariates on detection. Note that all values supplied in this list
must be on the inverse link scale, which will depend on the
specific submodel used. So for example, a value of 0 for
det implies a detection probability of 0.5, because we’re
using the logit link function.
## [1] 0.5Now let’s make our own coefs list:
Here we’re setting a negative effect of elevation on occupancy.
We now have all the pieces to simulate a dataset.
## Assumed parameter order for state:
## (Intercept), elev## Assumed parameter order for det:
## (Intercept)The result is always a list of unmarkedFrames. By
default, we just get one, but we can get more with the nsim
argument.
## Data frame representation of unmarkedFrame object.
##    y.1 y.2 y.3 y.4 y.5 y.6 y.7 y.8        elev
## 1    1   1   1   1   1   1   1   1 -0.56047565
## 2    0   0   0   0   0   0   0   0 -0.23017749
## 3    0   0   0   0   0   0   0   0  1.55870831
## 4    0   0   0   0   0   0   0   0  0.07050839
## 5    0   0   0   0   0   0   0   0  0.12928774
## 6    1   0   0   0   0   0   0   0  1.71506499
## 7    0   0   0   0   0   0   0   0  0.46091621
## 8    0   0   0   0   0   0   0   0 -1.26506123
## 9    1   1   1   0   1   0   1   0 -0.68685285
## 10   1   0   0   1   1   0   0   0 -0.44566197The simulated unmarkedFrame now contains y
values and is ready to use.
As a quick check, let’s fit a model to our simulated dataset.
## 
## Call:
## occu(formula = ~1 ~ elev, data = out[[1]])
## 
## Occupancy (logit-scale):
##             Estimate    SE     z  P(>|z|)
## (Intercept)   -0.146 0.120 -1.22 0.223241
## elev          -0.572 0.136 -4.20 0.000027
## 
## Detection (logit-scale):
##  Estimate     SE      z P(>|z|)
##    -0.038 0.0612 -0.621   0.535
## 
## AIC: 1929.538 
## Number of sites: 300We get out roughly the same parameters that we put in, as expected.
The gdistremoval function fits the model of Amundson et al. (2014), which estimates
abundance using a combination of distance sampling and removal sampling
data. When simulating a dataset based on this model, we have to provide
several additional pieces of information related to the structure of the
distance and removal sampling analyses.
unmarkedFrameFirst create the appropriate type of unmarkedFrame,
which is unmarkedFrameGDR. There’s two y-matrices: one for
distance sampling and one for removal sampling. We’ll create a dataset
with 4 distance bins and 5 removal periods.
set.seed(123)
M <- 100
Jdist <- 4
Jrem <- 5
y_dist <- matrix(NA, M, Jdist)
y_rem <- matrix(NA, M, Jrem)We’ll create an elevation site covariate and a wind observation
covariate. Observation-level covariates are only used by the removal
part of the model, so they should have the same number of values as
y_rem.
Finally we can create the unmarkedFrameGDR. We’ll also
need to specify the distance bins and the units for the distance part of
the model here. See ?unmarkedFrameGDR for more
information.
umf <- unmarkedFrameGDR(yRem = y_rem, yDist = y_dist, siteCovs = site_covs, obsCovs = obs_covs,
                        dist.breaks = c(0,25,50,75,100), unitsIn = 'm')## Data frame representation of unmarkedFrame object.
##    yDist.1 yDist.2 yDist.3 yDist.4 yRem.1 yRem.2 yRem.3 yRem.4 yRem.5
## 1       NA      NA      NA      NA     NA     NA     NA     NA     NA
## 2       NA      NA      NA      NA     NA     NA     NA     NA     NA
## 3       NA      NA      NA      NA     NA     NA     NA     NA     NA
## 4       NA      NA      NA      NA     NA     NA     NA     NA     NA
## 5       NA      NA      NA      NA     NA     NA     NA     NA     NA
## 6       NA      NA      NA      NA     NA     NA     NA     NA     NA
## 7       NA      NA      NA      NA     NA     NA     NA     NA     NA
## 8       NA      NA      NA      NA     NA     NA     NA     NA     NA
## 9       NA      NA      NA      NA     NA     NA     NA     NA     NA
## 10      NA      NA      NA      NA     NA     NA     NA     NA     NA
##           elev      wind.1     wind.2      wind.3      wind.4      wind.5
## 1  -0.56047565 -0.71040656  0.2568837 -0.24669188 -0.34754260 -0.95161857
## 2  -0.23017749 -0.04502772 -0.7849045 -1.66794194 -0.38022652  0.91899661
## 3   1.55870831 -0.57534696  0.6079643 -1.61788271 -0.05556197  0.51940720
## 4   0.07050839  0.30115336  0.1056762 -0.64070601 -0.84970435 -1.02412879
## 5   0.12928774  0.11764660 -0.9474746 -0.49055744 -0.25609219  1.84386201
## 6   1.71506499 -0.65194990  0.2353866  0.07796085 -0.96185663 -0.07130809
## 7   0.46091621  1.44455086  0.4515041  0.04123292 -0.42249683 -2.05324722
## 8  -1.26506123  1.13133721 -1.4606401  0.73994751  1.90910357 -1.44389316
## 9  -0.68685285  0.70178434 -0.2621975 -1.57214416 -1.51466765 -1.60153617
## 10 -0.44566197 -0.53090652 -1.4617556  0.68791677  2.10010894 -1.28703048gdistremovalLooking at ?gdistremoval, required arguments include
lambdaformula, removalformula, and
distanceformula. We need to set these formula values to
control the simulation. We’ll also use the negative binomial
distribution for abundance.
As in the previous section, we’ll leave the coefs
argument blank at first and get the correct output structure.
## coefs should be a named list of vectors, with the following structure
## (replace 0s with your values):
##
## $lambda
## (Intercept)        elev
##           0           0
##
## $alpha
## alpha
##     0
##
## $dist
## (Intercept)
##           0
##
## $rem
## (Intercept)        wind
##           0           0## Error: Specify coefs argument as shown aboveWe need to set two values for the abundance (lambda)
model on the log scale, one for dist which represents the
distance function sigma parameter (log scale), one for the negative
binomial dispersion parameter alpha (log scale), and two
for the removal detection probability model (logit scale).
We’ll pick the (relatively arbitrary) values below:
Now provide everything to simulate. Note we don’t need
to provide the model argument because
unmarkedFrameGDR is used for only one fitting function
(gdistremoval).
We’ll simulate 2 datasets.
out <- simulate(umf, lambdaformula=~elev, removalformula=~wind, distanceformula=~1,
                coefs=cf, mixture="NB", nsim=2)## Assumed parameter order for lambda:
## (Intercept), elev## Assumed parameter order for alpha:
## alpha## Assumed parameter order for dist:
## (Intercept)## Assumed parameter order for rem:
## (Intercept), wind## [[1]]
## Data frame representation of unmarkedFrame object.
##    yDist.1 yDist.2 yDist.3 yDist.4 yRem.1 yRem.2 yRem.3 yRem.4 yRem.5
## 1        0       1       0       0      1      0      0      0      0
## 2        1       0       1       2      1      1      1      0      1
## 3        4       6       7       3      7      3      5      2      3
## 4        1       0       1       0      0      0      0      1      1
## 5        0       0       1       0      1      0      0      0      0
## 6        0       0       0       0      0      0      0      0      0
## 7        3       3       0       1      1      0      1      3      2
## 8        0       0       0       0      0      0      0      0      0
## 9        0       0       0       0      0      0      0      0      0
## 10       0       0       0       0      0      0      0      0      0
##           elev      wind.1     wind.2      wind.3      wind.4      wind.5
## 1  -0.56047565 -0.71040656  0.2568837 -0.24669188 -0.34754260 -0.95161857
## 2  -0.23017749 -0.04502772 -0.7849045 -1.66794194 -0.38022652  0.91899661
## 3   1.55870831 -0.57534696  0.6079643 -1.61788271 -0.05556197  0.51940720
## 4   0.07050839  0.30115336  0.1056762 -0.64070601 -0.84970435 -1.02412879
## 5   0.12928774  0.11764660 -0.9474746 -0.49055744 -0.25609219  1.84386201
## 6   1.71506499 -0.65194990  0.2353866  0.07796085 -0.96185663 -0.07130809
## 7   0.46091621  1.44455086  0.4515041  0.04123292 -0.42249683 -2.05324722
## 8  -1.26506123  1.13133721 -1.4606401  0.73994751  1.90910357 -1.44389316
## 9  -0.68685285  0.70178434 -0.2621975 -1.57214416 -1.51466765 -1.60153617
## 10 -0.44566197 -0.53090652 -1.4617556  0.68791677  2.10010894 -1.28703048
## 
## [[2]]
## Data frame representation of unmarkedFrame object.
##    yDist.1 yDist.2 yDist.3 yDist.4 yRem.1 yRem.2 yRem.3 yRem.4 yRem.5
## 1        0       1       0       0      1      0      0      0      0
## 2        0       1       0       0      0      1      0      0      0
## 3        2       1       2       1      4      2      0      0      0
## 4        0       0       0       0      0      0      0      0      0
## 5        1       1       0       1      1      1      0      1      0
## 6        0       0       0       0      0      0      0      0      0
## 7        1       1       1       0      1      0      2      0      0
## 8        0       0       0       0      0      0      0      0      0
## 9        0       0       0       1      1      0      0      0      0
## 10       0       0       0       0      0      0      0      0      0
##           elev      wind.1     wind.2      wind.3      wind.4      wind.5
## 1  -0.56047565 -0.71040656  0.2568837 -0.24669188 -0.34754260 -0.95161857
## 2  -0.23017749 -0.04502772 -0.7849045 -1.66794194 -0.38022652  0.91899661
## 3   1.55870831 -0.57534696  0.6079643 -1.61788271 -0.05556197  0.51940720
## 4   0.07050839  0.30115336  0.1056762 -0.64070601 -0.84970435 -1.02412879
## 5   0.12928774  0.11764660 -0.9474746 -0.49055744 -0.25609219  1.84386201
## 6   1.71506499 -0.65194990  0.2353866  0.07796085 -0.96185663 -0.07130809
## 7   0.46091621  1.44455086  0.4515041  0.04123292 -0.42249683 -2.05324722
## 8  -1.26506123  1.13133721 -1.4606401  0.73994751  1.90910357 -1.44389316
## 9  -0.68685285  0.70178434 -0.2621975 -1.57214416 -1.51466765 -1.60153617
## 10 -0.44566197 -0.53090652 -1.4617556  0.68791677  2.10010894 -1.28703048As a check, we’ll fit the same model used for simulation to one of the datasets.
gdistremoval(lambdaformula=~elev, removalformula=~wind, distanceformula=~1, data=out[[1]],
             mixture="NB")## 
## Call:
## gdistremoval(lambdaformula = ~elev, removalformula = ~wind, distanceformula = ~1, 
##     data = out[[1]], mixture = "NB")
## 
## Abundance (log-scale):
##             Estimate    SE     z  P(>|z|)
## (Intercept)    1.651 0.152 10.89 1.28e-27
## elev           0.887 0.142  6.25 4.16e-10
## 
## Dispersion (log-scale):
##  Estimate    SE     z P(>|z|)
##     0.118 0.242 0.486   0.627
## 
## Distance (log-scale):
##  Estimate    SE    z P(>|z|)
##      3.89 0.053 73.4       0
## 
## Removal (logit-scale):
##             Estimate    SE     z  P(>|z|)
## (Intercept)   -0.901 0.141 -6.38 1.74e-10
## wind          -0.373 0.088 -4.24 2.20e-05
## 
## AIC: 1162.882 
## Number of sites: 100Looks good.
The simulate function provides a flexible tool for
simulating data from any model in unmarked. These datasets
can be used for a variety of purposes, such as for teaching examples,
testing models, or developing new tools that work with
unmarked. Additionally, simulating datasets is a key
component of the power analysis workflow in unmarked - see
the power analysis vignette for more examples.