We use this vignette to reproduce the real world example for local
outlier detection analysed and described in Puchhammer and Filzmoser
(2023). All functions are included in the package
ssMRCD
.
library(ssMRCD)
library(ggplot2)
library(dplyr)
library(rnaturalearth)
library(rnaturalearthdata)
#>
#> Attache Paket: 'rnaturalearthdata'
#> Das folgende Objekt ist maskiert 'package:rnaturalearth':
#>
#> countries110
The original data from GeoSphere Austria (2022) is pre-cleaned and
saved in the data frame object weatherAUT2021
. Additional
information can be found on the helping page.
? weatherAUT2021
Load the data from the package.
data("weatherAUT2021")
head(weatherAUT2021)
#> p s vv t rsum rel name lon lat alt
#> 1 941.35 154.5 2.15 6.15 49.5 77.5 AIGEN/ENNSTAL 14.13833 47.53278 641
#> 2 945.80 172.5 2.95 6.25 39.5 76.0 ALLENTSTEIG 15.36694 48.69083 599
#> 3 985.50 152.0 2.55 8.20 46.5 78.0 AMSTETTEN 14.89500 48.10889 266
#> 4 893.20 123.5 1.85 5.10 66.0 74.5 BAD GASTEIN 13.13333 47.11055 1092
#> 5 984.50 176.0 1.50 8.45 40.5 75.0 BAD GLEICHENBERG 15.90361 46.87222 269
#> 6 992.05 188.5 2.20 8.95 60.5 77.0 BAD RADKERSBURG 15.99333 46.69222 207
rownames(weatherAUT2021) = weatherAUT2021$name
For nice plotting use the package rnaturalearth
.
# Load Austria as sf object
<- ne_countries(scale = "medium", country = "Austria", returnclass = "sf")
austria
= ggplot() +
g_boundary geom_sf(data = austria, fill = "transparent", color = "black") +
theme_classic()
To apply the ssMRCD and finally the local outlier detection function
local_outliers_ssMRCD
, we need to specify groups based on
spatial proximity and the relative influence/weights between the
groups/neighborhoods. Since a prominent part of Austria is Alpine
landscape, we choose many small groups in based on a grid. We construct
the grid based on the longitude and latitude values and group the
observations into neighborhoods. We summarize neighborhoods for a very
small number of observations.
# group by spatial grid
= c(9:16, 18)
cut_lon = c(46, 47, 47.5, 48, 49)
cut_lat = groups_gridbased(x = weatherAUT2021$lon,
groups y = weatherAUT2021$lat,
cutx = cut_lon,
cuty = cut_lat)
table(groups)
#> groups
#> 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26
#> 6 1 1 12 1 1 9 5 9 3 1 8 6 7 8 12 6 5 9 5 7 7 17 7 9 21
# join particularly small groups together
== 2] = 1
groups[groups == 3] = 4
groups[groups == 5] = 4
groups[groups == 6] = 7
groups[groups == 11] = 15
groups[groups = as.numeric(as.factor(groups))
groups table(groups)
#> groups
#> 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21
#> 7 14 10 5 9 3 8 6 7 9 12 6 5 9 5 7 7 17 7 9 21
The final neighborhood structure can be seen in the following plot, where the neighborhoods are differently colorized.
= g_boundary +
g_groups geom_text(data = weatherAUT2021, aes(x = lon, y = lat, col = as.factor(groups), label = groups)) +
geom_hline(aes(yintercept = cut_lat), lty = "dashed", col = "gray") +
geom_vline(aes(xintercept = cut_lon), lty = "dashed", col = "gray") +
labs(x = "Longitude", y = "Latitude", title = "Austrian Weather Stations: Neighborhood Structure") +
theme_classic() +
theme(legend.position = "none")
g_groups
A natural choice for the weights
matrix when using a
neighborhood structure based on spatial coordinates is a geographical
inverse-distance weighting matrix. It is the default value for the
local_outliers_ssMRCD
function but can explicitly be
calculated using the function geo_weights
. This function
returns also the center of each neighborhood. E.g. for neighborhood 4 we
have the following weights, corresponding to the transparancy of the
arrows.
= geo_weights(coordinates = weatherAUT2021[, c("lon", "lat")],
GW groups = groups)
= g_groups +
g_weights labs(title = "Austrian Weather Stations: Weighting Matrix W") +
geom_segment(aes(x = GW$centersN[4, 1],
y = GW$centersN[4, 2],
xend = GW$centersN[-4, 1]-0.1,
yend = GW$centersN[-4, 2]-0.05,
alpha = GW$W[4, -4]),
arrow = arrow(length = unit(0.25, "cm")),
linewidth = 2,
col = "blue")+
geom_text(aes(x = GW$centersN[, 1],
y = GW$centersN[, 2],
label = 1:length(GW$centersN[, 2])))
g_weights
The default amount of smoothing is lambda = 0.5
suggested by the parameter sensitivity studies. The default setting in
the function ssMRCD
applies the default setting and returns
the model.
? ssMRCD
set.seed(123)
= ssMRCD(X = weatherAUT2021[, 1:6],
out groups = groups,
weights = GW$W,
lambda = 0.5,
tuning = NULL)
class(out)
#> [1] "ssMRCD" "list"
There are also some build in plotting functions and methods.
# plot the tolerance ellipses in the geographical space
= plot(x = out,
plots type = c("convergence","ellipses", "ellipses_geo"),
geo_centers = GW$centersN,
variables = c("s", "t"),
manual_rescale = 0.001)
$plot_geoellipses +
plotsgeom_sf(data = austria, fill = "transparent", color = "black")
$plot_ellipses plots
$plot_convergence plots
The parameter tuning is focused on the parameter lambda
,
other hyper-parameters like k
or the neighborhood structure
can similarly be tuned in a similar fashion.
There are two approaches for tuning lambda
along the
grid given to the lambda
argument. The first approach is
based on detecting artificially introduced outliers. Thus, the
geographic coordinates and parameters for the local outlier detection
method need to be provided in the list given to tuning
. For
using this tuning method, you need to set
tuning$method = "local contamination"
.
set.seed(123)
= ssMRCD(X = weatherAUT2021[, 1:6],
out groups = groups,
weights = GW$W,
tuning = list(method = "local contamination",
plot = TRUE,
k = 10,
coords = weatherAUT2021[, c("lon", "lat")],
cont = 0.05,
repetitions = 3),
lambda = c(0.25, 0.5, 0.75))
Some comments on using the “local contamination” tuning approach. Observations are switched and thus, this should introduce outliers that we know of. Nevertheless, you should keep in mind that the observations are switched entirely at randomly. If unlucky, we might switch similar observations and, thus, the measured performance of the detection method might suffer. However, since we want to compare the different parameter settings on the same contaminated data sets this should not affect the fine tuning itself.
Moreover, we implicitly assume that the original data set has no
outliers. This is in general not the case. Thus, the false-negative rate
(FNR) is not the true FNR regarding all outliers but possibly biased.
Nevertheless, the parameter tuning is a way to see the effects of the
parameter setting on the FNR and a proxy of the false-positive rate
(FPR), which is given by the total number of found outliers. Plots
returned by the function ssMRCD
if
tuning$plots = TRUE
show both criteria.
The second approach is based on the minimizing residuals described in
Puchhammer, Wilms and Filzmoser (2024). This is generally faster and
does not depend on spatial coordinates. It is especially useful for
multi-group data that is not spatial. A plot is provided if
tuning$plot = TRUE
.
set.seed(123)
= ssMRCD(X = weatherAUT2021[, 1:6],
out groups = groups,
weights = GW$W,
tuning = list(method = "residuals",
plot = TRUE),
lambda = seq(0, 1, 0.1))
Either by specifying a value for k
like
k = 10
which gives the number of observations to compare
with or the value for dist
(e.g. dist = 1.5
)
as the radius of a circle around the observation where all observations
inside are compared to the initial observation. A default value for
k
that is used among various local outlier detection
methods is 10. However, depending on the spatial structure of the data
it makes sense to use other values as well.
If we are only interested in the covariance estimation we can use the
function ssMRCD
. A more convenient way if we are interested
in local outlier detection is to use the function
local_outliers_ssMRCD
, which already embeds the covariance
estimation. (Fine tuning of lambda
for covariance
estimation can only be done using the function ssMRCD
.)
? locOuts
set.seed(123)
= locOuts(data = weatherAUT2021[, 1:6],
res coords = weatherAUT2021[, c("lon", "lat")],
groups = groups,
lambda = 0.5,
k = 10)
summary(res)
The found outliers can be accessed by the list element
outliers
.
cat(weatherAUT2021$name[res$outliers], sep = ",\n")
For the local outlier detection method there are several plots
available regarding diagnostics (see ?plot.locOuts
).
? plot.locOuts
The histogram shows all next distances together with the cut-off value with a specified number of bins.
plot(res, type = "hist")$p_hist
The spatial plot shows the outliers on the map, the 3D-plot as well,
but with an additional axis for the next-distance/next-distance divided
by cut-off value. The line plot shows the scaled values and the
specified area in the map. Orange-coloured lines are other outliers on
the map, the darkred colored line is the selected observation
(focus
).
plot(res, type = "spatial")$p_spatial +
geom_sf(data = austria, fill = "transparent", color = "black")
# SCHOECKL
plot(res, type = "pcp", observation = "SCHOECKL", scale = "zscore")$p_pcp
GeoSphere Austria (2022): https://data.hub.geosphere.at.
Puchhammer P. and Filzmoser P. (2023): Spatially smoothed robust covariance estimation for local outlier detection. https://doi.org/10.1080/10618600.2023.2277875