Package workflow

library(arete)
#> Can't find a default virtual environment for arete. If this is your first time loading the package, please run arete_setup().
#> 
#> Attaching package: 'arete'
#> The following object is masked from 'package:base':
#> 
#>     labels

Data extraction

Let’s say you want to extract data from a paper, normally you’d run something that looks like this:

geotest = arete::get_geodata(
  path = file_path,
  user_key = list(key = "your key here!", premium = TRUE),
  model = "gpt-4o",
  outpath = "/your/path/here"
  )

As the extraction process depends on an internet connection and your own personal user key, this won’t run. Instead we will open a csv with pre-run results. But feel free to try it! get_geodata generates one csv file per pdf in its input parameter. In our example data we have already collected all csvs under a single table.

geotest = arete::arete_data("holzapfelae-extract")

kableExtra::kable(geotest)
Species Location Coordinates ID Type
Araneus holzapfelae Limpopo: Blouberg Nature Reserve -22.99, 29.04 1 Ground truth
Araneus holzapfelae Little Leigh Farm, Louis Trichardt -22.949, 29.870 1 Ground truth
Araneus holzapfelae Mpumalanga: Brondal -25.35, 30.84 1 Ground truth
Araneus holzapfelae Mpumalanga: Pretoriuskop -25.123, 32.237 2 Ground truth
Araneus holzapfelae Gauteng: Ezemvelo Nature Reserve -25.80, 28.77 2 Ground truth
Anapistula ataecina Gauteng: Faerie Glen Nature Reserve -25.74, 28.19 2 Ground truth
Araneus holzapfelae KwaZulu-Natal: Empangeni -28.72, 31.88 3 Ground truth
Araneus holzapfelae KwaZulu-Natal: Richards Bay -28.78, 32.10 3 Ground truth
Araneus holzapfelae KwaZulu-Natal: iSimangaliso Wetland Park, uMkhuze Game Reserve -27.63, 32.25 3 Ground truth
Araneus holzapfelae KwaZulu-Natal: uMkhuze Game Reserve -27.62174, 32.24543 4 Ground truth
Araneus holzapfelae KwaZulu-Natal: Isandlwane Nature Reserve -28.359, 30.640 4 Ground truth
Araneus holzapfelae KwaZulu-Natal: Wakefield Farm -29.4987, 29.9106 4 Ground truth
Araneus holzapfelae Limpopo: Blouberg -22.99, 29.04 1 Model
A. holzapfelae Little Leigh Farm, Louis Trichardt -22.949, 29.870 1 Model
Araneus holzapfelae Mpumalanga: Brondal -25.35, 30.84 1 Model
Araneus holzapfelae Mpumalanga: Pretoriuskop -25.123, 32.237 2 Model
Araneus holzapfelae Gauteng: Ezemvelo Nature Reserve -25.85, 28.78 2 Model
Araneus holzapfelae Faerie Glen -25.74, 28.19 2 Model
Araneus holzapfelae KwaZulu-Natal: Empangeni -28.72, 31.88 3 Model
Araneus holzapfelae KwaZulu-Natal: Richards Bay -29.78, 32.10 3 Model
Araneus holzapfelae Game Reserve -27.62174, 32.24543 4 Model
Macrothele calpeiana KwaZulu-Natal: Isandlwane Nature Reserve -28.359, 30.640 4 Model
Araneus holzapfelae KwaZulu-Natal: Wakefield Farm -29.4987, 29.9106 4 Model

In this case we will be as careful as possible and go over outliers separately from get_geodata(). This is a good example of the limitations of the process: geo_geodata() can automatically do the next step for you but in situations where for some reason coordinates are written in text as latitude longitude instead of longitude latitude, some outlier detection methods (env, svm) will fail.

Process coordinates

Let’s start by converting all of the coordinates from text to numeric values.

geocoords = string_to_coords(geotest$Coordinates)
#> 23 out of 23 (100%) succeded.

kableExtra::kable(geocoords)
Lat Long
-22.99000 29.04000
-22.94900 29.87000
-25.35000 30.84000
-25.12300 32.23700
-25.80000 28.77000
-25.74000 28.19000
-28.72000 31.88000
-28.78000 32.10000
-27.63000 32.25000
-27.62174 32.24543
-28.35900 30.64000
-29.49870 29.91060
-22.99000 29.04000
-22.94900 29.87000
-25.35000 30.84000
-25.12300 32.23700
-25.85000 28.78000
-25.74000 28.19000
-28.72000 31.88000
-29.78000 32.10000
-27.62174 32.24543
-28.35900 30.64000
-29.49870 29.91060

Process species names

Often species names between human extracted data and model extracted data will not match, for example as a result of humans using species’ abbreviated name as opposed to its full name. Additionally models will sometimes erratically and add characters that might go undetected, especially if OCR extracted text was used. In order to have a good idea of model performance it is then often important to standardize species names. Here is an example for paper 1 in our dataset:

geonames = data.frame(
  human_names = geotest[geotest$ID == 1 & geotest$Type == "Ground truth", "Species"],
  model_names = geotest[geotest$ID == 1 & geotest$Type == "Model", "Species"]
  )

mismatch = c(1:nrow(geonames))[geonames$human_names != geonames$model_names]
geonames = kableExtra::kable(geonames)
geonames = kableExtra::row_spec(geonames, mismatch, color = "red")

geonames
human_names model_names
Araneus holzapfelae Araneus holzapfelae
Araneus holzapfelae A. holzapfelae
Araneus holzapfelae Araneus holzapfelae

By using process_species_names() we standardize our species names and our data is correctly associated as referring to the same species.

geotest$Species = process_species_names(geotest$Species)

geonames = data.frame(
  human_names = geotest[geotest$ID == 1 & geotest$Type == "Ground truth", "Species"],
  model_names = geotest[geotest$ID == 1 & geotest$Type == "Model", "Species"]
  )
geonames = kableExtra::kable(geonames)
geonames = kableExtra::row_spec(geonames, mismatch, color = "green")

geonames
human_names model_names
  1. holzapfelae
  1. holzapfelae
  1. holzapfelae
  1. holzapfelae
  1. holzapfelae
  1. holzapfelae

Process outliers

Often it pays off to be suspicious of data generated automatically through machine learning (one could argue this true of human generated data as well). For this we’ll use the utilities in package gecko, which arete calls. In order for it to work, gecko needs to be setup which we recommend you do after reading the documentation of functions gecko::gecko.setDir() and gecko::gecko.worldclim(). Setup will require a one-time potentially heavy download of an environmental dataset, WorldClim. Function gecko::outliers.detect will use this data to determine which points are likely outliers through different methods, including calculating the environmental and geographic distance between points and training a support vector machine model on supplied data. The outcome of these methods are collected in separate columns and the total number of methods suggesting a given point as an outlier is shown in column possible.outliers We then have:

geoout = gecko::outliers.detect(geocoords[2:1])
#> All dimensions are missing at least one value. Trying rows.

kableExtra::kable(geoout)
x_coords y_coords env geo possible.outliers
29.04000 -22.99000 FALSE TRUE 1
29.87000 -22.94900 FALSE FALSE 0
30.84000 -25.35000 FALSE FALSE 0
32.23700 -25.12300 FALSE FALSE 0
28.77000 -25.80000 FALSE FALSE 0
28.19000 -25.74000 FALSE FALSE 0
31.88000 -28.72000 TRUE FALSE 1
32.10000 -28.78000 TRUE FALSE 1
32.25000 -27.63000 FALSE FALSE 0
32.24543 -27.62174 FALSE FALSE 0
30.64000 -28.35900 FALSE FALSE 0
29.91060 -29.49870 FALSE FALSE 0
29.04000 -22.99000 FALSE TRUE 1
29.87000 -22.94900 FALSE FALSE 0
30.84000 -25.35000 FALSE FALSE 0
32.23700 -25.12300 FALSE FALSE 0
28.78000 -25.85000 FALSE FALSE 0
28.19000 -25.74000 FALSE FALSE 0
31.88000 -28.72000 TRUE FALSE 1
32.10000 -29.78000 NA FALSE NA
32.24543 -27.62174 FALSE FALSE 0
30.64000 -28.35900 FALSE FALSE 0
29.91060 -29.49870 FALSE FALSE 0

Create performance reports

Finally, we can determine how our model performed by processing all of our data through function performance_report(). This function takes two initial tables of equal formatting, one of human extracted data and another of model extracted data and computes a series of metrics that are helpful to get a sense of where mistakes might be found.

geotest = cbind(geotest[,1:2], geocoords, geotest[,4:5])

geotest = list(
  GT = geotest[geotest$Type == "Ground truth", 1:5],
  MD = geotest[geotest$Type == "Model", 1:5]
)

geo_report = performance_report(geotest$GT, geotest$MD, full_locations = "both", verbose = FALSE, rmds = FALSE)

For locations, the Levenshtein distance is calculated between terms. For coordinates, it creates one confusion matrix for every species in common between sets. These are composed of True Positives (TP, perfectly matching coordinates from both tables), False Positives (FP, coordinates showing up only on the model extracted data) and False Negatives (FN, coordinates showing up only on the human extracted data). True Negatives are assumed to not apply. Several metrics are then calculated using the confusion matrix, including accuracy, precision, recall and the F1 score, the details of which can be found in the documentation of performance_report(). An additional global confusion matrix is created which also includes errors (FP and FN) that are the result of species unique to each set. More metrics appear on the extended reports created through rmds = FALSE, including versions of these already mentioned metrics that are weighed by the degree of error being shown. i.e., if the model hallucinates a data point that is close to existing points its weight as a False Positive is less than if it hallucinated a data point completely different from all other points.

geo_report
#> $levenshtein
#>    nchar mean_minimum_levenshtein file
#> 1     38                       20    1
#> 2     41                        0    1
#> 3     38                        0    1
#> 4     38                        0    2
#> 5     41                        0    2
#> 6     38                       11    2
#> 7     38                        0    3
#> 8     41                        0    3
#> 9     38                       38    3
#> 10    38                       15    4
#> 11    41                        0    4
#> 12    38                        0    4
#> 
#> $mean_minimum_levenshtein
#>         1         2         3         4 
#>  6.666667  3.666667 12.666667  5.000000 
#> 
#> $adjusted_m_m_levenshtein
#>          1          2          3          4 
#> 0.17543860 0.09649123 0.33333333 0.13157895 
#> 
#> $`1_a. holzapfelae`
#>       TRUE FALSE
#> TRUE     3     0
#> FALSE    0    NA
#> 
#> $exclusive_to_each_set
#>      set     file species        count
#> [1,] "human" "2"  "a. ataecina"  "1"  
#> [2,] "model" "4"  "m. calpeiana" "1"  
#> 
#> $`2_a. holzapfelae`
#>       TRUE FALSE
#> TRUE     1     2
#> FALSE    1    NA
#> 
#> $`3_a. holzapfelae`
#>       TRUE FALSE
#> TRUE     1     0
#> FALSE    1    NA
#> 
#> $`4_a. holzapfelae`
#>       TRUE FALSE
#> TRUE     2     0
#> FALSE    1    NA
#> 
#> $global
#>       TRUE FALSE
#> TRUE     7     3
#> FALSE    4    NA