Package workflow

Data extraction

Let’s say you want to extract data from a paper, normally you’d run something that looks like this:

geotest = arete::get_geodata(
  path = file_path,
  user_key = list(key = "your key here!", premium = TRUE),
  model = "gpt-4o",
  outpath = "/your/path/here"
  )

As the extraction process depends on an internet connection and your own personal user key, this won’t run. Instead we will open a csv with pre-run results. But feel free to try it! get_geodata generates one csv file per pdf in its input parameter. In our example data we have already collected all csvs under a single table.

geotest = arete::arete_data("holzapfelae-extract")

kableExtra::kable(geotest)

Species	Location	Coordinates	ID	Type
Araneus holzapfelae	Limpopo: Blouberg Nature Reserve	-22.99, 29.04	1	Ground truth
Araneus holzapfelae	Little Leigh Farm, Louis Trichardt	-22.949, 29.870	1	Ground truth
Araneus holzapfelae	Mpumalanga: Brondal	-25.35, 30.84	1	Ground truth
Araneus holzapfelae	Mpumalanga: Pretoriuskop	-25.123, 32.237	2	Ground truth
Araneus holzapfelae	Gauteng: Ezemvelo Nature Reserve	-25.80, 28.77	2	Ground truth
Anapistula ataecina	Gauteng: Faerie Glen Nature Reserve	-25.74, 28.19	2	Ground truth
Araneus holzapfelae	KwaZulu-Natal: Empangeni	-28.72, 31.88	3	Ground truth
Araneus holzapfelae	KwaZulu-Natal: Richards Bay	-28.78, 32.10	3	Ground truth
Araneus holzapfelae	KwaZulu-Natal: iSimangaliso Wetland Park, uMkhuze Game Reserve	-27.63, 32.25	3	Ground truth
Araneus holzapfelae	KwaZulu-Natal: uMkhuze Game Reserve	-27.62174, 32.24543	4	Ground truth
Araneus holzapfelae	KwaZulu-Natal: Isandlwane Nature Reserve	-28.359, 30.640	4	Ground truth
Araneus holzapfelae	KwaZulu-Natal: Wakefield Farm	-29.4987, 29.9106	4	Ground truth
Araneus holzapfelae	Limpopo: Blouberg	-22.99, 29.04	1	Model
A. holzapfelae	Little Leigh Farm, Louis Trichardt	-22.949, 29.870	1	Model
Araneus holzapfelae	Mpumalanga: Brondal	-25.35, 30.84	1	Model
Araneus holzapfelae	Mpumalanga: Pretoriuskop	-25.123, 32.237	2	Model
Araneus holzapfelae	Gauteng: Ezemvelo Nature Reserve	-25.85, 28.78	2	Model
Araneus holzapfelae	Faerie Glen	-25.74, 28.19	2	Model
Araneus holzapfelae	KwaZulu-Natal: Empangeni	-28.72, 31.88	3	Model
Araneus holzapfelae	KwaZulu-Natal: Richards Bay	-29.78, 32.10	3	Model
Araneus holzapfelae	Game Reserve	-27.62174, 32.24543	4	Model
Macrothele calpeiana	KwaZulu-Natal: Isandlwane Nature Reserve	-28.359, 30.640	4	Model
Araneus holzapfelae	KwaZulu-Natal: Wakefield Farm	-29.4987, 29.9106	4	Model

In this case we will be as careful as possible and go over outliers separately from get_geodata(). This is a good example of the limitations of the process: geo_geodata() can automatically do the next step for you but in situations where for some reason coordinates are written in text as latitude longitude instead of longitude latitude, some outlier detection methods (env, svm) will fail.

Process coordinates

Let’s start by converting all of the coordinates from text to numeric values.

geocoords = string_to_coords(geotest$Coordinates)
#> 23 out of 23 (100%) succeded.

kableExtra::kable(geocoords)

Lat	Long
-22.99000	29.04000
-22.94900	29.87000
-25.35000	30.84000
-25.12300	32.23700
-25.80000	28.77000
-25.74000	28.19000
-28.72000	31.88000
-28.78000	32.10000
-27.63000	32.25000
-27.62174	32.24543
-28.35900	30.64000
-29.49870	29.91060
-22.99000	29.04000
-22.94900	29.87000
-25.35000	30.84000
-25.12300	32.23700
-25.85000	28.78000
-25.74000	28.19000
-28.72000	31.88000
-29.78000	32.10000
-27.62174	32.24543
-28.35900	30.64000
-29.49870	29.91060

Process species names

Often species names between human extracted data and model extracted data will not match, for example as a result of humans using species’ abbreviated name as opposed to its full name. Additionally models will sometimes erratically and add characters that might go undetected, especially if OCR extracted text was used. In order to have a good idea of model performance it is then often important to standardize species names. Here is an example for paper 1 in our dataset:

geonames = data.frame(
  human_names = geotest[geotest$ID == 1 & geotest$Type == "Ground truth", "Species"],
  model_names = geotest[geotest$ID == 1 & geotest$Type == "Model", "Species"]
  )

mismatch = c(1:nrow(geonames))[geonames$human_names != geonames$model_names]
geonames = kableExtra::kable(geonames)
geonames = kableExtra::row_spec(geonames, mismatch, color = "red")

geonames

human_names	model_names
Araneus holzapfelae	Araneus holzapfelae
Araneus holzapfelae	A. holzapfelae
Araneus holzapfelae	Araneus holzapfelae

By using process_species_names() we standardize our species names and our data is correctly associated as referring to the same species.

geotest$Species = process_species_names(geotest$Species)

geonames = data.frame(
  human_names = geotest[geotest$ID == 1 & geotest$Type == "Ground truth", "Species"],
  model_names = geotest[geotest$ID == 1 & geotest$Type == "Model", "Species"]
  )
geonames = kableExtra::kable(geonames)
geonames = kableExtra::row_spec(geonames, mismatch, color = "green")

geonames

human_names	model_names
holzapfelae	holzapfelae
holzapfelae	holzapfelae
holzapfelae	holzapfelae

Process outliers

Often it pays off to be suspicious of data generated automatically through machine learning (one could argue this true of human generated data as well). For this we’ll use the utilities in package gecko, which arete calls. In order for it to work, gecko needs to be setup which we recommend you do after reading the documentation of functions gecko::gecko.setDir() and gecko::gecko.worldclim(). Setup will require a one-time potentially heavy download of an environmental dataset, WorldClim. Function gecko::outliers.detect will use this data to determine which points are likely outliers through different methods, including calculating the environmental and geographic distance between points and training a support vector machine model on supplied data. The outcome of these methods are collected in separate columns and the total number of methods suggesting a given point as an outlier is shown in column possible.outliers We then have:

geoout = gecko::outliers.detect(geocoords[2:1])
#> All dimensions are missing at least one value. Trying rows.

kableExtra::kable(geoout)

x_coords	y_coords	env	geo	possible.outliers
29.04000	-22.99000	FALSE	TRUE	1
29.87000	-22.94900	FALSE	FALSE	0
30.84000	-25.35000	FALSE	FALSE	0
32.23700	-25.12300	FALSE	FALSE	0
28.77000	-25.80000	FALSE	FALSE	0
28.19000	-25.74000	FALSE	FALSE	0
31.88000	-28.72000	TRUE	FALSE	1
32.10000	-28.78000	TRUE	FALSE	1
32.25000	-27.63000	FALSE	FALSE	0
32.24543	-27.62174	FALSE	FALSE	0
30.64000	-28.35900	FALSE	FALSE	0
29.91060	-29.49870	FALSE	FALSE	0
29.04000	-22.99000	FALSE	TRUE	1
29.87000	-22.94900	FALSE	FALSE	0
30.84000	-25.35000	FALSE	FALSE	0
32.23700	-25.12300	FALSE	FALSE	0
28.78000	-25.85000	FALSE	FALSE	0
28.19000	-25.74000	FALSE	FALSE	0
31.88000	-28.72000	TRUE	FALSE	1
32.10000	-29.78000	NA	FALSE	NA
32.24543	-27.62174	FALSE	FALSE	0
30.64000	-28.35900	FALSE	FALSE	0
29.91060	-29.49870	FALSE	FALSE	0

Create performance reports

Finally, we can determine how our model performed by processing all of our data through function performance_report(). This function takes two initial tables of equal formatting, one of human extracted data and another of model extracted data and computes a series of metrics that are helpful to get a sense of where mistakes might be found.

geotest = cbind(geotest[,1:2], geocoords, geotest[,4:5])

geotest = list(
  GT = geotest[geotest$Type == "Ground truth", 1:5],
  MD = geotest[geotest$Type == "Model", 1:5]
)

geo_report = performance_report(geotest$GT, geotest$MD, full_locations = "both", verbose = FALSE, rmds = FALSE)

For locations, the Levenshtein distance is calculated between terms. For coordinates, it creates one confusion matrix for every species in common between sets. These are composed of True Positives (TP, perfectly matching coordinates from both tables), False Positives (FP, coordinates showing up only on the model extracted data) and False Negatives (FN, coordinates showing up only on the human extracted data). True Negatives are assumed to not apply. Several metrics are then calculated using the confusion matrix, including accuracy, precision, recall and the F1 score, the details of which can be found in the documentation of performance_report(). An additional global confusion matrix is created which also includes errors (FP and FN) that are the result of species unique to each set. More metrics appear on the extended reports created through rmds = FALSE, including versions of these already mentioned metrics that are weighed by the degree of error being shown. i.e., if the model hallucinates a data point that is close to existing points its weight as a False Positive is less than if it hallucinated a data point completely different from all other points.

geo_report
#> $levenshtein
#>    nchar mean_minimum_levenshtein file
#> 1     38                       20    1
#> 2     41                        0    1
#> 3     38                        0    1
#> 4     38                        0    2
#> 5     41                        0    2
#> 6     38                       11    2
#> 7     38                        0    3
#> 8     41                        0    3
#> 9     38                       38    3
#> 10    38                       15    4
#> 11    41                        0    4
#> 12    38                        0    4
#> 
#> $mean_minimum_levenshtein
#>         1         2         3         4 
#>  6.666667  3.666667 12.666667  5.000000 
#> 
#> $adjusted_m_m_levenshtein
#>          1          2          3          4 
#> 0.17543860 0.09649123 0.33333333 0.13157895 
#> 
#> $`1_a. holzapfelae`
#>       TRUE FALSE
#> TRUE     3     0
#> FALSE    0    NA
#> 
#> $exclusive_to_each_set
#>      set     file species        count
#> [1,] "human" "2"  "a. ataecina"  "1"  
#> [2,] "model" "4"  "m. calpeiana" "1"  
#> 
#> $`2_a. holzapfelae`
#>       TRUE FALSE
#> TRUE     1     2
#> FALSE    1    NA
#> 
#> $`3_a. holzapfelae`
#>       TRUE FALSE
#> TRUE     1     0
#> FALSE    1    NA
#> 
#> $`4_a. holzapfelae`
#>       TRUE FALSE
#> TRUE     2     0
#> FALSE    1    NA
#> 
#> $global
#>       TRUE FALSE
#> TRUE     7     3
#> FALSE    4    NA