sfarrow is designed to help read/write spatial vector data in “simple feature” format from/to Parquet files while maintaining coordinate reference system information. Essentially, this tool is attempting to connect R objects in sf and in arrow and it relies on these packages for its internal work.
A key goal is to support interoperability of spatial data in Parquet files. R objects (including sf) can be written to files with arrow; however, these do not necessarily maintain the spatial information or can be read in by Python. sfarrow implements a metadata format also used by Python GeoPandas, described here: https://github.com/geopandas/geo-arrow-spec. Note that these metadata are not stable yet, and sfarrow will warn you that it may change.
# install from CRAN with install.packages('sfarrow')
# or install from devtools::install_github("wcjochem/sfarrow@main)
# load the library
library(sfarrow)
library(dplyr, warn.conflicts = FALSE)A Parquet file (with .parquet extension) can be read using st_read_parquet() and pointing to the file system. This will create an sf spatial data object in memory which can then be used as normal using functions from sf.
# read an example dataset created from Python using geopandas
world <- st_read_parquet(system.file("extdata", "world.parquet", package = "sfarrow"))
class(world)
#> [1] "sf"         "data.frame"
world
#> Simple feature collection with 177 features and 5 fields
#> Geometry type: GEOMETRY
#> Dimension:     XY
#> Bounding box:  xmin: -180 ymin: -90 xmax: 180 ymax: 83.64513
#> Geodetic CRS:  WGS 84
#> First 10 features:
#>      pop_est     continent                     name iso_a3 gdp_md_est
#> 1     920938       Oceania                     Fiji    FJI  8.374e+03
#> 2   53950935        Africa                 Tanzania    TZA  1.506e+05
#> 3     603253        Africa                W. Sahara    ESH  9.065e+02
#> 4   35623680 North America                   Canada    CAN  1.674e+06
#> 5  326625791 North America United States of America    USA  1.856e+07
#> 6   18556698          Asia               Kazakhstan    KAZ  4.607e+05
#> 7   29748859          Asia               Uzbekistan    UZB  2.023e+05
#> 8    6909701       Oceania         Papua New Guinea    PNG  2.802e+04
#> 9  260580739          Asia                Indonesia    IDN  3.028e+06
#> 10  44293293 South America                Argentina    ARG  8.794e+05
#>                          geometry
#> 1  MULTIPOLYGON (((180 -16.067...
#> 2  POLYGON ((33.90371 -0.95, 3...
#> 3  POLYGON ((-8.66559 27.65643...
#> 4  MULTIPOLYGON (((-122.84 49,...
#> 5  MULTIPOLYGON (((-122.84 49,...
#> 6  POLYGON ((87.35997 49.21498...
#> 7  POLYGON ((55.96819 41.30864...
#> 8  MULTIPOLYGON (((141.0002 -2...
#> 9  MULTIPOLYGON (((141.0002 -2...
#> 10 MULTIPOLYGON (((-68.63401 -...
plot(sf::st_geometry(world))Similarly, a Parquet file can be written from an sf object using st_write_parquet() and specifying a path to the new file. Non-spatial objects cannot be written with sfarrow, and users should instead use arrow.
# output the file to a new location
# note the warning about possible future changes in metadata.
st_write_parquet(world, dsn = file.path(tempdir(), "new_world.parquet"))
#> Warning: This is an initial implementation of Parquet/Feather file support and
#> geo metadata. This is tracking version 0.1.0 of the metadata
#> (https://github.com/geopandas/geo-arrow-spec). This metadata
#> specification may change and does not yet make stability promises.  We
#> do not yet recommend using this in a production setting unless you are
#> able to rewrite your Parquet/Feather files.While reading/writing a Parquet file is nice, the real power of arrow comes from splitting big datasets into multiple files, or partitions, based on criteria that make it faster to query. There is currently basic support in sfarrow for multi-file spatial datasets. For additional dataset querying options, see the arrow documentation.
sfarrow accesses arrows’s dplyr interface to explore partitioned, Arrow datasets.
For this example we will use a dataset which was created by randomly splitting the nc.shp file first into three groups and then further partitioning into two more random groups. This creates a nested set of files.
list.files(system.file("extdata", "ds", package = "sfarrow"), recursive = TRUE)
#> [1] "split1=1/split2=1/part-3.parquet" "split1=1/split2=2/part-0.parquet"
#> [3] "split1=2/split2=1/part-1.parquet" "split1=2/split2=2/part-5.parquet"
#> [5] "split1=3/split2=1/part-2.parquet" "split1=3/split2=2/part-4.parquet"The file tree is showing that the data were partitioned by the variables “split1” and “split2”. Those are the column names that were used for the random splits. This partitioning is in “Hive style” where the partitioning variables are in the paths.
The first step is to open the Dataset using arrow.
ds <- arrow::open_dataset(system.file("extdata", "ds", package="sfarrow"))For small datasets (as in the example) we can read the entire set of files into an sf object.
nc_ds <- read_sf_dataset(ds)
nc_ds
#> Simple feature collection with 100 features and 16 fields
#> Geometry type: MULTIPOLYGON
#> Dimension:     XY
#> Bounding box:  xmin: -84.32385 ymin: 33.88199 xmax: -75.45698 ymax: 36.58965
#> Geodetic CRS:  NAD27
#> First 10 features:
#>     AREA PERIMETER CNTY_ CNTY_ID     NAME  FIPS FIPSNO CRESS_ID BIR74 SID74
#> 1  0.097     1.670  1833    1833 Hertford 37091  37091       46  1452     7
#> 2  0.062     1.547  1834    1834   Camden 37029  37029       15   286     0
#> 3  0.109     1.325  1841    1841   Person 37145  37145       73  1556     4
#> 4  0.081     1.288  1880    1880  Watauga 37189  37189       95  1323     1
#> 5  0.044     1.158  1887    1887   Chowan 37041  37041       21   751     1
#> 6  0.086     1.267  1893    1893   Yadkin 37197  37197       99  1269     1
#> 7  0.170     1.680  1903    1903 Guilford 37081  37081       41 16184    23
#> 8  0.118     1.601  1946    1946  Madison 37115  37115       58   765     2
#> 9  0.134     1.755  1958    1958    Burke 37023  37023       12  3573     5
#> 10 0.116     1.664  1964    1964 McDowell 37111  37111       56  1946     5
#>    NWBIR74 BIR79 SID79 NWBIR79 split1 split2                       geometry
#> 1      954  1838     5    1237      1      1 MULTIPOLYGON (((-76.74506 3...
#> 2      115   350     2     139      1      1 MULTIPOLYGON (((-76.00897 3...
#> 3      613  1790     4     650      1      1 MULTIPOLYGON (((-78.8068 36...
#> 4       17  1775     1      33      1      1 MULTIPOLYGON (((-81.80622 3...
#> 5      368   899     1     491      1      1 MULTIPOLYGON (((-76.68874 3...
#> 6       65  1568     1      76      1      1 MULTIPOLYGON (((-80.49554 3...
#> 7     5483 20543    38    7089      1      1 MULTIPOLYGON (((-79.53782 3...
#> 8        5   926     2       3      1      1 MULTIPOLYGON (((-82.89597 3...
#> 9      326  4314    15     407      1      1 MULTIPOLYGON (((-81.81628 3...
#> 10     134  2215     5     128      1      1 MULTIPOLYGON (((-81.81628 3...With large datasets, more often we will want query them and return a reduced set of the partitioned records. To create a query, the easiest way is to use dplyr::filter() on the partitioning (and/or other) variables to subset the rows and dplyr::select() to subset the columns. read_sf_dataset() will then use the arrow_dplyr_query and call dplyr::collect() to extract and then process the Arrow Table into sf.
nc_d12 <- ds %>% 
            filter(split1 == 1, split2 == 2) %>%
            read_sf_dataset()
nc_d12
#> Simple feature collection with 20 features and 16 fields
#> Geometry type: MULTIPOLYGON
#> Dimension:     XY
#> Bounding box:  xmin: -83.36472 ymin: 34.71101 xmax: -75.45698 ymax: 36.58965
#> Geodetic CRS:  NAD27
#> First 10 features:
#>     AREA PERIMETER CNTY_ CNTY_ID        NAME  FIPS FIPSNO CRESS_ID BIR74 SID74
#> 1  0.114     1.442  1825    1825        Ashe 37009  37009        5  1091     1
#> 2  0.143     1.630  1828    1828       Surry 37171  37171       86  3188     5
#> 3  0.153     2.206  1832    1832 Northampton 37131  37131       66  1421     9
#> 4  0.118     1.421  1836    1836      Warren 37185  37185       93   968     4
#> 5  0.114     1.352  1838    1838     Caswell 37033  37033       17  1035     2
#> 6  0.143     1.663  1840    1840   Granville 37077  37077       39  1671     4
#> 7  0.108     1.483  1900    1900     Forsyth 37067  37067       34 11858    10
#> 8  0.111     1.392  1904    1904    Alamance 37001  37001        1  4672    13
#> 9  0.104     1.294  1907    1907      Orange 37135  37135       68  3164     4
#> 10 0.122     1.516  1932    1932    Caldwell 37027  37027       14  3609     6
#>    NWBIR74 BIR79 SID79 NWBIR79 split1 split2                       geometry
#> 1       10  1364     0      19      1      2 MULTIPOLYGON (((-81.47276 3...
#> 2      208  3616     6     260      1      2 MULTIPOLYGON (((-80.45634 3...
#> 3     1066  1606     3    1197      1      2 MULTIPOLYGON (((-77.21767 3...
#> 4      748  1190     2     844      1      2 MULTIPOLYGON (((-78.30876 3...
#> 5      550  1253     2     597      1      2 MULTIPOLYGON (((-79.53051 3...
#> 6      930  2074     4    1058      1      2 MULTIPOLYGON (((-78.74912 3...
#> 7     3919 15704    18    5031      1      2 MULTIPOLYGON (((-80.0381 36...
#> 8     1243  5767    11    1397      1      2 MULTIPOLYGON (((-79.24619 3...
#> 9      776  4478     6    1086      1      2 MULTIPOLYGON (((-79.01814 3...
#> 10     309  4249     9     360      1      2 MULTIPOLYGON (((-81.32813 3...
plot(sf::st_geometry(nc_d12), col="grey")When using select() to read only a subset of columns, if the geometry column is not returned, the default behaviour of sfarrow is to throw an error from read_sf_dataset. If you do not need the geometry column for your analyses, then using arrow and not sfarrow should be sufficient. However, setting find_geom = TRUE in read_sf_dataset will read in any geometry columns in the metadata, in addition to the selected columns.
# this command will throw an error
# no geometry column selected for read_sf_dataset
# nc_sub <- ds %>% 
#             select('FIPS') %>% # subset of columns
#             read_sf_dataset()
# set find_geom
nc_sub <- ds %>%
            select('FIPS') %>% # subset of columns
            read_sf_dataset(find_geom = TRUE)
nc_sub
#> Simple feature collection with 100 features and 1 field
#> Geometry type: MULTIPOLYGON
#> Dimension:     XY
#> Bounding box:  xmin: -84.32385 ymin: 33.88199 xmax: -75.45698 ymax: 36.58965
#> Geodetic CRS:  NAD27
#> First 10 features:
#>     FIPS                       geometry
#> 1  37091 MULTIPOLYGON (((-76.74506 3...
#> 2  37029 MULTIPOLYGON (((-76.00897 3...
#> 3  37145 MULTIPOLYGON (((-78.8068 36...
#> 4  37189 MULTIPOLYGON (((-81.80622 3...
#> 5  37041 MULTIPOLYGON (((-76.68874 3...
#> 6  37197 MULTIPOLYGON (((-80.49554 3...
#> 7  37081 MULTIPOLYGON (((-79.53782 3...
#> 8  37115 MULTIPOLYGON (((-82.89597 3...
#> 9  37023 MULTIPOLYGON (((-81.81628 3...
#> 10 37111 MULTIPOLYGON (((-81.81628 3...To write an sf object into multiple files, we can again construct a query using dplyr::group_by() to define the partitioning variables. The result is then passed to sfarrow.
world %>%
  group_by(continent) %>%
  write_sf_dataset(file.path(tempdir(), "world_ds"), 
                   format = "parquet",
                   hive_style = FALSE)
#> Warning: This is an initial implementation of Parquet/Feather file support and
#> geo metadata. This is tracking version 0.1.0 of the metadata
#> (https://github.com/geopandas/geo-arrow-spec). This metadata
#> specification may change and does not yet make stability promises.  We
#> do not yet recommend using this in a production setting unless you are
#> able to rewrite your Parquet/Feather files.In this example we are not using Hive style. This results in the partitioning variable not being in the folder paths.
list.files(file.path(tempdir(), "world_ds"))
#> [1] "Africa"                  "Antarctica"             
#> [3] "Asia"                    "Europe"                 
#> [5] "North America"           "Oceania"                
#> [7] "Seven seas (open ocean)" "South America"To read this style of Dataset, we must specify the partitioning variables when it is opened.
arrow::open_dataset(file.path(tempdir(), "world_ds"), 
                    partitioning = "continent") %>%
  filter(continent == "Africa") %>%
  read_sf_dataset()
#> Simple feature collection with 51 features and 5 fields
#> Geometry type: GEOMETRY
#> Dimension:     XY
#> Bounding box:  xmin: -17.62504 ymin: -34.81917 xmax: 51.13387 ymax: 37.34999
#> Geodetic CRS:  WGS 84
#> First 10 features:
#>     pop_est            name iso_a3 gdp_md_est continent
#> 1  53950935        Tanzania    TZA   150600.0    Africa
#> 2    603253       W. Sahara    ESH      906.5    Africa
#> 3  83301151 Dem. Rep. Congo    COD    66010.0    Africa
#> 4   7531386         Somalia    SOM     4719.0    Africa
#> 5  47615739           Kenya    KEN   152700.0    Africa
#> 6  37345935           Sudan    SDN   176300.0    Africa
#> 7  12075985            Chad    TCD    30590.0    Africa
#> 8  54841552    South Africa    ZAF   739100.0    Africa
#> 9   1958042         Lesotho    LSO     6019.0    Africa
#> 10 13805084        Zimbabwe    ZWE    28330.0    Africa
#>                          geometry
#> 1  POLYGON ((33.90371 -0.95, 3...
#> 2  POLYGON ((-8.66559 27.65643...
#> 3  POLYGON ((29.34 -4.499983, ...
#> 4  POLYGON ((41.58513 -1.68325...
#> 5  POLYGON ((39.20222 -4.67677...
#> 6  POLYGON ((24.56737 8.229188...
#> 7  POLYGON ((23.83766 19.58047...
#> 8  POLYGON ((16.34498 -28.5767...
#> 9  POLYGON ((28.97826 -28.9556...
#> 10 POLYGON ((31.19141 -22.2515...