Using rcldf to work with Cross-Linguistic Data Format datasets

Simon J. Greenhill

2025-09-16

library(rcldf)
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
library(ggplot2)
library(patchwork)

theme_set(theme_classic())

Introduction

The rcldf package provides tools to read and interact with CLDF datasets in R. This vignette demonstrates how to install the package, load these datasets, and use them.

Cross-Linguistic Data Formats (CLDF, Forkel et al. 2018) is a standardized data format designed to handle cross-linguistic and cross-cultural datasets. CLDF provides a consistent specification and package format for common types of linguistic and cultural data e.g. word lists, grammatical features, cultural traits etc. The aim of the format is to enable provide a simple, reliable data format to facilitate the sharing and re-use of these data to make new analyses possible.

There are currently more than 250 CLDF datasets available containing data from the world’s languages and cultures including everything from catalogues of linguistic metadata (e.g. Glottolog), EndangeredLanguages.com), to word lists of lexical data (e.g. Lexibank), grammatical features (e.g. WALS and Grambank), phonetic information (e.g. Phoible), geographic informations (e.g. Wurm & Hattori 1981/1983), and religious and cultural databases (e.g. D-PLACE and Pulotu).

Installation

You can install rcldf directly from GitHub using the devtools package:

library(devtools)
install_github("SimonGreenhill/rcldf", dependencies = TRUE)
library(rcldf)

Loading a CLDF Dataset

Once installed, load a CLDF dataset by creating a cldf object from a dataset. You can point to the dataset using a path or URL where the dataset is located:

library(rcldf)

# Load from a local directory:
ds <- cldf('/path/to/dir/wals_1a_cldf')

# or load from a specific metadatafile:
ds <- cldf('/path/to/dir/wals_1a_cldf/StructureDataset-metadata.json')

# or load from zenodo:
df <- cldf("https://zenodo.org/record/7844558/files/grambank/grambank-v1.0.3.zip?download=1")

# or load from github:
df <- cldf('https://github.com/lexibank/abvd')

Exploring a CLDF Dataset

Once loaded, a cldf object has various bits of information. To show this I’ll use a small dataset that comes packaged with rcldf. This dataset contains a wordlist from a number of Huon Peninsula languages from New Guinea, originally detailed in McElhanon (1967):

library(rcldf)
ds <- cldf(system.file("extdata/huon", package="rcldf"))

# this dataset has 4 tables:
ds 
#> A CLDF dataset with 4 tables (CognateTable, FormTable, LanguageTable, ParameterTable)
#> 
#>  McElhanon, K.A. 1967. Preliminary Observations on Huon Peninsula Languages. Oceanic Linguistics. 6, 1-45.

# more details:
summary(ds)
#> A Cross-Linguistic Data Format (CLDF) dataset:
#> Name: McElhanon 1967 Huon Peninsula data
#> Path: /Users/simon/src/rlib/rcldf/extdata/huon
#> Type: http://cldf.clld.org/v1.0/terms.rdf#Wordlist
#> Tables:
#>   1/4: CognateTable (10 columns, 1960 rows)
#>   2/4: FormTable (11 columns, 1960 rows)
#>   3/4: LanguageTable (9 columns, 14 rows)
#>   4/4: ParameterTable (4 columns, 140 rows)
#> Sources: 0
#> Cite:
#>   McElhanon, K.A. 1967. Preliminary Observations on Huon Peninsula Languages. Oceanic Linguistics. 6, 1-45.

So here we have a dataset with tables of languages, parameters (=words), forms (=the lexical items), and cognates (=cognacy information showing how the lexical items are related). There is also other information here, e.g. the citation for where the dataset came from, the path where the dataset is stored, and which Type of CLDF specificiation this dataset adheres to.

Accessing the data.

Each table is attached to the df$tables list, so to access them you need to call df$tables$<tablename>. These are simply dataframes (or tibbles) so you can then do anything you want with them:

names(ds$tables)
#> [1] "FormTable"      "LanguageTable"  "ParameterTable" "CognateTable"

# let's look at the languages -- 
head(ds$tables$LanguageTable)
#> # A tibble: 6 × 9
#>   ID     Name     Glottocode Glottolog_Name ISO639P3code Macroarea Latitude
#>   <chr>  <chr>    <chr>      <chr>          <chr>        <chr>        <dbl>
#> 1 borong Kosorong boro1279   <NA>           ksr          <NA>            NA
#> 2 burum  Burum    buru1306   <NA>           bmu          <NA>            NA
#> 3 dedua  Dedua    dedu1240   <NA>           ded          <NA>            NA
#> 4 kate   Kâte     kate1253   <NA>           kmg          <NA>            NA
#> 5 komba  Komba    komb1273   <NA>           kpf          <NA>            NA
#> 6 kube   Hube     kube1244   <NA>           kgf          <NA>            NA
#> # ℹ 2 more variables: Longitude <dbl>, Family <chr>

# and the parameters - in this case the words in the wordlist
head(ds$tables$ParameterTable)
#> # A tibble: 6 × 4
#>   ID    Name  Concepticon_ID Concepticon_Gloss
#>   <chr> <chr> <chr>          <chr>            
#> 1 i     I     1209           I                
#> 2 thou  thou  1215           THOU             
#> 3 we    we    1212           WE               
#> 4 who   who   1235           WHO              
#> 5 what  what  1236           WHAT             
#> 6 that  that  78             THAT

# and finally, the lexical items themselves:
ds$tables$ValueTable
#> NULL

Load all the source information

CLDF datasets have sources stored in BibTeX format. We don’t load them by default, as it can take a long time to parse the BibTeX file correctly.

You can load and access them like this, and the sources are then available in ds$sources in bib2df format:

ds <- cldf(system.file("extdata/huon", package="rcldf"), load_bib=TRUE)
# or if you loaded the CLDF without sources the first time you can add them now:
ds <- read_bib(ds)

ds$sources
#> # A tibble: 1 × 26
#>   CATEGORY BIBTEXKEY    ADDRESS ANNOTE AUTHOR BOOKTITLE CHAPTER CROSSREF EDITION
#>   <chr>    <chr>        <chr>   <chr>  <list> <chr>     <chr>   <chr>    <chr>  
#> 1 ARTICLE  McElhanon19… <NA>    <NA>   <chr>  <NA>      <NA>    <NA>     <NA>   
#> # ℹ 17 more variables: EDITOR <list>, HOWPUBLISHED <chr>, INSTITUTION <chr>,
#> #   JOURNAL <chr>, KEY <chr>, MONTH <chr>, NOTE <chr>, NUMBER <chr>,
#> #   ORGANIZATION <chr>, PAGES <chr>, PUBLISHER <chr>, SCHOOL <chr>,
#> #   SERIES <chr>, TITLE <chr>, TYPE <chr>, VOLUME <chr>, YEAR <dbl>

Construct a ‘wide’ table with all foreign key entries filled in:

Sometimes people want to have all the data from a CLDF dataset as one dataframe. Use as.cldf.wide to do this, passing it the name of a table to act as the base.

This will take the base table, and resolve all foreign keys (usually *_ID) into their own columns. Name clashes between the two tables are resolved by appending the table name (e.g. the column Name in the original CodeTable will become Name.CodeTable).

For example, this dataset has a FormTable which connects to the ParameterTable via Parameter_ID and the LanguageTable via Language_ID.

Using as.cldf.wide we can combine all the information from ParameterTable and LanguageTable into the FormTable:

as.cldf.wide(ds, 'FormTable')
#> Joining Language_ID -> LanguageTable -> ID
#> Joining Parameter_ID -> ParameterTable -> ID
#> # A tibble: 1,960 × 22
#>    ID      Local_ID Language_ID Parameter_ID Value Form  Segments Comment Source
#>    <chr>   <chr>    <chr>       <chr>        <chr> <chr> <chr>    <chr>   <chr> 
#>  1 borong… 90452    borong      i            ni    ni    n i      I       McElh…
#>  2 borong… 91684    borong      where        daaŋ… daaŋ… d a a ŋ… where   McElh…
#>  3 borong… 90466    borong      thou         gi    gi    g i      thou    McElh…
#>  4 borong… 91978    borong      itsflower    dzur… dzur… d z u r… (its) … McElh…
#>  5 borong… 90480    borong      we           nono  nono  n o n o  we      McElh…
#>  6 borong… 91810    borong      hethrows     mesa… mesa… m e s a… (he) t… McElh…
#>  7 borong… 90578    borong      hesays       dzo-  dzo-  d z o -  (he) s… McElh…
#>  8 borong… 92076    borong      itswing      eŋga… eŋga… e ŋ g a… (its) … McElh…
#>  9 borong… 90704    borong      hegivesme    non-  non-  n o n -  (he) g… McElh…
#> 10 borong… 92202    borong      dull         dzit… dzit… d z i t… dull    McElh…
#> # ℹ 1,950 more rows
#> # ℹ 13 more variables: Cognacy <chr>, Loan <lgl>, Name.LanguageTable <chr>,
#> #   Glottocode <chr>, Glottolog_Name <chr>, ISO639P3code <chr>,
#> #   Macroarea <chr>, Latitude <dbl>, Longitude <dbl>, Family <chr>,
#> #   Name.ParameterTable <chr>, Concepticon_ID <chr>, Concepticon_Gloss <chr>

Load just one table:

Sometimes you just want to get one table. To do this call get_table_from with the table type, and dataset path:

get_table_from('LanguageTable', system.file("extdata/huon", package="rcldf"))
#> # A tibble: 14 × 9
#>    ID      Name     Glottocode Glottolog_Name ISO639P3code Macroarea Latitude
#>    <chr>   <chr>    <chr>      <chr>          <chr>        <chr>        <dbl>
#>  1 borong  Kosorong boro1279   <NA>           ksr          <NA>            NA
#>  2 burum   Burum    buru1306   <NA>           bmu          <NA>            NA
#>  3 dedua   Dedua    dedu1240   <NA>           ded          <NA>            NA
#>  4 kate    Kâte     kate1253   <NA>           kmg          <NA>            NA
#>  5 komba   Komba    komb1273   <NA>           kpf          <NA>            NA
#>  6 kube    Hube     kube1244   <NA>           kgf          <NA>            NA
#>  7 mape    Mape     mape1249   <NA>           mlh          <NA>            NA
#>  8 mesem   Mese     mese1244   <NA>           mci          <NA>            NA
#>  9 mindik  Mindik   buru1306   <NA>           bmu          <NA>            NA
#> 10 nabak   Nabak    naba1256   <NA>           naf          <NA>            NA
#> 11 ono     Ono      onoo1246   <NA>           ons          <NA>            NA
#> 12 selepet Selepet  sele1250   <NA>           spl          <NA>            NA
#> 13 timbe   Timbe    timb1251   <NA>           tim          <NA>            NA
#> 14 tobo    Tobo     tobo1251   <NA>           kgf          <NA>            NA
#> # ℹ 2 more variables: Longitude <dbl>, Family <chr>

Get the citation for a dataset:

print(ds$citation)
#> [1] " McElhanon, K.A. 1967. Preliminary Observations on Huon Peninsula Languages. Oceanic Linguistics. 6, 1-45."

Quickly get information on an unloaded dataset:

Perhaps you want to know which version of a dataset you have without loading the whole dataset. Use get_details:

get_details(system.file("extdata/huon", package="rcldf"))
#>                                Title                                     Path
#> 1 McElhanon 1967 Huon Peninsula data /Users/simon/src/rlib/rcldf/extdata/huon
#>     Size
#> 1 299051
#>                                                                                                     Citation
#> 1  McElhanon, K.A. 1967. Preliminary Observations on Huon Peninsula Languages. Oceanic Linguistics. 6, 1-45.
#>                                     ConformsTo
#> 1 http://cldf.clld.org/v1.0/terms.rdf#Wordlist

Easily get reference catalogs (Glottolog, Concepticon):

rcldf has a couple of utility functions to get the latest versions of the Glottolog and Concepticon CLDF reference catalogues:

glott <- load_glottolog()
conc <- load_concepticon()

Cache Information

When you load a dataset from a URL, rcldf downloads the dataset and unpacks it to a cache directory. By default this is a temporary directory which will be deleted when you close R.

However, by specifying a directory or using tools::R_user_dir("rcldf", which = "cache") you can re-use the dataset later.

To see where downloads will be saved:

get_cache_dir()
#> [1] "~/src/cldf"

You can set the cache_dir setting for the session by:

set_cache_dir('/path/somewhere')

If you want to set this permanently, then edit your R environ file to add the line: RCLDF_CACHE_DIR=/path/somewhere.

usethis::edit_r_environ()
# now add 
# RCLDF_CACHE_DIR=<where you want the data saved>

To see what datasets you’ve downloaded:

list_cache_files()
#>   Title
#> 1    NA
#> 2    NA
#>                                                                                                                                                                                                     Path
#> 1                                          /Users/simon/src/cldf/zenodo_api_records_10997741_files_cldf_clts_clts_v2_3_0_zip__b57347f1b05b04c6d6108a111ca3720a/cldf-clts-clts-1c0b886/cldf-metadata.json
#> 2 /Users/simon/src/cldf/zenodo_api_records_10997741_files_cldf_clts_clts_v2_3_0_zip__b57347f1b05b04c6d6108a111ca3720a/cldf-clts-clts-1c0b886/pkg/transcriptionsystems/transcription-system-metadata.json
#>    Size Citation                                  ConformsTo
#> 1 27744       NA http://cldf.clld.org/v1.0/terms.rdf#Generic
#> 2 14636       NA                                        <NA>

You can re-use datasets in your cache:

cldf(list_cache_files()[1, 'Path'])
#> A CLDF dataset with 4 tables (features.tsv, graphemes.tsv, index.tsv, sounds.tsv)
#> 
#> 

Or just save them to a particular directory:

df <- cldf("https://zenodo.org/record/7844558/files/grambank/grambank-v1.0.3.zip?download=1", cache_dir="~/data/grambank")

Using rcldf to analyse datasets:

To show you how to use rcldf to analyse datasets, we’re going to test whether languages that distinguish inclusive pronouns from exclusive pronouns (i.e. [clusivity][https://en.wikipedia.org/wiki/Clusivity]), also tend to have high rigidity in social structure.

To do this, we will use the Grambank Feature GB028: Is there a distinction between inclusive and exclusive?, and the D-PLACE Trait EA113: Degree of rigidity in social structures.

First, let’s get the published version of Grambank off Zenodo. It’s always a good idea to use the published version as this is a citable and versioned product which makes replicating your analysis easier for other researchers. To do this, go to the Zenodo page for Grambank, and find the download link under the ‘Files’ section and copy it.

Screenshot of Zenodo page for Grambank showing the download link
Screenshot of Zenodo page for Grambank showing the download link

It will look like this https://zenodo.org/records/7844558/files/grambank/grambank-v1.0.3.zip?download=1. Give that to rcldf:

grambank <- cldf("https://zenodo.org/records/7844558/files/grambank/grambank-v1.0.3.zip?download=1")
grambank
#> A CLDF dataset with 6 tables (CodeTable, contributors.csv, families.csv, LanguageTable, ParameterTable, ValueTable)
#> 
#> Skirgård, Hedvig and Haynie, Hannah J. and Blasi, Damián E. and Hammarström, Harald and Collins, Jeremy and Latarche, Jay J. and Lesage, Jakob and Weber, Tobias and Witzlack-Makarevich, Alena and Passmore, Sam and Chira, Angela and Maurits, Luke and Dinnage, Russell and Dunn, Michael and Reesink, Ger and Singer, Ruth and Bowern, Claire and Epps, Patience and Hill, Jane and Vesakoski, Outi and Robbeets, Martine and Abbas, Noor Karolin and Auer, Daniel and Bakker, Nancy A. and Barbos, Giulia and Borges, Robert D. and Danielsen, Swintha and Dorenbusch, Luise and Dorn, Ella and Elliott, John and Falcone, Giada and Fischer, Jana and Ghanggo Ate, Yustinus and Gibson, Hannah and Göbel, Hans-Philipp and Goodall, Jemima A. and Gruner, Victoria and Harvey, Andrew and Hayes, Rebekah and Heer, Leonard and Herrera Miranda, Roberto E. and Hübler, Nataliia and Huntington-Rainey, Biu and Ivani, Jessica K. and Johns, Marilen and Just, Erika and Kashima, Eri and Kipf, Carolina and Klingenberg, Janina V. and König, Nikita and Koti, Aikaterina and Kowalik, Richard G. A. and Krasnoukhova, Olga and Lindvall, Nora L.M. and Lorenzen, Mandy and Lutzenberger, Hannah and Martins, Tônia R.A. and Mata German, Celia and van der  Meer, Suzanne and Montoya Samamé, Jaime and Müller, Michael and Muradoglu, Saliha and Neely, Kelsey and Nickel, Johanna and Norvik, Miina and Oluoch, Cheryl Akinyi and Peacock, Jesse and Pearey, India O.C. and Peck, Naomi and Petit, Stephanie and Pieper, Sören and Poblete, Mariana and Prestipino, Daniel and Raabe, Linda and Raja, Amna and Reimringer, Janis and Rey, Sydney C. and Rizaew, Julia and Ruppert, Eloisa and Salmon, Kim K. and Sammet, Jill and Schembri, Rhiannon and Schlabbach, Lars and Schmidt, Frederick W.P. and Skilton, Amalia and Smith, Wikaliler Daniel and de  Sousa, Hilário and Sverredal, Kristin and Valle, Daniel and Vera, Javier and Voß, Judith and Witte, Tim and Wu, Henry and Yam, Stephanie and Ye 葉婧婷, Jingting and Yong, Maisie and Yuditha, Tessa and Zariquiey, Roberto and Forkel, Robert and Evans, Nicholas and Levinson, Stephen C. and Haspelmath, Martin and Greenhill, Simon J. and Atkinson, Quentin D. and Gray, Russell D. (in press). Grambank reveals the importance of genealogical constraints on linguistic diversity and highlights the impact of language loss. Science Advances.

Now, let’s use dplyr to get the variable we want from grambank. We use summary to see what the dataset looks like. It is a CLDF ‘Structure’ Dataset with six tables:

summary(grambank)
#> A Cross-Linguistic Data Format (CLDF) dataset:
#> Name: Grambank v1.0
#> Path: /Users/simon/Library/Caches/org.R-project.R/R/rcldf/zenodo_records_7844558_files_grambank_grambank_v1_0_3_zip_5ba67f1e8557c79b5e4fffaab8f58bcc/grambank-grambank-7ae000c/cldf
#> Type: http://cldf.clld.org/v1.0/terms.rdf#StructureDataset
#> Tables:
#>   1/6: CodeTable (4 columns, 398 rows)
#>   2/6: contributors.csv (5 columns, 139 rows)
#>   3/6: families.csv (2 columns, 215 rows)
#>   4/6: LanguageTable (13 columns, 2467 rows)
#>   5/6: ParameterTable (12 columns, 195 rows)
#>   6/6: ValueTable (9 columns, 441663 rows)
#> Sources: 0
#> Cite:
#>  Skirgård, Hedvig and Haynie, Hannah J. and Blasi, Damián E. and Hammarström, Harald and Collins, Jeremy and Latarche, Jay J. and Lesage, Jakob and Weber, Tobias and Witzlack-Makarevich, Alena and Passmore, Sam and Chira, Angela and Maurits, Luke and Dinnage, Russell and Dunn, Michael and Reesink, Ger and Singer, Ruth and Bowern, Claire and Epps, Patience and Hill, Jane and Vesakoski, Outi and Robbeets, Martine and Abbas, Noor Karolin and Auer, Daniel and Bakker, Nancy A. and Barbos, Giulia and Borges, Robert D. and Danielsen, Swintha and Dorenbusch, Luise and Dorn, Ella and Elliott, John and Falcone, Giada and Fischer, Jana and Ghanggo Ate, Yustinus and Gibson, Hannah and Göbel, Hans-Philipp and Goodall, Jemima A. and Gruner, Victoria and Harvey, Andrew and Hayes, Rebekah and Heer, Leonard and Herrera Miranda, Roberto E. and Hübler, Nataliia and Huntington-Rainey, Biu and Ivani, Jessica K. and Johns, Marilen and Just, Erika and Kashima, Eri and Kipf, Carolina and Klingenberg, Janina V. and König, Nikita and Koti, Aikaterina and Kowalik, Richard G. A. and Krasnoukhova, Olga and Lindvall, Nora L.M. and Lorenzen, Mandy and Lutzenberger, Hannah and Martins, Tônia R.A. and Mata German, Celia and van der  Meer, Suzanne and Montoya Samamé, Jaime and Müller, Michael and Muradoglu, Saliha and Neely, Kelsey and Nickel, Johanna and Norvik, Miina and Oluoch, Cheryl Akinyi and Peacock, Jesse and Pearey, India O.C. and Peck, Naomi and Petit, Stephanie and Pieper, Sören and Poblete, Mariana and Prestipino, Daniel and Raabe, Linda and Raja, Amna and Reimringer, Janis and Rey, Sydney C. and Rizaew, Julia and Ruppert, Eloisa and Salmon, Kim K. and Sammet, Jill and Schembri, Rhiannon and Schlabbach, Lars and Schmidt, Frederick W.P. and Skilton, Amalia and Smith, Wikaliler Daniel and de  Sousa, Hilário and Sverredal, Kristin and Valle, Daniel and Vera, Javier and Voß, Judith and Witte, Tim and Wu, Henry and Yam, Stephanie and Ye 葉婧婷, Jingting and Yong, Maisie and Yuditha, Tessa and Zariquiey, Roberto and Forkel, Robert and Evans, Nicholas and Levinson, Stephen C. and Haspelmath, Martin and Greenhill, Simon J. and Atkinson, Quentin D. and Gray, Russell D. (in press). Grambank reveals the importance of genealogical constraints on linguistic diversity and highlights the impact of language loss. Science Advances.

Let’s start by extracting a list of languages in this dataset:

languages <- grambank$tables$LanguageTable |>
    select(ID, Name, Macroarea, Latitude, Longitude)    
    # only selecting some columns above to make it easier to see    
languages    
#> # A tibble: 2,467 × 5
#>    ID       Name         Macroarea     Latitude Longitude
#>    <chr>    <chr>        <chr>            <dbl>     <dbl>
#>  1 abad1241 Abadi        Papunesia       -9.03     147.  
#>  2 abar1238 Mungbam      Africa           6.58      10.2 
#>  3 abau1245 Abau         Papunesia       -3.97     141.  
#>  4 abee1242 Abé          Africa           5.60      -4.38
#>  5 aben1249 Abenlen Ayta Papunesia       15.4      120.  
#>  6 abip1241 Abipon       South America  -29        -61   
#>  7 abkh1244 Abkhaz       Eurasia         43.1       41.2 
#>  8 abua1245 Abu' Arapesh Papunesia       -3.46     143.  
#>  9 abui1241 Abui         Papunesia       -8.31     125.  
#> 10 abun1252 Abun         Papunesia       -0.571    132.  
#> # ℹ 2,457 more rows

We can also look at the ParameterTable to see information on our parameter of interest: “GB028: Is there a distinction between inclusive and exclusive?”. The ID of this feature is “GB028”, so let’s just see that one:

grambank$tables$ParameterTable |> filter(ID=='GB028')
#> # A tibble: 1 × 12
#>   ID    Name           Description ColumnSpec Patrons Grambank_ID_desc Boundness
#>   <chr> <chr>          <chr>       <chr>      <chr>   <chr>            <chr>    
#> 1 GB028 Is there a di… "## Is the… <NA>       HJH     GB028 Clusivity  <NA>     
#> # ℹ 5 more variables: Flexivity <chr>, Gender_or_Noun_Class <chr>,
#> #   Locus_of_Marking <chr>, Word_Order <chr>, Informativity <chr>

Ok, now let’s get all the Values for this Parameter in ValueTable. Following the cldf standard means that the Parameter will be mapped into the ValueTable in the Parameter_ID column, so let’s select all the rows in ValueTable that match Parameter_ID of GB028:

values <- grambank$tables$ValueTable |> 
    filter(Parameter_ID=='GB028') |>
    select(ID, Language_ID, Parameter_ID, Value, Source)

Now we need to get the data from D-PLACE. We’ll use the Github repository to get these data to show you how that works, but you should probably use the published version on Zenodo for proper work. Get the github repository link https://github.com/D-PLACE/dplace-data and give that to rcldf too:

dplace <- cldf("https://github.com/D-PLACE/dplace-cldf")
summary(dplace)
#> A Cross-Linguistic Data Format (CLDF) dataset:
#> Name: D-PLACE aggregated dataset
#> Path: /Users/simon/Library/Caches/org.R-project.R/R/rcldf/github_D_PLACE_dplace_cldf_266fef649c0d27ffcdc6502c39bad1fa/D-PLACE-dplace-cldf-c00e58b/cldf
#> Type: http://cldf.clld.org/v1.0/terms.rdf#StructureDataset
#> Tables:
#>   1/7: CodeTable (5 columns, 15450 rows)
#>   2/7: ContributionTable (8 columns, 122 rows)
#>   3/7: LanguageTable (20 columns, 6174 rows)
#>   4/7: MediaTable (6 columns, 109 rows)
#>   5/7: ParameterTable (11 columns, 2987 rows)
#>   6/7: TreeTable (9 columns, 109 rows)
#>   7/7: ValueTable (11 columns, 631668 rows)
#> Sources: 0
#> Cite:
#>  Kathryn R. Kirby, Russell D. Gray, Simon J. Greenhill, Fiona M. Jordan, Stephanie Gomes-Ng, Hans-Jörg Bibiko, Damián E. Blasi, Carlos A. Botero, Claire Bowern, Carol R. Ember, Dan Leehr, Bobbi S. Low, Joe McCarter, William Divale, and Michael C. Gavin. (2016). D-PLACE: A Global Database of Cultural, Linguistic and Environmental Diversity. PLoS ONE, 11(7): e0158391. doi:10.1371/journal.pone.0158391.

Great, let’s get the values for the parameter we want. D-PLACE indexes the variables in a column Var_ID:

dplace$tables$ValueTable |> filter(Var_ID=='EA113')
#> # A tibble: 1,291 × 11
#>    ID        Soc_ID Var_ID Value    Code_ID  Comment Source sub_case       year 
#>    <chr>     <chr>  <chr>  <chr>    <chr>    <chr>   <chr>  <chr>          <chr>
#>  1 ea-117482 Aa1    EA113  Flexible EA113-2  <NA>    <NA>   Nyai Nyae reg… 1950 
#>  2 ea-117483 Aa2    EA113  <NA>     EA113-NA <NA>    <NA>   <NA>           <NA> 
#>  3 ea-117484 Aa3    EA113  <NA>     EA113-NA <NA>    <NA>   <NA>           <NA> 
#>  4 ea-117485 Aa4    EA113  <NA>     EA113-NA <NA>    <NA>   <NA>           <NA> 
#>  5 ea-117486 Aa5    EA113  Flexible EA113-2  <NA>    <NA>   Epulu net-hun… 1930 
#>  6 ea-117487 Aa6    EA113  <NA>     EA113-NA <NA>    <NA>   <NA>           <NA> 
#>  7 ea-117488 Aa7    EA113  <NA>     EA113-NA <NA>    <NA>   <NA>           <NA> 
#>  8 ea-117489 Aa8    EA113  <NA>     EA113-NA <NA>    <NA>   <NA>           <NA> 
#>  9 ea-117490 Aa9    EA113  <NA>     EA113-NA <NA>    <NA>   <NA>           <NA> 
#> 10 ea-117491 Ab1    EA113  <NA>     EA113-NA <NA>    <NA>   <NA>           <NA> 
#> # ℹ 1,281 more rows
#> # ℹ 2 more variables: source_coded_data <chr>, admin_comment <chr>

We want to merge this with our Grambank data but need a way to do this. D-PLACE stores glottocode information that maps each society to a language. This column is also in the Languages table in Grambank. So, let’s get the trait values for EA113 and add the glottocodes from LanguageTable using a join:

# get languages from DPLACE
dplanguages <- dplace$tables$LanguageTable |> select(ID, Glottocode)

# get values for EA113 and merge with language information
ea113 <- dplace$tables$ValueTable |> 
    filter(Var_ID=='EA113') |> 
    select(Soc_ID, Value) |>
    left_join(dplanguages, join_by(Soc_ID==ID))

# rename `Value` to EA113
ea113 <- ea113 |> mutate(EA113=Value) |> select(Glottocode, EA113)

head(ea113)
#> # A tibble: 6 × 2
#>   Glottocode EA113   
#>   <chr>      <chr>   
#> 1 juho1239   Flexible
#> 2 okie1245   <NA>    
#> 3 nama1265   <NA>    
#> 4 dama1270   <NA>    
#> 5 bila1255   Flexible
#> 6 sand1273   <NA>

Now get the grambank data into the same format:

gb028 <- values |> 
    mutate(Glottocode=Language_ID, GB028=Value) |>
    select(Glottocode, GB028)
head(gb028)
#> # A tibble: 6 × 2
#>   Glottocode GB028
#>   <chr>      <chr>
#> 1 abad1241   1    
#> 2 abar1238   0    
#> 3 abau1245   0    
#> 4 abee1242   0    
#> 5 aben1249   1    
#> 6 abip1241   0

…and finally join them up using the mutual column ‘Glottocode’. I’ll use an inner join here to only get the languages/societies that are in both datasets. And we’ll use na.omit to only keep rows that have data for both variables:

df <- gb028 |> inner_join(ea113) |> na.omit()
#> Joining with `by = join_by(Glottocode)`
head(df)
#> # A tibble: 6 × 3
#>   Glottocode GB028 EA113   
#>   <chr>      <chr> <chr>   
#> 1 bila1255   0     Flexible
#> 2 chig1238   0     Flexible
#> 3 fefe1239   0     Flexible
#> 4 gand1255   1     Rigid   
#> 5 hehe1240   0     Flexible
#> 6 juku1254   0     Rigid

Ok, it looks like we only have 17 rows for this pairing. That’s a little small, but let’s plot the data:

p1 <- ggplot(df, aes(x=GB028)) + geom_histogram(stat='count')
#> Warning in geom_histogram(stat = "count"): Ignoring unknown parameters:
#> `binwidth`, `bins`, and `pad`
p2 <- ggplot(df, aes(x=EA113)) + geom_histogram(stat='count')
#> Warning in geom_histogram(stat = "count"): Ignoring unknown parameters:
#> `binwidth`, `bins`, and `pad`

p1 / p2   #  patchwork

Hmm. It looks like societies that do not mark this distinction between inclusive and exclusive tend to be those with flexible social structure. Looking good for our hypothesis, but let’s test it formally to make sure we’re not seeing a chance pattern:

tab <- table(df$GB028, df$EA113)
tab
#>    
#>     Flexible Rigid
#>   0        9     4
#>   1        3     1

chisq.test(tab)
#> Warning in chisq.test(tab): Chi-squared approximation may be incorrect
#> 
#>  Pearson's Chi-squared test with Yates' continuity correction
#> 
#> data:  tab
#> X-squared = 4.986e-31, df = 1, p-value = 1

….and the probability here is 1.0 of getting a result like this due to chance. So yes, this is just a chance result and there is no evidence in these data that languages which distinguish inclusive from exclusive also tend to be those that have rigid social structures. There goes our big paper in Nature/Science.

References:

McElhanon, K A. 1967. “Preliminary Observations on Huon Peninsula Languages.” Oceanic Linguistics 6 (1): 1–45.