library(rcldf)
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
library(ggplot2)
library(patchwork)
theme_set(theme_classic())
The rcldf
package provides tools to read and interact
with CLDF
datasets in R. This vignette demonstrates how to
install the package, load these datasets, and use them.
Cross-Linguistic Data Formats (CLDF, Forkel et al. 2018) is a standardized data format designed to handle cross-linguistic and cross-cultural datasets. CLDF provides a consistent specification and package format for common types of linguistic and cultural data e.g. word lists, grammatical features, cultural traits etc. The aim of the format is to enable provide a simple, reliable data format to facilitate the sharing and re-use of these data to make new analyses possible.
There are currently more than 250 CLDF datasets available containing data from the world’s languages and cultures including everything from catalogues of linguistic metadata (e.g. Glottolog), EndangeredLanguages.com), to word lists of lexical data (e.g. Lexibank), grammatical features (e.g. WALS and Grambank), phonetic information (e.g. Phoible), geographic informations (e.g. Wurm & Hattori 1981/1983), and religious and cultural databases (e.g. D-PLACE and Pulotu).
You can install rcldf
directly from GitHub using the
devtools
package:
Once installed, load a CLDF dataset by creating a cldf
object from a dataset. You can point to the dataset using a path or URL
where the dataset is located:
library(rcldf)
# Load from a local directory:
ds <- cldf('/path/to/dir/wals_1a_cldf')
# or load from a specific metadatafile:
ds <- cldf('/path/to/dir/wals_1a_cldf/StructureDataset-metadata.json')
# or load from zenodo:
df <- cldf("https://zenodo.org/record/7844558/files/grambank/grambank-v1.0.3.zip?download=1")
# or load from github:
df <- cldf('https://github.com/lexibank/abvd')
Once loaded, a cldf
object has various bits of
information. To show this I’ll use a small dataset that comes packaged
with rcldf
. This dataset contains a wordlist from a number
of Huon Peninsula languages from New Guinea, originally detailed in
McElhanon (1967):
library(rcldf)
ds <- cldf(system.file("extdata/huon", package="rcldf"))
# this dataset has 4 tables:
ds
#> A CLDF dataset with 4 tables (CognateTable, FormTable, LanguageTable, ParameterTable)
#>
#> McElhanon, K.A. 1967. Preliminary Observations on Huon Peninsula Languages. Oceanic Linguistics. 6, 1-45.
# more details:
summary(ds)
#> A Cross-Linguistic Data Format (CLDF) dataset:
#> Name: McElhanon 1967 Huon Peninsula data
#> Path: /Users/simon/src/rlib/rcldf/extdata/huon
#> Type: http://cldf.clld.org/v1.0/terms.rdf#Wordlist
#> Tables:
#> 1/4: CognateTable (10 columns, 1960 rows)
#> 2/4: FormTable (11 columns, 1960 rows)
#> 3/4: LanguageTable (9 columns, 14 rows)
#> 4/4: ParameterTable (4 columns, 140 rows)
#> Sources: 0
#> Cite:
#> McElhanon, K.A. 1967. Preliminary Observations on Huon Peninsula Languages. Oceanic Linguistics. 6, 1-45.
So here we have a dataset with tables of languages, parameters
(=words), forms (=the lexical items), and cognates (=cognacy information
showing how the lexical items are related). There is also other
information here, e.g. the citation
for where the dataset
came from, the path
where the dataset is stored, and which
Type
of CLDF specificiation this dataset adheres to.
Each table is attached to the df$tables list, so to access
them you need to call df$tables$<tablename>
. These
are simply dataframes (or tibbles) so you can then do anything you want
with them:
names(ds$tables)
#> [1] "FormTable" "LanguageTable" "ParameterTable" "CognateTable"
# let's look at the languages --
head(ds$tables$LanguageTable)
#> # A tibble: 6 × 9
#> ID Name Glottocode Glottolog_Name ISO639P3code Macroarea Latitude
#> <chr> <chr> <chr> <chr> <chr> <chr> <dbl>
#> 1 borong Kosorong boro1279 <NA> ksr <NA> NA
#> 2 burum Burum buru1306 <NA> bmu <NA> NA
#> 3 dedua Dedua dedu1240 <NA> ded <NA> NA
#> 4 kate Kâte kate1253 <NA> kmg <NA> NA
#> 5 komba Komba komb1273 <NA> kpf <NA> NA
#> 6 kube Hube kube1244 <NA> kgf <NA> NA
#> # ℹ 2 more variables: Longitude <dbl>, Family <chr>
# and the parameters - in this case the words in the wordlist
head(ds$tables$ParameterTable)
#> # A tibble: 6 × 4
#> ID Name Concepticon_ID Concepticon_Gloss
#> <chr> <chr> <chr> <chr>
#> 1 i I 1209 I
#> 2 thou thou 1215 THOU
#> 3 we we 1212 WE
#> 4 who who 1235 WHO
#> 5 what what 1236 WHAT
#> 6 that that 78 THAT
# and finally, the lexical items themselves:
ds$tables$ValueTable
#> NULL
CLDF datasets have sources stored in BibTeX format. We don’t load them by default, as it can take a long time to parse the BibTeX file correctly.
You can load and access them like this, and the sources are then
available in ds$sources
in bib2df
format:
ds <- cldf(system.file("extdata/huon", package="rcldf"), load_bib=TRUE)
# or if you loaded the CLDF without sources the first time you can add them now:
ds <- read_bib(ds)
ds$sources
#> # A tibble: 1 × 26
#> CATEGORY BIBTEXKEY ADDRESS ANNOTE AUTHOR BOOKTITLE CHAPTER CROSSREF EDITION
#> <chr> <chr> <chr> <chr> <list> <chr> <chr> <chr> <chr>
#> 1 ARTICLE McElhanon19… <NA> <NA> <chr> <NA> <NA> <NA> <NA>
#> # ℹ 17 more variables: EDITOR <list>, HOWPUBLISHED <chr>, INSTITUTION <chr>,
#> # JOURNAL <chr>, KEY <chr>, MONTH <chr>, NOTE <chr>, NUMBER <chr>,
#> # ORGANIZATION <chr>, PAGES <chr>, PUBLISHER <chr>, SCHOOL <chr>,
#> # SERIES <chr>, TITLE <chr>, TYPE <chr>, VOLUME <chr>, YEAR <dbl>
Sometimes people want to have all the data from a CLDF dataset as one
dataframe. Use as.cldf.wide
to do this, passing it the name
of a table to act as the base.
This will take the base table, and resolve all foreign keys (usually
*_ID
) into their own columns. Name clashes between the two
tables are resolved by appending the table name (e.g. the column
Name
in the original CodeTable
will become
Name.CodeTable
).
For example, this dataset has a FormTable
which connects
to the ParameterTable
via Parameter_ID
and the
LanguageTable
via Language_ID
.
Using as.cldf.wide
we can combine all the information
from ParameterTable
and LanguageTable
into the
FormTable
:
as.cldf.wide(ds, 'FormTable')
#> Joining Language_ID -> LanguageTable -> ID
#> Joining Parameter_ID -> ParameterTable -> ID
#> # A tibble: 1,960 × 22
#> ID Local_ID Language_ID Parameter_ID Value Form Segments Comment Source
#> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 borong… 90452 borong i ni ni n i I McElh…
#> 2 borong… 91684 borong where daaŋ… daaŋ… d a a ŋ… where McElh…
#> 3 borong… 90466 borong thou gi gi g i thou McElh…
#> 4 borong… 91978 borong itsflower dzur… dzur… d z u r… (its) … McElh…
#> 5 borong… 90480 borong we nono nono n o n o we McElh…
#> 6 borong… 91810 borong hethrows mesa… mesa… m e s a… (he) t… McElh…
#> 7 borong… 90578 borong hesays dzo- dzo- d z o - (he) s… McElh…
#> 8 borong… 92076 borong itswing eŋga… eŋga… e ŋ g a… (its) … McElh…
#> 9 borong… 90704 borong hegivesme non- non- n o n - (he) g… McElh…
#> 10 borong… 92202 borong dull dzit… dzit… d z i t… dull McElh…
#> # ℹ 1,950 more rows
#> # ℹ 13 more variables: Cognacy <chr>, Loan <lgl>, Name.LanguageTable <chr>,
#> # Glottocode <chr>, Glottolog_Name <chr>, ISO639P3code <chr>,
#> # Macroarea <chr>, Latitude <dbl>, Longitude <dbl>, Family <chr>,
#> # Name.ParameterTable <chr>, Concepticon_ID <chr>, Concepticon_Gloss <chr>
Sometimes you just want to get one table. To do this call
get_table_from
with the table type, and dataset path:
get_table_from('LanguageTable', system.file("extdata/huon", package="rcldf"))
#> # A tibble: 14 × 9
#> ID Name Glottocode Glottolog_Name ISO639P3code Macroarea Latitude
#> <chr> <chr> <chr> <chr> <chr> <chr> <dbl>
#> 1 borong Kosorong boro1279 <NA> ksr <NA> NA
#> 2 burum Burum buru1306 <NA> bmu <NA> NA
#> 3 dedua Dedua dedu1240 <NA> ded <NA> NA
#> 4 kate Kâte kate1253 <NA> kmg <NA> NA
#> 5 komba Komba komb1273 <NA> kpf <NA> NA
#> 6 kube Hube kube1244 <NA> kgf <NA> NA
#> 7 mape Mape mape1249 <NA> mlh <NA> NA
#> 8 mesem Mese mese1244 <NA> mci <NA> NA
#> 9 mindik Mindik buru1306 <NA> bmu <NA> NA
#> 10 nabak Nabak naba1256 <NA> naf <NA> NA
#> 11 ono Ono onoo1246 <NA> ons <NA> NA
#> 12 selepet Selepet sele1250 <NA> spl <NA> NA
#> 13 timbe Timbe timb1251 <NA> tim <NA> NA
#> 14 tobo Tobo tobo1251 <NA> kgf <NA> NA
#> # ℹ 2 more variables: Longitude <dbl>, Family <chr>
Perhaps you want to know which version of a dataset you have without
loading the whole dataset. Use get_details
:
get_details(system.file("extdata/huon", package="rcldf"))
#> Title Path
#> 1 McElhanon 1967 Huon Peninsula data /Users/simon/src/rlib/rcldf/extdata/huon
#> Size
#> 1 299051
#> Citation
#> 1 McElhanon, K.A. 1967. Preliminary Observations on Huon Peninsula Languages. Oceanic Linguistics. 6, 1-45.
#> ConformsTo
#> 1 http://cldf.clld.org/v1.0/terms.rdf#Wordlist
rcldf
has a couple of utility functions to get the
latest versions of the Glottolog and Concepticon CLDF reference
catalogues:
When you load a dataset from a URL, rcldf
downloads the
dataset and unpacks it to a cache directory. By default this is a
temporary directory which will be deleted when you close R.
However, by specifying a directory or using
tools::R_user_dir("rcldf", which = "cache")
you can re-use
the dataset later.
To see where downloads will be saved:
You can set the cache_dir setting for the session by:
If you want to set this permanently, then edit your R environ file to
add the line: RCLDF_CACHE_DIR=/path/somewhere
.
To see what datasets you’ve downloaded:
list_cache_files()
#> Title
#> 1 NA
#> 2 NA
#> Path
#> 1 /Users/simon/src/cldf/zenodo_api_records_10997741_files_cldf_clts_clts_v2_3_0_zip__b57347f1b05b04c6d6108a111ca3720a/cldf-clts-clts-1c0b886/cldf-metadata.json
#> 2 /Users/simon/src/cldf/zenodo_api_records_10997741_files_cldf_clts_clts_v2_3_0_zip__b57347f1b05b04c6d6108a111ca3720a/cldf-clts-clts-1c0b886/pkg/transcriptionsystems/transcription-system-metadata.json
#> Size Citation ConformsTo
#> 1 27744 NA http://cldf.clld.org/v1.0/terms.rdf#Generic
#> 2 14636 NA <NA>
You can re-use datasets in your cache:
cldf(list_cache_files()[1, 'Path'])
#> A CLDF dataset with 4 tables (features.tsv, graphemes.tsv, index.tsv, sounds.tsv)
#>
#>
Or just save them to a particular directory:
To show you how to use rcldf
to analyse datasets, we’re
going to test whether languages that distinguish inclusive pronouns from
exclusive pronouns (i.e. [clusivity][https://en.wikipedia.org/wiki/Clusivity]), also tend to
have high rigidity in social structure.
To do this, we will use the Grambank Feature GB028: Is there a distinction between inclusive and exclusive?, and the D-PLACE Trait EA113: Degree of rigidity in social structures.
First, let’s get the published version of Grambank off Zenodo. It’s always a good idea to use the published version as this is a citable and versioned product which makes replicating your analysis easier for other researchers. To do this, go to the Zenodo page for Grambank, and find the download link under the ‘Files’ section and copy it.
It will look like this https://zenodo.org/records/7844558/files/grambank/grambank-v1.0.3.zip?download=1.
Give that to rcldf
:
grambank <- cldf("https://zenodo.org/records/7844558/files/grambank/grambank-v1.0.3.zip?download=1")
grambank
#> A CLDF dataset with 6 tables (CodeTable, contributors.csv, families.csv, LanguageTable, ParameterTable, ValueTable)
#>
#> Skirgård, Hedvig and Haynie, Hannah J. and Blasi, Damián E. and Hammarström, Harald and Collins, Jeremy and Latarche, Jay J. and Lesage, Jakob and Weber, Tobias and Witzlack-Makarevich, Alena and Passmore, Sam and Chira, Angela and Maurits, Luke and Dinnage, Russell and Dunn, Michael and Reesink, Ger and Singer, Ruth and Bowern, Claire and Epps, Patience and Hill, Jane and Vesakoski, Outi and Robbeets, Martine and Abbas, Noor Karolin and Auer, Daniel and Bakker, Nancy A. and Barbos, Giulia and Borges, Robert D. and Danielsen, Swintha and Dorenbusch, Luise and Dorn, Ella and Elliott, John and Falcone, Giada and Fischer, Jana and Ghanggo Ate, Yustinus and Gibson, Hannah and Göbel, Hans-Philipp and Goodall, Jemima A. and Gruner, Victoria and Harvey, Andrew and Hayes, Rebekah and Heer, Leonard and Herrera Miranda, Roberto E. and Hübler, Nataliia and Huntington-Rainey, Biu and Ivani, Jessica K. and Johns, Marilen and Just, Erika and Kashima, Eri and Kipf, Carolina and Klingenberg, Janina V. and König, Nikita and Koti, Aikaterina and Kowalik, Richard G. A. and Krasnoukhova, Olga and Lindvall, Nora L.M. and Lorenzen, Mandy and Lutzenberger, Hannah and Martins, Tônia R.A. and Mata German, Celia and van der Meer, Suzanne and Montoya Samamé, Jaime and Müller, Michael and Muradoglu, Saliha and Neely, Kelsey and Nickel, Johanna and Norvik, Miina and Oluoch, Cheryl Akinyi and Peacock, Jesse and Pearey, India O.C. and Peck, Naomi and Petit, Stephanie and Pieper, Sören and Poblete, Mariana and Prestipino, Daniel and Raabe, Linda and Raja, Amna and Reimringer, Janis and Rey, Sydney C. and Rizaew, Julia and Ruppert, Eloisa and Salmon, Kim K. and Sammet, Jill and Schembri, Rhiannon and Schlabbach, Lars and Schmidt, Frederick W.P. and Skilton, Amalia and Smith, Wikaliler Daniel and de Sousa, Hilário and Sverredal, Kristin and Valle, Daniel and Vera, Javier and Voß, Judith and Witte, Tim and Wu, Henry and Yam, Stephanie and Ye 葉婧婷, Jingting and Yong, Maisie and Yuditha, Tessa and Zariquiey, Roberto and Forkel, Robert and Evans, Nicholas and Levinson, Stephen C. and Haspelmath, Martin and Greenhill, Simon J. and Atkinson, Quentin D. and Gray, Russell D. (in press). Grambank reveals the importance of genealogical constraints on linguistic diversity and highlights the impact of language loss. Science Advances.
Now, let’s use dplyr
to get the variable we want from
grambank. We use summary
to see what the dataset looks
like. It is a CLDF ‘Structure’ Dataset with six tables:
summary(grambank)
#> A Cross-Linguistic Data Format (CLDF) dataset:
#> Name: Grambank v1.0
#> Path: /Users/simon/Library/Caches/org.R-project.R/R/rcldf/zenodo_records_7844558_files_grambank_grambank_v1_0_3_zip_5ba67f1e8557c79b5e4fffaab8f58bcc/grambank-grambank-7ae000c/cldf
#> Type: http://cldf.clld.org/v1.0/terms.rdf#StructureDataset
#> Tables:
#> 1/6: CodeTable (4 columns, 398 rows)
#> 2/6: contributors.csv (5 columns, 139 rows)
#> 3/6: families.csv (2 columns, 215 rows)
#> 4/6: LanguageTable (13 columns, 2467 rows)
#> 5/6: ParameterTable (12 columns, 195 rows)
#> 6/6: ValueTable (9 columns, 441663 rows)
#> Sources: 0
#> Cite:
#> Skirgård, Hedvig and Haynie, Hannah J. and Blasi, Damián E. and Hammarström, Harald and Collins, Jeremy and Latarche, Jay J. and Lesage, Jakob and Weber, Tobias and Witzlack-Makarevich, Alena and Passmore, Sam and Chira, Angela and Maurits, Luke and Dinnage, Russell and Dunn, Michael and Reesink, Ger and Singer, Ruth and Bowern, Claire and Epps, Patience and Hill, Jane and Vesakoski, Outi and Robbeets, Martine and Abbas, Noor Karolin and Auer, Daniel and Bakker, Nancy A. and Barbos, Giulia and Borges, Robert D. and Danielsen, Swintha and Dorenbusch, Luise and Dorn, Ella and Elliott, John and Falcone, Giada and Fischer, Jana and Ghanggo Ate, Yustinus and Gibson, Hannah and Göbel, Hans-Philipp and Goodall, Jemima A. and Gruner, Victoria and Harvey, Andrew and Hayes, Rebekah and Heer, Leonard and Herrera Miranda, Roberto E. and Hübler, Nataliia and Huntington-Rainey, Biu and Ivani, Jessica K. and Johns, Marilen and Just, Erika and Kashima, Eri and Kipf, Carolina and Klingenberg, Janina V. and König, Nikita and Koti, Aikaterina and Kowalik, Richard G. A. and Krasnoukhova, Olga and Lindvall, Nora L.M. and Lorenzen, Mandy and Lutzenberger, Hannah and Martins, Tônia R.A. and Mata German, Celia and van der Meer, Suzanne and Montoya Samamé, Jaime and Müller, Michael and Muradoglu, Saliha and Neely, Kelsey and Nickel, Johanna and Norvik, Miina and Oluoch, Cheryl Akinyi and Peacock, Jesse and Pearey, India O.C. and Peck, Naomi and Petit, Stephanie and Pieper, Sören and Poblete, Mariana and Prestipino, Daniel and Raabe, Linda and Raja, Amna and Reimringer, Janis and Rey, Sydney C. and Rizaew, Julia and Ruppert, Eloisa and Salmon, Kim K. and Sammet, Jill and Schembri, Rhiannon and Schlabbach, Lars and Schmidt, Frederick W.P. and Skilton, Amalia and Smith, Wikaliler Daniel and de Sousa, Hilário and Sverredal, Kristin and Valle, Daniel and Vera, Javier and Voß, Judith and Witte, Tim and Wu, Henry and Yam, Stephanie and Ye 葉婧婷, Jingting and Yong, Maisie and Yuditha, Tessa and Zariquiey, Roberto and Forkel, Robert and Evans, Nicholas and Levinson, Stephen C. and Haspelmath, Martin and Greenhill, Simon J. and Atkinson, Quentin D. and Gray, Russell D. (in press). Grambank reveals the importance of genealogical constraints on linguistic diversity and highlights the impact of language loss. Science Advances.
Let’s start by extracting a list of languages in this dataset:
languages <- grambank$tables$LanguageTable |>
select(ID, Name, Macroarea, Latitude, Longitude)
# only selecting some columns above to make it easier to see
languages
#> # A tibble: 2,467 × 5
#> ID Name Macroarea Latitude Longitude
#> <chr> <chr> <chr> <dbl> <dbl>
#> 1 abad1241 Abadi Papunesia -9.03 147.
#> 2 abar1238 Mungbam Africa 6.58 10.2
#> 3 abau1245 Abau Papunesia -3.97 141.
#> 4 abee1242 Abé Africa 5.60 -4.38
#> 5 aben1249 Abenlen Ayta Papunesia 15.4 120.
#> 6 abip1241 Abipon South America -29 -61
#> 7 abkh1244 Abkhaz Eurasia 43.1 41.2
#> 8 abua1245 Abu' Arapesh Papunesia -3.46 143.
#> 9 abui1241 Abui Papunesia -8.31 125.
#> 10 abun1252 Abun Papunesia -0.571 132.
#> # ℹ 2,457 more rows
We can also look at the ParameterTable
to see
information on our parameter of interest: “GB028: Is there a distinction
between inclusive and exclusive?”. The ID
of this feature
is “GB028”, so let’s just see that one:
grambank$tables$ParameterTable |> filter(ID=='GB028')
#> # A tibble: 1 × 12
#> ID Name Description ColumnSpec Patrons Grambank_ID_desc Boundness
#> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 GB028 Is there a di… "## Is the… <NA> HJH GB028 Clusivity <NA>
#> # ℹ 5 more variables: Flexivity <chr>, Gender_or_Noun_Class <chr>,
#> # Locus_of_Marking <chr>, Word_Order <chr>, Informativity <chr>
Ok, now let’s get all the Values for this Parameter in
ValueTable
. Following the cldf
standard means
that the Parameter will be mapped into the ValueTable in the
Parameter_ID
column, so let’s select all the rows in
ValueTable
that match Parameter_ID of GB028:
values <- grambank$tables$ValueTable |>
filter(Parameter_ID=='GB028') |>
select(ID, Language_ID, Parameter_ID, Value, Source)
Now we need to get the data from D-PLACE. We’ll use the Github repository to get these data to show you how that works, but you should probably use the published version on Zenodo for proper work. Get the github repository link https://github.com/D-PLACE/dplace-data and give that to rcldf too:
dplace <- cldf("https://github.com/D-PLACE/dplace-cldf")
summary(dplace)
#> A Cross-Linguistic Data Format (CLDF) dataset:
#> Name: D-PLACE aggregated dataset
#> Path: /Users/simon/Library/Caches/org.R-project.R/R/rcldf/github_D_PLACE_dplace_cldf_266fef649c0d27ffcdc6502c39bad1fa/D-PLACE-dplace-cldf-c00e58b/cldf
#> Type: http://cldf.clld.org/v1.0/terms.rdf#StructureDataset
#> Tables:
#> 1/7: CodeTable (5 columns, 15450 rows)
#> 2/7: ContributionTable (8 columns, 122 rows)
#> 3/7: LanguageTable (20 columns, 6174 rows)
#> 4/7: MediaTable (6 columns, 109 rows)
#> 5/7: ParameterTable (11 columns, 2987 rows)
#> 6/7: TreeTable (9 columns, 109 rows)
#> 7/7: ValueTable (11 columns, 631668 rows)
#> Sources: 0
#> Cite:
#> Kathryn R. Kirby, Russell D. Gray, Simon J. Greenhill, Fiona M. Jordan, Stephanie Gomes-Ng, Hans-Jörg Bibiko, Damián E. Blasi, Carlos A. Botero, Claire Bowern, Carol R. Ember, Dan Leehr, Bobbi S. Low, Joe McCarter, William Divale, and Michael C. Gavin. (2016). D-PLACE: A Global Database of Cultural, Linguistic and Environmental Diversity. PLoS ONE, 11(7): e0158391. doi:10.1371/journal.pone.0158391.
Great, let’s get the values for the parameter we want. D-PLACE
indexes the variables in a column Var_ID
:
dplace$tables$ValueTable |> filter(Var_ID=='EA113')
#> # A tibble: 1,291 × 11
#> ID Soc_ID Var_ID Value Code_ID Comment Source sub_case year
#> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 ea-117482 Aa1 EA113 Flexible EA113-2 <NA> <NA> Nyai Nyae reg… 1950
#> 2 ea-117483 Aa2 EA113 <NA> EA113-NA <NA> <NA> <NA> <NA>
#> 3 ea-117484 Aa3 EA113 <NA> EA113-NA <NA> <NA> <NA> <NA>
#> 4 ea-117485 Aa4 EA113 <NA> EA113-NA <NA> <NA> <NA> <NA>
#> 5 ea-117486 Aa5 EA113 Flexible EA113-2 <NA> <NA> Epulu net-hun… 1930
#> 6 ea-117487 Aa6 EA113 <NA> EA113-NA <NA> <NA> <NA> <NA>
#> 7 ea-117488 Aa7 EA113 <NA> EA113-NA <NA> <NA> <NA> <NA>
#> 8 ea-117489 Aa8 EA113 <NA> EA113-NA <NA> <NA> <NA> <NA>
#> 9 ea-117490 Aa9 EA113 <NA> EA113-NA <NA> <NA> <NA> <NA>
#> 10 ea-117491 Ab1 EA113 <NA> EA113-NA <NA> <NA> <NA> <NA>
#> # ℹ 1,281 more rows
#> # ℹ 2 more variables: source_coded_data <chr>, admin_comment <chr>
We want to merge this with our Grambank data but need a way to do
this. D-PLACE stores glottocode
information that maps each
society to a language. This column is also in the Languages table in
Grambank. So, let’s get the trait values for EA113 and add the
glottocodes from LanguageTable using a join
:
# get languages from DPLACE
dplanguages <- dplace$tables$LanguageTable |> select(ID, Glottocode)
# get values for EA113 and merge with language information
ea113 <- dplace$tables$ValueTable |>
filter(Var_ID=='EA113') |>
select(Soc_ID, Value) |>
left_join(dplanguages, join_by(Soc_ID==ID))
# rename `Value` to EA113
ea113 <- ea113 |> mutate(EA113=Value) |> select(Glottocode, EA113)
head(ea113)
#> # A tibble: 6 × 2
#> Glottocode EA113
#> <chr> <chr>
#> 1 juho1239 Flexible
#> 2 okie1245 <NA>
#> 3 nama1265 <NA>
#> 4 dama1270 <NA>
#> 5 bila1255 Flexible
#> 6 sand1273 <NA>
Now get the grambank data into the same format:
gb028 <- values |>
mutate(Glottocode=Language_ID, GB028=Value) |>
select(Glottocode, GB028)
head(gb028)
#> # A tibble: 6 × 2
#> Glottocode GB028
#> <chr> <chr>
#> 1 abad1241 1
#> 2 abar1238 0
#> 3 abau1245 0
#> 4 abee1242 0
#> 5 aben1249 1
#> 6 abip1241 0
…and finally join them up using the mutual column ‘Glottocode’. I’ll
use an inner join here to only get the languages/societies that are in
both datasets. And we’ll use na.omit
to only keep rows that
have data for both variables:
df <- gb028 |> inner_join(ea113) |> na.omit()
#> Joining with `by = join_by(Glottocode)`
head(df)
#> # A tibble: 6 × 3
#> Glottocode GB028 EA113
#> <chr> <chr> <chr>
#> 1 bila1255 0 Flexible
#> 2 chig1238 0 Flexible
#> 3 fefe1239 0 Flexible
#> 4 gand1255 1 Rigid
#> 5 hehe1240 0 Flexible
#> 6 juku1254 0 Rigid
Ok, it looks like we only have 17 rows for this pairing. That’s a little small, but let’s plot the data:
p1 <- ggplot(df, aes(x=GB028)) + geom_histogram(stat='count')
#> Warning in geom_histogram(stat = "count"): Ignoring unknown parameters:
#> `binwidth`, `bins`, and `pad`
p2 <- ggplot(df, aes(x=EA113)) + geom_histogram(stat='count')
#> Warning in geom_histogram(stat = "count"): Ignoring unknown parameters:
#> `binwidth`, `bins`, and `pad`
p1 / p2 # patchwork
Hmm. It looks like societies that do not mark this distinction between inclusive and exclusive tend to be those with flexible social structure. Looking good for our hypothesis, but let’s test it formally to make sure we’re not seeing a chance pattern:
tab <- table(df$GB028, df$EA113)
tab
#>
#> Flexible Rigid
#> 0 9 4
#> 1 3 1
chisq.test(tab)
#> Warning in chisq.test(tab): Chi-squared approximation may be incorrect
#>
#> Pearson's Chi-squared test with Yates' continuity correction
#>
#> data: tab
#> X-squared = 4.986e-31, df = 1, p-value = 1
….and the probability here is 1.0 of getting a result like this due to chance. So yes, this is just a chance result and there is no evidence in these data that languages which distinguish inclusive from exclusive also tend to be those that have rigid social structures. There goes our big paper in Nature/Science.