library(repr)
options(repr.plot.width=50, repr.plot.height=35)
Statistical Typology
35th European Summer School in Logic, Language and Information
Leuven, August 2024
Gerhard Jäger, Tübingen University
Lecture 1
Typology
- subfield of linguistics
- studies and classifies languages according to their structural features (as opposed to genealogical classification)
- aims:
- delineate the diversity of human languages
- identify common properties of all (or most) languages, so-called universals
What is a possible human language? And what is a probable human language?
Statistical typology
- identify robust correlations between typological features of languages or between linguistic and non-linguistic properties of populations and their languages
- do so in a statistically sound way
Some history
August Wilhelm von Schlegel (1818)
<p>
- tripartite classification of languages
-
driven by morphological characteristics
- fusional languages
- agglutinative languages
- isolating languages
<li>later added: <ul> <li>polysynthetic languages</li> </ul> </li>
Schlegel’s morphological types
(src: https://opentextbc.ca/psyclanguage/) https://www.reddit.com/r/MapPorn/comments/46wdqb/morphological_typology_tendency_of_languages_in/#lightbox
Schlegel’s morphological types
(src: https://www.languagesoftheworld.info/historical-linguistics/more-on-word-order-morphological-types-and-historical-change.html)
Georg von der Gabelentz (1840-1893)
Georg von der Gabelentz (1840-1893)
‘Dürfte man ein ungeborenes Kind taufen, ich würde den Namen Typologie wählen’ (Gabelentz 1901:481)
[If one were permitted to christen an unborn child, I would choose the name typology].
Joseph Greenberg (1915-2001)
- 1963: Some Universals of Grammar with Particular Reference to the Order of Meaningful Elements
- founding document of modern typology
- genealogically and geographically diverse sample of 30 languages
- 45 universally or near-universally valid statements
<img src="_img/greenberg_map_modified.svg" width="1000">
library(tidyverse)
library(leaflet)
library(tmap)
library(sf)
library(svglite)
library(ggthemes)
library(brms)
library(rstan)
library(viridis)
── Attaching packages ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── tidyverse 1.3.1 ──
✔ ggplot2 3.3.3 ✔ purrr 0.3.4
✔ tibble 3.1.2 ✔ dplyr 1.0.6
✔ tidyr 1.1.3 ✔ stringr 1.4.0
✔ readr 1.4.0 ✔ forcats 0.5.1
── Conflicts ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
Linking to GEOS 3.9.1, GDAL 3.3.0, PROJ 8.0.0
Loading required package: Rcpp
Loading 'brms' package (version 2.14.4). Useful instructions
can be found by typing help('brms'). A more detailed introduction
to the package is available through vignette('brms_overview').
Attaching package: ‘brms’
The following object is masked from ‘package:stats’:
ar
Loading required package: StanHeaders
rstan (Version 2.21.2, GitRev: 2e1f913d3ca3)
For execution on a local, multicore CPU with excess RAM we recommend calling
options(mc.cores = parallel::detectCores()).
To avoid recompilation of unchanged Stan programs, we recommend calling
rstan_options(auto_write = TRUE)
Attaching package: ‘rstan’
The following object is masked from ‘package:tidyr’:
extract
Loading required package: viridisLite
<- read_csv("../data/greenberg30_glot.csv") %>%
greenberg30 select(Language, Latitude, Longitude) %>%
st_as_sf(coords = c("Longitude", "Latitude"))
── Column specification ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
cols(
ID = col_character(),
Name = col_character(),
Macroarea = col_character(),
Latitude = col_double(),
Longitude = col_double(),
Glottocode = col_character(),
ISO639P3code = col_character(),
Level = col_character(),
Countries = col_character(),
Family_ID = col_character(),
Language_ID = col_character(),
Closest_ISO369P3code = col_character(),
First_Year_Of_Documentation = col_double(),
Last_Year_Of_Documentation = col_double(),
Is_Isolate = col_logical(),
Language = col_character()
)
.160e <- read_sf("../data/world_160e.gpkg") world
<- world.160e %>%
bg.map st_transform("+proj=eqearth lon_0=160") %>%
tm_shape() +
tm_fill()
st_crs(greenberg30) <- 4326
tmap_options(bg.color = "lightblue", legend.text.color = "black")
<- bg.map +
greenberg_map %>%
greenberg30 tm_shape() +
tm_symbols(
col="red",
style = "quantile",
size=2,
border.lwd=0.5
+
) tm_text("Language", col = "black", size = 2, xmod=0, ymod = -1, auto.placement=FALSE)
tmap_save(greenberg_map, filename = "_img/greenberg_map.svg", width = 21, height = 14)
The argument 'units' has been set to "in" since the specified width or height is less than or equal to 50. Specify units = "px" to change this.
Map saved to /mnt/c/Users/gerha/OneDrive - UT Cloud/shareAcrossMachines/_lehre/ws2425/esslli2024_statistical_typology/slides/_img/greenberg_map.svg
Size: 21 by 14 inches
Unconditional vs. conditional universals
(src: Evans & Levinson 2009)
Unconditional vs. conditional universals
- unconditional:
- 1: In declarative sentences with nominal subject and object, the dominant order is almost always one in which the subject precedes the object.
- 14: In conditional statements, the conditional clause precedes the conclusion as the normal order in all languages.
- 35: There is no language in which the plural does not have some nonzero allomorphs, whereas there are languages in which the singular is expressed only by zero. The dual and the trial are almost never expressed only by zero.
- …
- conditional:
- 2: In languages with prepositions, the genitive almost always follows the governing noun, while in languages with postpositions it almost always precedes it.
- 3: Languages with dominant VSO order are always prepositional.
- 4: With overwhelmingly greater than chance frequency, languages with normal S0V order are postpositional.
- 13: If the nominal object always precedes the verb, then verb forms subordinate to the main verb also precede it.
- …
Absolute vs. statistical universals
- absolute:
- 14: In conditional statements, the conditional clause precedes the conclusion as the normal order in all languages.
- 44: If a language has gender distinctions in die first person, it always has gender distinctions in the second or third person, or in both.
- 45: If there are any gender distinctions in the plural of the pronoun, there are some gender distinctions in the singular also.
- …
- statistical:
- 17: With overwhelmingly more than chance frequency, languages with dominant order VSO have the adjective after the noun.
- 18: When the descriptive adjective precedes the noun, the demonstrative and the numeral, with overwhelmingly more than chance frequency, do likewise.
- 41: If in a language the verb follows both the nominal subject and nominal object as the dominant order, the language almost always has a case system.
- …
Since Greenberg
- plethora of similar studies with larger language samples
- important emerging generalization:
- If XP is head initial, then YP is head initial.
- If XP is head final, then YP is head final.
- several possible explanations:
- simplifies parsing (Hawkins 1983)
- side effect of diachronic processes
- part of some innate Universal Grammar
- important emerging generalization:
- multitude of large-scale comparative databases
- World Atlas of Language Structures: https://wals.info/
- Autotyp: https://github.com/autotyp/autotyp-data
- Grambank: https://grambank.clld.org/
Bottom line: There are nore absolute universals beyond trivial ones, such as all spoken languages have vowels.
Why is this even interesting
- Language universals reflect
- constraints on information encoding in communication
- human cognitive capacities for production, processing and acquisition
- Important source of information about individual and social cognition
Methodological issues
- Languages are not independent samples
- Galton’s Problem
- If related languages share a trait, they could have inherited it from their common ancestor.
- If geographically close languages share a trait, it could be the result of language contact.
\(\Rightarrow\) Standard statistical tests are not applicable.
The following plots illustrate the spatially and genealogically biased distribution of some features from Grambank.
<- read_csv("../data/grambank-v1.0.3/grambank-grambank-7ae000c/cldf/languages.csv")
languages_gb <- read_csv("../data/grambank-v1.0.3/grambank-grambank-7ae000c/cldf/parameters.csv")
parameters_gb <- read_csv("../data/grambank-v1.0.3/grambank-grambank-7ae000c/cldf/values.csv") values_gb
── Column specification ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
cols(
ID = col_character(),
Name = col_character(),
Macroarea = col_character(),
Latitude = col_double(),
Longitude = col_double(),
Glottocode = col_character(),
ISO639P3code = col_logical(),
provenance = col_character(),
Family_name = col_character(),
Family_level_ID = col_character(),
Language_level_ID = col_character(),
level = col_character(),
lineage = col_character()
)
── Column specification ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
cols(
ID = col_character(),
Name = col_character(),
Description = col_character(),
ColumnSpec = col_logical(),
Patrons = col_character(),
Grambank_ID_desc = col_character(),
Boundness = col_double(),
Flexivity = col_double(),
Gender_or_Noun_Class = col_double(),
Locus_of_Marking = col_double(),
Word_Order = col_double(),
Informativity = col_character()
)
── Column specification ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
cols(
ID = col_character(),
Language_ID = col_character(),
Parameter_ID = col_character(),
Value = col_character(),
Code_ID = col_character(),
Comment = col_character(),
Source = col_character(),
Source_comment = col_character(),
Coders = col_character()
)
library(repr)
options(repr.plot.width=20, repr.plot.height=14)
<- parameters_gb$ID %>% unique features_gb
<- sample(features_gb, 1)
f <- parameters_gb %>%
nm filter(ID == f) %>%
select(Name) %>% pull()
# Filter and join data
<- values_gb %>%
f_df filter(Parameter_ID == f) %>%
left_join(languages_gb, by = c("Language_ID" = "ID")) %>%
left_join(parameters_gb, by = c("Parameter_ID" = "ID")) %>%
select(Language_ID, Parameter_ID, Value, Name.x, Latitude, Longitude) %>%
drop_na()
# Convert to sf object
<- f_df %>%
f_sf st_as_sf(coords = c("Longitude", "Latitude"))
st_crs(f_sf) <- 4326
# Plot the map
<- bg.map +
f_map tm_shape(f_sf) +
tm_symbols(
col = "Value",
title.col = nm,
style = "quantile",
palette = "viridis",
size = 0.4,
border.lwd = 0.1
+
) tm_layout(
bg.color = "lightblue",
legend.outside=T,
legend.outside.position = "bottom",
legend.bg.color="grey",
legend.text.size = 10,
legend.title.size = 10
)tmap_save(f_map, filename = "_img/f_map.svg", width = 14, height = 9)
#print(nm)
Map saved to /mnt/c/Users/gerha/OneDrive - UT Cloud/shareAcrossMachines/_lehre/ws2425/esslli2024_statistical_typology/slides/_img/f_map.svg
Size: 14 by 9 inches
<- sample(features_gb, 1)
f <- parameters_gb %>%
nm filter(ID == f) %>%
select(Name) %>% pull()
print(nm)
<- c(
selected_families "Austronesian",
"Austro-Asiatic",
"Indo-European",
"Niger-Congo",
"Afro-Asiatic",
"Altaic",
"Arawakan",
"Trans-New Guinea",
"Pama-Nyungan"
)%>%
values_gb filter(Parameter_ID==f) %>%
select(Language_ID, Parameter_ID, Value) %>%
left_join(languages_gb, by=c("Language_ID" = "ID")) %>%
select(Value, Family_name) %>%
filter(Value != "?") %>%
mutate(Value = as.factor(Value)) %>%
group_by(Value, Family_name) %>%
summarize(count = n(), .groups = 'drop') %>%
group_by(Family_name) %>%
mutate(total = sum(count), relative_freq = count / total) %>%
select(Value, Family_name, relative_freq) %>%
pivot_wider(names_from = Value, values_from = relative_freq, values_fill = 0) %>%
pivot_longer(cols = -Family_name, names_to = "Value", values_to = "relative_freq") %>%
filter(Family_name %in% selected_families) %>%
ggplot(aes(x = "", y = relative_freq, fill = Value)) +
geom_bar(stat = "identity", width = 1) +
coord_polar("y") +
facet_wrap(~ Family_name) +
labs(title = paste0(nm, "\n\n"),
x = NULL, y = NULL) +
scale_fill_colorblind() + # Using a colorblind-friendly palette
theme_void() +
theme(
legend.position = "bottom",
text = element_text(size = 18), # Increase font size for all text elements
plot.title = element_text(size = 24), # Center and increase the title size
strip.text = element_text(size = 18), # Increase the facet label size
legend.text = element_text(size = 18), # Increase legend text size
legend.title = element_text(size = 18), # Increase legend title size
plot.margin = margin(t = 20, r = 20, b = 20, l = 20) # Adjust margins for the plot
-> f_families_pie
) ggsave(f_families_pie, filename="_img/f_families_pie.svg")
[1] "Are there several nouns (more than three) which are suppletive for number?"
Saving 6.67 x 6.67 in image
<- read_csv("../data/wals-v2020.3/cldf-datasets-wals-878ea47/cldf/languages.csv")
languages_wals <- read_csv("../data/wals-v2020.3/cldf-datasets-wals-878ea47/cldf/parameters.csv")
parameters_wals <- read_csv("../data/wals-v2020.3/cldf-datasets-wals-878ea47/cldf/values.csv")
values_wals <- read_csv("../data/wals-v2020.3/cldf-datasets-wals-878ea47/cldf/codes.csv") codes_wals
── Column specification ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
cols(
ID = col_character(),
Name = col_character(),
Macroarea = col_character(),
Latitude = col_double(),
Longitude = col_double(),
Glottocode = col_character(),
ISO639P3code = col_character(),
Family = col_character(),
Subfamily = col_character(),
Genus = col_character(),
GenusIcon = col_logical(),
ISO_codes = col_character(),
Samples_100 = col_logical(),
Samples_200 = col_logical(),
Country_ID = col_character(),
Source = col_character(),
Parent_ID = col_character()
)
Warning message:
“625 parsing failures.
row col expected actual file
2664 GenusIcon 1/0/T/F/TRUE/FALSE cCC8C51 '../data/wals-v2020.3/cldf-datasets-wals-878ea47/cldf/languages.csv'
2666 GenusIcon 1/0/T/F/TRUE/FALSE cCC6851 '../data/wals-v2020.3/cldf-datasets-wals-878ea47/cldf/languages.csv'
2667 GenusIcon 1/0/T/F/TRUE/FALSE cCC7E51 '../data/wals-v2020.3/cldf-datasets-wals-878ea47/cldf/languages.csv'
2668 GenusIcon 1/0/T/F/TRUE/FALSE c8FCC51 '../data/wals-v2020.3/cldf-datasets-wals-878ea47/cldf/languages.csv'
2670 GenusIcon 1/0/T/F/TRUE/FALSE cCC8051 '../data/wals-v2020.3/cldf-datasets-wals-878ea47/cldf/languages.csv'
.... ......... .................. ....... ....................................................................
See problems(...) for more details.
”
── Column specification ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
cols(
ID = col_character(),
Name = col_character(),
Description = col_logical(),
ColumnSpec = col_logical(),
Chapter_ID = col_double()
)
── Column specification ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
cols(
ID = col_character(),
Language_ID = col_character(),
Parameter_ID = col_character(),
Value = col_double(),
Code_ID = col_character(),
Comment = col_character(),
Source = col_character(),
Example_ID = col_character()
)
── Column specification ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
cols(
ID = col_character(),
Parameter_ID = col_character(),
Name = col_character(),
Description = col_character(),
Number = col_double(),
icon = col_character()
)
<- c(
selected_families "Austronesian",
"Austro-Asiatic",
"Indo-European",
"Niger-Congo",
"Afro-Asiatic",
"Altaic",
"Arawakan",
"Trans-New Guinea",
"Pama-Nyungan"
)
<- values_wals %>%
mwo_plot filter(Parameter_ID=="81A") %>%
select(Language_ID, Parameter_ID, Value) %>%
left_join(languages_wals, by=c("Language_ID" = "ID")) %>%
select(Language_ID, Parameter_ID, Value, Family) %>%
left_join(codes_wals, by=c("Parameter_ID" = "Parameter_ID", "Value" = "Number")) %>%
select(Name, Family) %>%
group_by(Name, Family) %>%
summarize(count = n(), .groups = 'drop') %>%
group_by(Family) %>%
mutate(total = sum(count), relative_freq = count / total) %>%
select(Name, Family, relative_freq) %>%
pivot_wider(names_from = Name, values_from = relative_freq, values_fill = 0) %>%
pivot_longer(cols = -Family, names_to = "Name", values_to = "relative_freq") %>%
filter(Family %in% selected_families) %>%
ggplot(aes(x = "", y = relative_freq, fill = Name)) +
geom_bar(stat = "identity", width = 1) +
coord_polar("y") +
facet_wrap(~ Family) +
labs(title = "Relative Frequencies of Dominant Orders by Family",
x = NULL, y = NULL) +
scale_fill_colorblind() + # Using a colorblind-friendly palette
theme_void() +
theme(
legend.position = "bottom",
text = element_text(size = 18), # Increase font size for all text elements
plot.title = element_text(hjust = 0.5, size = 18, margin = margin(b = 40)), # Center, increase size, and add space below title
strip.text = element_text(size = 18), # Increase the facet label size
legend.text = element_text(size = 18), # Increase legend text size
legend.title = element_text(size = 18) # Increase legend title size
)
ggsave(mwo_plot, filename = "_img/mwo_plot.svg", width = 10, height = 10)
<- values_wals %>%
mwo_macroarea_plot filter(Parameter_ID=="81A") %>%
select(Language_ID, Parameter_ID, Value) %>%
left_join(languages_wals, by=c("Language_ID" = "ID")) %>%
select(Language_ID, Parameter_ID, Value, Macroarea) %>%
left_join(codes_wals, by=c("Parameter_ID" = "Parameter_ID", "Value" = "Number")) %>%
select(Name, Macroarea) %>%
group_by(Name, Macroarea) %>%
summarize(count = n(), .groups = 'drop')%>%
group_by(Macroarea) %>%
mutate(total = sum(count), relative_freq = count / total) %>%
select(Name, Macroarea, relative_freq) %>%
pivot_wider(names_from = Name, values_from = relative_freq, values_fill = 0)%>%
pivot_longer(cols = -Macroarea, names_to = "Name", values_to = "relative_freq") %>%
ggplot(aes(x = "", y = relative_freq, fill = Name)) +
geom_bar(stat = "identity", width = 1) +
coord_polar("y") +
facet_wrap(~ Macroarea) +
labs(title = "Relative Frequencies of Dominant Orders by Macroarea",
x = NULL, y = NULL) +
scale_fill_colorblind() + # Using a colorblind-friendly palette
theme_void() +
theme(
legend.position = "bottom",
text = element_text(size = 18), # Increase font size for all text elements
plot.title = element_text(hjust = 0.5, size = 18, margin = margin(b = 40)),
strip.text = element_text(size = 18), # Increase the facet label size
legend.text = element_text(size = 18), # Increase legend text size
legend.title = element_text(size = 18) # Increase legend title size
)ggsave(mwo_macroarea_plot, filename = "_img/mwo_macroarea.svg", width = 10, height = 10)
Where do universals come from?
A popular approach to explain universals is functionalistic.
Grammars provide the most economical coding mechanism … for those speech functions which speakers most often need to perform. More succinctly: Grammars code best what speakers do most. (Du Bois 1985: 362-363)
- similar to evolution by natural selection in biology
- However, evolutionary biology provides a causal mechanism for apparently teleological patterns.
- The current distribution of values for some typological feature is the result of language change.
Under this perspective, a conditional universal
If a language has feature A, it also has feature B (with more than chance frequency)
could result, e.g., from
- languages with feature A are likely to develop feature B
- languages lacking feature B are likely to loose feature A
How to identify universals
naive approach
- Simply count different types. There should be much more languages conforming to the putative universal than those violating it.
- Example:
Greenberg’s Universal 1: In declarative sentences with nominal subject and object, the dominant order is almost always one in which the subject precedes the object.
Frequency in WALS:
Type | Frequency |
---|---|
SO | 1147 |
OS | 40 |
%>%
values_wals filter(Parameter_ID=="81A") %>%
select(Language_ID, Parameter_ID, Value) %>%
left_join(languages_wals, by=c("Language_ID" = "ID")) %>%
select(Language_ID, Parameter_ID, Value) %>%
left_join(codes_wals, by=c("Parameter_ID" = "Parameter_ID", "Value" = "Number")) %>%
select(Name) %>%
filter(Name != "No dominant order") %>%
mutate(SO = str_replace_all(Name, "V", "")) %>%
group_by(SO) %>%
summarize(count = n())
SO | count |
---|---|
<chr> | <int> |
OS | 40 |
SO | 1147 |
How to identify universals
This approach can be deceptive, especially when considering conditional universals.
Running example
Greenberg’s Universal 17: With overwhelmingly more than chance frequency, languages with dominant order VSO have the adjective after the noun.
<- values_wals %>%
universal17_wals filter(Parameter_ID %in% c("81A", "87A")) %>%
select(Language_ID, Parameter_ID, Value) %>%
left_join(languages_wals, by=c("Language_ID" = "ID")) %>%
select(Name, Parameter_ID, Value, Macroarea, Family, Latitude, Longitude) %>%
mutate(Language = Name, .keep="unused") %>%
left_join(codes_wals, by=c("Parameter_ID" = "Parameter_ID", "Value" = "Number")) %>%
select(Language, Parameter_ID, Name, Macroarea, Family, Latitude, Longitude) %>%
mutate(Value = Name, .keep="unused") %>%
pivot_wider(names_from = Parameter_ID, values_from = Value, values_fill = NA) %>%
drop_na() %>%
mutate(VSO = 1 * (`81A` == "VSO")) %>%
mutate(AN = 1 * (`87A` == "Noun-Adjective")) %>%
select(-`81A`, -`87A`) %>%
select(Language, Macroarea, Family, VSO, AN, Latitude, Longitude) %>%
st_as_sf(coords = c("Longitude", "Latitude"))
<- universal17_wals %>%
universal17_wals mutate(type = as.factor(1+c(2 * VSO + AN)))
levels(universal17_wals$type) <- c("-VSO, AN", "-VSO, NA", "VSO, AN", "VSO, NA")
st_crs(universal17_wals) <- 4326
<- bg.map +
universal17_map tm_shape(universal17_wals) +
tm_symbols(
col = "type",
style = "quantile",
palette = viridis(4, option = "viridis"),
size = 0.1,
border.lwd = 0.01,
legend.col.show = FALSE
+
) tm_layout(
bg.color = "lightblue",
legend.outside=T,
legend.outside.position = "right",
legend.bg.color="grey",
+
) tm_add_legend(
type = "fill",
labels = c("-VSO, AN", "-VSO, NA", "VSO, AN", "VSO, NA"),
col = viridis(4, option = "viridis"),
)tmap_save(universal17_map, filename = "_img/universal17_map.svg")
Map saved to /mnt/c/Users/gerha/OneDrive - UT Cloud/shareAcrossMachines/_lehre/ws2425/esslli2024_statistical_typology/slides/_img/universal17_map.svg
Size: 9.765879 by 5.01747 inches
%>%
universal17_wals st_set_geometry(NULL) %>%
group_by(type) %>%
summarize(count = n())
type | count |
---|---|
<fct> | <int> |
-VSO, AN | 380 |
-VSO, NA | 713 |
VSO, AN | 28 |
VSO, NA | 61 |
%>%
universal17_wals st_set_geometry(NULL) %>%
group_by(type) %>%
summarize(count = n()) %>%
ggplot(aes(x = "", y = count, fill = type)) +
geom_bar(stat = "identity", width = 1) +
coord_polar("y") +
labs(title = "Count in WALS",
x = NULL, y = NULL) +
scale_fill_colorblind() + # Using a colorblind-friendly palette
theme_void() +
theme(
legend.position = "bottom",
text = element_text(size = 18), # Increase font size for all text elements
plot.title = element_text(hjust = 0.5, size = 18), # Center and increase the title size
strip.text = element_text(size = 18), # Increase the facet label size
legend.text = element_text(size = 18), # Increase legend text size
legend.title = element_text(size = 18) # Increase legend title size
-> universal17_pie
) ggsave(universal17_pie, filename = "_img/universal17_pie.svg")
Saving 6.67 x 6.67 in image
%>%
universal17_wals st_set_geometry(NULL) %>%
group_by(type, Family) %>%
summarize(count = n(), .groups = 'drop') %>%
group_by(Family) %>%
mutate(total = sum(count), relative_freq = count / total) %>%
arrange(Family) %>%
select(type, Family, relative_freq) %>%
pivot_wider(names_from = type, values_from = relative_freq, values_fill = 0) %>%
pivot_longer(cols = -Family, names_to = "type", values_to = "relative_freq") %>%
filter(Family %in% selected_families) %>%
ggplot(aes(x = "", y = relative_freq, fill = type)) +
geom_bar(stat = "identity", width = 1) +
coord_polar("y") +
facet_wrap(~ Family) +
labs(title = "Relative Frequencies of type by Family",
x = NULL, y = NULL) +
scale_fill_colorblind() + # Using a colorblind-friendly palette
theme_void() +
theme(
legend.position = "bottom",
text = element_text(size = 18), # Increase font size for all text elements
plot.title = element_text(hjust = 0.5, size = 18, margin = margin(b = 40)),
strip.text = element_text(size = 18), # Increase the facet label size
legend.text = element_text(size = 18), # Increase legend text size
legend.title = element_text(size = 18) # Increase legend title size
-> universal17_family
) ggsave(universal17_family, filename = "_img/universal17_family.svg")
Saving 6.67 x 6.67 in image
%>%
universal17_wals st_set_geometry(NULL) %>%
group_by(type, Macroarea) %>%
summarize(count = n(), .groups = 'drop') %>%
group_by(Macroarea) %>%
mutate(total = sum(count), relative_freq = count / total) %>%
select(type, Macroarea, relative_freq) %>%
pivot_wider(names_from = type, values_from = relative_freq, values_fill = 0) %>%
pivot_longer(cols = -Macroarea, names_to = "type", values_to = "relative_freq") %>%
ggplot(aes(x = "", y = relative_freq, fill = type)) +
geom_bar(stat = "identity", width = 1) +
coord_polar("y") +
facet_wrap(~ Macroarea) +
labs(title = "Relative Frequencies of Dominant Orders by Macroarea",
x = NULL, y = NULL) +
scale_fill_colorblind() + # Using a colorblind-friendly palette
theme_void() +
theme(
legend.position = "bottom",
text = element_text(size = 18), # Increase font size for all text elements
plot.title = element_text(hjust = 0.5, size = 18, margin = margin(b = 40)),
strip.text = element_text(size = 18), # Increase the facet label size
legend.text = element_text(size = 18), # Increase legend text size
legend.title = element_text(size = 18) # Increase legend title size
-> universal17_macroarea
) ggsave(universal17_macroarea, filename = "_img/universal17_macroarea.svg")
Saving 6.67 x 6.67 in image
Statistical validation
A first stab at it: - There are underlying probabilities \(p(\mathrm{-VSO}, \mathrm{AN}), p(\mathrm{-VSO}, \mathrm{NA)}, p(\mathrm{VSO}, \mathrm{AN}), p(\mathrm{VSO}, \mathrm{NA})\). - Each language is a sample from this categorical distribution. - We test the null hypothesis \(p(\mathrm{NA}|\mathrm{VSO}) > p(\mathrm{AN}|\mathrm{VSO})\).
<- universal17_wals %>%
type_counts st_set_geometry(NULL) %>%
group_by(type) %>%
summarize(count = n())
# Calculate the probabilities
<- sum(type_counts$count)
total_count <- sum(type_counts$count[type_counts$type %in% c("VSO, AN", "VSO, NA")]) / total_count
p_VSO <- type_counts$count[type_counts$type == "VSO, NA"] / total_count
p_NA_and_VSO <- p_NA_and_VSO / p_VSO
p_NA_given_VSO
# Number of VSO observations
<- sum(type_counts$count[type_counts$type %in% c("VSO, AN", "VSO, NA")])
n_VSO
# Number of NA and VSO observations
<- type_counts$count[type_counts$type == "VSO, NA"]
n_NA_and_VSO
# Perform a binomial test
binom.test(x = n_NA_and_VSO, n = n_VSO, p = 0.5, alternative = "greater")
Exact binomial test
data: n_NA_and_VSO and n_VSO
number of successes = 61, number of trials = 89, p-value = 0.0003051
alternative hypothesis: true probability of success is greater than 0.5
95 percent confidence interval:
0.5949397 1.0000000
sample estimates:
probability of success
0.6853933
This indicates statistical support for Universal 17.
We can also do a Bayesian test.
<- universal17_wals %>% st_set_geometry(NULL)
d levels(d$type) <- c("VSO_AN", "VSO_NA", "Minus_VSO_AN", "Minus_VSO_NA")
<- list(
stan_data N = nrow(d),
K = length(unique(d$type)),
y = as.numeric(d$type)
)
<- "
stan_code data {
int<lower=1> N; // Number of observations
int<lower=1> K; // Number of categories
int<lower=1,upper=K> y[N]; // Observations
}
parameters {
simplex[K] theta; // Probabilities for each category
}
model {
// Dirichlet(1,1,1,1) prior
theta ~ dirichlet(rep_vector(1.0, K));
// Likelihood
y ~ categorical(theta);
}
generated quantities {
vector[K] theta_draw;
for (k in 1:K) {
theta_draw[k] = theta[k];
}
}
"
<- stan_model(model_code=stan_code) model_categorical
<- capture.output(
outout <- rstan::sampling(
fit_categorical
model_categorical,data=stan_data,
chains=4,
iter=2000,
) )
<- extract(fit_categorical, pars = "theta_draw", permuted = TRUE)$theta_draw
theta_samples as_tibble(theta_samples) %>%
set_names(levels(universal17_wals$type)) %>%
mutate("NA|VSO" = `VSO, NA`/(`VSO, NA` + `VSO, AN`)) %>%
{<- .$`NA|VSO`
na_vso_samples
# Calculate the 95% credible interval
<- quantile(na_vso_samples, 0.025)
ci_lower <- quantile(na_vso_samples, 0.975)
ci_upper
# Create the density plot
ggplot(data.frame(na_vso_samples), aes(x = na_vso_samples)) +
geom_density(fill = "skyblue", alpha = 0.5) +
geom_vline(xintercept = ci_lower, linetype = "dashed", color = "red", size = 1) +
geom_vline(xintercept = ci_upper, linetype = "dashed", color = "red", size = 1) +
labs(title = "P(NA|VSO) with 95% Credible Interval",
x = "NA|VSO Probability",
y = "Density") +
theme_minimal() +
theme(text = element_text(size = 16))
-> universal17_posterior
} ggsave(universal17_posterior, filename="_img/universal17_posterior.svg")
Warning message:
“The `x` argument of `as_tibble.matrix()` must have unique column names if `.name_repair` is omitted as of tibble 2.0.0.
Using compatibility `.name_repair`.
This warning is displayed once every 8 hours.
Call `lifecycle::last_warnings()` to see where this warning was generated.”
Saving 6.67 x 6.67 in image
This also seems to confirm Universal 17.