Statistical Typology

35th European Summer School in Logic, Language and Information

Leuven, August 2024

Gerhard Jäger, Tübingen University

Lecture 1

Typology

subfield of linguistics
studies and classifies languages according to their structural features (as opposed to genealogical classification)
aims:
- delineate the diversity of human languages
- identify common properties of all (or most) languages, so-called universals

What is a possible human language? And what is a probable human language?

Statistical typology

identify robust correlations between typological features of languages or between linguistic and non-linguistic properties of populations and their languages
do so in a statistically sound way

Some history

August Wilhelm von Schlegel (1818)

<p>

tripartite classification of languages

driven by morphological characteristics

fusional languages
agglutinative languages
isolating languages

    <li>later added:
        <ul>
            <li>polysynthetic languages</li>
        </ul>
</li>

August Schlegel

Schlegel’s morphological types

(src: https://opentextbc.ca/psyclanguage/) https://www.reddit.com/r/MapPorn/comments/46wdqb/morphological_typology_tendency_of_languages_in/#lightbox

Schlegel’s morphological types

(src: https://www.languagesoftheworld.info/historical-linguistics/more-on-word-order-morphological-types-and-historical-change.html)

Georg von der Gabelentz (1840-1893)

(Gabelentz 1901)

August Schlegel

Georg von der Gabelentz (1840-1893)

‘Dürfte man ein ungeborenes Kind taufen, ich würde den Namen Typologie wählen’ (Gabelentz 1901:481)

[If one were permitted to christen an unborn child, I would choose the name typology].

Joseph Greenberg (1915-2001)

1963: Some Universals of Grammar with Particular Reference to the Order of Meaningful Elements
founding document of modern typology
genealogically and geographically diverse sample of 30 languages
45 universally or near-universally valid statements

<img src="_img/greenberg_map_modified.svg" width="1000">

Joseph Greenberg

library(repr)
options(repr.plot.width=50, repr.plot.height=35)

library(tidyverse)
library(leaflet)
library(tmap)
library(sf)
library(svglite)
library(ggthemes)
library(brms)
library(rstan)
library(viridis)

── Attaching packages ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── tidyverse 1.3.1 ──

✔ ggplot2 3.3.3     ✔ purrr   0.3.4
✔ tibble  3.1.2     ✔ dplyr   1.0.6
✔ tidyr   1.1.3     ✔ stringr 1.4.0
✔ readr   1.4.0     ✔ forcats 0.5.1

── Conflicts ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()

Linking to GEOS 3.9.1, GDAL 3.3.0, PROJ 8.0.0

Loading required package: Rcpp

Loading 'brms' package (version 2.14.4). Useful instructions
can be found by typing help('brms'). A more detailed introduction
to the package is available through vignette('brms_overview').


Attaching package: ‘brms’


The following object is masked from ‘package:stats’:

    ar


Loading required package: StanHeaders

rstan (Version 2.21.2, GitRev: 2e1f913d3ca3)

For execution on a local, multicore CPU with excess RAM we recommend calling
options(mc.cores = parallel::detectCores()).
To avoid recompilation of unchanged Stan programs, we recommend calling
rstan_options(auto_write = TRUE)


Attaching package: ‘rstan’


The following object is masked from ‘package:tidyr’:

    extract


Loading required package: viridisLite

greenberg30 <- read_csv("../data/greenberg30_glot.csv") %>%
    select(Language, Latitude, Longitude) %>%
    st_as_sf(coords = c("Longitude", "Latitude"))


── Column specification ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
cols(
  ID = col_character(),
  Name = col_character(),
  Macroarea = col_character(),
  Latitude = col_double(),
  Longitude = col_double(),
  Glottocode = col_character(),
  ISO639P3code = col_character(),
  Level = col_character(),
  Countries = col_character(),
  Family_ID = col_character(),
  Language_ID = col_character(),
  Closest_ISO369P3code = col_character(),
  First_Year_Of_Documentation = col_double(),
  Last_Year_Of_Documentation = col_double(),
  Is_Isolate = col_logical(),
  Language = col_character()
)

world.160e <- read_sf("../data/world_160e.gpkg")

bg.map <- world.160e %>% 
    st_transform("+proj=eqearth lon_0=160") %>%
    tm_shape() +
    tm_fill()

st_crs(greenberg30) <- 4326

tmap_options(bg.color = "lightblue", legend.text.color = "black")
greenberg_map <- bg.map +
    greenberg30 %>%
    tm_shape() +
    tm_symbols(
        col="red",
        style = "quantile",
        size=2,
        border.lwd=0.5
    ) +
    tm_text("Language", col = "black", size = 2, xmod=0, ymod = -1, auto.placement=FALSE)
tmap_save(greenberg_map, filename = "_img/greenberg_map.svg",  width = 21, height = 14)

The argument 'units' has been set to "in" since the specified width or height is less than or equal to 50. Specify units = "px" to change this.

Map saved to /mnt/c/Users/gerha/OneDrive - UT Cloud/shareAcrossMachines/_lehre/ws2425/esslli2024_statistical_typology/slides/_img/greenberg_map.svg

Size: 21 by 14 inches

Unconditional vs. conditional universals

(src: Evans & Levinson 2009)

Unconditional vs. conditional universals

unconditional:
- 1: In declarative sentences with nominal subject and object, the dominant order is almost always one in which the subject precedes the object.
- 14: In conditional statements, the conditional clause precedes the conclusion as the normal order in all languages.
- 35: There is no language in which the plural does not have some nonzero allomorphs, whereas there are languages in which the singular is expressed only by zero. The dual and the trial are almost never expressed only by zero.
- …
conditional:
- 2: In languages with prepositions, the genitive almost always follows the governing noun, while in languages with postpositions it almost always precedes it.
- 3: Languages with dominant VSO order are always prepositional.
- 4: With overwhelmingly greater than chance frequency, languages with normal S0V order are postpositional.
- 13: If the nominal object always precedes the verb, then verb forms subordinate to the main verb also precede it.
- …

Absolute vs. statistical universals

absolute:
- 14: In conditional statements, the conditional clause precedes the conclusion as the normal order in all languages.
- 44: If a language has gender distinctions in die first person, it always has gender distinctions in the second or third person, or in both.
- 45: If there are any gender distinctions in the plural of the pronoun, there are some gender distinctions in the singular also.
- …
statistical:
- 17: With overwhelmingly more than chance frequency, languages with dominant order VSO have the adjective after the noun.
- 18: When the descriptive adjective precedes the noun, the demonstrative and the numeral, with overwhelmingly more than chance frequency, do likewise.
- 41: If in a language the verb follows both the nominal subject and nominal object as the dominant order, the language almost always has a case system.
- …

Since Greenberg

plethora of similar studies with larger language samples
- important emerging generalization:
  - If XP is head initial, then YP is head initial.
  - If XP is head final, then YP is head final.
- several possible explanations:
  - simplifies parsing (Hawkins 1983)
  - side effect of diachronic processes
  - part of some innate Universal Grammar
multitude of large-scale comparative databases
- World Atlas of Language Structures: https://wals.info/
- Autotyp: https://github.com/autotyp/autotyp-data
- Grambank: https://grambank.clld.org/

link to paper

Bottom line: There are nore absolute universals beyond trivial ones, such as all spoken languages have vowels.

Why is this even interesting

Language universals reflect
- constraints on information encoding in communication
- human cognitive capacities for production, processing and acquisition
Important source of information about individual and social cognition

Methodological issues

Languages are not independent samples
Galton’s Problem
- If related languages share a trait, they could have inherited it from their common ancestor.
- If geographically close languages share a trait, it could be the result of language contact.

\(\Rightarrow\) Standard statistical tests are not applicable.

The following plots illustrate the spatially and genealogically biased distribution of some features from Grambank.

languages_gb <- read_csv("../data/grambank-v1.0.3/grambank-grambank-7ae000c/cldf/languages.csv")
parameters_gb <- read_csv("../data/grambank-v1.0.3/grambank-grambank-7ae000c/cldf/parameters.csv")
values_gb <- read_csv("../data/grambank-v1.0.3/grambank-grambank-7ae000c/cldf/values.csv")


── Column specification ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
cols(
  ID = col_character(),
  Name = col_character(),
  Macroarea = col_character(),
  Latitude = col_double(),
  Longitude = col_double(),
  Glottocode = col_character(),
  ISO639P3code = col_logical(),
  provenance = col_character(),
  Family_name = col_character(),
  Family_level_ID = col_character(),
  Language_level_ID = col_character(),
  level = col_character(),
  lineage = col_character()
)



── Column specification ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
cols(
  ID = col_character(),
  Name = col_character(),
  Description = col_character(),
  ColumnSpec = col_logical(),
  Patrons = col_character(),
  Grambank_ID_desc = col_character(),
  Boundness = col_double(),
  Flexivity = col_double(),
  Gender_or_Noun_Class = col_double(),
  Locus_of_Marking = col_double(),
  Word_Order = col_double(),
  Informativity = col_character()
)



── Column specification ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
cols(
  ID = col_character(),
  Language_ID = col_character(),
  Parameter_ID = col_character(),
  Value = col_character(),
  Code_ID = col_character(),
  Comment = col_character(),
  Source = col_character(),
  Source_comment = col_character(),
  Coders = col_character()
)

library(repr)
options(repr.plot.width=20, repr.plot.height=14)

features_gb <- parameters_gb$ID %>% unique

f <- sample(features_gb, 1)
nm <- parameters_gb %>%
    filter(ID == f) %>%
    select(Name) %>% pull()

# Filter and join data
f_df <- values_gb %>%
    filter(Parameter_ID == f) %>%
    left_join(languages_gb, by = c("Language_ID" = "ID")) %>%
    left_join(parameters_gb, by = c("Parameter_ID" = "ID")) %>%
    select(Language_ID, Parameter_ID, Value, Name.x, Latitude, Longitude) %>%
    drop_na()


# Convert to sf object
f_sf <- f_df %>%
    st_as_sf(coords = c("Longitude", "Latitude"))
st_crs(f_sf) <- 4326
# Plot the map
f_map <- bg.map +
    tm_shape(f_sf) +
    tm_symbols(
        col = "Value",
        title.col = nm,
        style = "quantile",
        palette = "viridis",
        size = 0.4,
        border.lwd = 0.1
    ) + 
    tm_layout(
        bg.color = "lightblue",
        legend.outside=T,
        legend.outside.position = "bottom",
        legend.bg.color="grey",
        legend.text.size = 10, 
        legend.title.size = 10 
    )
tmap_save(f_map, filename = "_img/f_map.svg",  width = 14, height = 9)

#print(nm)

Map saved to /mnt/c/Users/gerha/OneDrive - UT Cloud/shareAcrossMachines/_lehre/ws2425/esslli2024_statistical_typology/slides/_img/f_map.svg

Size: 14 by 9 inches


f <- sample(features_gb, 1)
nm <- parameters_gb %>%
    filter(ID == f) %>%
    select(Name) %>% pull()

print(nm)

selected_families <- c(
    "Austronesian",
    "Austro-Asiatic",
    "Indo-European",
    "Niger-Congo",
    "Afro-Asiatic",
    "Altaic",
    "Arawakan",
    "Trans-New Guinea",
    "Pama-Nyungan"
)
values_gb %>%
    filter(Parameter_ID==f) %>%
    select(Language_ID, Parameter_ID, Value) %>%
    left_join(languages_gb, by=c("Language_ID" = "ID")) %>%
    select(Value, Family_name) %>%
    filter(Value != "?") %>%
    mutate(Value = as.factor(Value)) %>%
    group_by(Value, Family_name) %>%
    summarize(count = n(), .groups = 'drop') %>%
    group_by(Family_name) %>%
    mutate(total = sum(count), relative_freq = count / total) %>%
    select(Value, Family_name, relative_freq) %>%
    pivot_wider(names_from = Value, values_from = relative_freq, values_fill = 0) %>%
    pivot_longer(cols = -Family_name, names_to = "Value", values_to = "relative_freq") %>%
    filter(Family_name %in% selected_families) %>%
    ggplot(aes(x = "", y = relative_freq, fill = Value)) +
  geom_bar(stat = "identity", width = 1) +
  coord_polar("y") +
  facet_wrap(~ Family_name) +
  labs(title = paste0(nm, "\n\n"),
       x = NULL, y = NULL) +
  scale_fill_colorblind() +  # Using a colorblind-friendly palette
  theme_void() +
  theme(
    legend.position = "bottom",
    text = element_text(size = 18),  # Increase font size for all text elements
    plot.title = element_text(size = 24),  # Center and increase the title size
    strip.text = element_text(size = 18),  # Increase the facet label size
    legend.text = element_text(size = 18),  # Increase legend text size
    legend.title = element_text(size = 18),  # Increase legend title size
    plot.margin = margin(t = 20, r = 20, b = 20, l = 20)  # Adjust margins for the plot

  ) -> f_families_pie
ggsave(f_families_pie, filename="_img/f_families_pie.svg")

[1] "Are there several nouns (more than three) which are suppletive for number?"

Saving 6.67 x 6.67 in image

languages_wals <- read_csv("../data/wals-v2020.3/cldf-datasets-wals-878ea47/cldf/languages.csv")
parameters_wals <- read_csv("../data/wals-v2020.3/cldf-datasets-wals-878ea47/cldf/parameters.csv")
values_wals <- read_csv("../data/wals-v2020.3/cldf-datasets-wals-878ea47/cldf/values.csv")
codes_wals <-  read_csv("../data/wals-v2020.3/cldf-datasets-wals-878ea47/cldf/codes.csv")


── Column specification ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
cols(
  ID = col_character(),
  Name = col_character(),
  Macroarea = col_character(),
  Latitude = col_double(),
  Longitude = col_double(),
  Glottocode = col_character(),
  ISO639P3code = col_character(),
  Family = col_character(),
  Subfamily = col_character(),
  Genus = col_character(),
  GenusIcon = col_logical(),
  ISO_codes = col_character(),
  Samples_100 = col_logical(),
  Samples_200 = col_logical(),
  Country_ID = col_character(),
  Source = col_character(),
  Parent_ID = col_character()
)


Warning message:
“625 parsing failures.
 row       col           expected  actual                                                                 file
2664 GenusIcon 1/0/T/F/TRUE/FALSE cCC8C51 '../data/wals-v2020.3/cldf-datasets-wals-878ea47/cldf/languages.csv'
2666 GenusIcon 1/0/T/F/TRUE/FALSE cCC6851 '../data/wals-v2020.3/cldf-datasets-wals-878ea47/cldf/languages.csv'
2667 GenusIcon 1/0/T/F/TRUE/FALSE cCC7E51 '../data/wals-v2020.3/cldf-datasets-wals-878ea47/cldf/languages.csv'
2668 GenusIcon 1/0/T/F/TRUE/FALSE c8FCC51 '../data/wals-v2020.3/cldf-datasets-wals-878ea47/cldf/languages.csv'
2670 GenusIcon 1/0/T/F/TRUE/FALSE cCC8051 '../data/wals-v2020.3/cldf-datasets-wals-878ea47/cldf/languages.csv'
.... ......... .................. ....... ....................................................................
See problems(...) for more details.
”

── Column specification ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
cols(
  ID = col_character(),
  Name = col_character(),
  Description = col_logical(),
  ColumnSpec = col_logical(),
  Chapter_ID = col_double()
)



── Column specification ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
cols(
  ID = col_character(),
  Language_ID = col_character(),
  Parameter_ID = col_character(),
  Value = col_double(),
  Code_ID = col_character(),
  Comment = col_character(),
  Source = col_character(),
  Example_ID = col_character()
)



── Column specification ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
cols(
  ID = col_character(),
  Parameter_ID = col_character(),
  Name = col_character(),
  Description = col_character(),
  Number = col_double(),
  icon = col_character()
)

selected_families <- c(
    "Austronesian",
    "Austro-Asiatic",
    "Indo-European",
    "Niger-Congo",
    "Afro-Asiatic",
    "Altaic",
    "Arawakan",
    "Trans-New Guinea",
    "Pama-Nyungan"
)

mwo_plot <- values_wals %>%
    filter(Parameter_ID=="81A") %>%
    select(Language_ID, Parameter_ID, Value) %>%
    left_join(languages_wals, by=c("Language_ID" = "ID")) %>%
    select(Language_ID, Parameter_ID, Value, Family) %>%
    left_join(codes_wals, by=c("Parameter_ID" = "Parameter_ID", "Value" = "Number")) %>%
    select(Name, Family) %>%
    group_by(Name, Family) %>%
    summarize(count = n(), .groups = 'drop') %>%
    group_by(Family) %>%
    mutate(total = sum(count), relative_freq = count / total) %>%
    select(Name, Family, relative_freq) %>%
    pivot_wider(names_from = Name, values_from = relative_freq, values_fill = 0) %>%
    pivot_longer(cols = -Family, names_to = "Name", values_to = "relative_freq") %>%
    filter(Family %in% selected_families) %>%
    ggplot(aes(x = "", y = relative_freq, fill = Name)) +
  geom_bar(stat = "identity", width = 1) +
  coord_polar("y") +
  facet_wrap(~ Family) +
  labs(title = "Relative Frequencies of Dominant Orders by Family",
       x = NULL, y = NULL) +
  scale_fill_colorblind() +  # Using a colorblind-friendly palette
  theme_void() +
  theme(
    legend.position = "bottom",
    text = element_text(size = 18),  # Increase font size for all text elements
    plot.title = element_text(hjust = 0.5, size = 18, margin = margin(b = 40)),  # Center, increase size, and add space below title
    strip.text = element_text(size = 18),  # Increase the facet label size
    legend.text = element_text(size = 18),  # Increase legend text size
    legend.title = element_text(size = 18)  # Increase legend title size
  )

ggsave(mwo_plot, filename = "_img/mwo_plot.svg", width = 10, height = 10)

mwo_macroarea_plot <- values_wals %>%
    filter(Parameter_ID=="81A") %>%
    select(Language_ID, Parameter_ID, Value) %>%
    left_join(languages_wals, by=c("Language_ID" = "ID")) %>%
    select(Language_ID, Parameter_ID, Value, Macroarea)  %>%
    left_join(codes_wals, by=c("Parameter_ID" = "Parameter_ID", "Value" = "Number")) %>%
    select(Name, Macroarea) %>%
    group_by(Name, Macroarea) %>%
    summarize(count = n(), .groups = 'drop')%>%
    group_by(Macroarea) %>%
    mutate(total = sum(count), relative_freq = count / total) %>%
    select(Name, Macroarea, relative_freq) %>%
    pivot_wider(names_from = Name, values_from = relative_freq, values_fill = 0)%>%
    pivot_longer(cols = -Macroarea, names_to = "Name", values_to = "relative_freq") %>%
    ggplot(aes(x = "", y = relative_freq, fill = Name)) +
  geom_bar(stat = "identity", width = 1) +
  coord_polar("y") +
  facet_wrap(~ Macroarea) +
  labs(title = "Relative Frequencies of Dominant Orders by Macroarea",
       x = NULL, y = NULL) +
  scale_fill_colorblind() +  # Using a colorblind-friendly palette
  theme_void() +
  theme(
    legend.position = "bottom",
    text = element_text(size = 18),  # Increase font size for all text elements
    plot.title = element_text(hjust = 0.5, size = 18, margin = margin(b = 40)), 
    strip.text = element_text(size = 18),  # Increase the facet label size
    legend.text = element_text(size = 18),  # Increase legend text size
    legend.title = element_text(size = 18)  # Increase legend title size
  )
ggsave(mwo_macroarea_plot, filename = "_img/mwo_macroarea.svg", width = 10, height = 10)

Where do universals come from?

A popular approach to explain universals is functionalistic.

Grammars provide the most economical coding mechanism … for those speech functions which speakers most often need to perform. More succinctly: Grammars code best what speakers do most. (Du Bois 1985: 362-363)

similar to evolution by natural selection in biology
However, evolutionary biology provides a causal mechanism for apparently teleological patterns.
The current distribution of values for some typological feature is the result of language change.

Under this perspective, a conditional universal

If a language has feature A, it also has feature B (with more than chance frequency)

could result, e.g., from

languages with feature A are likely to develop feature B
languages lacking feature B are likely to loose feature A

How to identify universals

naive approach

Simply count different types. There should be much more languages conforming to the putative universal than those violating it.
Example:

Greenberg’s Universal 1: In declarative sentences with nominal subject and object, the dominant order is almost always one in which the subject precedes the object.

Frequency in WALS:

Type	Frequency
SO	1147
OS	40

values_wals %>%
    filter(Parameter_ID=="81A") %>%
    select(Language_ID, Parameter_ID, Value) %>%
    left_join(languages_wals, by=c("Language_ID" = "ID")) %>%
    select(Language_ID, Parameter_ID, Value)  %>%
    left_join(codes_wals, by=c("Parameter_ID" = "Parameter_ID", "Value" = "Number")) %>%
    select(Name) %>%
    filter(Name != "No dominant order") %>%
    mutate(SO = str_replace_all(Name, "V", "")) %>%
    group_by(SO) %>%
    summarize(count = n())

A tibble: 2 × 2
SO	count
<chr>	<int>
OS	40
SO	1147

How to identify universals

This approach can be deceptive, especially when considering conditional universals.

Running example

Greenberg’s Universal 17: With overwhelmingly more than chance frequency, languages with dominant order VSO have the adjective after the noun.

universal17_wals <- values_wals %>%
    filter(Parameter_ID %in% c("81A", "87A")) %>%
    select(Language_ID, Parameter_ID, Value) %>%
    left_join(languages_wals, by=c("Language_ID" = "ID")) %>%
    select(Name, Parameter_ID, Value, Macroarea, Family, Latitude, Longitude) %>%
    mutate(Language = Name, .keep="unused") %>%
    left_join(codes_wals, by=c("Parameter_ID" = "Parameter_ID", "Value" = "Number")) %>%
    select(Language, Parameter_ID, Name, Macroarea, Family, Latitude, Longitude) %>%
    mutate(Value = Name, .keep="unused") %>%
    pivot_wider(names_from = Parameter_ID, values_from = Value, values_fill = NA) %>%
    drop_na() %>%
    mutate(VSO = 1 * (`81A` == "VSO")) %>%
    mutate(AN = 1 * (`87A` == "Noun-Adjective")) %>%
    select(-`81A`, -`87A`) %>%
    select(Language, Macroarea, Family, VSO, AN, Latitude, Longitude) %>%
    st_as_sf(coords = c("Longitude", "Latitude"))

universal17_wals <- universal17_wals %>%
    mutate(type = as.factor(1+c(2 * VSO + AN)))
levels(universal17_wals$type) <- c("-VSO, AN", "-VSO, NA", "VSO, AN", "VSO, NA")
st_crs(universal17_wals) <- 4326

universal17_map <- bg.map +
    tm_shape(universal17_wals) +
    tm_symbols(
        col = "type",
        style = "quantile",
        palette = viridis(4, option = "viridis"),
        size = 0.1,
        border.lwd = 0.01,
        legend.col.show = FALSE
    ) +
    tm_layout(
        bg.color = "lightblue",
        legend.outside=T,
        legend.outside.position = "right",
        legend.bg.color="grey",

    ) +
    tm_add_legend(
        type = "fill",
        labels = c("-VSO, AN", "-VSO, NA", "VSO, AN", "VSO, NA"),
        col = viridis(4, option = "viridis"),
    )
tmap_save(universal17_map, filename = "_img/universal17_map.svg")

Map saved to /mnt/c/Users/gerha/OneDrive - UT Cloud/shareAcrossMachines/_lehre/ws2425/esslli2024_statistical_typology/slides/_img/universal17_map.svg

Size: 9.765879 by 5.01747 inches

universal17_wals %>%
    st_set_geometry(NULL) %>%
    group_by(type) %>%
    summarize(count = n())

A tibble: 4 × 2
type	count
<fct>	<int>
-VSO, AN	380
-VSO, NA	713
VSO, AN	28
VSO, NA	61

universal17_wals %>%
    st_set_geometry(NULL) %>%
    group_by(type) %>%
    summarize(count = n()) %>%
ggplot(aes(x = "", y = count, fill = type)) +
geom_bar(stat = "identity", width = 1) +
  coord_polar("y") +
  labs(title = "Count in WALS",
       x = NULL, y = NULL) +
  scale_fill_colorblind() +  # Using a colorblind-friendly palette
  theme_void() +
  theme(
    legend.position = "bottom",
    text = element_text(size = 18),  # Increase font size for all text elements
    plot.title = element_text(hjust = 0.5, size = 18),  # Center and increase the title size
    strip.text = element_text(size = 18),  # Increase the facet label size
    legend.text = element_text(size = 18),  # Increase legend text size
    legend.title = element_text(size = 18)  # Increase legend title size
  ) -> universal17_pie
ggsave(universal17_pie, filename = "_img/universal17_pie.svg")

Saving 6.67 x 6.67 in image

universal17_wals %>%
    st_set_geometry(NULL) %>%
    group_by(type, Family) %>%
    summarize(count = n(), .groups = 'drop') %>%
    group_by(Family) %>%
    mutate(total = sum(count), relative_freq = count / total) %>%
    arrange(Family) %>%
    select(type, Family, relative_freq) %>%
    pivot_wider(names_from = type, values_from = relative_freq, values_fill = 0) %>%
    pivot_longer(cols = -Family, names_to = "type", values_to = "relative_freq") %>%
    filter(Family %in% selected_families) %>%
    ggplot(aes(x = "", y = relative_freq, fill = type)) +
  geom_bar(stat = "identity", width = 1) +
  coord_polar("y") +
  facet_wrap(~ Family) +
  labs(title = "Relative Frequencies of type by Family",
       x = NULL, y = NULL) +
  scale_fill_colorblind() +  # Using a colorblind-friendly palette
  theme_void() +
  theme(
    legend.position = "bottom",
    text = element_text(size = 18),  # Increase font size for all text elements
    plot.title = element_text(hjust = 0.5, size = 18, margin = margin(b = 40)), 
    strip.text = element_text(size = 18),  # Increase the facet label size
    legend.text = element_text(size = 18),  # Increase legend text size
    legend.title = element_text(size = 18)  # Increase legend title size
  ) -> universal17_family
ggsave(universal17_family, filename = "_img/universal17_family.svg")

Saving 6.67 x 6.67 in image

universal17_wals %>%
    st_set_geometry(NULL) %>%
    group_by(type, Macroarea) %>%
    summarize(count = n(), .groups = 'drop') %>%
    group_by(Macroarea) %>%
    mutate(total = sum(count), relative_freq = count / total) %>%
    select(type, Macroarea, relative_freq) %>%
    pivot_wider(names_from = type, values_from = relative_freq, values_fill = 0) %>%
    pivot_longer(cols = -Macroarea, names_to = "type", values_to = "relative_freq") %>%
    ggplot(aes(x = "", y = relative_freq, fill = type)) +
  geom_bar(stat = "identity", width = 1) +
  coord_polar("y") +
  facet_wrap(~ Macroarea) +
  labs(title = "Relative Frequencies of Dominant Orders by Macroarea",
       x = NULL, y = NULL) +
  scale_fill_colorblind() +  # Using a colorblind-friendly palette
  theme_void() +
  theme(
    legend.position = "bottom",
    text = element_text(size = 18),  # Increase font size for all text elements
    plot.title = element_text(hjust = 0.5, size = 18, margin = margin(b = 40)), 
    strip.text = element_text(size = 18),  # Increase the facet label size
    legend.text = element_text(size = 18),  # Increase legend text size
    legend.title = element_text(size = 18)  # Increase legend title size
  ) -> universal17_macroarea
ggsave(universal17_macroarea, filename = "_img/universal17_macroarea.svg")

Saving 6.67 x 6.67 in image

Statistical validation

A first stab at it: - There are underlying probabilities \(p(\mathrm{-VSO}, \mathrm{AN}), p(\mathrm{-VSO}, \mathrm{NA)}, p(\mathrm{VSO}, \mathrm{AN}), p(\mathrm{VSO}, \mathrm{NA})\). - Each language is a sample from this categorical distribution. - We test the null hypothesis \(p(\mathrm{NA}|\mathrm{VSO}) > p(\mathrm{AN}|\mathrm{VSO})\).

type_counts <- universal17_wals %>%
  st_set_geometry(NULL) %>%
  group_by(type) %>%
  summarize(count = n())

# Calculate the probabilities
total_count <- sum(type_counts$count)
p_VSO <- sum(type_counts$count[type_counts$type %in% c("VSO, AN", "VSO, NA")]) / total_count
p_NA_and_VSO <- type_counts$count[type_counts$type == "VSO, NA"] / total_count
p_NA_given_VSO <- p_NA_and_VSO / p_VSO

# Number of VSO observations
n_VSO <- sum(type_counts$count[type_counts$type %in% c("VSO, AN", "VSO, NA")])

# Number of NA and VSO observations
n_NA_and_VSO <- type_counts$count[type_counts$type == "VSO, NA"]

# Perform a binomial test
binom.test(x = n_NA_and_VSO, n = n_VSO, p = 0.5, alternative = "greater")


    Exact binomial test

data:  n_NA_and_VSO and n_VSO
number of successes = 61, number of trials = 89, p-value = 0.0003051
alternative hypothesis: true probability of success is greater than 0.5
95 percent confidence interval:
 0.5949397 1.0000000
sample estimates:
probability of success 
             0.6853933

This indicates statistical support for Universal 17.

We can also do a Bayesian test.

d <- universal17_wals %>% st_set_geometry(NULL)
levels(d$type) <- c("VSO_AN", "VSO_NA", "Minus_VSO_AN", "Minus_VSO_NA")

stan_data <- list(
  N = nrow(d),
  K = length(unique(d$type)),
  y = as.numeric(d$type)
)

stan_code <- "
data {
  int<lower=1> N;  // Number of observations
  int<lower=1> K;  // Number of categories
  int<lower=1,upper=K> y[N];  // Observations
}
parameters {
  simplex[K] theta;  // Probabilities for each category
}
model {
  // Dirichlet(1,1,1,1) prior
  theta ~ dirichlet(rep_vector(1.0, K));
  
  // Likelihood
  y ~ categorical(theta);
}
generated quantities {
  vector[K] theta_draw;
  for (k in 1:K) {
    theta_draw[k] = theta[k];
  }
}

"

model_categorical <- stan_model(model_code=stan_code)

outout <- capture.output(
    fit_categorical <- rstan::sampling(
        model_categorical,
        data=stan_data,
        chains=4,
        iter=2000,
    )
)

theta_samples <- extract(fit_categorical, pars = "theta_draw", permuted = TRUE)$theta_draw
as_tibble(theta_samples) %>%
    set_names(levels(universal17_wals$type)) %>%
    mutate("NA|VSO" = `VSO, NA`/(`VSO, NA` + `VSO, AN`)) %>%
{
    na_vso_samples <- .$`NA|VSO`
    
    # Calculate the 95% credible interval
    ci_lower <- quantile(na_vso_samples, 0.025)
    ci_upper <- quantile(na_vso_samples, 0.975)
    
    # Create the density plot
    ggplot(data.frame(na_vso_samples), aes(x = na_vso_samples)) +
      geom_density(fill = "skyblue", alpha = 0.5) +
      geom_vline(xintercept = ci_lower, linetype = "dashed", color = "red", size = 1) +
      geom_vline(xintercept = ci_upper, linetype = "dashed", color = "red", size = 1) +
      labs(title = "P(NA|VSO) with 95% Credible Interval",
           x = "NA|VSO Probability",
           y = "Density") +
      theme_minimal() +
      theme(text = element_text(size = 16))
  } -> universal17_posterior
ggsave(universal17_posterior, filename="_img/universal17_posterior.svg")

Warning message:
“The `x` argument of `as_tibble.matrix()` must have unique column names if `.name_repair` is omitted as of tibble 2.0.0.
Using compatibility `.name_repair`.
This warning is displayed once every 8 hours.
Call `lifecycle::last_warnings()` to see where this warning was generated.”
Saving 6.67 x 6.67 in image

This also seems to confirm Universal 17.