Statistical Typology

35th European Summer School in Logic, Language and Information

Leuven, August 2024

Gerhard Jäger, Tübingen University

Lecture 2

Correlation between linguistic and non-linguistic traits

Another major focus of research in statistical typology is the correlation between linguistic and non-linguistic traits. The idea is that languages are shaped by the social and ecological environment in which they are spoken. This is a very old idea, going back to the 19th century, but it has been given a new lease of life by the availability of large-scale databases that allow us to test these ideas in a more systematic way.

A few examples from https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0070902

So why do we then get the correlation? It is unlikely that it is due to mere chance. But as has been pointed out by several of the bloggers cited above, cultural phenomena come in bundles, and typically, there is no very profound explanation behind this, rather the bundling is due to several features happening to take part in the same historical process of diffusion. To add to the examples already given elsewhere, consider the fact that all the nine full member countries of the International Cricket Council also have left-hand driving. (I am not counting the tenth member, the West Indies, which is not a country.) There is no universal principle connecting cricket and left-hand driving, rather both these phenomena happened to be part of British culture at a specific point in time and spread with the growth of the British empire.

Östen Dahl 2013, https://dlc.hypotheses.org/360

The problem of statistical independence

Both tests used in the previous lecture are based on the assumption that each language is an independent sample from the same distribution. This is patently false due to the genealogical relationships between languages and their dependency dueo to contact.
The issue of creating a suitable sample of languages has been discussed intensely in the literature.
Two basic strategies:
- Diversity Sampling: Select languages so that the range of existing variation is covered as faithfully as possible.
- Probability Sampling: Select languages in such a way that stochastic independence is approximated as well as possible.

WALS contains a 100-language probability sample. Let us test Universal 17 with it.

library(repr)
options(repr.plot.width=30, repr.plot.height=21)
     
library(tidyverse)
library(leaflet)
library(tmap)
library(sf)
library(svglite)
library(ggthemes)
library(viridis)
library(brms)
library(loo)
library(bridgesampling)
library(coda)
library(bayesplot)
library(RColorBrewer)

── Attaching packages ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── tidyverse 1.3.1 ──

✔ ggplot2 3.3.3     ✔ purrr   0.3.4
✔ tibble  3.1.2     ✔ dplyr   1.0.6
✔ tidyr   1.1.3     ✔ stringr 1.4.0
✔ readr   1.4.0     ✔ forcats 0.5.1

── Conflicts ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()

Linking to GEOS 3.9.1, GDAL 3.3.0, PROJ 8.0.0

Loading required package: viridisLite

Loading required package: Rcpp

Loading 'brms' package (version 2.14.4). Useful instructions
can be found by typing help('brms'). A more detailed introduction
to the package is available through vignette('brms_overview').


Attaching package: ‘brms’


The following object is masked from ‘package:stats’:

    ar


This is loo version 2.4.1

- Online documentation and vignettes at mc-stan.org/loo

- As of v2.0.0 loo defaults to 1 core but we recommend using as many as possible. Use the 'cores' argument or set options(mc.cores = NUM_CORES) for an entire session. 


Attaching package: ‘bridgesampling’


The following object is masked from ‘package:brms’:

    bf


This is bayesplot version 1.8.0

- Online documentation and vignettes at mc-stan.org/bayesplot

- bayesplot theme set to bayesplot::theme_default()

   * Does _not_ affect other ggplot2 plots

   * See ?bayesplot_theme_set for details on theme setting

languages_wals <- read_csv("../data/wals-v2020.3/cldf-datasets-wals-878ea47/cldf/languages.csv")
parameters_wals <- read_csv("../data/wals-v2020.3/cldf-datasets-wals-878ea47/cldf/parameters.csv")
values_wals <- read_csv("../data/wals-v2020.3/cldf-datasets-wals-878ea47/cldf/values.csv")
codes_wals <-  read_csv("../data/wals-v2020.3/cldf-datasets-wals-878ea47/cldf/codes.csv")


── Column specification ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
cols(
  ID = col_character(),
  Name = col_character(),
  Macroarea = col_character(),
  Latitude = col_double(),
  Longitude = col_double(),
  Glottocode = col_character(),
  ISO639P3code = col_character(),
  Family = col_character(),
  Subfamily = col_character(),
  Genus = col_character(),
  GenusIcon = col_logical(),
  ISO_codes = col_character(),
  Samples_100 = col_logical(),
  Samples_200 = col_logical(),
  Country_ID = col_character(),
  Source = col_character(),
  Parent_ID = col_character()
)


Warning message:
“625 parsing failures.
 row       col           expected  actual                                                                 file
2664 GenusIcon 1/0/T/F/TRUE/FALSE cCC8C51 '../data/wals-v2020.3/cldf-datasets-wals-878ea47/cldf/languages.csv'
2666 GenusIcon 1/0/T/F/TRUE/FALSE cCC6851 '../data/wals-v2020.3/cldf-datasets-wals-878ea47/cldf/languages.csv'
2667 GenusIcon 1/0/T/F/TRUE/FALSE cCC7E51 '../data/wals-v2020.3/cldf-datasets-wals-878ea47/cldf/languages.csv'
2668 GenusIcon 1/0/T/F/TRUE/FALSE c8FCC51 '../data/wals-v2020.3/cldf-datasets-wals-878ea47/cldf/languages.csv'
2670 GenusIcon 1/0/T/F/TRUE/FALSE cCC8051 '../data/wals-v2020.3/cldf-datasets-wals-878ea47/cldf/languages.csv'
.... ......... .................. ....... ....................................................................
See problems(...) for more details.
”

── Column specification ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
cols(
  ID = col_character(),
  Name = col_character(),
  Description = col_logical(),
  ColumnSpec = col_logical(),
  Chapter_ID = col_double()
)



── Column specification ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
cols(
  ID = col_character(),
  Language_ID = col_character(),
  Parameter_ID = col_character(),
  Value = col_double(),
  Code_ID = col_character(),
  Comment = col_character(),
  Source = col_character(),
  Example_ID = col_character()
)



── Column specification ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
cols(
  ID = col_character(),
  Parameter_ID = col_character(),
  Name = col_character(),
  Description = col_character(),
  Number = col_double(),
  icon = col_character()
)

world.160e <- read_sf("../data/world_160e.gpkg")
bg.map <- world.160e %>% 
    st_transform("+proj=eqearth lon_0=160") %>%
    tm_shape() +
    tm_fill()

universal17_100 <- values_wals %>%
    filter(Parameter_ID %in% c("81A", "87A")) %>%
    select(Language_ID, Parameter_ID, Value) %>%
    left_join(languages_wals, by=c("Language_ID" = "ID")) %>%
    filter(Samples_100 == TRUE) %>%
    select(Name, Parameter_ID, Value, Macroarea, Family, Latitude, Longitude) %>%
    mutate(Language = Name, .keep="unused") %>%
    left_join(codes_wals, by=c("Parameter_ID" = "Parameter_ID", "Value" = "Number")) %>%
    select(Language, Parameter_ID, Name, Macroarea, Family, Latitude, Longitude) %>%
    mutate(Value = Name, .keep="unused") %>%
    pivot_wider(names_from = Parameter_ID, values_from = Value, values_fill = NA) %>%
    drop_na() %>%
    mutate(VSO = 1 * (`81A` == "VSO")) %>%
    mutate(AN = 1 * (`87A` == "Noun-Adjective")) %>%
    select(-`81A`, -`87A`) %>%
    select(Language, Macroarea, Family, VSO, AN, Latitude, Longitude) %>%
    st_as_sf(coords = c("Longitude", "Latitude"))

universal17_100 <- universal17_100 %>%
    mutate(type = as.factor(1+c(2 * VSO + AN)))
levels(universal17_100$type) <- c("-VSO, AN", "-VSO, NA", "VSO, AN", "VSO, NA")
st_crs(universal17_100) <- 4326

cbbPalette <- c("#E69F00", "#009E73", "#F0E442","#CC79A7")
universal17_100_map <- bg.map +
    tm_shape(universal17_100) +
    tm_symbols(
        col = "type",
        palette = cbbPalette,
        size = 0.5,
        border.lwd = 0.2,
        border.col = 'black',
        alpha = 0.7,
        legend.col.show = TRUE
    ) +
    tm_layout(
        bg.color = "lightblue",
        legend.outside=T,
        legend.outside.position = "right",
        legend.bg.color="grey",

    )
    
tmap_save(universal17_100_map, "_img/universal17_100_map.svg")

Map saved to /mnt/c/Users/gerha/OneDrive - UT Cloud/shareAcrossMachines/_lehre/ws2425/esslli2024_statistical_typology/slides/_img/universal17_100_map.svg

Size: 9.765879 by 5.01747 inches

universal17_100 %>%
    st_set_geometry(NULL) %>%
    group_by(type) %>%
    summarize(count = n())

A tibble: 4 × 2
type	count
<fct>	<int>
-VSO, AN	33
-VSO, NA	50
VSO, AN	4
VSO, NA	4

universal17_100 %>%
    st_set_geometry(NULL) %>%
    group_by(type) %>%
    summarize(count = n()) %>%
ggplot(aes(x = "", y = count, fill = type)) +
geom_bar(stat = "identity", width = 1) +
  coord_polar("y") +
  labs(title = "Count in WALS-100 sample",
       x = NULL, y = NULL) +
  scale_fill_manual(values = cbbPalette) +
  theme_void() +
  theme(
    legend.position = "bottom",
    text = element_text(size = 18),  # Increase font size for all text elements
    plot.title = element_text(hjust = 0.5, size = 18),  # Center and increase the title size
    strip.text = element_text(size = 18),  # Increase the facet label size
    legend.text = element_text(size = 18),  # Increase legend text size
    legend.title = element_text(size = 18)  # Increase legend title size
  ) -> universal17_100_pie
ggsave(universal17_100_pie, filename="_img/universal17_100_pie.svg")

Saving 6.67 x 6.67 in image

No need to apply statistical tests here:
- Sample clearly violates the Universal 17.
- Sample size is so small that not much can be inferred from this.

One big downside of sampling is that you have to throw away a lot of data. This is why many researchers prefer to use all available data and then correct for the non-independence of the data in the statistical analysis.

Hierarchical modeling

The effect of genealogical and geographic non-independence can be ameliorated by adding random effects for - language families - linguistic macro-areas

universal17_wals <- values_wals %>%
    filter(Parameter_ID %in% c("81A", "87A")) %>%
    select(Language_ID, Parameter_ID, Value) %>%
    left_join(languages_wals, by=c("Language_ID" = "ID")) %>%
    select(Name, Parameter_ID, Value, Macroarea, Family, Latitude, Longitude) %>%
    mutate(Language = Name, .keep="unused") %>%
    left_join(codes_wals, by=c("Parameter_ID" = "Parameter_ID", "Value" = "Number")) %>%
    select(Language, Parameter_ID, Name, Macroarea, Family, Latitude, Longitude) %>%
    mutate(Value = Name, .keep="unused") %>%
    pivot_wider(names_from = Parameter_ID, values_from = Value, values_fill = NA) %>%
    drop_na() %>%
    mutate(VSO = 1 * (`81A` == "VSO")) %>%
    mutate(N_A = 1 * (`87A` == "Noun-Adjective")) %>%
    select(-`81A`, -`87A`) %>%
    select(Language, Macroarea, Family, VSO, N_A, Latitude, Longitude) %>%
    st_as_sf(coords = c("Longitude", "Latitude"))

d <- universal17_wals %>%
    st_set_geometry(NULL) %>%
    mutate(Macroarea = as.factor(Macroarea)) %>%
    mutate(Family = as.factor(Family)) %>%
    select(VSO, N_A, Family, Macroarea)

First I fit a baseline model with all data from WALS.

I am interested in the effect of VSO on adjective-noun order. Therefore I use VSO as independent and N_A as dependent variable.

As the dependent variable is binary, a logistic regression is called for.

# fit_1 <- brm(N_A ~ VSO , data = d, family = bernoulli(), save_pars = save_pars(all = TRUE))
# saveRDS(fit_1, file = "fit_1.rds")

fit_1 <- readRDS("fit_1.rds")

summary(fit_1)

 Family: bernoulli 
  Links: mu = logit 
Formula: N_A ~ VSO 
   Data: d (Number of observations: 1182) 
Samples: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
         total post-warmup samples = 4000

Population-Level Effects: 
          Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
Intercept     0.63      0.07     0.50     0.76 1.00     3666     2681
VSO           0.15      0.23    -0.30     0.62 1.00     3662     2670

Samples were drawn using sampling(NUTS). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).

# Extract posterior samples for logpop
posterior_samples <- as.mcmc(fit_1)

# Create MCMC density plot for logpop with 95% HPD interval
mcmc_areas(
  posterior_samples, 
  pars = "b_VSO", 
  prob = 0.95  # 95% HPD interval
) +
  ggtitle("MCMC Density Plot for VSO") +
  xlab("VSO Coefficient") -> sound_inventories_slope_hpd
ggsave(sound_inventories_slope_hpd, file = "_img/universal17_slope_hpd_1.svg")

Saving 6.67 x 6.67 in image

The result seems to indicate that VSO has no credible impact on N_A.

Now let us include random effects. We start with macro-areas.


#     fit_2 <- brm(
#         N_A ~ VSO + (1|Macroarea),
#         data = d,
#         family = bernoulli(),
#         save_pars = save_pars(all = TRUE)
#     )
# saveRDS(fit_2, file = "fit_2.rds")

fit_2 <- readRDS("fit_2.rds")

summary(fit_2)

 Family: bernoulli 
  Links: mu = logit 
Formula: N_A ~ VSO + (1 | Macroarea) 
   Data: d (Number of observations: 1182) 
Samples: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
         total post-warmup samples = 4000

Group-Level Effects: 
~Macroarea (Number of levels: 6) 
              Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sd(Intercept)     1.04      0.40     0.52     2.09 1.00      824     1261

Population-Level Effects: 
          Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
Intercept     0.54      0.43    -0.38     1.43 1.00      740      913
VSO           0.21      0.27    -0.30     0.74 1.00     1953     2221

Samples were drawn using sampling(NUTS). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).

# Extract posterior samples for logpop
posterior_samples <- as.mcmc(fit_2)

# Create MCMC density plot for logpop with 95% HPD interval
mcmc_areas(
  posterior_samples, 
  pars = "b_VSO", 
  prob = 0.95  # 95% HPD interval
) +
  ggtitle("MCMC Density Plot for VSO") +
  xlab("VSO Coefficient") -> sound_inventories_slope_hpd
ggsave(sound_inventories_slope_hpd, file = "_img/universal17_slope_hpd_2.svg")

Saving 6.67 x 6.67 in image

Language family as random effect:

# fit_3 <- brm(
#     N_A ~ VSO + (1|Family),
#     data = d,
#     family = bernoulli(),
#     save_pars = save_pars(all = TRUE)
# )
# saveRDS(fit_3, file="fit_3.rds")

fit_3 <- readRDS("fit_3.rds")

summary(fit_3)

 Family: bernoulli 
  Links: mu = logit 
Formula: N_A ~ VSO + (1 | Family) 
   Data: d (Number of observations: 1182) 
Samples: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
         total post-warmup samples = 4000

Group-Level Effects: 
~Family (Number of levels: 179) 
              Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sd(Intercept)     3.18      0.53     2.32     4.35 1.00      791     1558

Population-Level Effects: 
          Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
Intercept     0.70      0.32     0.10     1.36 1.01      444      833
VSO          -0.05      0.38    -0.78     0.70 1.00     3361     2900

Samples were drawn using sampling(NUTS). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).

# Extract posterior samples for logpop
posterior_samples <- as.mcmc(fit_3)

# Create MCMC density plot for logpop with 95% HPD interval
mcmc_areas(
  posterior_samples, 
  pars = "b_VSO", 
  prob = 0.95  # 95% HPD interval
) +
  ggtitle("MCMC Density Plot for VSO") +
  xlab("VSO Coefficient") -> sound_inventories_slope_hpd
ggsave(sound_inventories_slope_hpd, file = "_img/universal17_slope_hpd_3.svg")

Saving 6.67 x 6.67 in image

And now both random effects simultaneously.

# fit_4 <- brm(
#     N_A ~ VSO + (1|Macroarea) + (1|Family),
#     data = d,
#     family = bernoulli(),
#     save_pars = save_pars(all = TRUE),
#     control = list(adapt_delta = 0.99, max_treedepth = 15)
# )
# saveRDS(fit_4, "fit_4.rds")

fit_4 <- readRDS("fit_4.rds")

summary(fit_4)

 Family: bernoulli 
  Links: mu = logit 
Formula: N_A ~ VSO + (1 | Macroarea) + (1 | Family) 
   Data: d (Number of observations: 1182) 
Samples: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
         total post-warmup samples = 4000

Group-Level Effects: 
~Family (Number of levels: 179) 
              Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sd(Intercept)     3.05      0.55     2.14     4.24 1.01      884     1734

~Macroarea (Number of levels: 6) 
              Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sd(Intercept)     0.56      0.43     0.02     1.67 1.00      798     1652

Population-Level Effects: 
          Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
Intercept     0.68      0.43    -0.18     1.58 1.00     1293     2182
VSO          -0.06      0.38    -0.79     0.71 1.00     5485     3101

Samples were drawn using sampling(NUTS). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).

# Extract posterior samples for logpop
posterior_samples <- as.mcmc(fit_4)

# Create MCMC density plot for logpop with 95% HPD interval
mcmc_areas(
  posterior_samples, 
  pars = "b_VSO", 
  prob = 0.95  # 95% HPD interval
) +
  ggtitle("MCMC Density Plot for VSO") +
  xlab("VSO Coefficient") -> sound_inventories_slope_hpd
ggsave(sound_inventories_slope_hpd, file = "_img/universal17_slope_hpd_4.svg")

Saving 6.67 x 6.67 in image

Model comparison

I will use two techniques for model comparison:

Pareto-smoothed leave-one-out cross-validation
Bayes Factor

# Compute LOO for each model
loo_1 <- loo(fit_1)
loo_2 <- loo(fit_2)
loo_3 <- loo(fit_3)
loo_4 <- loo(fit_4)

Warning message:
“Found 52 observations with a pareto_k > 0.7 in model 'fit_3'. It is recommended to set 'moment_match = TRUE' in order to perform moment matching for problematic observations. ”
Warning message:
“Found 44 observations with a pareto_k > 0.7 in model 'fit_4'. It is recommended to set 'moment_match = TRUE' in order to perform moment matching for problematic observations. ”

# Compare the models
loo_comparison <- loo_compare(loo_1, loo_2, loo_3, loo_4)

# Print the comparison
print(loo_comparison)

      elpd_diff se_diff
fit_3    0.0       0.0 
fit_4   -1.8       1.1 
fit_2 -121.4      13.8 
fit_1 -203.9      16.9

Including language families as random effect leads to a clear improvement. Additionally adding macro areas makes the fit to the data slightly worse.

Now we estimate the marginal log-likelihoods of the models to compute the Bayes Factors.

output <- capture.output(ll_1 <- bridge_sampler(fit_1))
output <- capture.output(ll_2 <- bridge_sampler(fit_2))
output <- capture.output(ll_3 <- bridge_sampler(fit_3))
output <- capture.output(ll_4 <- bridge_sampler(fit_4))

tibble(
    model_1 = ll_1$logml,
    model_2 = ll_2$logml,
    model_3 = ll_3$logml,
    model_4 = ll_4$logml
) %>% pivot_longer(
    cols = c("model_1", "model_2", "model_3", "model_4"),
    names_to = "model", values_to = "marginal_loglikelihood"
) %>% arrange(desc(marginal_loglikelihood))

A tibble: 4 × 2
model	marginal_loglikelihood
<chr>	<dbl>
model_3	-606.7754
model_4	-608.7576
model_2	-690.1308
model_1	-765.8460

model	marginal_likelihood
model_3	-606.9392
model_4	-608.6612
model_2	-690.1678
model_1	-765.8465

The results are qualitatively identical to PSIS-LOO model comparison.

Let us check whether a random slope for families makes a difference.

# fit_5 <- brm(
#     N_A ~ VSO + (VSO|Family),
#     data = d,
#     family = bernoulli(),
#     save_pars = save_pars(all = TRUE)
# )
# saveRDS(fit_5, file="fit_5.rds")

fit_5 <- readRDS("fit_5.rds")

summary(fit_5)

 Family: bernoulli 
  Links: mu = logit 
Formula: N_A ~ VSO + (VSO | Family) 
   Data: d (Number of observations: 1182) 
Samples: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
         total post-warmup samples = 4000

Group-Level Effects: 
~Family (Number of levels: 179) 
                   Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sd(Intercept)          3.31      0.58     2.35     4.60 1.00      874     1554
sd(VSO)                2.89      1.40     1.11     6.47 1.00     1418     1641
cor(Intercept,VSO)    -0.10      0.38    -0.80     0.62 1.00     2598     2595

Population-Level Effects: 
          Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
Intercept     0.69      0.33     0.08     1.38 1.01      903     1504
VSO           0.65      1.02    -1.36     2.92 1.00     1660     1258

Samples were drawn using sampling(NUTS). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).

# Extract posterior samples for logpop
posterior_samples <- as.mcmc(fit_5)

# Create MCMC density plot for logpop with 95% HPD interval
mcmc_areas(
  posterior_samples, 
  pars = "b_VSO", 
  prob = 0.95  # 95% HPD interval
) +
  ggtitle("MCMC Density Plot for VSO") +
  xlab("VSO Coefficient") -> sound_inventories_slope_hpd
ggsave(sound_inventories_slope_hpd, file = "_img/universal17_slope_hpd_5.svg")

Saving 6.67 x 6.67 in image

loo_5 <- loo(fit_5)
loo_comparison <- loo_compare(loo_1, loo_2, loo_3, loo_4, loo_5)
loo_comparison

Warning message:
“Found 58 observations with a pareto_k > 0.7 in model 'fit_5'. It is recommended to set 'moment_match = TRUE' in order to perform moment matching for problematic observations. ”

A compare.loo: 5 × 8 of type dbl
	elpd_diff	se_diff	elpd_loo	se_elpd_loo	p_loo	se_p_loo	looic	se_looic
fit_5	0.00000	0.000000	-548.6753	18.41976	94.810705	6.2567796	1097.351	36.83952
fit_3	-10.98718	5.002760	-559.6625	18.48225	89.500034	5.8725805	1119.325	36.96449
fit_4	-12.80774	5.080454	-561.4831	18.46092	89.196145	5.7341311	1122.966	36.92185
fit_2	-132.41430	14.129699	-681.0896	15.39807	7.110228	0.2704566	1362.179	30.79614
fit_1	-214.89195	17.058385	-763.5673	10.52670	2.075960	0.1283622	1527.135	21.05340

output <- capture.output(ll_5 <- bridge_sampler(fit_5))

# tibble(
#     model_1 = ll_1$logml,
#     model_2 = ll_2$logml,
#     model_3 = ll_3$logml,
#     model_4 = ll_4$logml,
#     model_5 = ll_5$logml
# ) %>% pivot_longer(
#     cols = c("model_1", "model_2", "model_3", "model_4", "model_5"),
#     names_to = "model", values_to = "marginal_loglikelihood"
# ) %>% arrange(desc(marginal_loglikelihood))

model	marginal_likelihood
model_5	-598.7606
model_3	-606.9392
model_4	-608.6612
model_2	-690.1678
model_1	-765.8465

Including a random slope for families clearly improves the model.

No suppose we discover a new language family. What is the posterior distribution of the conditional probability \(P(\text{N_A}|\text{VSO})\) within this family?

new_data <- data.frame(
  VSO = 1,   # Set VSO to 1
  Family = NA  # Placeholder for the new family
)

# Make predictions
# posterior_epred gives you the expected value (mean) of the posterior distribution
posterior_predictions <- posterior_epred(fit_5, newdata = new_data, re_formula = ~ (VSO|Family))

posterior_tibble <- tibble(posterior = posterior_predictions[,1])

summary(posterior_predictions)

       V1        
 Min.   :0.0000  
 1st Qu.:0.2074  
 Median :0.7950  
 Mean   :0.6214  
 3rd Qu.:0.9828  
 Max.   :1.0000

mcmc_areas(
  posterior_tibble, 
  pars = "posterior",
  prob = 0.95
) +
  ggtitle("95% Highest Posterior Density Interval for P(N_A = 1 | VSO = 1)") -> universal17_hpd
  ggsave(universal17_hpd, filename="_img/universal17_hpd.svg")

Saving 6.67 x 6.67 in image

So in an unkown family, we expect VSO language either to have a strong noun-adjective or a strong adjetive-noun bias.

Now let us check what we can expect if we sample an unkonwn VSO language from an unknwon family.

samples <- posterior_predict(fit_5, newdata = new_data, re_formula = ~ (VSO|Family))
mean(samples)

0.6175

So, this modeling approach lends support to Universal 17, if we interpret with overwhelmingly more than chance frequency as \(>0.5\).

Let us check what the model predicts for non-VSO languages.

new_data <- data.frame(
  VSO = 1,   # Set VSO to 1
  Family = NA  # Placeholder for the new family
)

samples <- posterior_predict(fit_5, newdata = new_data, re_formula = ~ (VSO|Family))
mean(samples)

0.62425

This is virtually identical to the estimate for VSO languages. So if we interpret with overwhelmingly more than chance frequency as more than what you expect if you know nothing about the language, then the model does not support Universal 17.

Of course, this is also the interpretation we get directly from the regression analysis, since the credible interval of the slope of the main effect includes 0.

summary(fit_5)

 Family: bernoulli 
  Links: mu = logit 
Formula: N_A ~ VSO + (VSO | Family) 
   Data: d (Number of observations: 1182) 
Samples: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
         total post-warmup samples = 4000

Group-Level Effects: 
~Family (Number of levels: 179) 
                   Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sd(Intercept)          3.31      0.58     2.35     4.60 1.00      874     1554
sd(VSO)                2.89      1.40     1.11     6.47 1.00     1418     1641
cor(Intercept,VSO)    -0.10      0.38    -0.80     0.62 1.00     2598     2595

Population-Level Effects: 
          Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
Intercept     0.69      0.33     0.08     1.38 1.01      903     1504
VSO           0.65      1.02    -1.36     2.92 1.00     1660     1258

Samples were drawn using sampling(NUTS). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).

Continuous variables

Let’s check this out
data sources:
- phoneme inventory size: Phoible
- population size: ASJP
- Family affiliation an macro-area: Glottolog

sound_inventory_population <- read_csv("../data/soundinventory_population.csv") %>%
    drop_na("Family")
sound_inventory_population


── Column specification ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
cols(
  Glottocode = col_character(),
  nSegments = col_double(),
  population = col_double(),
  Macroarea = col_character(),
  Latitude = col_double(),
  Longitude = col_double(),
  Family = col_character()
)

A tibble: 1645 × 7
Glottocode	nSegments	population	Macroarea	Latitude	Longitude	Family
<chr>	<dbl>	<dbl>	<chr>	<dbl>	<dbl>	<chr>
huaa1248	161.0	2500	Africa	-24.200220	20.79583	Tuu
juho1239	141.0	45500	Africa	-19.688000	20.76730	Kxa
kild1236	128.0	600	Eurasia	68.215800	35.83480	Uralic
kung1261	120.0	16500	Africa	-21.920000	18.00000	Kxa
tigo1236	116.0	60000	Africa	6.983995	10.70849	Atlantic-Congo
irig1241	95.0	40000	Africa	9.851970	8.65250	Atlantic-Congo
arch1244	91.0	970	Eurasia	42.323900	46.82670	Nakh-Daghestanian
suga1248	90.0	10000	Africa	7.129060	12.37790	Atlantic-Congo
yeyi1239	90.0	11180	Africa	-18.917800	23.60880	Atlantic-Congo
mfum1238	89.0	24700	Africa	6.600000	10.93639	Atlantic-Congo
xhos1239	84.0	8183300	Africa	-31.038900	28.07690	Atlantic-Congo
sera1259	83.0	26009000	Eurasia	29.553400	71.90600	Indo-European
scot1245	82.0	60130	Eurasia	56.757400	-5.24366	Indo-European
veng1238	82.0	27000	Africa	6.114030	10.41370	Atlantic-Congo
moro1292	78.0	28019650	Africa	32.500000	-7.50000	Afro-Asiatic
para1301	77.0	805700	Eurasia	22.865800	99.37500	Austroasiatic
kang1288	75.5	15000	Africa	11.017400	29.34630	Kadugli-Krongo
idom1241	75.0	927000	Africa	7.191390	7.71261	Atlantic-Congo
komc1235	75.0	233000	Africa	6.259600	10.33480	Atlantic-Congo
skol1241	75.0	320	Eurasia	68.832600	29.72040	Uralic
eten1239	74.0	40000	Africa	9.730730	8.61443	Atlantic-Congo
nort2671	74.0	25700	Eurasia	68.725000	22.11130	Uralic
sich1238	74.0	2000000	Eurasia	28.194700	102.12100	Sino-Tibetan
dime1235	73.0	1100	Africa	6.209510	36.33290	South Omotic
lezg1247	71.5	623710	Eurasia	41.515700	47.89510	Nakh-Daghestanian
kham1282	70.5	1380300	Eurasia	31.931300	91.70620	Sino-Tibetan
chen1255	70.0	26000	Eurasia	15.878700	79.25340	Dravidian
tada1238	70.0	159800	Africa	16.685000	2.32653	Songhay
huml1238	69.0	5000	Eurasia	30.145100	81.56590	Sino-Tibetan
lakk1252	69.0	156300	Eurasia	42.132800	47.08090	Nakh-Daghestanian
⋮	⋮	⋮	⋮	⋮	⋮	⋮
iwam1256	17.0	3000	Papunesia	-4.273460	141.8170	Sepik
mara1404	17.0	866000	Papunesia	7.794140	124.1750	Austronesian
nucl1632	17.0	30000	Papunesia	-2.601300	140.5120	Sentanic
orok1269	17.0	35000	Papunesia	-8.790190	148.0900	Nuclear Trans New Guinea
para1310	17.0	340	South America	-3.712630	-53.0657	Cariban
para1312	17.0	900	South America	-4.646810	-50.0621	Tupian
saba1268	17.0	3	South America	-12.985700	-60.3350	Nambiquaran
sout2895	17.0	1500	Papunesia	-3.234110	129.1480	Austronesian
tatu1247	17.0	330	South America	0.555820	-70.5327	Tucanoan
tiga1245	17.0	10000	Papunesia	-2.709510	150.8930	Austronesian
tong1325	17.0	189740	Papunesia	-21.170000	-175.2500	Austronesian
yuru1263	17.0	1100	South America	0.735585	-69.6955	Tucanoan
ainu1240	16.5	2	Eurasia	43.633654	142.4617	Ainu
binu1245	16.0	1200	Papunesia	-6.220680	146.1020	Nuclear Trans New Guinea
coca1259	16.0	250	South America	-4.500000	-74.0000	Tupian
dyir1250	16.0	52	Australia	-17.451600	145.5440	Pama-Nyungan
gras1249	16.0	1700	Papunesia	-9.512180	147.4370	Koiarian
hawa1245	16.0	2000	Papunesia	19.629700	-155.4300	Austronesian
jama1261	16.0	780	South America	-7.622230	-66.5662	Arawan
jung1240	16.0	11200	South America	0.888720	-76.6593	Quechuan
kuku1273	16.0	320	Australia	-16.003600	145.1880	Pama-Nyungan
toca1235	16.0	380	South America	-3.593590	-49.8991	Tupian
waya1269	16.0	1740	South America	2.775370	-54.4429	Cariban
naas1242	15.5	22000	Papunesia	-6.460220	155.6320	South Bougainville
ekar1243	15.0	100000	Papunesia	-3.889740	136.0180	Nuclear Trans New Guinea
nami1256	15.0	7000	Papunesia	-3.855380	141.7880	Sepik
abau1245	14.0	7500	Papunesia	-3.972220	141.3240	Sepik
toar1246	14.0	25200	Papunesia	-8.072310	146.2060	Eleman
waim1251	14.0	12000	Papunesia	-8.671450	146.5440	Austronesian
roto1249	11.0	11600	Papunesia	-5.943390	155.1540	North Bougainville

sound_inventory_population <- sound_inventory_population %>%
    mutate(logpop = log10(population)) %>%
    st_as_sf(coords = c("Longitude", "Latitude"), crs = 4326)

tmap_options(bg.color = "black", legend.text.color = "black")
bg.map +
    sound_inventory_population %>%
    tm_shape() +
    tm_symbols(
        col="nSegments",
        style = "quantile",
        title.col = "Number of segments",
        palette="RdYlBu",
        size=.05,
        border.lwd=0.1
    ) + 
    tm_layout(legend.outside=T,
              legend.outside.position = "bottom", scale=1, 
              legend.bg.color="grey") -> sound_inventory_map

tmap_save(sound_inventory_map, "_img/sound_inventory_map.svg")

Map saved to /mnt/c/Users/gerha/OneDrive - UT Cloud/shareAcrossMachines/_lehre/ws2425/esslli2024_statistical_typology/slides/_img/sound_inventory_map.svg

Size: 9.765879 by 5.01747 inches

tmap_options(bg.color = "black", legend.text.color = "black")
bg.map +
    sound_inventory_population %>%
    tm_shape() +
    tm_symbols(
        col="logpop",
        style = "quantile",
        title.col = "Population size (log)",
        palette="RdYlBu",
        size=.05,
        border.lwd=0.1
    ) + 
    tm_layout(legend.outside=T,
              legend.outside.position = "bottom", scale=1, 
              legend.bg.color="grey") -> logpop_map

tmap_save(logpop_map, "_img/logpop_map.svg")

Map saved to /mnt/c/Users/gerha/OneDrive - UT Cloud/shareAcrossMachines/_lehre/ws2425/esslli2024_statistical_typology/slides/_img/logpop_map.svg

Size: 9.765879 by 5.01747 inches

ggplot(sound_inventory_population, aes(x = logpop, y = nSegments)) +
    geom_point(aes(color = Macroarea), alpha = 0.6) +      # Scatter plot with colors by Macroarea
    geom_smooth(method = "lm", se = TRUE, color = "black") +  # Trend line for the entire dataframe
    scale_y_log10() +
    geom_smooth(aes(group = Macroarea, color = Macroarea), method = "lm", se = FALSE) +  # Trend lines for each Macroarea
    labs(title = "Scatter plot of n_segments vs logpop",
         x = "Population (log)",
         y = "Number of Segments",
         color = "Macroarea") +                            # Label for the legend
    theme_minimal() -> sound_pop_scatter
ggsave(sound_pop_scatter, file = "_img/sound_pop_scatter.svg")

Saving 6.67 x 6.67 in image

`geom_smooth()` using formula 'y ~ x'

`geom_smooth()` using formula 'y ~ x'

medians <- sound_inventory_population %>%
    st_set_geometry(NULL) %>%
    group_by(Macroarea) %>%
    summarize(median_n_segments = median(nSegments, na.rm = TRUE),
              median_logpop = median(logpop, na.rm = TRUE))

sound_pop_median_scatter <- ggplot(medians, aes(x = median_logpop, y = median_n_segments)) +
    scale_y_log10() +
    geom_point(aes(color = Macroarea), size = 4) +             # Scatter plot with colors by Macroarea
    geom_text(aes(label = Macroarea), vjust="inward",hjust="inward", check_overlap = TRUE) + # Adding labels for each macroarea
    geom_smooth(method = "lm", se = TRUE, color = "black", fill = "lightgray") +  # Trend line with uncertainty interval
    labs(title = "Scatter plot of medians by Macroarea",
         x = "Median Log Population",
         y = "Median Number of Segments",
         color = "Macroarea") +                                 # Label for the legend
    theme_minimal()

ggsave(sound_pop_median_scatter, file = "_img/sound_pop_median_scatter.svg")

Saving 6.67 x 6.67 in image

`geom_smooth()` using formula 'y ~ x'

The correlation seems to be mostly between macroareas.

sound_inventory_population <- sound_inventory_population %>%
    mutate(nSegments = as.integer(nSegments))

# fit_c_1 <- brm(nSegments ~ logpop, family=poisson(), data = sound_inventory_population, save_pars = save_pars(all = TRUE))
# saveRDS(fit_c_1, "fit_c_1.rds")

Hierarchical Bayesian modeling

baseline model

Since the response variable holds count data, a Poisson regression is a natural choice.

fit_c_1 <- readRDS("fit_c_1.rds")

summary(fit_c_1)

 Family: poisson 
  Links: mu = log 
Formula: nSegments ~ logpop 
   Data: sound_inventory_population (Number of observations: 1645) 
Samples: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
         total post-warmup samples = 4000

Population-Level Effects: 
          Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
Intercept     3.28      0.01     3.26     3.31 1.00     2588     2482
logpop        0.07      0.00     0.06     0.07 1.00     3025     2854

Samples were drawn using sampling(NUTS). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).

posterior_samples <- as.mcmc(fit_c_1)
mcmc_areas(
  posterior_samples, 
  pars = "b_logpop", 
  prob = 0.95  # 95% HPD interval
) +
  ggtitle("MCMC Density Plot for VSO") +
  xlab("VSO Coefficient") -> sound_inventories_slope_hpd_1
ggsave(sound_inventories_slope_hpd_1, file = "_img/sound_inventories_slope_hpd_1.svg")

Saving 6.67 x 6.67 in image

# fit_c_2 <- brm(
#     nSegments ~ logpop + (1|Macroarea),
#     family=poisson(),
#     data = sound_inventory_population,
#     save_pars = save_pars(all = TRUE),
#     control = list(adapt_delta = 0.99, max_treedepth = 15)
# )
# saveRDS(fit_c_2, file = "fit_c_2.rds")

fit_c_2 <- readRDS("fit_c_2.rds")

summary(fit_c_2)

 Family: poisson 
  Links: mu = log 
Formula: nSegments ~ logpop + (1 | Macroarea) 
   Data: sound_inventory_population (Number of observations: 1645) 
Samples: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
         total post-warmup samples = 4000

Group-Level Effects: 
~Macroarea (Number of levels: 6) 
              Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sd(Intercept)     0.37      0.18     0.18     0.88 1.01      453      647

Population-Level Effects: 
          Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
Intercept     3.46      0.18     3.10     3.83 1.01      454      313
logpop       -0.01      0.00    -0.01     0.00 1.00     2059     1490

Samples were drawn using sampling(NUTS). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).

posterior_samples <- as.mcmc(fit_c_2)
mcmc_areas(
  posterior_samples, 
  pars = "b_logpop", 
  prob = 0.95  # 95% HPD interval
) +
  ggtitle("MCMC Density Plot for VSO") +
  xlab("VSO Coefficient") -> sound_inventories_slope_hpd_2
ggsave(sound_inventories_slope_hpd_2, file = "_img/sound_inventories_slope_hpd_2.svg")

Saving 6.67 x 6.67 in image

# fit_c_3 <- brm(
#     nSegments ~ logpop + (1|Family),
#     family = poisson(),
#     data = sound_inventory_population,
#     save_pars = save_pars(all = TRUE),
#     control = list(adapt_delta = 0.99, max_treedepth = 15)
# )
# saveRDS(fit_c_3, "fit_c_3.rds")

fit_c_3 <- readRDS("fit_c_3.rds")

summary(fit_c_3)

 Family: poisson 
  Links: mu = log 
Formula: nSegments ~ logpop + (1 | Family) 
   Data: sound_inventory_population (Number of observations: 1645) 
Samples: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
         total post-warmup samples = 4000

Group-Level Effects: 
~Family (Number of levels: 159) 
              Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sd(Intercept)     0.36      0.02     0.31     0.41 1.01      433      769

Population-Level Effects: 
          Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
Intercept     3.43      0.03     3.36     3.49 1.03      246      608
logpop       -0.01      0.00    -0.02    -0.00 1.00     4245     3218

Samples were drawn using sampling(NUTS). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).

posterior_samples <- as.mcmc(fit_c_3)
mcmc_areas(
  posterior_samples, 
  pars = "b_logpop", 
  prob = 0.95  # 95% HPD interval
) +
  ggtitle("MCMC Density Plot for VSO") +
  xlab("VSO Coefficient") -> sound_inventories_slope_hpd_3
ggsave(sound_inventories_slope_hpd_1, file = "_img/sound_inventories_slope_hpd_3.svg")

Saving 6.67 x 6.67 in image

# fit_c_4 <- brm(
#     nSegments ~ logpop + (1|Macroarea) + (1|Family),
#     family = poisson(),
#     data = sound_inventory_population,
#     save_pars = save_pars(all = TRUE),
#     control = list(adapt_delta = 0.99, max_treedepth = 15)
# )
# saveRDS(fit_c_4, file="fit_c_4.rds")

fit_c_4 <- readRDS("fit_c_4.rds")

summary(fit_4)

 Family: bernoulli 
  Links: mu = logit 
Formula: N_A ~ VSO + (1 | Macroarea) + (1 | Family) 
   Data: d (Number of observations: 1182) 
Samples: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
         total post-warmup samples = 4000

Group-Level Effects: 
~Family (Number of levels: 179) 
              Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sd(Intercept)     3.05      0.55     2.14     4.24 1.01      884     1734

~Macroarea (Number of levels: 6) 
              Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sd(Intercept)     0.56      0.43     0.02     1.67 1.00      798     1652

Population-Level Effects: 
          Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
Intercept     0.68      0.43    -0.18     1.58 1.00     1293     2182
VSO          -0.06      0.38    -0.79     0.71 1.00     5485     3101

Samples were drawn using sampling(NUTS). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).

posterior_samples <- as.mcmc(fit_c_4)
mcmc_areas(
  posterior_samples, 
  pars = "b_logpop", 
  prob = 0.95  # 95% HPD interval
) +
  ggtitle("MCMC Density Plot for VSO") +
  xlab("VSO Coefficient") -> sound_inventories_slope_hpd_4
ggsave(sound_inventories_slope_hpd_4, file = "_img/sound_inventories_slope_hpd_4.svg")

Saving 6.67 x 6.67 in image

# Compute LOO for each model
loo_1 <- loo(fit_1)
loo_2 <- loo(fit_2)
loo_3 <- loo(fit_3)
loo_4 <- loo(fit_4)

Warning message:
“Found 52 observations with a pareto_k > 0.7 in model 'fit_3'. It is recommended to set 'moment_match = TRUE' in order to perform moment matching for problematic observations. ”
Warning message:
“Found 44 observations with a pareto_k > 0.7 in model 'fit_4'. It is recommended to set 'moment_match = TRUE' in order to perform moment matching for problematic observations. ”

# Compare the models
loo_comparison <- loo_compare(loo_1, loo_2, loo_3, loo_4)

# Print the comparison
print(loo_comparison)

      elpd_diff se_diff
fit_3    0.0       0.0 
fit_4   -1.8       1.1 
fit_2 -121.4      13.8 
fit_1 -203.9      16.9

output <- capture.output(ll_1 <- bridge_sampler(fit_1))
output <- capture.output(ll_2 <- bridge_sampler(fit_2))
output <- capture.output(ll_3 <- bridge_sampler(fit_3))
output <- capture.output(ll_4 <- bridge_sampler(fit_4))

tibble(
    model_1 = ll_1$logml,
    model_2 = ll_2$logml,
    model_3 = ll_3$logml,
    model_4 = ll_4$logml
) %>% pivot_longer(
    cols = c("model_1", "model_2", "model_3", "model_4"),
    names_to = "model", values_to = "marginal_loglikelihood"
) %>% arrange(desc(marginal_loglikelihood))

A tibble: 4 × 2
model	marginal_loglikelihood
<chr>	<dbl>
model_3	-606.7577
model_4	-608.4277
model_2	-690.1292
model_1	-765.8476

Model 3 and model 4 are virtually indistinguishable regarding their fit to the data. Model 3 finds a positive correlation and model 4 a negative one.