Statistical Typology

35th European Summer School in Logic, Language and Information

Leuven, August 2024

Gerhard Jäger, Tübingen University

Lecture 2

Correlation between linguistic and non-linguistic traits

Another major focus of research in statistical typology is the correlation between linguistic and non-linguistic traits. The idea is that languages are shaped by the social and ecological environment in which they are spoken. This is a very old idea, going back to the 19th century, but it has been given a new lease of life by the availability of large-scale databases that allow us to test these ideas in a more systematic way.

A few examples from https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0070902

So why do we then get the correlation? It is unlikely that it is due to mere chance. But as has been pointed out by several of the bloggers cited above, cultural phenomena come in bundles, and typically, there is no very profound explanation behind this, rather the bundling is due to several features happening to take part in the same historical process of diffusion. To add to the examples already given elsewhere, consider the fact that all the nine full member countries of the International Cricket Council also have left-hand driving. (I am not counting the tenth member, the West Indies, which is not a country.) There is no universal principle connecting cricket and left-hand driving, rather both these phenomena happened to be part of British culture at a specific point in time and spread with the growth of the British empire.

Östen Dahl 2013, https://dlc.hypotheses.org/360

The problem of statistical independence

  • Both tests used in the previous lecture are based on the assumption that each language is an independent sample from the same distribution. This is patently false due to the genealogical relationships between languages and their dependency dueo to contact.
  • The issue of creating a suitable sample of languages has been discussed intensely in the literature.
  • Two basic strategies:
    • Diversity Sampling: Select languages so that the range of existing variation is covered as faithfully as possible.
    • Probability Sampling: Select languages in such a way that stochastic independence is approximated as well as possible.

WALS contains a 100-language probability sample. Let us test Universal 17 with it.

library(repr)
options(repr.plot.width=30, repr.plot.height=21)
     
library(tidyverse)
library(leaflet)
library(tmap)
library(sf)
library(svglite)
library(ggthemes)
library(viridis)
library(brms)
library(loo)
library(bridgesampling)
library(coda)
library(bayesplot)
library(RColorBrewer)
── Attaching packages ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── tidyverse 1.3.1 ──

✔ ggplot2 3.3.3     ✔ purrr   0.3.4
✔ tibble  3.1.2     ✔ dplyr   1.0.6
✔ tidyr   1.1.3     ✔ stringr 1.4.0
✔ readr   1.4.0     ✔ forcats 0.5.1

── Conflicts ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()

Linking to GEOS 3.9.1, GDAL 3.3.0, PROJ 8.0.0

Loading required package: viridisLite

Loading required package: Rcpp

Loading 'brms' package (version 2.14.4). Useful instructions
can be found by typing help('brms'). A more detailed introduction
to the package is available through vignette('brms_overview').


Attaching package: ‘brms’


The following object is masked from ‘package:stats’:

    ar


This is loo version 2.4.1

- Online documentation and vignettes at mc-stan.org/loo

- As of v2.0.0 loo defaults to 1 core but we recommend using as many as possible. Use the 'cores' argument or set options(mc.cores = NUM_CORES) for an entire session. 


Attaching package: ‘bridgesampling’


The following object is masked from ‘package:brms’:

    bf


This is bayesplot version 1.8.0

- Online documentation and vignettes at mc-stan.org/bayesplot

- bayesplot theme set to bayesplot::theme_default()

   * Does _not_ affect other ggplot2 plots

   * See ?bayesplot_theme_set for details on theme setting
languages_wals <- read_csv("../data/wals-v2020.3/cldf-datasets-wals-878ea47/cldf/languages.csv")
parameters_wals <- read_csv("../data/wals-v2020.3/cldf-datasets-wals-878ea47/cldf/parameters.csv")
values_wals <- read_csv("../data/wals-v2020.3/cldf-datasets-wals-878ea47/cldf/values.csv")
codes_wals <-  read_csv("../data/wals-v2020.3/cldf-datasets-wals-878ea47/cldf/codes.csv")

── Column specification ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
cols(
  ID = col_character(),
  Name = col_character(),
  Macroarea = col_character(),
  Latitude = col_double(),
  Longitude = col_double(),
  Glottocode = col_character(),
  ISO639P3code = col_character(),
  Family = col_character(),
  Subfamily = col_character(),
  Genus = col_character(),
  GenusIcon = col_logical(),
  ISO_codes = col_character(),
  Samples_100 = col_logical(),
  Samples_200 = col_logical(),
  Country_ID = col_character(),
  Source = col_character(),
  Parent_ID = col_character()
)


Warning message:
“625 parsing failures.
 row       col           expected  actual                                                                 file
2664 GenusIcon 1/0/T/F/TRUE/FALSE cCC8C51 '../data/wals-v2020.3/cldf-datasets-wals-878ea47/cldf/languages.csv'
2666 GenusIcon 1/0/T/F/TRUE/FALSE cCC6851 '../data/wals-v2020.3/cldf-datasets-wals-878ea47/cldf/languages.csv'
2667 GenusIcon 1/0/T/F/TRUE/FALSE cCC7E51 '../data/wals-v2020.3/cldf-datasets-wals-878ea47/cldf/languages.csv'
2668 GenusIcon 1/0/T/F/TRUE/FALSE c8FCC51 '../data/wals-v2020.3/cldf-datasets-wals-878ea47/cldf/languages.csv'
2670 GenusIcon 1/0/T/F/TRUE/FALSE cCC8051 '../data/wals-v2020.3/cldf-datasets-wals-878ea47/cldf/languages.csv'
.... ......... .................. ....... ....................................................................
See problems(...) for more details.
”

── Column specification ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
cols(
  ID = col_character(),
  Name = col_character(),
  Description = col_logical(),
  ColumnSpec = col_logical(),
  Chapter_ID = col_double()
)



── Column specification ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
cols(
  ID = col_character(),
  Language_ID = col_character(),
  Parameter_ID = col_character(),
  Value = col_double(),
  Code_ID = col_character(),
  Comment = col_character(),
  Source = col_character(),
  Example_ID = col_character()
)



── Column specification ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
cols(
  ID = col_character(),
  Parameter_ID = col_character(),
  Name = col_character(),
  Description = col_character(),
  Number = col_double(),
  icon = col_character()
)

world.160e <- read_sf("../data/world_160e.gpkg")
bg.map <- world.160e %>% 
    st_transform("+proj=eqearth lon_0=160") %>%
    tm_shape() +
    tm_fill()
universal17_100 <- values_wals %>%
    filter(Parameter_ID %in% c("81A", "87A")) %>%
    select(Language_ID, Parameter_ID, Value) %>%
    left_join(languages_wals, by=c("Language_ID" = "ID")) %>%
    filter(Samples_100 == TRUE) %>%
    select(Name, Parameter_ID, Value, Macroarea, Family, Latitude, Longitude) %>%
    mutate(Language = Name, .keep="unused") %>%
    left_join(codes_wals, by=c("Parameter_ID" = "Parameter_ID", "Value" = "Number")) %>%
    select(Language, Parameter_ID, Name, Macroarea, Family, Latitude, Longitude) %>%
    mutate(Value = Name, .keep="unused") %>%
    pivot_wider(names_from = Parameter_ID, values_from = Value, values_fill = NA) %>%
    drop_na() %>%
    mutate(VSO = 1 * (`81A` == "VSO")) %>%
    mutate(AN = 1 * (`87A` == "Noun-Adjective")) %>%
    select(-`81A`, -`87A`) %>%
    select(Language, Macroarea, Family, VSO, AN, Latitude, Longitude) %>%
    st_as_sf(coords = c("Longitude", "Latitude"))
universal17_100 <- universal17_100 %>%
    mutate(type = as.factor(1+c(2 * VSO + AN)))
levels(universal17_100$type) <- c("-VSO, AN", "-VSO, NA", "VSO, AN", "VSO, NA")
st_crs(universal17_100) <- 4326
cbbPalette <- c("#E69F00", "#009E73", "#F0E442","#CC79A7")
universal17_100_map <- bg.map +
    tm_shape(universal17_100) +
    tm_symbols(
        col = "type",
        palette = cbbPalette,
        size = 0.5,
        border.lwd = 0.2,
        border.col = 'black',
        alpha = 0.7,
        legend.col.show = TRUE
    ) +
    tm_layout(
        bg.color = "lightblue",
        legend.outside=T,
        legend.outside.position = "right",
        legend.bg.color="grey",

    )
    
tmap_save(universal17_100_map, "_img/universal17_100_map.svg")
Map saved to /mnt/c/Users/gerha/OneDrive - UT Cloud/shareAcrossMachines/_lehre/ws2425/esslli2024_statistical_typology/slides/_img/universal17_100_map.svg

Size: 9.765879 by 5.01747 inches

universal17_100 %>%
    st_set_geometry(NULL) %>%
    group_by(type) %>%
    summarize(count = n()) 
A tibble: 4 × 2
type count
<fct> <int>
-VSO, AN 33
-VSO, NA 50
VSO, AN 4
VSO, NA 4
universal17_100 %>%
    st_set_geometry(NULL) %>%
    group_by(type) %>%
    summarize(count = n()) %>%
ggplot(aes(x = "", y = count, fill = type)) +
geom_bar(stat = "identity", width = 1) +
  coord_polar("y") +
  labs(title = "Count in WALS-100 sample",
       x = NULL, y = NULL) +
  scale_fill_manual(values = cbbPalette) +
  theme_void() +
  theme(
    legend.position = "bottom",
    text = element_text(size = 18),  # Increase font size for all text elements
    plot.title = element_text(hjust = 0.5, size = 18),  # Center and increase the title size
    strip.text = element_text(size = 18),  # Increase the facet label size
    legend.text = element_text(size = 18),  # Increase legend text size
    legend.title = element_text(size = 18)  # Increase legend title size
  ) -> universal17_100_pie
ggsave(universal17_100_pie, filename="_img/universal17_100_pie.svg")
Saving 6.67 x 6.67 in image

  • No need to apply statistical tests here:
    • Sample clearly violates the Universal 17.
    • Sample size is so small that not much can be inferred from this.

One big downside of sampling is that you have to throw away a lot of data. This is why many researchers prefer to use all available data and then correct for the non-independence of the data in the statistical analysis.

Hierarchical modeling

The effect of genealogical and geographic non-independence can be ameliorated by adding random effects for - language families - linguistic macro-areas

universal17_wals <- values_wals %>%
    filter(Parameter_ID %in% c("81A", "87A")) %>%
    select(Language_ID, Parameter_ID, Value) %>%
    left_join(languages_wals, by=c("Language_ID" = "ID")) %>%
    select(Name, Parameter_ID, Value, Macroarea, Family, Latitude, Longitude) %>%
    mutate(Language = Name, .keep="unused") %>%
    left_join(codes_wals, by=c("Parameter_ID" = "Parameter_ID", "Value" = "Number")) %>%
    select(Language, Parameter_ID, Name, Macroarea, Family, Latitude, Longitude) %>%
    mutate(Value = Name, .keep="unused") %>%
    pivot_wider(names_from = Parameter_ID, values_from = Value, values_fill = NA) %>%
    drop_na() %>%
    mutate(VSO = 1 * (`81A` == "VSO")) %>%
    mutate(N_A = 1 * (`87A` == "Noun-Adjective")) %>%
    select(-`81A`, -`87A`) %>%
    select(Language, Macroarea, Family, VSO, N_A, Latitude, Longitude) %>%
    st_as_sf(coords = c("Longitude", "Latitude"))
d <- universal17_wals %>%
    st_set_geometry(NULL) %>%
    mutate(Macroarea = as.factor(Macroarea)) %>%
    mutate(Family = as.factor(Family)) %>%
    select(VSO, N_A, Family, Macroarea)

First I fit a baseline model with all data from WALS.

I am interested in the effect of VSO on adjective-noun order. Therefore I use VSO as independent and N_A as dependent variable.

As the dependent variable is binary, a logistic regression is called for.

# fit_1 <- brm(N_A ~ VSO , data = d, family = bernoulli(), save_pars = save_pars(all = TRUE))
# saveRDS(fit_1, file = "fit_1.rds")
fit_1 <- readRDS("fit_1.rds")
summary(fit_1)
 Family: bernoulli 
  Links: mu = logit 
Formula: N_A ~ VSO 
   Data: d (Number of observations: 1182) 
Samples: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
         total post-warmup samples = 4000

Population-Level Effects: 
          Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
Intercept     0.63      0.07     0.50     0.76 1.00     3666     2681
VSO           0.15      0.23    -0.30     0.62 1.00     3662     2670

Samples were drawn using sampling(NUTS). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).
# Extract posterior samples for logpop
posterior_samples <- as.mcmc(fit_1)
# Create MCMC density plot for logpop with 95% HPD interval
mcmc_areas(
  posterior_samples, 
  pars = "b_VSO", 
  prob = 0.95  # 95% HPD interval
) +
  ggtitle("MCMC Density Plot for VSO") +
  xlab("VSO Coefficient") -> sound_inventories_slope_hpd
ggsave(sound_inventories_slope_hpd, file = "_img/universal17_slope_hpd_1.svg")
Saving 6.67 x 6.67 in image

The result seems to indicate that VSO has no credible impact on N_A.

Now let us include random effects. We start with macro-areas.


#     fit_2 <- brm(
#         N_A ~ VSO + (1|Macroarea),
#         data = d,
#         family = bernoulli(),
#         save_pars = save_pars(all = TRUE)
#     )
# saveRDS(fit_2, file = "fit_2.rds")
fit_2 <- readRDS("fit_2.rds")
summary(fit_2)
 Family: bernoulli 
  Links: mu = logit 
Formula: N_A ~ VSO + (1 | Macroarea) 
   Data: d (Number of observations: 1182) 
Samples: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
         total post-warmup samples = 4000

Group-Level Effects: 
~Macroarea (Number of levels: 6) 
              Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sd(Intercept)     1.04      0.40     0.52     2.09 1.00      824     1261

Population-Level Effects: 
          Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
Intercept     0.54      0.43    -0.38     1.43 1.00      740      913
VSO           0.21      0.27    -0.30     0.74 1.00     1953     2221

Samples were drawn using sampling(NUTS). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).
# Extract posterior samples for logpop
posterior_samples <- as.mcmc(fit_2)
# Create MCMC density plot for logpop with 95% HPD interval
mcmc_areas(
  posterior_samples, 
  pars = "b_VSO", 
  prob = 0.95  # 95% HPD interval
) +
  ggtitle("MCMC Density Plot for VSO") +
  xlab("VSO Coefficient") -> sound_inventories_slope_hpd
ggsave(sound_inventories_slope_hpd, file = "_img/universal17_slope_hpd_2.svg")
Saving 6.67 x 6.67 in image

Language family as random effect:

# fit_3 <- brm(
#     N_A ~ VSO + (1|Family),
#     data = d,
#     family = bernoulli(),
#     save_pars = save_pars(all = TRUE)
# )
# saveRDS(fit_3, file="fit_3.rds")
fit_3 <- readRDS("fit_3.rds")
summary(fit_3)
 Family: bernoulli 
  Links: mu = logit 
Formula: N_A ~ VSO + (1 | Family) 
   Data: d (Number of observations: 1182) 
Samples: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
         total post-warmup samples = 4000

Group-Level Effects: 
~Family (Number of levels: 179) 
              Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sd(Intercept)     3.18      0.53     2.32     4.35 1.00      791     1558

Population-Level Effects: 
          Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
Intercept     0.70      0.32     0.10     1.36 1.01      444      833
VSO          -0.05      0.38    -0.78     0.70 1.00     3361     2900

Samples were drawn using sampling(NUTS). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).
# Extract posterior samples for logpop
posterior_samples <- as.mcmc(fit_3)
# Create MCMC density plot for logpop with 95% HPD interval
mcmc_areas(
  posterior_samples, 
  pars = "b_VSO", 
  prob = 0.95  # 95% HPD interval
) +
  ggtitle("MCMC Density Plot for VSO") +
  xlab("VSO Coefficient") -> sound_inventories_slope_hpd
ggsave(sound_inventories_slope_hpd, file = "_img/universal17_slope_hpd_3.svg")
Saving 6.67 x 6.67 in image

And now both random effects simultaneously.

# fit_4 <- brm(
#     N_A ~ VSO + (1|Macroarea) + (1|Family),
#     data = d,
#     family = bernoulli(),
#     save_pars = save_pars(all = TRUE),
#     control = list(adapt_delta = 0.99, max_treedepth = 15)
# )
# saveRDS(fit_4, "fit_4.rds")
fit_4 <- readRDS("fit_4.rds")
summary(fit_4)
 Family: bernoulli 
  Links: mu = logit 
Formula: N_A ~ VSO + (1 | Macroarea) + (1 | Family) 
   Data: d (Number of observations: 1182) 
Samples: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
         total post-warmup samples = 4000

Group-Level Effects: 
~Family (Number of levels: 179) 
              Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sd(Intercept)     3.05      0.55     2.14     4.24 1.01      884     1734

~Macroarea (Number of levels: 6) 
              Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sd(Intercept)     0.56      0.43     0.02     1.67 1.00      798     1652

Population-Level Effects: 
          Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
Intercept     0.68      0.43    -0.18     1.58 1.00     1293     2182
VSO          -0.06      0.38    -0.79     0.71 1.00     5485     3101

Samples were drawn using sampling(NUTS). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).
# Extract posterior samples for logpop
posterior_samples <- as.mcmc(fit_4)
# Create MCMC density plot for logpop with 95% HPD interval
mcmc_areas(
  posterior_samples, 
  pars = "b_VSO", 
  prob = 0.95  # 95% HPD interval
) +
  ggtitle("MCMC Density Plot for VSO") +
  xlab("VSO Coefficient") -> sound_inventories_slope_hpd
ggsave(sound_inventories_slope_hpd, file = "_img/universal17_slope_hpd_4.svg")
Saving 6.67 x 6.67 in image

Model comparison

I will use two techniques for model comparison:

  • Pareto-smoothed leave-one-out cross-validation
  • Bayes Factor
# Compute LOO for each model
loo_1 <- loo(fit_1)
loo_2 <- loo(fit_2)
loo_3 <- loo(fit_3)
loo_4 <- loo(fit_4)
Warning message:
“Found 52 observations with a pareto_k > 0.7 in model 'fit_3'. It is recommended to set 'moment_match = TRUE' in order to perform moment matching for problematic observations. ”
Warning message:
“Found 44 observations with a pareto_k > 0.7 in model 'fit_4'. It is recommended to set 'moment_match = TRUE' in order to perform moment matching for problematic observations. ”
# Compare the models
loo_comparison <- loo_compare(loo_1, loo_2, loo_3, loo_4)

# Print the comparison
print(loo_comparison)
      elpd_diff se_diff
fit_3    0.0       0.0 
fit_4   -1.8       1.1 
fit_2 -121.4      13.8 
fit_1 -203.9      16.9 

Including language families as random effect leads to a clear improvement. Additionally adding macro areas makes the fit to the data slightly worse.

Now we estimate the marginal log-likelihoods of the models to compute the Bayes Factors.

output <- capture.output(ll_1 <- bridge_sampler(fit_1))
output <- capture.output(ll_2 <- bridge_sampler(fit_2))
output <- capture.output(ll_3 <- bridge_sampler(fit_3))
output <- capture.output(ll_4 <- bridge_sampler(fit_4))
tibble(
    model_1 = ll_1$logml,
    model_2 = ll_2$logml,
    model_3 = ll_3$logml,
    model_4 = ll_4$logml
) %>% pivot_longer(
    cols = c("model_1", "model_2", "model_3", "model_4"),
    names_to = "model", values_to = "marginal_loglikelihood"
) %>% arrange(desc(marginal_loglikelihood))
A tibble: 4 × 2
model marginal_loglikelihood
<chr> <dbl>
model_3 -606.7754
model_4 -608.7576
model_2 -690.1308
model_1 -765.8460
model marginal_likelihood
model_3 -606.9392
model_4 -608.6612
model_2 -690.1678
model_1 -765.8465

The results are qualitatively identical to PSIS-LOO model comparison.

Let us check whether a random slope for families makes a difference.

# fit_5 <- brm(
#     N_A ~ VSO + (VSO|Family),
#     data = d,
#     family = bernoulli(),
#     save_pars = save_pars(all = TRUE)
# )
# saveRDS(fit_5, file="fit_5.rds")
fit_5 <- readRDS("fit_5.rds")
summary(fit_5)
 Family: bernoulli 
  Links: mu = logit 
Formula: N_A ~ VSO + (VSO | Family) 
   Data: d (Number of observations: 1182) 
Samples: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
         total post-warmup samples = 4000

Group-Level Effects: 
~Family (Number of levels: 179) 
                   Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sd(Intercept)          3.31      0.58     2.35     4.60 1.00      874     1554
sd(VSO)                2.89      1.40     1.11     6.47 1.00     1418     1641
cor(Intercept,VSO)    -0.10      0.38    -0.80     0.62 1.00     2598     2595

Population-Level Effects: 
          Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
Intercept     0.69      0.33     0.08     1.38 1.01      903     1504
VSO           0.65      1.02    -1.36     2.92 1.00     1660     1258

Samples were drawn using sampling(NUTS). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).
# Extract posterior samples for logpop
posterior_samples <- as.mcmc(fit_5)
# Create MCMC density plot for logpop with 95% HPD interval
mcmc_areas(
  posterior_samples, 
  pars = "b_VSO", 
  prob = 0.95  # 95% HPD interval
) +
  ggtitle("MCMC Density Plot for VSO") +
  xlab("VSO Coefficient") -> sound_inventories_slope_hpd
ggsave(sound_inventories_slope_hpd, file = "_img/universal17_slope_hpd_5.svg")
Saving 6.67 x 6.67 in image

loo_5 <- loo(fit_5)
loo_comparison <- loo_compare(loo_1, loo_2, loo_3, loo_4, loo_5)
loo_comparison
Warning message:
“Found 58 observations with a pareto_k > 0.7 in model 'fit_5'. It is recommended to set 'moment_match = TRUE' in order to perform moment matching for problematic observations. ”
A compare.loo: 5 × 8 of type dbl
elpd_diff se_diff elpd_loo se_elpd_loo p_loo se_p_loo looic se_looic
fit_5 0.00000 0.000000 -548.6753 18.41976 94.810705 6.2567796 1097.351 36.83952
fit_3 -10.98718 5.002760 -559.6625 18.48225 89.500034 5.8725805 1119.325 36.96449
fit_4 -12.80774 5.080454 -561.4831 18.46092 89.196145 5.7341311 1122.966 36.92185
fit_2 -132.41430 14.129699 -681.0896 15.39807 7.110228 0.2704566 1362.179 30.79614
fit_1 -214.89195 17.058385 -763.5673 10.52670 2.075960 0.1283622 1527.135 21.05340
output <- capture.output(ll_5 <- bridge_sampler(fit_5))
# tibble(
#     model_1 = ll_1$logml,
#     model_2 = ll_2$logml,
#     model_3 = ll_3$logml,
#     model_4 = ll_4$logml,
#     model_5 = ll_5$logml
# ) %>% pivot_longer(
#     cols = c("model_1", "model_2", "model_3", "model_4", "model_5"),
#     names_to = "model", values_to = "marginal_loglikelihood"
# ) %>% arrange(desc(marginal_loglikelihood))
model marginal_likelihood
model_5 -598.7606
model_3 -606.9392
model_4 -608.6612
model_2 -690.1678
model_1 -765.8465

Including a random slope for families clearly improves the model.

No suppose we discover a new language family. What is the posterior distribution of the conditional probability \(P(\text{N_A}|\text{VSO})\) within this family?

new_data <- data.frame(
  VSO = 1,   # Set VSO to 1
  Family = NA  # Placeholder for the new family
)

# Make predictions
# posterior_epred gives you the expected value (mean) of the posterior distribution
posterior_predictions <- posterior_epred(fit_5, newdata = new_data, re_formula = ~ (VSO|Family))

posterior_tibble <- tibble(posterior = posterior_predictions[,1])

summary(posterior_predictions)
       V1        
 Min.   :0.0000  
 1st Qu.:0.2074  
 Median :0.7950  
 Mean   :0.6214  
 3rd Qu.:0.9828  
 Max.   :1.0000  
mcmc_areas(
  posterior_tibble, 
  pars = "posterior",
  prob = 0.95
) +
  ggtitle("95% Highest Posterior Density Interval for P(N_A = 1 | VSO = 1)") -> universal17_hpd
  ggsave(universal17_hpd, filename="_img/universal17_hpd.svg")
Saving 6.67 x 6.67 in image

So in an unkown family, we expect VSO language either to have a strong noun-adjective or a strong adjetive-noun bias.

Now let us check what we can expect if we sample an unkonwn VSO language from an unknwon family.

samples <- posterior_predict(fit_5, newdata = new_data, re_formula = ~ (VSO|Family))
mean(samples)
0.6175

So, this modeling approach lends support to Universal 17, if we interpret with overwhelmingly more than chance frequency as \(>0.5\).

Let us check what the model predicts for non-VSO languages.

new_data <- data.frame(
  VSO = 1,   # Set VSO to 1
  Family = NA  # Placeholder for the new family
)

samples <- posterior_predict(fit_5, newdata = new_data, re_formula = ~ (VSO|Family))
mean(samples)
0.62425

This is virtually identical to the estimate for VSO languages. So if we interpret with overwhelmingly more than chance frequency as more than what you expect if you know nothing about the language, then the model does not support Universal 17.

Of course, this is also the interpretation we get directly from the regression analysis, since the credible interval of the slope of the main effect includes 0.

summary(fit_5)
 Family: bernoulli 
  Links: mu = logit 
Formula: N_A ~ VSO + (VSO | Family) 
   Data: d (Number of observations: 1182) 
Samples: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
         total post-warmup samples = 4000

Group-Level Effects: 
~Family (Number of levels: 179) 
                   Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sd(Intercept)          3.31      0.58     2.35     4.60 1.00      874     1554
sd(VSO)                2.89      1.40     1.11     6.47 1.00     1418     1641
cor(Intercept,VSO)    -0.10      0.38    -0.80     0.62 1.00     2598     2595

Population-Level Effects: 
          Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
Intercept     0.69      0.33     0.08     1.38 1.01      903     1504
VSO           0.65      1.02    -1.36     2.92 1.00     1660     1258

Samples were drawn using sampling(NUTS). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).

Continuous variables

  • Let’s check this out
  • data sources:
sound_inventory_population <- read_csv("../data/soundinventory_population.csv") %>%
    drop_na("Family")
sound_inventory_population

── Column specification ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
cols(
  Glottocode = col_character(),
  nSegments = col_double(),
  population = col_double(),
  Macroarea = col_character(),
  Latitude = col_double(),
  Longitude = col_double(),
  Family = col_character()
)

A tibble: 1645 × 7
Glottocode nSegments population Macroarea Latitude Longitude Family
<chr> <dbl> <dbl> <chr> <dbl> <dbl> <chr>
huaa1248 161.0 2500 Africa -24.200220 20.79583 Tuu
juho1239 141.0 45500 Africa -19.688000 20.76730 Kxa
kild1236 128.0 600 Eurasia 68.215800 35.83480 Uralic
kung1261 120.0 16500 Africa -21.920000 18.00000 Kxa
tigo1236 116.0 60000 Africa 6.983995 10.70849 Atlantic-Congo
irig1241 95.0 40000 Africa 9.851970 8.65250 Atlantic-Congo
arch1244 91.0 970 Eurasia 42.323900 46.82670 Nakh-Daghestanian
suga1248 90.0 10000 Africa 7.129060 12.37790 Atlantic-Congo
yeyi1239 90.0 11180 Africa -18.917800 23.60880 Atlantic-Congo
mfum1238 89.0 24700 Africa 6.600000 10.93639 Atlantic-Congo
xhos1239 84.0 8183300 Africa -31.038900 28.07690 Atlantic-Congo
sera1259 83.0 26009000 Eurasia 29.553400 71.90600 Indo-European
scot1245 82.0 60130 Eurasia 56.757400 -5.24366 Indo-European
veng1238 82.0 27000 Africa 6.114030 10.41370 Atlantic-Congo
moro1292 78.0 28019650 Africa 32.500000 -7.50000 Afro-Asiatic
para1301 77.0 805700 Eurasia 22.865800 99.37500 Austroasiatic
kang1288 75.5 15000 Africa 11.017400 29.34630 Kadugli-Krongo
idom1241 75.0 927000 Africa 7.191390 7.71261 Atlantic-Congo
komc1235 75.0 233000 Africa 6.259600 10.33480 Atlantic-Congo
skol1241 75.0 320 Eurasia 68.832600 29.72040 Uralic
eten1239 74.0 40000 Africa 9.730730 8.61443 Atlantic-Congo
nort2671 74.0 25700 Eurasia 68.725000 22.11130 Uralic
sich1238 74.0 2000000 Eurasia 28.194700 102.12100 Sino-Tibetan
dime1235 73.0 1100 Africa 6.209510 36.33290 South Omotic
lezg1247 71.5 623710 Eurasia 41.515700 47.89510 Nakh-Daghestanian
kham1282 70.5 1380300 Eurasia 31.931300 91.70620 Sino-Tibetan
chen1255 70.0 26000 Eurasia 15.878700 79.25340 Dravidian
tada1238 70.0 159800 Africa 16.685000 2.32653 Songhay
huml1238 69.0 5000 Eurasia 30.145100 81.56590 Sino-Tibetan
lakk1252 69.0 156300 Eurasia 42.132800 47.08090 Nakh-Daghestanian
iwam1256 17.0 3000 Papunesia -4.273460 141.8170 Sepik
mara1404 17.0 866000 Papunesia 7.794140 124.1750 Austronesian
nucl1632 17.0 30000 Papunesia -2.601300 140.5120 Sentanic
orok1269 17.0 35000 Papunesia -8.790190 148.0900 Nuclear Trans New Guinea
para1310 17.0 340 South America -3.712630 -53.0657 Cariban
para1312 17.0 900 South America -4.646810 -50.0621 Tupian
saba1268 17.0 3 South America -12.985700 -60.3350 Nambiquaran
sout2895 17.0 1500 Papunesia -3.234110 129.1480 Austronesian
tatu1247 17.0 330 South America 0.555820 -70.5327 Tucanoan
tiga1245 17.0 10000 Papunesia -2.709510 150.8930 Austronesian
tong1325 17.0 189740 Papunesia -21.170000 -175.2500 Austronesian
yuru1263 17.0 1100 South America 0.735585 -69.6955 Tucanoan
ainu1240 16.5 2 Eurasia 43.633654 142.4617 Ainu
binu1245 16.0 1200 Papunesia -6.220680 146.1020 Nuclear Trans New Guinea
coca1259 16.0 250 South America -4.500000 -74.0000 Tupian
dyir1250 16.0 52 Australia -17.451600 145.5440 Pama-Nyungan
gras1249 16.0 1700 Papunesia -9.512180 147.4370 Koiarian
hawa1245 16.0 2000 Papunesia 19.629700 -155.4300 Austronesian
jama1261 16.0 780 South America -7.622230 -66.5662 Arawan
jung1240 16.0 11200 South America 0.888720 -76.6593 Quechuan
kuku1273 16.0 320 Australia -16.003600 145.1880 Pama-Nyungan
toca1235 16.0 380 South America -3.593590 -49.8991 Tupian
waya1269 16.0 1740 South America 2.775370 -54.4429 Cariban
naas1242 15.5 22000 Papunesia -6.460220 155.6320 South Bougainville
ekar1243 15.0 100000 Papunesia -3.889740 136.0180 Nuclear Trans New Guinea
nami1256 15.0 7000 Papunesia -3.855380 141.7880 Sepik
abau1245 14.0 7500 Papunesia -3.972220 141.3240 Sepik
toar1246 14.0 25200 Papunesia -8.072310 146.2060 Eleman
waim1251 14.0 12000 Papunesia -8.671450 146.5440 Austronesian
roto1249 11.0 11600 Papunesia -5.943390 155.1540 North Bougainville
sound_inventory_population <- sound_inventory_population %>%
    mutate(logpop = log10(population)) %>%
    st_as_sf(coords = c("Longitude", "Latitude"), crs = 4326)
tmap_options(bg.color = "black", legend.text.color = "black")
bg.map +
    sound_inventory_population %>%
    tm_shape() +
    tm_symbols(
        col="nSegments",
        style = "quantile",
        title.col = "Number of segments",
        palette="RdYlBu",
        size=.05,
        border.lwd=0.1
    ) + 
    tm_layout(legend.outside=T,
              legend.outside.position = "bottom", scale=1, 
              legend.bg.color="grey") -> sound_inventory_map

tmap_save(sound_inventory_map, "_img/sound_inventory_map.svg")
Map saved to /mnt/c/Users/gerha/OneDrive - UT Cloud/shareAcrossMachines/_lehre/ws2425/esslli2024_statistical_typology/slides/_img/sound_inventory_map.svg

Size: 9.765879 by 5.01747 inches

tmap_options(bg.color = "black", legend.text.color = "black")
bg.map +
    sound_inventory_population %>%
    tm_shape() +
    tm_symbols(
        col="logpop",
        style = "quantile",
        title.col = "Population size (log)",
        palette="RdYlBu",
        size=.05,
        border.lwd=0.1
    ) + 
    tm_layout(legend.outside=T,
              legend.outside.position = "bottom", scale=1, 
              legend.bg.color="grey") -> logpop_map

tmap_save(logpop_map, "_img/logpop_map.svg")
Map saved to /mnt/c/Users/gerha/OneDrive - UT Cloud/shareAcrossMachines/_lehre/ws2425/esslli2024_statistical_typology/slides/_img/logpop_map.svg

Size: 9.765879 by 5.01747 inches

ggplot(sound_inventory_population, aes(x = logpop, y = nSegments)) +
    geom_point(aes(color = Macroarea), alpha = 0.6) +      # Scatter plot with colors by Macroarea
    geom_smooth(method = "lm", se = TRUE, color = "black") +  # Trend line for the entire dataframe
    scale_y_log10() +
    geom_smooth(aes(group = Macroarea, color = Macroarea), method = "lm", se = FALSE) +  # Trend lines for each Macroarea
    labs(title = "Scatter plot of n_segments vs logpop",
         x = "Population (log)",
         y = "Number of Segments",
         color = "Macroarea") +                            # Label for the legend
    theme_minimal() -> sound_pop_scatter
ggsave(sound_pop_scatter, file = "_img/sound_pop_scatter.svg")
Saving 6.67 x 6.67 in image

`geom_smooth()` using formula 'y ~ x'

`geom_smooth()` using formula 'y ~ x'

medians <- sound_inventory_population %>%
    st_set_geometry(NULL) %>%
    group_by(Macroarea) %>%
    summarize(median_n_segments = median(nSegments, na.rm = TRUE),
              median_logpop = median(logpop, na.rm = TRUE))
sound_pop_median_scatter <- ggplot(medians, aes(x = median_logpop, y = median_n_segments)) +
    scale_y_log10() +
    geom_point(aes(color = Macroarea), size = 4) +             # Scatter plot with colors by Macroarea
    geom_text(aes(label = Macroarea), vjust="inward",hjust="inward", check_overlap = TRUE) + # Adding labels for each macroarea
    geom_smooth(method = "lm", se = TRUE, color = "black", fill = "lightgray") +  # Trend line with uncertainty interval
    labs(title = "Scatter plot of medians by Macroarea",
         x = "Median Log Population",
         y = "Median Number of Segments",
         color = "Macroarea") +                                 # Label for the legend
    theme_minimal()

ggsave(sound_pop_median_scatter, file = "_img/sound_pop_median_scatter.svg")
Saving 6.67 x 6.67 in image

`geom_smooth()` using formula 'y ~ x'

The correlation seems to be mostly between macroareas.

sound_inventory_population <- sound_inventory_population %>%
    mutate(nSegments = as.integer(nSegments))
# fit_c_1 <- brm(nSegments ~ logpop, family=poisson(), data = sound_inventory_population, save_pars = save_pars(all = TRUE))
# saveRDS(fit_c_1, "fit_c_1.rds")

Hierarchical Bayesian modeling

baseline model

Since the response variable holds count data, a Poisson regression is a natural choice.

fit_c_1 <- readRDS("fit_c_1.rds")
summary(fit_c_1)
 Family: poisson 
  Links: mu = log 
Formula: nSegments ~ logpop 
   Data: sound_inventory_population (Number of observations: 1645) 
Samples: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
         total post-warmup samples = 4000

Population-Level Effects: 
          Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
Intercept     3.28      0.01     3.26     3.31 1.00     2588     2482
logpop        0.07      0.00     0.06     0.07 1.00     3025     2854

Samples were drawn using sampling(NUTS). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).
posterior_samples <- as.mcmc(fit_c_1)
mcmc_areas(
  posterior_samples, 
  pars = "b_logpop", 
  prob = 0.95  # 95% HPD interval
) +
  ggtitle("MCMC Density Plot for VSO") +
  xlab("VSO Coefficient") -> sound_inventories_slope_hpd_1
ggsave(sound_inventories_slope_hpd_1, file = "_img/sound_inventories_slope_hpd_1.svg")
Saving 6.67 x 6.67 in image

# fit_c_2 <- brm(
#     nSegments ~ logpop + (1|Macroarea),
#     family=poisson(),
#     data = sound_inventory_population,
#     save_pars = save_pars(all = TRUE),
#     control = list(adapt_delta = 0.99, max_treedepth = 15)
# )
# saveRDS(fit_c_2, file = "fit_c_2.rds")
fit_c_2 <- readRDS("fit_c_2.rds")
summary(fit_c_2)
 Family: poisson 
  Links: mu = log 
Formula: nSegments ~ logpop + (1 | Macroarea) 
   Data: sound_inventory_population (Number of observations: 1645) 
Samples: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
         total post-warmup samples = 4000

Group-Level Effects: 
~Macroarea (Number of levels: 6) 
              Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sd(Intercept)     0.37      0.18     0.18     0.88 1.01      453      647

Population-Level Effects: 
          Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
Intercept     3.46      0.18     3.10     3.83 1.01      454      313
logpop       -0.01      0.00    -0.01     0.00 1.00     2059     1490

Samples were drawn using sampling(NUTS). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).
posterior_samples <- as.mcmc(fit_c_2)
mcmc_areas(
  posterior_samples, 
  pars = "b_logpop", 
  prob = 0.95  # 95% HPD interval
) +
  ggtitle("MCMC Density Plot for VSO") +
  xlab("VSO Coefficient") -> sound_inventories_slope_hpd_2
ggsave(sound_inventories_slope_hpd_2, file = "_img/sound_inventories_slope_hpd_2.svg")
Saving 6.67 x 6.67 in image

# fit_c_3 <- brm(
#     nSegments ~ logpop + (1|Family),
#     family = poisson(),
#     data = sound_inventory_population,
#     save_pars = save_pars(all = TRUE),
#     control = list(adapt_delta = 0.99, max_treedepth = 15)
# )
# saveRDS(fit_c_3, "fit_c_3.rds")
fit_c_3 <- readRDS("fit_c_3.rds")
summary(fit_c_3)
 Family: poisson 
  Links: mu = log 
Formula: nSegments ~ logpop + (1 | Family) 
   Data: sound_inventory_population (Number of observations: 1645) 
Samples: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
         total post-warmup samples = 4000

Group-Level Effects: 
~Family (Number of levels: 159) 
              Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sd(Intercept)     0.36      0.02     0.31     0.41 1.01      433      769

Population-Level Effects: 
          Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
Intercept     3.43      0.03     3.36     3.49 1.03      246      608
logpop       -0.01      0.00    -0.02    -0.00 1.00     4245     3218

Samples were drawn using sampling(NUTS). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).
posterior_samples <- as.mcmc(fit_c_3)
mcmc_areas(
  posterior_samples, 
  pars = "b_logpop", 
  prob = 0.95  # 95% HPD interval
) +
  ggtitle("MCMC Density Plot for VSO") +
  xlab("VSO Coefficient") -> sound_inventories_slope_hpd_3
ggsave(sound_inventories_slope_hpd_1, file = "_img/sound_inventories_slope_hpd_3.svg")
Saving 6.67 x 6.67 in image

# fit_c_4 <- brm(
#     nSegments ~ logpop + (1|Macroarea) + (1|Family),
#     family = poisson(),
#     data = sound_inventory_population,
#     save_pars = save_pars(all = TRUE),
#     control = list(adapt_delta = 0.99, max_treedepth = 15)
# )
# saveRDS(fit_c_4, file="fit_c_4.rds")
fit_c_4 <- readRDS("fit_c_4.rds")
summary(fit_4)
 Family: bernoulli 
  Links: mu = logit 
Formula: N_A ~ VSO + (1 | Macroarea) + (1 | Family) 
   Data: d (Number of observations: 1182) 
Samples: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
         total post-warmup samples = 4000

Group-Level Effects: 
~Family (Number of levels: 179) 
              Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sd(Intercept)     3.05      0.55     2.14     4.24 1.01      884     1734

~Macroarea (Number of levels: 6) 
              Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sd(Intercept)     0.56      0.43     0.02     1.67 1.00      798     1652

Population-Level Effects: 
          Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
Intercept     0.68      0.43    -0.18     1.58 1.00     1293     2182
VSO          -0.06      0.38    -0.79     0.71 1.00     5485     3101

Samples were drawn using sampling(NUTS). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).
posterior_samples <- as.mcmc(fit_c_4)
mcmc_areas(
  posterior_samples, 
  pars = "b_logpop", 
  prob = 0.95  # 95% HPD interval
) +
  ggtitle("MCMC Density Plot for VSO") +
  xlab("VSO Coefficient") -> sound_inventories_slope_hpd_4
ggsave(sound_inventories_slope_hpd_4, file = "_img/sound_inventories_slope_hpd_4.svg")
Saving 6.67 x 6.67 in image

# Compute LOO for each model
loo_1 <- loo(fit_1)
loo_2 <- loo(fit_2)
loo_3 <- loo(fit_3)
loo_4 <- loo(fit_4)
Warning message:
“Found 52 observations with a pareto_k > 0.7 in model 'fit_3'. It is recommended to set 'moment_match = TRUE' in order to perform moment matching for problematic observations. ”
Warning message:
“Found 44 observations with a pareto_k > 0.7 in model 'fit_4'. It is recommended to set 'moment_match = TRUE' in order to perform moment matching for problematic observations. ”
# Compare the models
loo_comparison <- loo_compare(loo_1, loo_2, loo_3, loo_4)

# Print the comparison
print(loo_comparison)
      elpd_diff se_diff
fit_3    0.0       0.0 
fit_4   -1.8       1.1 
fit_2 -121.4      13.8 
fit_1 -203.9      16.9 
output <- capture.output(ll_1 <- bridge_sampler(fit_1))
output <- capture.output(ll_2 <- bridge_sampler(fit_2))
output <- capture.output(ll_3 <- bridge_sampler(fit_3))
output <- capture.output(ll_4 <- bridge_sampler(fit_4))
tibble(
    model_1 = ll_1$logml,
    model_2 = ll_2$logml,
    model_3 = ll_3$logml,
    model_4 = ll_4$logml
) %>% pivot_longer(
    cols = c("model_1", "model_2", "model_3", "model_4"),
    names_to = "model", values_to = "marginal_loglikelihood"
) %>% arrange(desc(marginal_loglikelihood))
A tibble: 4 × 2
model marginal_loglikelihood
<chr> <dbl>
model_3 -606.7577
model_4 -608.4277
model_2 -690.1292
model_1 -765.8476

Model 3 and model 4 are virtually indistinguishable regarding their fit to the data. Model 3 finds a positive correlation and model 4 a negative one.