Statistical Typology

35th European Summer School in Logic, Language and Information

Leuven, August 2024

Gerhard Jäger, Tübingen University

Lecture 3

Continuous variables

  • Let’s check this out
  • data sources:
library(repr)
options(repr.plot.width=30, repr.plot.height=21)
     
library(tidyverse)
library(leaflet)
library(tmap)
library(sf)
library(svglite)
library(ggthemes)
library(viridis)
library(brms)
library(loo)
library(bridgesampling)
library(coda)
library(bayesplot)
library(RColorBrewer)
── Attaching packages ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── tidyverse 1.3.1 ──

✔ ggplot2 3.3.3     ✔ purrr   0.3.4
✔ tibble  3.1.2     ✔ dplyr   1.0.6
✔ tidyr   1.1.3     ✔ stringr 1.4.0
✔ readr   1.4.0     ✔ forcats 0.5.1

── Conflicts ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()

Linking to GEOS 3.9.1, GDAL 3.3.0, PROJ 8.0.0

Loading required package: viridisLite

Loading required package: Rcpp

Loading 'brms' package (version 2.14.4). Useful instructions
can be found by typing help('brms'). A more detailed introduction
to the package is available through vignette('brms_overview').


Attaching package: ‘brms’


The following object is masked from ‘package:stats’:

    ar


This is loo version 2.4.1

- Online documentation and vignettes at mc-stan.org/loo

- As of v2.0.0 loo defaults to 1 core but we recommend using as many as possible. Use the 'cores' argument or set options(mc.cores = NUM_CORES) for an entire session. 


Attaching package: ‘bridgesampling’


The following object is masked from ‘package:brms’:

    bf


This is bayesplot version 1.8.0

- Online documentation and vignettes at mc-stan.org/bayesplot

- bayesplot theme set to bayesplot::theme_default()

   * Does _not_ affect other ggplot2 plots

   * See ?bayesplot_theme_set for details on theme setting
world.160e <- read_sf("../data/world_160e.gpkg")
bg.map <- world.160e %>% 
    st_transform("+proj=eqearth lon_0=160") %>%
    tm_shape() +
    tm_fill()
sound_inventory_population <- read_csv("../data/soundinventory_population.csv") %>%
    drop_na("Family")
sound_inventory_population

── Column specification ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
cols(
  Glottocode = col_character(),
  nSegments = col_double(),
  population = col_double(),
  Macroarea = col_character(),
  Latitude = col_double(),
  Longitude = col_double(),
  Family = col_character()
)

A tibble: 1645 × 7
Glottocode nSegments population Macroarea Latitude Longitude Family
<chr> <dbl> <dbl> <chr> <dbl> <dbl> <chr>
huaa1248 161.0 2500 Africa -24.200220 20.79583 Tuu
juho1239 141.0 45500 Africa -19.688000 20.76730 Kxa
kild1236 128.0 600 Eurasia 68.215800 35.83480 Uralic
kung1261 120.0 16500 Africa -21.920000 18.00000 Kxa
tigo1236 116.0 60000 Africa 6.983995 10.70849 Atlantic-Congo
irig1241 95.0 40000 Africa 9.851970 8.65250 Atlantic-Congo
arch1244 91.0 970 Eurasia 42.323900 46.82670 Nakh-Daghestanian
suga1248 90.0 10000 Africa 7.129060 12.37790 Atlantic-Congo
yeyi1239 90.0 11180 Africa -18.917800 23.60880 Atlantic-Congo
mfum1238 89.0 24700 Africa 6.600000 10.93639 Atlantic-Congo
xhos1239 84.0 8183300 Africa -31.038900 28.07690 Atlantic-Congo
sera1259 83.0 26009000 Eurasia 29.553400 71.90600 Indo-European
scot1245 82.0 60130 Eurasia 56.757400 -5.24366 Indo-European
veng1238 82.0 27000 Africa 6.114030 10.41370 Atlantic-Congo
moro1292 78.0 28019650 Africa 32.500000 -7.50000 Afro-Asiatic
para1301 77.0 805700 Eurasia 22.865800 99.37500 Austroasiatic
kang1288 75.5 15000 Africa 11.017400 29.34630 Kadugli-Krongo
idom1241 75.0 927000 Africa 7.191390 7.71261 Atlantic-Congo
komc1235 75.0 233000 Africa 6.259600 10.33480 Atlantic-Congo
skol1241 75.0 320 Eurasia 68.832600 29.72040 Uralic
eten1239 74.0 40000 Africa 9.730730 8.61443 Atlantic-Congo
nort2671 74.0 25700 Eurasia 68.725000 22.11130 Uralic
sich1238 74.0 2000000 Eurasia 28.194700 102.12100 Sino-Tibetan
dime1235 73.0 1100 Africa 6.209510 36.33290 South Omotic
lezg1247 71.5 623710 Eurasia 41.515700 47.89510 Nakh-Daghestanian
kham1282 70.5 1380300 Eurasia 31.931300 91.70620 Sino-Tibetan
chen1255 70.0 26000 Eurasia 15.878700 79.25340 Dravidian
tada1238 70.0 159800 Africa 16.685000 2.32653 Songhay
huml1238 69.0 5000 Eurasia 30.145100 81.56590 Sino-Tibetan
lakk1252 69.0 156300 Eurasia 42.132800 47.08090 Nakh-Daghestanian
iwam1256 17.0 3000 Papunesia -4.273460 141.8170 Sepik
mara1404 17.0 866000 Papunesia 7.794140 124.1750 Austronesian
nucl1632 17.0 30000 Papunesia -2.601300 140.5120 Sentanic
orok1269 17.0 35000 Papunesia -8.790190 148.0900 Nuclear Trans New Guinea
para1310 17.0 340 South America -3.712630 -53.0657 Cariban
para1312 17.0 900 South America -4.646810 -50.0621 Tupian
saba1268 17.0 3 South America -12.985700 -60.3350 Nambiquaran
sout2895 17.0 1500 Papunesia -3.234110 129.1480 Austronesian
tatu1247 17.0 330 South America 0.555820 -70.5327 Tucanoan
tiga1245 17.0 10000 Papunesia -2.709510 150.8930 Austronesian
tong1325 17.0 189740 Papunesia -21.170000 -175.2500 Austronesian
yuru1263 17.0 1100 South America 0.735585 -69.6955 Tucanoan
ainu1240 16.5 2 Eurasia 43.633654 142.4617 Ainu
binu1245 16.0 1200 Papunesia -6.220680 146.1020 Nuclear Trans New Guinea
coca1259 16.0 250 South America -4.500000 -74.0000 Tupian
dyir1250 16.0 52 Australia -17.451600 145.5440 Pama-Nyungan
gras1249 16.0 1700 Papunesia -9.512180 147.4370 Koiarian
hawa1245 16.0 2000 Papunesia 19.629700 -155.4300 Austronesian
jama1261 16.0 780 South America -7.622230 -66.5662 Arawan
jung1240 16.0 11200 South America 0.888720 -76.6593 Quechuan
kuku1273 16.0 320 Australia -16.003600 145.1880 Pama-Nyungan
toca1235 16.0 380 South America -3.593590 -49.8991 Tupian
waya1269 16.0 1740 South America 2.775370 -54.4429 Cariban
naas1242 15.5 22000 Papunesia -6.460220 155.6320 South Bougainville
ekar1243 15.0 100000 Papunesia -3.889740 136.0180 Nuclear Trans New Guinea
nami1256 15.0 7000 Papunesia -3.855380 141.7880 Sepik
abau1245 14.0 7500 Papunesia -3.972220 141.3240 Sepik
toar1246 14.0 25200 Papunesia -8.072310 146.2060 Eleman
waim1251 14.0 12000 Papunesia -8.671450 146.5440 Austronesian
roto1249 11.0 11600 Papunesia -5.943390 155.1540 North Bougainville
sound_inventory_population <- sound_inventory_population %>%
    mutate(logpop = log10(population)) %>%
    st_as_sf(coords = c("Longitude", "Latitude"), crs = 4326)
tmap_options(bg.color = "black", legend.text.color = "black")
bg.map +
    sound_inventory_population %>%
    tm_shape() +
    tm_symbols(
        col="nSegments",
        style = "quantile",
        title.col = "Number of segments",
        palette="RdYlBu",
        size=.05,
        border.lwd=0.1
    ) + 
    tm_layout(legend.outside=T,
              legend.outside.position = "bottom", scale=1, 
              legend.bg.color="grey") -> sound_inventory_map

tmap_save(sound_inventory_map, "_img/sound_inventory_map.svg")
Map saved to /mnt/c/Users/gerha/OneDrive - UT Cloud/shareAcrossMachines/_lehre/ws2425/esslli2024_statistical_typology/slides/_img/sound_inventory_map.svg

Size: 9.765879 by 5.01747 inches

tmap_options(bg.color = "black", legend.text.color = "black")
bg.map +
    sound_inventory_population %>%
    tm_shape() +
    tm_symbols(
        col="logpop",
        style = "quantile",
        title.col = "Population size (log)",
        palette="RdYlBu",
        size=.05,
        border.lwd=0.1
    ) + 
    tm_layout(legend.outside=T,
              legend.outside.position = "bottom", scale=1, 
              legend.bg.color="grey") -> logpop_map

tmap_save(logpop_map, "_img/logpop_map.svg")
Map saved to /mnt/c/Users/gerha/OneDrive - UT Cloud/shareAcrossMachines/_lehre/ws2425/esslli2024_statistical_typology/slides/_img/logpop_map.svg

Size: 9.765879 by 5.01747 inches

ggplot(sound_inventory_population, aes(x = logpop, y = nSegments)) +
    geom_point(aes(color = Macroarea), alpha = 0.6) +      # Scatter plot with colors by Macroarea
    geom_smooth(method = "lm", se = TRUE, color = "black") +  # Trend line for the entire dataframe
    scale_y_log10() +
    geom_smooth(aes(group = Macroarea, color = Macroarea), method = "lm", se = FALSE) +  # Trend lines for each Macroarea
    labs(title = "Scatter plot of n_segments vs logpop",
         x = "Population (log)",
         y = "Number of Segments",
         color = "Macroarea") +                            # Label for the legend
    theme_minimal() -> sound_pop_scatter
ggsave(sound_pop_scatter, file = "_img/sound_pop_scatter.svg")
Saving 6.67 x 6.67 in image

`geom_smooth()` using formula 'y ~ x'

`geom_smooth()` using formula 'y ~ x'

medians <- sound_inventory_population %>%
    st_set_geometry(NULL) %>%
    group_by(Macroarea) %>%
    summarize(median_n_segments = median(nSegments, na.rm = TRUE),
              median_logpop = median(logpop, na.rm = TRUE))
sound_pop_median_scatter <- ggplot(medians, aes(x = median_logpop, y = median_n_segments)) +
    scale_y_log10() +
    geom_point(aes(color = Macroarea), size = 4) +             # Scatter plot with colors by Macroarea
    geom_text(aes(label = Macroarea), vjust="inward",hjust="inward", check_overlap = TRUE) + # Adding labels for each macroarea
    geom_smooth(method = "lm", se = TRUE, color = "black", fill = "lightgray") +  # Trend line with uncertainty interval
    labs(title = "Scatter plot of medians by Macroarea",
         x = "Median Log Population",
         y = "Median Number of Segments",
         color = "Macroarea") +                                 # Label for the legend
    theme_minimal()

ggsave(sound_pop_median_scatter, file = "_img/sound_pop_median_scatter.svg")
Saving 6.67 x 6.67 in image

`geom_smooth()` using formula 'y ~ x'

The correlation seems to be mostly between macroareas.

sound_inventory_population <- sound_inventory_population %>%
    mutate(nSegments = as.integer(nSegments))
# fit_c_1 <- brm(nSegments ~ logpop, family=poisson(), data = sound_inventory_population, save_pars = save_pars(all = TRUE))
# saveRDS(fit_c_1, "fit_c_1.rds")

Hierarchical Bayesian modeling

baseline model

Since the response variable holds count data, a Poisson regression is a natural choice.

fit_c_1 <- readRDS("fit_c_1.rds")
summary(fit_c_1)
 Family: poisson 
  Links: mu = log 
Formula: nSegments ~ logpop 
   Data: sound_inventory_population (Number of observations: 1645) 
Samples: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
         total post-warmup samples = 4000

Population-Level Effects: 
          Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
Intercept     3.28      0.01     3.26     3.31 1.00     2588     2482
logpop        0.07      0.00     0.06     0.07 1.00     3025     2854

Samples were drawn using sampling(NUTS). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).
posterior_samples <- as.mcmc(fit_c_1)
mcmc_areas(
  posterior_samples, 
  pars = "b_logpop", 
  prob = 0.95  # 95% HPD interval
) +
  ggtitle("MCMC Density Plot") +
  xlab("log(population) Coefficient") -> sound_inventories_slope_hpd_1
ggsave(sound_inventories_slope_hpd_1, file = "_img/sound_inventories_slope_hpd_1.svg")
Saving 6.67 x 6.67 in image

# fit_c_2 <- brm(
#     nSegments ~ logpop + (1|Macroarea),
#     family=poisson(),
#     data = sound_inventory_population,
#     save_pars = save_pars(all = TRUE),
#     control = list(adapt_delta = 0.99, max_treedepth = 15)
# )
# saveRDS(fit_c_2, file = "fit_c_2.rds")
fit_c_2 <- readRDS("fit_c_2.rds")
summary(fit_c_2)
 Family: poisson 
  Links: mu = log 
Formula: nSegments ~ logpop + (1 | Macroarea) 
   Data: sound_inventory_population (Number of observations: 1645) 
Samples: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
         total post-warmup samples = 4000

Group-Level Effects: 
~Macroarea (Number of levels: 6) 
              Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sd(Intercept)     0.37      0.18     0.18     0.88 1.01      453      647

Population-Level Effects: 
          Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
Intercept     3.46      0.18     3.10     3.83 1.01      454      313
logpop       -0.01      0.00    -0.01     0.00 1.00     2059     1490

Samples were drawn using sampling(NUTS). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).
posterior_samples <- as.mcmc(fit_c_2)
mcmc_areas(
  posterior_samples, 
  pars = "b_logpop", 
  prob = 0.95  # 95% HPD interval
) +
  ggtitle("MCMC Density Plot") +
  xlab("log(population) Coefficient") -> sound_inventories_slope_hpd_2
ggsave(sound_inventories_slope_hpd_2, file = "_img/sound_inventories_slope_hpd_2.svg")
Saving 6.67 x 6.67 in image

# fit_c_3 <- brm(
#     nSegments ~ logpop + (1|Family),
#     family = poisson(),
#     data = sound_inventory_population,
#     save_pars = save_pars(all = TRUE),
#     control = list(adapt_delta = 0.99, max_treedepth = 15)
# )
# saveRDS(fit_c_3, "fit_c_3.rds")
fit_c_3 <- readRDS("fit_c_3.rds")
summary(fit_c_3)
 Family: poisson 
  Links: mu = log 
Formula: nSegments ~ logpop + (1 | Family) 
   Data: sound_inventory_population (Number of observations: 1645) 
Samples: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
         total post-warmup samples = 4000

Group-Level Effects: 
~Family (Number of levels: 159) 
              Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sd(Intercept)     0.36      0.02     0.31     0.41 1.01      433      769

Population-Level Effects: 
          Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
Intercept     3.43      0.03     3.36     3.49 1.03      246      608
logpop       -0.01      0.00    -0.02    -0.00 1.00     4245     3218

Samples were drawn using sampling(NUTS). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).
posterior_samples <- as.mcmc(fit_c_3)
mcmc_areas(
  posterior_samples, 
  pars = "b_logpop", 
  prob = 0.95  # 95% HPD interval
) +
  ggtitle("MCMC Density Plot") +
  xlab("log(population) Coefficient") -> sound_inventories_slope_hpd_3
ggsave(sound_inventories_slope_hpd_3, file = "_img/sound_inventories_slope_hpd_3.svg")
Saving 6.67 x 6.67 in image

# fit_c_4 <- brm(
#     nSegments ~ logpop + (1|Macroarea) + (1|Family),
#     family = poisson(),
#     data = sound_inventory_population,
#     save_pars = save_pars(all = TRUE),
#     control = list(adapt_delta = 0.99, max_treedepth = 15)
# )
# saveRDS(fit_c_4, file="fit_c_4.rds")
fit_c_4 <- readRDS("fit_c_4.rds")
summary(fit_c_4)
Warning message:
“There were 1 divergent transitions after warmup. Increasing adapt_delta above 0.99 may help. See http://mc-stan.org/misc/warnings.html#divergent-transitions-after-warmup”
 Family: poisson 
  Links: mu = log 
Formula: nSegments ~ logpop + (1 | Macroarea) + (1 | Family) 
   Data: sound_inventory_population (Number of observations: 1645) 
Samples: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
         total post-warmup samples = 4000

Group-Level Effects: 
~Family (Number of levels: 159) 
              Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sd(Intercept)     0.27      0.02     0.24     0.32 1.00      957     1384

~Macroarea (Number of levels: 6) 
              Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sd(Intercept)     0.31      0.16     0.15     0.71 1.00     1164     1785

Population-Level Effects: 
          Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
Intercept     3.42      0.15     3.14     3.68 1.00     1258     1576
logpop       -0.01      0.00    -0.02    -0.00 1.00     7312     3511

Samples were drawn using sampling(NUTS). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).
posterior_samples <- as.mcmc(fit_c_4)
mcmc_areas(
  posterior_samples, 
  pars = "b_logpop", 
  prob = 0.95  # 95% HPD interval
) +
  ggtitle("MCMC Density Plot") +
  xlab("log(population) Coefficient") -> sound_inventories_slope_hpd_4
ggsave(sound_inventories_slope_hpd_4, file = "_img/sound_inventories_slope_hpd_4.svg")
Saving 6.67 x 6.67 in image

# Compute LOO for each model
loo_1 <- loo(fit_c_1)
loo_2 <- loo(fit_c_2)
loo_3 <- loo(fit_c_3)
loo_4 <- loo(fit_c_4)
Warning message:
“Found 57 observations with a pareto_k > 0.7 in model 'fit_c_3'. It is recommended to set 'moment_match = TRUE' in order to perform moment matching for problematic observations. ”
Warning message:
“Found 41 observations with a pareto_k > 0.7 in model 'fit_c_4'. It is recommended to set 'moment_match = TRUE' in order to perform moment matching for problematic observations. ”
# Compare the models
loo_comparison <- loo_compare(loo_1, loo_2, loo_3, loo_4)

# Print the comparison
print(loo_comparison)
        elpd_diff se_diff
fit_c_4     0.0       0.0
fit_c_3   -15.4       8.6
fit_c_2  -519.9     134.6
fit_c_1 -1577.5     182.3
output <- capture.output(ll_1 <- bridge_sampler(fit_c_1))
output <- capture.output(ll_2 <- bridge_sampler(fit_c_2))
output <- capture.output(ll_3 <- bridge_sampler(fit_c_3))
output <- capture.output(ll_4 <- bridge_sampler(fit_c_4))
tibble(
    model_1 = ll_1$logml,
    model_2 = ll_2$logml,
    model_3 = ll_3$logml,
    model_4 = ll_4$logml
) %>% pivot_longer(
    cols = c("model_1", "model_2", "model_3", "model_4"),
    names_to = "model", values_to = "marginal_loglikelihood"
) %>% arrange(desc(marginal_loglikelihood))
A tibble: 4 × 2
model marginal_loglikelihood
<chr> <dbl>
model_4 -6355.114
model_3 -6387.788
model_2 -6800.491
model_1 -7846.969

Model 4 provides the best fit for the data. It predicts a negative coefficient for log(population). So contrary to the initial impression, large languages tend to have slightly smaller phoneme inventories if we control for family and macroarea.