Statistical Typology

35th European Summer School in Logic, Language and Information

Leuven, August 2024

Gerhard Jäger, Tübingen University

Lecture 3

Continuous variables

Let’s check this out
data sources:
- phoneme inventory size: Phoible
- population size: ASJP
- Family affiliation an macro-area: Glottolog

library(repr)
options(repr.plot.width=30, repr.plot.height=21)
     
library(tidyverse)
library(leaflet)
library(tmap)
library(sf)
library(svglite)
library(ggthemes)
library(viridis)
library(brms)
library(loo)
library(bridgesampling)
library(coda)
library(bayesplot)
library(RColorBrewer)

── Attaching packages ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── tidyverse 1.3.1 ──

✔ ggplot2 3.3.3     ✔ purrr   0.3.4
✔ tibble  3.1.2     ✔ dplyr   1.0.6
✔ tidyr   1.1.3     ✔ stringr 1.4.0
✔ readr   1.4.0     ✔ forcats 0.5.1

── Conflicts ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()

Linking to GEOS 3.9.1, GDAL 3.3.0, PROJ 8.0.0

Loading required package: viridisLite

Loading required package: Rcpp

Loading 'brms' package (version 2.14.4). Useful instructions
can be found by typing help('brms'). A more detailed introduction
to the package is available through vignette('brms_overview').


Attaching package: ‘brms’


The following object is masked from ‘package:stats’:

    ar


This is loo version 2.4.1

- Online documentation and vignettes at mc-stan.org/loo

- As of v2.0.0 loo defaults to 1 core but we recommend using as many as possible. Use the 'cores' argument or set options(mc.cores = NUM_CORES) for an entire session. 


Attaching package: ‘bridgesampling’


The following object is masked from ‘package:brms’:

    bf


This is bayesplot version 1.8.0

- Online documentation and vignettes at mc-stan.org/bayesplot

- bayesplot theme set to bayesplot::theme_default()

   * Does _not_ affect other ggplot2 plots

   * See ?bayesplot_theme_set for details on theme setting

world.160e <- read_sf("../data/world_160e.gpkg")
bg.map <- world.160e %>% 
    st_transform("+proj=eqearth lon_0=160") %>%
    tm_shape() +
    tm_fill()

sound_inventory_population <- read_csv("../data/soundinventory_population.csv") %>%
    drop_na("Family")
sound_inventory_population


── Column specification ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
cols(
  Glottocode = col_character(),
  nSegments = col_double(),
  population = col_double(),
  Macroarea = col_character(),
  Latitude = col_double(),
  Longitude = col_double(),
  Family = col_character()
)

A tibble: 1645 × 7
Glottocode	nSegments	population	Macroarea	Latitude	Longitude	Family
<chr>	<dbl>	<dbl>	<chr>	<dbl>	<dbl>	<chr>
huaa1248	161.0	2500	Africa	-24.200220	20.79583	Tuu
juho1239	141.0	45500	Africa	-19.688000	20.76730	Kxa
kild1236	128.0	600	Eurasia	68.215800	35.83480	Uralic
kung1261	120.0	16500	Africa	-21.920000	18.00000	Kxa
tigo1236	116.0	60000	Africa	6.983995	10.70849	Atlantic-Congo
irig1241	95.0	40000	Africa	9.851970	8.65250	Atlantic-Congo
arch1244	91.0	970	Eurasia	42.323900	46.82670	Nakh-Daghestanian
suga1248	90.0	10000	Africa	7.129060	12.37790	Atlantic-Congo
yeyi1239	90.0	11180	Africa	-18.917800	23.60880	Atlantic-Congo
mfum1238	89.0	24700	Africa	6.600000	10.93639	Atlantic-Congo
xhos1239	84.0	8183300	Africa	-31.038900	28.07690	Atlantic-Congo
sera1259	83.0	26009000	Eurasia	29.553400	71.90600	Indo-European
scot1245	82.0	60130	Eurasia	56.757400	-5.24366	Indo-European
veng1238	82.0	27000	Africa	6.114030	10.41370	Atlantic-Congo
moro1292	78.0	28019650	Africa	32.500000	-7.50000	Afro-Asiatic
para1301	77.0	805700	Eurasia	22.865800	99.37500	Austroasiatic
kang1288	75.5	15000	Africa	11.017400	29.34630	Kadugli-Krongo
idom1241	75.0	927000	Africa	7.191390	7.71261	Atlantic-Congo
komc1235	75.0	233000	Africa	6.259600	10.33480	Atlantic-Congo
skol1241	75.0	320	Eurasia	68.832600	29.72040	Uralic
eten1239	74.0	40000	Africa	9.730730	8.61443	Atlantic-Congo
nort2671	74.0	25700	Eurasia	68.725000	22.11130	Uralic
sich1238	74.0	2000000	Eurasia	28.194700	102.12100	Sino-Tibetan
dime1235	73.0	1100	Africa	6.209510	36.33290	South Omotic
lezg1247	71.5	623710	Eurasia	41.515700	47.89510	Nakh-Daghestanian
kham1282	70.5	1380300	Eurasia	31.931300	91.70620	Sino-Tibetan
chen1255	70.0	26000	Eurasia	15.878700	79.25340	Dravidian
tada1238	70.0	159800	Africa	16.685000	2.32653	Songhay
huml1238	69.0	5000	Eurasia	30.145100	81.56590	Sino-Tibetan
lakk1252	69.0	156300	Eurasia	42.132800	47.08090	Nakh-Daghestanian
⋮	⋮	⋮	⋮	⋮	⋮	⋮
iwam1256	17.0	3000	Papunesia	-4.273460	141.8170	Sepik
mara1404	17.0	866000	Papunesia	7.794140	124.1750	Austronesian
nucl1632	17.0	30000	Papunesia	-2.601300	140.5120	Sentanic
orok1269	17.0	35000	Papunesia	-8.790190	148.0900	Nuclear Trans New Guinea
para1310	17.0	340	South America	-3.712630	-53.0657	Cariban
para1312	17.0	900	South America	-4.646810	-50.0621	Tupian
saba1268	17.0	3	South America	-12.985700	-60.3350	Nambiquaran
sout2895	17.0	1500	Papunesia	-3.234110	129.1480	Austronesian
tatu1247	17.0	330	South America	0.555820	-70.5327	Tucanoan
tiga1245	17.0	10000	Papunesia	-2.709510	150.8930	Austronesian
tong1325	17.0	189740	Papunesia	-21.170000	-175.2500	Austronesian
yuru1263	17.0	1100	South America	0.735585	-69.6955	Tucanoan
ainu1240	16.5	2	Eurasia	43.633654	142.4617	Ainu
binu1245	16.0	1200	Papunesia	-6.220680	146.1020	Nuclear Trans New Guinea
coca1259	16.0	250	South America	-4.500000	-74.0000	Tupian
dyir1250	16.0	52	Australia	-17.451600	145.5440	Pama-Nyungan
gras1249	16.0	1700	Papunesia	-9.512180	147.4370	Koiarian
hawa1245	16.0	2000	Papunesia	19.629700	-155.4300	Austronesian
jama1261	16.0	780	South America	-7.622230	-66.5662	Arawan
jung1240	16.0	11200	South America	0.888720	-76.6593	Quechuan
kuku1273	16.0	320	Australia	-16.003600	145.1880	Pama-Nyungan
toca1235	16.0	380	South America	-3.593590	-49.8991	Tupian
waya1269	16.0	1740	South America	2.775370	-54.4429	Cariban
naas1242	15.5	22000	Papunesia	-6.460220	155.6320	South Bougainville
ekar1243	15.0	100000	Papunesia	-3.889740	136.0180	Nuclear Trans New Guinea
nami1256	15.0	7000	Papunesia	-3.855380	141.7880	Sepik
abau1245	14.0	7500	Papunesia	-3.972220	141.3240	Sepik
toar1246	14.0	25200	Papunesia	-8.072310	146.2060	Eleman
waim1251	14.0	12000	Papunesia	-8.671450	146.5440	Austronesian
roto1249	11.0	11600	Papunesia	-5.943390	155.1540	North Bougainville

sound_inventory_population <- sound_inventory_population %>%
    mutate(logpop = log10(population)) %>%
    st_as_sf(coords = c("Longitude", "Latitude"), crs = 4326)

tmap_options(bg.color = "black", legend.text.color = "black")
bg.map +
    sound_inventory_population %>%
    tm_shape() +
    tm_symbols(
        col="nSegments",
        style = "quantile",
        title.col = "Number of segments",
        palette="RdYlBu",
        size=.05,
        border.lwd=0.1
    ) + 
    tm_layout(legend.outside=T,
              legend.outside.position = "bottom", scale=1, 
              legend.bg.color="grey") -> sound_inventory_map

tmap_save(sound_inventory_map, "_img/sound_inventory_map.svg")

Map saved to /mnt/c/Users/gerha/OneDrive - UT Cloud/shareAcrossMachines/_lehre/ws2425/esslli2024_statistical_typology/slides/_img/sound_inventory_map.svg

Size: 9.765879 by 5.01747 inches

tmap_options(bg.color = "black", legend.text.color = "black")
bg.map +
    sound_inventory_population %>%
    tm_shape() +
    tm_symbols(
        col="logpop",
        style = "quantile",
        title.col = "Population size (log)",
        palette="RdYlBu",
        size=.05,
        border.lwd=0.1
    ) + 
    tm_layout(legend.outside=T,
              legend.outside.position = "bottom", scale=1, 
              legend.bg.color="grey") -> logpop_map

tmap_save(logpop_map, "_img/logpop_map.svg")

Map saved to /mnt/c/Users/gerha/OneDrive - UT Cloud/shareAcrossMachines/_lehre/ws2425/esslli2024_statistical_typology/slides/_img/logpop_map.svg

Size: 9.765879 by 5.01747 inches

ggplot(sound_inventory_population, aes(x = logpop, y = nSegments)) +
    geom_point(aes(color = Macroarea), alpha = 0.6) +      # Scatter plot with colors by Macroarea
    geom_smooth(method = "lm", se = TRUE, color = "black") +  # Trend line for the entire dataframe
    scale_y_log10() +
    geom_smooth(aes(group = Macroarea, color = Macroarea), method = "lm", se = FALSE) +  # Trend lines for each Macroarea
    labs(title = "Scatter plot of n_segments vs logpop",
         x = "Population (log)",
         y = "Number of Segments",
         color = "Macroarea") +                            # Label for the legend
    theme_minimal() -> sound_pop_scatter
ggsave(sound_pop_scatter, file = "_img/sound_pop_scatter.svg")

Saving 6.67 x 6.67 in image

`geom_smooth()` using formula 'y ~ x'

`geom_smooth()` using formula 'y ~ x'

medians <- sound_inventory_population %>%
    st_set_geometry(NULL) %>%
    group_by(Macroarea) %>%
    summarize(median_n_segments = median(nSegments, na.rm = TRUE),
              median_logpop = median(logpop, na.rm = TRUE))

sound_pop_median_scatter <- ggplot(medians, aes(x = median_logpop, y = median_n_segments)) +
    scale_y_log10() +
    geom_point(aes(color = Macroarea), size = 4) +             # Scatter plot with colors by Macroarea
    geom_text(aes(label = Macroarea), vjust="inward",hjust="inward", check_overlap = TRUE) + # Adding labels for each macroarea
    geom_smooth(method = "lm", se = TRUE, color = "black", fill = "lightgray") +  # Trend line with uncertainty interval
    labs(title = "Scatter plot of medians by Macroarea",
         x = "Median Log Population",
         y = "Median Number of Segments",
         color = "Macroarea") +                                 # Label for the legend
    theme_minimal()

ggsave(sound_pop_median_scatter, file = "_img/sound_pop_median_scatter.svg")

Saving 6.67 x 6.67 in image

`geom_smooth()` using formula 'y ~ x'

The correlation seems to be mostly between macroareas.

sound_inventory_population <- sound_inventory_population %>%
    mutate(nSegments = as.integer(nSegments))

# fit_c_1 <- brm(nSegments ~ logpop, family=poisson(), data = sound_inventory_population, save_pars = save_pars(all = TRUE))
# saveRDS(fit_c_1, "fit_c_1.rds")

Hierarchical Bayesian modeling

baseline model

Since the response variable holds count data, a Poisson regression is a natural choice.

fit_c_1 <- readRDS("fit_c_1.rds")

summary(fit_c_1)

 Family: poisson 
  Links: mu = log 
Formula: nSegments ~ logpop 
   Data: sound_inventory_population (Number of observations: 1645) 
Samples: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
         total post-warmup samples = 4000

Population-Level Effects: 
          Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
Intercept     3.28      0.01     3.26     3.31 1.00     2588     2482
logpop        0.07      0.00     0.06     0.07 1.00     3025     2854

Samples were drawn using sampling(NUTS). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).

posterior_samples <- as.mcmc(fit_c_1)
mcmc_areas(
  posterior_samples, 
  pars = "b_logpop", 
  prob = 0.95  # 95% HPD interval
) +
  ggtitle("MCMC Density Plot") +
  xlab("log(population) Coefficient") -> sound_inventories_slope_hpd_1
ggsave(sound_inventories_slope_hpd_1, file = "_img/sound_inventories_slope_hpd_1.svg")

Saving 6.67 x 6.67 in image

# fit_c_2 <- brm(
#     nSegments ~ logpop + (1|Macroarea),
#     family=poisson(),
#     data = sound_inventory_population,
#     save_pars = save_pars(all = TRUE),
#     control = list(adapt_delta = 0.99, max_treedepth = 15)
# )
# saveRDS(fit_c_2, file = "fit_c_2.rds")

fit_c_2 <- readRDS("fit_c_2.rds")

summary(fit_c_2)

 Family: poisson 
  Links: mu = log 
Formula: nSegments ~ logpop + (1 | Macroarea) 
   Data: sound_inventory_population (Number of observations: 1645) 
Samples: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
         total post-warmup samples = 4000

Group-Level Effects: 
~Macroarea (Number of levels: 6) 
              Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sd(Intercept)     0.37      0.18     0.18     0.88 1.01      453      647

Population-Level Effects: 
          Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
Intercept     3.46      0.18     3.10     3.83 1.01      454      313
logpop       -0.01      0.00    -0.01     0.00 1.00     2059     1490

Samples were drawn using sampling(NUTS). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).

posterior_samples <- as.mcmc(fit_c_2)
mcmc_areas(
  posterior_samples, 
  pars = "b_logpop", 
  prob = 0.95  # 95% HPD interval
) +
  ggtitle("MCMC Density Plot") +
  xlab("log(population) Coefficient") -> sound_inventories_slope_hpd_2
ggsave(sound_inventories_slope_hpd_2, file = "_img/sound_inventories_slope_hpd_2.svg")

Saving 6.67 x 6.67 in image

# fit_c_3 <- brm(
#     nSegments ~ logpop + (1|Family),
#     family = poisson(),
#     data = sound_inventory_population,
#     save_pars = save_pars(all = TRUE),
#     control = list(adapt_delta = 0.99, max_treedepth = 15)
# )
# saveRDS(fit_c_3, "fit_c_3.rds")

fit_c_3 <- readRDS("fit_c_3.rds")

summary(fit_c_3)

 Family: poisson 
  Links: mu = log 
Formula: nSegments ~ logpop + (1 | Family) 
   Data: sound_inventory_population (Number of observations: 1645) 
Samples: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
         total post-warmup samples = 4000

Group-Level Effects: 
~Family (Number of levels: 159) 
              Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sd(Intercept)     0.36      0.02     0.31     0.41 1.01      433      769

Population-Level Effects: 
          Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
Intercept     3.43      0.03     3.36     3.49 1.03      246      608
logpop       -0.01      0.00    -0.02    -0.00 1.00     4245     3218

Samples were drawn using sampling(NUTS). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).

posterior_samples <- as.mcmc(fit_c_3)
mcmc_areas(
  posterior_samples, 
  pars = "b_logpop", 
  prob = 0.95  # 95% HPD interval
) +
  ggtitle("MCMC Density Plot") +
  xlab("log(population) Coefficient") -> sound_inventories_slope_hpd_3
ggsave(sound_inventories_slope_hpd_3, file = "_img/sound_inventories_slope_hpd_3.svg")

Saving 6.67 x 6.67 in image

# fit_c_4 <- brm(
#     nSegments ~ logpop + (1|Macroarea) + (1|Family),
#     family = poisson(),
#     data = sound_inventory_population,
#     save_pars = save_pars(all = TRUE),
#     control = list(adapt_delta = 0.99, max_treedepth = 15)
# )
# saveRDS(fit_c_4, file="fit_c_4.rds")

fit_c_4 <- readRDS("fit_c_4.rds")

summary(fit_c_4)

Warning message:
“There were 1 divergent transitions after warmup. Increasing adapt_delta above 0.99 may help. See http://mc-stan.org/misc/warnings.html#divergent-transitions-after-warmup”

 Family: poisson 
  Links: mu = log 
Formula: nSegments ~ logpop + (1 | Macroarea) + (1 | Family) 
   Data: sound_inventory_population (Number of observations: 1645) 
Samples: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
         total post-warmup samples = 4000

Group-Level Effects: 
~Family (Number of levels: 159) 
              Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sd(Intercept)     0.27      0.02     0.24     0.32 1.00      957     1384

~Macroarea (Number of levels: 6) 
              Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sd(Intercept)     0.31      0.16     0.15     0.71 1.00     1164     1785

Population-Level Effects: 
          Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
Intercept     3.42      0.15     3.14     3.68 1.00     1258     1576
logpop       -0.01      0.00    -0.02    -0.00 1.00     7312     3511

Samples were drawn using sampling(NUTS). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).

posterior_samples <- as.mcmc(fit_c_4)
mcmc_areas(
  posterior_samples, 
  pars = "b_logpop", 
  prob = 0.95  # 95% HPD interval
) +
  ggtitle("MCMC Density Plot") +
  xlab("log(population) Coefficient") -> sound_inventories_slope_hpd_4
ggsave(sound_inventories_slope_hpd_4, file = "_img/sound_inventories_slope_hpd_4.svg")

Saving 6.67 x 6.67 in image

# Compute LOO for each model
loo_1 <- loo(fit_c_1)
loo_2 <- loo(fit_c_2)
loo_3 <- loo(fit_c_3)
loo_4 <- loo(fit_c_4)

Warning message:
“Found 57 observations with a pareto_k > 0.7 in model 'fit_c_3'. It is recommended to set 'moment_match = TRUE' in order to perform moment matching for problematic observations. ”
Warning message:
“Found 41 observations with a pareto_k > 0.7 in model 'fit_c_4'. It is recommended to set 'moment_match = TRUE' in order to perform moment matching for problematic observations. ”

# Compare the models
loo_comparison <- loo_compare(loo_1, loo_2, loo_3, loo_4)

# Print the comparison
print(loo_comparison)

        elpd_diff se_diff
fit_c_4     0.0       0.0
fit_c_3   -15.4       8.6
fit_c_2  -519.9     134.6
fit_c_1 -1577.5     182.3

output <- capture.output(ll_1 <- bridge_sampler(fit_c_1))
output <- capture.output(ll_2 <- bridge_sampler(fit_c_2))
output <- capture.output(ll_3 <- bridge_sampler(fit_c_3))
output <- capture.output(ll_4 <- bridge_sampler(fit_c_4))

tibble(
    model_1 = ll_1$logml,
    model_2 = ll_2$logml,
    model_3 = ll_3$logml,
    model_4 = ll_4$logml
) %>% pivot_longer(
    cols = c("model_1", "model_2", "model_3", "model_4"),
    names_to = "model", values_to = "marginal_loglikelihood"
) %>% arrange(desc(marginal_loglikelihood))

A tibble: 4 × 2
model	marginal_loglikelihood
<chr>	<dbl>
model_4	-6355.114
model_3	-6387.788
model_2	-6800.491
model_1	-7846.969

Model 4 provides the best fit for the data. It predicts a negative coefficient for log(population). So contrary to the initial impression, large languages tend to have slightly smaller phoneme inventories if we control for family and macroarea.