35th European Summer School in Logic, Language and Information
Leuven, August 2024
Gerhard Jäger, Tübingen University
Lecture 2
Correlation between linguistic and non-linguistic traits
Another major focus of research in statistical typology is the correlation between linguistic and non-linguistic traits. The idea is that languages are shaped by the social and ecological environment in which they are spoken. This is a very old idea, going back to the 19th century, but it has been given a new lease of life by the availability of large-scale databases that allow us to test these ideas in a more systematic way.
A few examples from https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0070902
So why do we then get the correlation? It is unlikely that it is due to mere chance. But as has been pointed out by several of the bloggers cited above, cultural phenomena come in bundles, and typically, there is no very profound explanation behind this, rather the bundling is due to several features happening to take part in the same historical process of diffusion. To add to the examples already given elsewhere, consider the fact that all the nine full member countries of the International Cricket Council also have left-hand driving. (I am not counting the tenth member, the West Indies, which is not a country.) There is no universal principle connecting cricket and left-hand driving, rather both these phenomena happened to be part of British culture at a specific point in time and spread with the growth of the British empire.
Östen Dahl 2013, https://dlc.hypotheses.org/360
The problem of statistical independence
Both tests used in the previous lecture are based on the assumption that each language is an independent sample from the same distribution. This is patently false due to the genealogical relationships between languages and their dependency dueo to contact.
The issue of creating a suitable sample of languages has been discussed intensely in the literature.
Two basic strategies:
Diversity Sampling: Select languages so that the range of existing variation is covered as faithfully as possible.
Probability Sampling: Select languages in such a way that stochastic independence is approximated as well as possible.
WALS contains a 100-language probability sample. Let us test Universal 17 with it.
── Attaching packages ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── tidyverse 1.3.1 ──
✔ ggplot2 3.3.3 ✔ purrr 0.3.4
✔ tibble 3.1.2 ✔ dplyr 1.0.6
✔ tidyr 1.1.3 ✔ stringr 1.4.0
✔ readr 1.4.0 ✔ forcats 0.5.1
── Conflicts ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
Linking to GEOS 3.9.1, GDAL 3.3.0, PROJ 8.0.0
Loading required package: viridisLite
Loading required package: Rcpp
Loading 'brms' package (version 2.14.4). Useful instructions
can be found by typing help('brms'). A more detailed introduction
to the package is available through vignette('brms_overview').
Attaching package: ‘brms’
The following object is masked from ‘package:stats’:
ar
This is loo version 2.4.1
- Online documentation and vignettes at mc-stan.org/loo
- As of v2.0.0 loo defaults to 1 core but we recommend using as many as possible. Use the 'cores' argument or set options(mc.cores = NUM_CORES) for an entire session.
Attaching package: ‘bridgesampling’
The following object is masked from ‘package:brms’:
bf
This is bayesplot version 1.8.0
- Online documentation and vignettes at mc-stan.org/bayesplot
- bayesplot theme set to bayesplot::theme_default()
* Does _not_ affect other ggplot2 plots
* See ?bayesplot_theme_set for details on theme setting
Map saved to /mnt/c/Users/gerha/OneDrive - UT Cloud/shareAcrossMachines/_lehre/ws2425/esslli2024_statistical_typology/slides/_img/universal17_100_map.svg
Size: 9.765879 by 5.01747 inches
universal17_100 %>%st_set_geometry(NULL) %>%group_by(type) %>%summarize(count =n()) %>%ggplot(aes(x ="", y = count, fill = type)) +geom_bar(stat ="identity", width =1) +coord_polar("y") +labs(title ="Count in WALS-100 sample",x =NULL, y =NULL) +scale_fill_manual(values = cbbPalette) +theme_void() +theme(legend.position ="bottom",text =element_text(size =18), # Increase font size for all text elementsplot.title =element_text(hjust =0.5, size =18), # Center and increase the title sizestrip.text =element_text(size =18), # Increase the facet label sizelegend.text =element_text(size =18), # Increase legend text sizelegend.title =element_text(size =18) # Increase legend title size ) -> universal17_100_pieggsave(universal17_100_pie, filename="_img/universal17_100_pie.svg")
Saving 6.67 x 6.67 in image
No need to apply statistical tests here:
Sample clearly violates the Universal 17.
Sample size is so small that not much can be inferred from this.
One big downside of sampling is that you have to throw away a lot of data. This is why many researchers prefer to use all available data and then correct for the non-independence of the data in the statistical analysis.
Hierarchical modeling
The effect of genealogical and geographic non-independence can be ameliorated by adding random effects for - language families - linguistic macro-areas
First I fit a baseline model with all data from WALS.
I am interested in the effect of VSO on adjective-noun order. Therefore I use VSO as independent and N_A as dependent variable.
As the dependent variable is binary, a logistic regression is called for.
# fit_1 <- brm(N_A ~ VSO , data = d, family = bernoulli(), save_pars = save_pars(all = TRUE))# saveRDS(fit_1, file = "fit_1.rds")
fit_1 <-readRDS("fit_1.rds")
summary(fit_1)
Family: bernoulli
Links: mu = logit
Formula: N_A ~ VSO
Data: d (Number of observations: 1182)
Samples: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
total post-warmup samples = 4000
Population-Level Effects:
Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
Intercept 0.63 0.07 0.50 0.76 1.00 3666 2681
VSO 0.15 0.23 -0.30 0.62 1.00 3662 2670
Samples were drawn using sampling(NUTS). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).
# Extract posterior samples for logpopposterior_samples <-as.mcmc(fit_1)
# Create MCMC density plot for logpop with 95% HPD intervalmcmc_areas( posterior_samples, pars ="b_VSO", prob =0.95# 95% HPD interval) +ggtitle("MCMC Density Plot for VSO") +xlab("VSO Coefficient") -> sound_inventories_slope_hpdggsave(sound_inventories_slope_hpd, file ="_img/universal17_slope_hpd_1.svg")
Saving 6.67 x 6.67 in image
The result seems to indicate that VSO has no credible impact on N_A.
Now let us include random effects. We start with macro-areas.
# fit_2 <- brm(# N_A ~ VSO + (1|Macroarea),# data = d,# family = bernoulli(),# save_pars = save_pars(all = TRUE)# )# saveRDS(fit_2, file = "fit_2.rds")
fit_2 <-readRDS("fit_2.rds")
summary(fit_2)
Family: bernoulli
Links: mu = logit
Formula: N_A ~ VSO + (1 | Macroarea)
Data: d (Number of observations: 1182)
Samples: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
total post-warmup samples = 4000
Group-Level Effects:
~Macroarea (Number of levels: 6)
Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sd(Intercept) 1.04 0.40 0.52 2.09 1.00 824 1261
Population-Level Effects:
Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
Intercept 0.54 0.43 -0.38 1.43 1.00 740 913
VSO 0.21 0.27 -0.30 0.74 1.00 1953 2221
Samples were drawn using sampling(NUTS). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).
# Extract posterior samples for logpopposterior_samples <-as.mcmc(fit_2)
# Create MCMC density plot for logpop with 95% HPD intervalmcmc_areas( posterior_samples, pars ="b_VSO", prob =0.95# 95% HPD interval) +ggtitle("MCMC Density Plot for VSO") +xlab("VSO Coefficient") -> sound_inventories_slope_hpdggsave(sound_inventories_slope_hpd, file ="_img/universal17_slope_hpd_2.svg")
Saving 6.67 x 6.67 in image
Language family as random effect:
# fit_3 <- brm(# N_A ~ VSO + (1|Family),# data = d,# family = bernoulli(),# save_pars = save_pars(all = TRUE)# )# saveRDS(fit_3, file="fit_3.rds")
fit_3 <-readRDS("fit_3.rds")
summary(fit_3)
Family: bernoulli
Links: mu = logit
Formula: N_A ~ VSO + (1 | Family)
Data: d (Number of observations: 1182)
Samples: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
total post-warmup samples = 4000
Group-Level Effects:
~Family (Number of levels: 179)
Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sd(Intercept) 3.18 0.53 2.32 4.35 1.00 791 1558
Population-Level Effects:
Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
Intercept 0.70 0.32 0.10 1.36 1.01 444 833
VSO -0.05 0.38 -0.78 0.70 1.00 3361 2900
Samples were drawn using sampling(NUTS). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).
# Extract posterior samples for logpopposterior_samples <-as.mcmc(fit_3)
# Create MCMC density plot for logpop with 95% HPD intervalmcmc_areas( posterior_samples, pars ="b_VSO", prob =0.95# 95% HPD interval) +ggtitle("MCMC Density Plot for VSO") +xlab("VSO Coefficient") -> sound_inventories_slope_hpdggsave(sound_inventories_slope_hpd, file ="_img/universal17_slope_hpd_3.svg")
Saving 6.67 x 6.67 in image
And now both random effects simultaneously.
# fit_4 <- brm(# N_A ~ VSO + (1|Macroarea) + (1|Family),# data = d,# family = bernoulli(),# save_pars = save_pars(all = TRUE),# control = list(adapt_delta = 0.99, max_treedepth = 15)# )# saveRDS(fit_4, "fit_4.rds")
fit_4 <-readRDS("fit_4.rds")
summary(fit_4)
Family: bernoulli
Links: mu = logit
Formula: N_A ~ VSO + (1 | Macroarea) + (1 | Family)
Data: d (Number of observations: 1182)
Samples: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
total post-warmup samples = 4000
Group-Level Effects:
~Family (Number of levels: 179)
Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sd(Intercept) 3.05 0.55 2.14 4.24 1.01 884 1734
~Macroarea (Number of levels: 6)
Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sd(Intercept) 0.56 0.43 0.02 1.67 1.00 798 1652
Population-Level Effects:
Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
Intercept 0.68 0.43 -0.18 1.58 1.00 1293 2182
VSO -0.06 0.38 -0.79 0.71 1.00 5485 3101
Samples were drawn using sampling(NUTS). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).
# Extract posterior samples for logpopposterior_samples <-as.mcmc(fit_4)
# Create MCMC density plot for logpop with 95% HPD intervalmcmc_areas( posterior_samples, pars ="b_VSO", prob =0.95# 95% HPD interval) +ggtitle("MCMC Density Plot for VSO") +xlab("VSO Coefficient") -> sound_inventories_slope_hpdggsave(sound_inventories_slope_hpd, file ="_img/universal17_slope_hpd_4.svg")
Saving 6.67 x 6.67 in image
Model comparison
I will use two techniques for model comparison:
Pareto-smoothed leave-one-out cross-validation
Bayes Factor
# Compute LOO for each modelloo_1 <-loo(fit_1)loo_2 <-loo(fit_2)loo_3 <-loo(fit_3)loo_4 <-loo(fit_4)
Warning message:
“Found 52 observations with a pareto_k > 0.7 in model 'fit_3'. It is recommended to set 'moment_match = TRUE' in order to perform moment matching for problematic observations. ”
Warning message:
“Found 44 observations with a pareto_k > 0.7 in model 'fit_4'. It is recommended to set 'moment_match = TRUE' in order to perform moment matching for problematic observations. ”
# Compare the modelsloo_comparison <-loo_compare(loo_1, loo_2, loo_3, loo_4)# Print the comparisonprint(loo_comparison)
The results are qualitatively identical to PSIS-LOO model comparison.
Let us check whether a random slope for families makes a difference.
# fit_5 <- brm(# N_A ~ VSO + (VSO|Family),# data = d,# family = bernoulli(),# save_pars = save_pars(all = TRUE)# )# saveRDS(fit_5, file="fit_5.rds")
fit_5 <-readRDS("fit_5.rds")
summary(fit_5)
Family: bernoulli
Links: mu = logit
Formula: N_A ~ VSO + (VSO | Family)
Data: d (Number of observations: 1182)
Samples: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
total post-warmup samples = 4000
Group-Level Effects:
~Family (Number of levels: 179)
Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sd(Intercept) 3.31 0.58 2.35 4.60 1.00 874 1554
sd(VSO) 2.89 1.40 1.11 6.47 1.00 1418 1641
cor(Intercept,VSO) -0.10 0.38 -0.80 0.62 1.00 2598 2595
Population-Level Effects:
Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
Intercept 0.69 0.33 0.08 1.38 1.01 903 1504
VSO 0.65 1.02 -1.36 2.92 1.00 1660 1258
Samples were drawn using sampling(NUTS). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).
# Extract posterior samples for logpopposterior_samples <-as.mcmc(fit_5)
# Create MCMC density plot for logpop with 95% HPD intervalmcmc_areas( posterior_samples, pars ="b_VSO", prob =0.95# 95% HPD interval) +ggtitle("MCMC Density Plot for VSO") +xlab("VSO Coefficient") -> sound_inventories_slope_hpdggsave(sound_inventories_slope_hpd, file ="_img/universal17_slope_hpd_5.svg")
Warning message:
“Found 58 observations with a pareto_k > 0.7 in model 'fit_5'. It is recommended to set 'moment_match = TRUE' in order to perform moment matching for problematic observations. ”
Including a random slope for families clearly improves the model.
No suppose we discover a new language family. What is the posterior distribution of the conditional probability \(P(\text{N_A}|\text{VSO})\) within this family?
new_data <-data.frame(VSO =1, # Set VSO to 1Family =NA# Placeholder for the new family)# Make predictions# posterior_epred gives you the expected value (mean) of the posterior distributionposterior_predictions <-posterior_epred(fit_5, newdata = new_data, re_formula =~ (VSO|Family))posterior_tibble <-tibble(posterior = posterior_predictions[,1])summary(posterior_predictions)
V1
Min. :0.0000
1st Qu.:0.2074
Median :0.7950
Mean :0.6214
3rd Qu.:0.9828
Max. :1.0000
mcmc_areas( posterior_tibble, pars ="posterior",prob =0.95) +ggtitle("95% Highest Posterior Density Interval for P(N_A = 1 | VSO = 1)") -> universal17_hpdggsave(universal17_hpd, filename="_img/universal17_hpd.svg")
Saving 6.67 x 6.67 in image
So in an unkown family, we expect VSO language either to have a strong noun-adjective or a strong adjetive-noun bias.
Now let us check what we can expect if we sample an unkonwn VSO language from an unknwon family.
So, this modeling approach lends support to Universal 17, if we interpret with overwhelmingly more than chance frequency as \(>0.5\).
Let us check what the model predicts for non-VSO languages.
new_data <-data.frame(VSO =1, # Set VSO to 1Family =NA# Placeholder for the new family)samples <-posterior_predict(fit_5, newdata = new_data, re_formula =~ (VSO|Family))mean(samples)
0.62425
This is virtually identical to the estimate for VSO languages. So if we interpret with overwhelmingly more than chance frequency as more than what you expect if you know nothing about the language, then the model does not support Universal 17.
Of course, this is also the interpretation we get directly from the regression analysis, since the credible interval of the slope of the main effect includes 0.
summary(fit_5)
Family: bernoulli
Links: mu = logit
Formula: N_A ~ VSO + (VSO | Family)
Data: d (Number of observations: 1182)
Samples: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
total post-warmup samples = 4000
Group-Level Effects:
~Family (Number of levels: 179)
Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sd(Intercept) 3.31 0.58 2.35 4.60 1.00 874 1554
sd(VSO) 2.89 1.40 1.11 6.47 1.00 1418 1641
cor(Intercept,VSO) -0.10 0.38 -0.80 0.62 1.00 2598 2595
Population-Level Effects:
Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
Intercept 0.69 0.33 0.08 1.38 1.01 903 1504
VSO 0.65 1.02 -1.36 2.92 1.00 1660 1258
Samples were drawn using sampling(NUTS). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).
Map saved to /mnt/c/Users/gerha/OneDrive - UT Cloud/shareAcrossMachines/_lehre/ws2425/esslli2024_statistical_typology/slides/_img/sound_inventory_map.svg
Size: 9.765879 by 5.01747 inches
Map saved to /mnt/c/Users/gerha/OneDrive - UT Cloud/shareAcrossMachines/_lehre/ws2425/esslli2024_statistical_typology/slides/_img/logpop_map.svg
Size: 9.765879 by 5.01747 inches
ggplot(sound_inventory_population, aes(x = logpop, y = nSegments)) +geom_point(aes(color = Macroarea), alpha =0.6) +# Scatter plot with colors by Macroareageom_smooth(method ="lm", se =TRUE, color ="black") +# Trend line for the entire dataframescale_y_log10() +geom_smooth(aes(group = Macroarea, color = Macroarea), method ="lm", se =FALSE) +# Trend lines for each Macroarealabs(title ="Scatter plot of n_segments vs logpop",x ="Population (log)",y ="Number of Segments",color ="Macroarea") +# Label for the legendtheme_minimal() -> sound_pop_scatterggsave(sound_pop_scatter, file ="_img/sound_pop_scatter.svg")
Saving 6.67 x 6.67 in image
`geom_smooth()` using formula 'y ~ x'
`geom_smooth()` using formula 'y ~ x'
sound_pop_median_scatter <-ggplot(medians, aes(x = median_logpop, y = median_n_segments)) +scale_y_log10() +geom_point(aes(color = Macroarea), size =4) +# Scatter plot with colors by Macroareageom_text(aes(label = Macroarea), vjust="inward",hjust="inward", check_overlap =TRUE) +# Adding labels for each macroareageom_smooth(method ="lm", se =TRUE, color ="black", fill ="lightgray") +# Trend line with uncertainty intervallabs(title ="Scatter plot of medians by Macroarea",x ="Median Log Population",y ="Median Number of Segments",color ="Macroarea") +# Label for the legendtheme_minimal()ggsave(sound_pop_median_scatter, file ="_img/sound_pop_median_scatter.svg")
Saving 6.67 x 6.67 in image
`geom_smooth()` using formula 'y ~ x'
The correlation seems to be mostly between macroareas.
Family: bernoulli
Links: mu = logit
Formula: N_A ~ VSO + (1 | Macroarea) + (1 | Family)
Data: d (Number of observations: 1182)
Samples: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
total post-warmup samples = 4000
Group-Level Effects:
~Family (Number of levels: 179)
Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sd(Intercept) 3.05 0.55 2.14 4.24 1.01 884 1734
~Macroarea (Number of levels: 6)
Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sd(Intercept) 0.56 0.43 0.02 1.67 1.00 798 1652
Population-Level Effects:
Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
Intercept 0.68 0.43 -0.18 1.58 1.00 1293 2182
VSO -0.06 0.38 -0.79 0.71 1.00 5485 3101
Samples were drawn using sampling(NUTS). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).
posterior_samples <-as.mcmc(fit_c_4)mcmc_areas( posterior_samples, pars ="b_logpop", prob =0.95# 95% HPD interval) +ggtitle("MCMC Density Plot for VSO") +xlab("VSO Coefficient") -> sound_inventories_slope_hpd_4ggsave(sound_inventories_slope_hpd_4, file ="_img/sound_inventories_slope_hpd_4.svg")
Saving 6.67 x 6.67 in image
# Compute LOO for each modelloo_1 <-loo(fit_1)loo_2 <-loo(fit_2)loo_3 <-loo(fit_3)loo_4 <-loo(fit_4)
Warning message:
“Found 52 observations with a pareto_k > 0.7 in model 'fit_3'. It is recommended to set 'moment_match = TRUE' in order to perform moment matching for problematic observations. ”
Warning message:
“Found 44 observations with a pareto_k > 0.7 in model 'fit_4'. It is recommended to set 'moment_match = TRUE' in order to perform moment matching for problematic observations. ”
# Compare the modelsloo_comparison <-loo_compare(loo_1, loo_2, loo_3, loo_4)# Print the comparisonprint(loo_comparison)