ggplot(sound_inventory_population, aes(x = logpop, y = nSegments)) +geom_point(aes(color = Macroarea), alpha =0.6) +# Scatter plot with colors by Macroareageom_smooth(method ="lm", se =TRUE, color ="black") +# Trend line for the entire dataframescale_y_log10() +geom_smooth(aes(group = Macroarea, color = Macroarea), method ="lm", se =FALSE) +# Trend lines for each Macroarealabs(title ="Scatter plot of n_segments vs logpop",x ="Population (log)",y ="Number of Segments",color ="Macroarea") +# Label for the legendtheme_minimal() -> sound_pop_scatterggsave(sound_pop_scatter, file ="_img/sound_pop_scatter.svg")
sound_pop_median_scatter <-ggplot(medians, aes(x = median_logpop, y = median_n_segments)) +scale_y_log10() +geom_point(aes(color = Macroarea), size =4) +# Scatter plot with colors by Macroareageom_text(aes(label = Macroarea), vjust="inward",hjust="inward", check_overlap =TRUE) +# Adding labels for each macroareageom_smooth(method ="lm", se =TRUE, color ="black", fill ="lightgray") +# Trend line with uncertainty intervallabs(title ="Scatter plot of medians by Macroarea",x ="Median Log Population",y ="Median Number of Segments",color ="Macroarea") +# Label for the legendtheme_minimal()ggsave(sound_pop_median_scatter, file ="_img/sound_pop_median_scatter.svg")
The correlation seems to be mostly between macroareas.
Family: poisson
Links: mu = log
Formula: nSegments ~ logpop + (1 | Macroarea) + (1 | Family)
Data: sound_inventory_population (Number of observations: 1645)
Samples: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
total post-warmup samples = 4000
Group-Level Effects:
~Family (Number of levels: 159)
Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sd(Intercept) 0.27 0.02 0.24 0.32 1.00 957 1384
~Macroarea (Number of levels: 6)
Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sd(Intercept) 0.31 0.16 0.15 0.71 1.00 1164 1785
Population-Level Effects:
Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
Intercept 3.42 0.15 3.14 3.68 1.00 1258 1576
logpop -0.01 0.00 -0.02 -0.00 1.00 7312 3511
Samples were drawn using sampling(NUTS). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).
posterior_samples <-as.mcmc(fit_c_4)mcmc_areas( posterior_samples, pars ="b_logpop", prob =0.95# 95% HPD interval) +ggtitle("MCMC Density Plot") +xlab("log(population) Coefficient") -> sound_inventories_slope_hpd_4ggsave(sound_inventories_slope_hpd_4, file ="_img/sound_inventories_slope_hpd_4.svg")
# Compute LOO for each modelloo_1 <-loo(fit_c_1)loo_2 <-loo(fit_c_2)loo_3 <-loo(fit_c_3)loo_4 <-loo(fit_c_4)
# Compare the modelsloo_comparison <-loo_compare(loo_1, loo_2, loo_3, loo_4)# Print the comparisonprint(loo_comparison)
Model 4 provides the best fit for the data. It predicts a negative coefficient for log(population). So contrary to the initial impression, large languages tend to have slightly smaller phoneme inventories if we control for family and macroarea.