Computational Typology
  • Home
  • About

Case study I: affix and adposition types

load libraries

library(repr)
options(repr.plot.width=15, repr.plot.height=9)
library(tidyverse)
library(svglite)
library(ggthemes)
library(brms)
library(viridis)
library(loo)
library(bayesplot)
library(RColorBrewer)
library(ape)
library(phytools)
library(Matrix)
library(posterior)
library(rstan)
options(brms.backend = "rstan")
library(bridgesampling)
library(mgcv)
library(tidybayes)
library(patchwork)
library(ggtree)
library(paletteer)
library(ggthemes)
library(cowplot)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.2     ✔ tibble    3.2.1
✔ lubridate 1.9.4     ✔ tidyr     1.3.1
✔ purrr     1.0.4     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
Loading required package: Rcpp

Loading 'brms' package (version 2.22.0). Useful instructions
can be found by typing help('brms'). A more detailed introduction
to the package is available through vignette('brms_overview').


Attaching package: ‘brms’


The following object is masked from ‘package:stats’:

    ar


Loading required package: viridisLite

This is loo version 2.8.0

- Online documentation and vignettes at mc-stan.org/loo

- As of v2.0.0 loo defaults to 1 core but we recommend using as many as possible. Use the 'cores' argument or set options(mc.cores = NUM_CORES) for an entire session. 

This is bayesplot version 1.11.1

- Online documentation and vignettes at mc-stan.org/bayesplot

- bayesplot theme set to bayesplot::theme_default()

   * Does _not_ affect other ggplot2 plots

   * See ?bayesplot_theme_set for details on theme setting


Attaching package: ‘bayesplot’


The following object is masked from ‘package:brms’:

    rhat



Attaching package: ‘ape’


The following object is masked from ‘package:dplyr’:

    where


Loading required package: maps


Attaching package: ‘maps’


The following object is masked from ‘package:viridis’:

    unemp


The following object is masked from ‘package:purrr’:

    map



Attaching package: ‘Matrix’


The following objects are masked from ‘package:tidyr’:

    expand, pack, unpack


This is posterior version 1.6.1


Attaching package: ‘posterior’


The following object is masked from ‘package:bayesplot’:

    rhat


The following objects are masked from ‘package:stats’:

    mad, sd, var


The following objects are masked from ‘package:base’:

    %in%, match


Loading required package: StanHeaders


rstan version 2.32.6 (Stan version 2.32.2)


For execution on a local, multicore CPU with excess RAM we recommend calling
options(mc.cores = parallel::detectCores()).
To avoid recompilation of unchanged Stan programs, we recommend calling
rstan_options(auto_write = TRUE)
For within-chain threading using `reduce_sum()` or `map_rect()` Stan functions,
change `threads_per_chain` option:
rstan_options(threads_per_chain = 1)



Attaching package: ‘rstan’


The following objects are masked from ‘package:posterior’:

    ess_bulk, ess_tail


The following object is masked from ‘package:tidyr’:

    extract



Attaching package: ‘bridgesampling’


The following object is masked from ‘package:brms’:

    bf


Loading required package: nlme


Attaching package: ‘nlme’


The following object is masked from ‘package:dplyr’:

    collapse


This is mgcv 1.9-3. For overview type 'help("mgcv-package")'.


Attaching package: ‘mgcv’


The following objects are masked from ‘package:brms’:

    s, t2



Attaching package: ‘tidybayes’


The following objects are masked from ‘package:brms’:

    dstudent_t, pstudent_t, qstudent_t, rstudent_t


ggtree v3.14.0 Learn more at https://yulab-smu.top/contribution-tree-data/

Please cite:

Guangchuang Yu.  Data Integration, Manipulation and Visualization of
Phylogenetic Trees (1st edition). Chapman and Hall/CRC. 2022,
doi:10.1201/9781003279242, ISBN: 9781032233574


Attaching package: ‘ggtree’


The following object is masked from ‘package:nlme’:

    collapse


The following object is masked from ‘package:Matrix’:

    expand


The following object is masked from ‘package:ape’:

    rotate


The following object is masked from ‘package:tidyr’:

    expand



Attaching package: ‘cowplot’


The following object is masked from ‘package:patchwork’:

    align_plots


The following object is masked from ‘package:ggthemes’:

    theme_map


The following object is masked from ‘package:lubridate’:

    stamp

Load WALS data

Filter and process data

d = read_csv("../data/affix_adposition.csv")
d %>% head()
Rows: 488 Columns: 8
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (4): Glottocode, Name, Macroarea, Family
dbl (4): Longitude, Latitude, Affix, Adposition

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
A tibble: 6 × 8
Glottocode Name Macroarea Longitude Latitude Family Affix Adposition
<chr> <chr> <chr> <dbl> <dbl> <chr> <dbl> <dbl>
aari1239 Aari Africa 36.583333 6.000000 Afro-Asiatic 2 1
abkh1244 Abkhaz Eurasia 41.000000 43.083333 Northwest Caucasian 4 1
acha1250 Achagua South America -72.250000 4.416667 Arawakan 4 1
acol1236 Acholi Africa 32.666667 3.000000 Eastern Sudanic 4 2
west2632 Acoma North America -107.583333 34.916667 Keresan 4 1
adio1239 Adioukrou Africa -4.583333 5.416667 Niger-Congo 5 1

load tree and scale its height

tree <- read.tree("../data/affix_adposition.tre")

tree$edge.length <- tree$edge.length / median(tree$edge.length)
edge_matrix <- tree$edge
internal_nodes <- intersect(edge_matrix[, 1], edge_matrix[, 2])
# Map: for each node, at what row does it first appear as a child?
child_indices <- match(internal_nodes, edge_matrix[,2])

# Map: for each node, at what row does it first appear as a parent?
parent_indices <- match(internal_nodes, edge_matrix[,1])
all(child_indices < parent_indices)
TRUE

arrange data in order of tree tip.labels

d <- d[match(tree$tip.label, d$Glottocode), ]

create numerical features for affix and adposition types, and family affiliation

d$Affix_num <- as.integer(d$Affix - 1)  # 1, 2, 3
d$Glottocode <- factor(d$Glottocode, levels = tree$tip.label)
d$family_index <- as.integer(factor(d$Family))
# Prepare ordinal and binary variables
d$affix_ord <- as.integer(factor(d$Affix))  # 1-based ordinal
d$adp_bin <- as.integer(d$Adposition) - 1   # 0/1

store relevant data in list for use by Stan

Nnodes <- tree$edge %>% as.vector %>% unique %>% length
stan_data <- list(
    Ntip = length(tree$tip.label),
    Nnodes = Nnodes,
    Nedges = nrow(tree$edge),
    edge_matrix = tree$edge,
    edge_lengths = tree$edge.length,
    tip_indices = match(d$Glottocode, tree$tip.label),
    K = max(d$affix_ord),
    x = d$affix_ord,
    y = d$adp_bin,
    root_index = setdiff(tree$edge[,1], tree$edge[, 2])[1],
    Nfamilies = length(unique(d$Family)),
    family_index = d$family_index
)

fit and assess the vanilla bivariate model

Stan code bivariate_logistic.stan:

data {
  int<lower=1> Ntip;
  int<lower=2> K;
  array[Ntip] int<lower=1, upper=K> x;
  array[Ntip] int<lower=0, upper=1> y;
}

parameters {
  // Non-centered latent traits
  matrix[2, Ntip] z_raw;

  // Correlation and scale of latent traits
  cholesky_factor_corr[2] L;
  vector<lower=0>[2] sigma;

  // Ordinal cutpoints
  ordered[K - 1] cutpoints;  // Directly model ordered cutpoints
}

transformed parameters {
  matrix[2, Ntip] z_scaled;
  matrix[2, Ntip] z = diag_pre_multiply(sigma, L) * z_raw;  // Non-centered z

  // Transform latent z to real scale
  z_scaled = z;
}

model {
  // Priors
  to_vector(z_raw) ~ normal(0, 1);  // Standard normal prior for non-centered z
  sigma ~ lognormal(0, 1);  // Lognormal ensures positivity and avoids boundary issues
  L ~ lkj_corr_cholesky(1);  // Weaker prior on correlation structure
  cutpoints ~ normal(0, 2);  // Prior for ordered cutpoints
  cutpoint1 ~ normal(0, 2);  // Broader prior for cutpoint1

  // Likelihood
  for (i in 1:Ntip) {
    x[i] ~ ordered_logistic(z_scaled[1, i], cutpoints);
    y[i] ~ bernoulli_logit(z_scaled[2, i]);
  }
}

generated quantities {
  vector[Ntip] log_lik;
  matrix[2, 2] Corr = multiply_lower_tri_self_transpose(L);
  real rho = Corr[1, 2];  // Posterior correlation between latent traits

  for (i in 1:Ntip) {
    log_lik[i] =
      ordered_logistic_lpmf(x[i] | z_scaled[1, i], cutpoints) +
      bernoulli_logit_lpmf(y[i] | z_scaled[2, i]);
  }
}
if (file.exists("../models/fit_bivariate.rds")) {
    fit_bivariate <- readRDS("../models/fit_bivariate.rds")
} else {
    model_bivariate <- stan_model("bivariate_logistic.stan")
    # Fit model
    fit_bivariate <- sampling(
    model_bivariate,
    data = stan_data,
    chains = 4,
    iter = 20000,
    cores = 4,
    seed = 123,
    control = list(adapt_delta = 0.99)
    )
    saveRDS(fit_bivariate, file = "../models/fit_bivariate.rds")
}
# Summarize parameters of interest
summary(fit_bivariate, pars = c("cutpoints", "sigma", "rho"))$summary
A matrix: 7 × 10 of type dbl
mean se_mean sd 2.5% 25% 50% 75% 97.5% n_eff Rhat
cutpoints[1] -0.3612366 0.001381992 0.14743378 -0.6610041 -0.4574764 -0.3568101 -0.2618225 -0.08146378 11381.074 1.000402
cutpoints[2] 0.5362430 0.001448392 0.15625856 0.2497027 0.4284973 0.5296169 0.6366265 0.86225787 11638.987 1.000004
cutpoints[3] 2.0264992 0.004521351 0.27439448 1.5516343 1.8308464 2.0029655 2.2016056 2.61805409 3683.106 1.000226
cutpoints[4] 3.4973167 0.007356807 0.41809605 2.7841887 3.1929916 3.4586498 3.7701682 4.38584360 3229.784 1.000388
sigma[1] 2.0270076 0.007450687 0.38189697 1.3406703 1.7483028 2.0003625 2.2888186 2.81242138 2627.240 1.000594
sigma[2] 2.1063941 0.005999209 0.41223323 1.3535314 1.7982527 2.0895968 2.4097841 2.89240318 4721.695 1.000447
rho 0.8437225 0.001252780 0.07543255 0.6820014 0.7941615 0.8508169 0.9014640 0.96583299 3625.500 1.001170
if (file.exists("../models/loo_bivariate.rds")) {
    loo_bivariate <- readRDS("../models/loo_bivariate.rds")
} else {
    loo_bivariate <- loo(fit_bivariate, resp = c("x", "y"), moment_matching = TRUE)
    saveRDS(loo_bivariate, file = "../models/loo_bivariate.rds")
}
if (file.exists("../models/marginal_bivariate.rds")) {
    marginal_bivariate <- readRDS("../models/marginal_bivariate.rds")
} else {
    marginal_bivariate <- bridge_sampler(
        fit_bivariate,
        method = "warp3",
        silent = FALSE,
        repetitions = 10
    )
    saveRDS(marginal_bivariate, file = "../models/marginal_bivariate.rds")
}

fit and assess the bivariate model with family-level random effects

Stan code bivariate_family_logistic.stan:

data {
  int<lower=1> Ntip;
  int<lower=2> K;                         // ordinal levels
  array[Ntip] int<lower=1, upper=K> x;    // ordinal outcome
  array[Ntip] int<lower=0, upper=1> y;    // binary outcome

  int<lower=1> Nfamilies;
  array[Ntip] int<lower=1, upper=Nfamilies> family_index;
}

parameters {
  // Latent residuals (non-centered)
  matrix[2, Ntip] z_std;

  // Family intercepts (non-centered)
  matrix[2, Nfamilies] z_family;

  // Scales and correlations
  vector<lower=0>[2] sigma_resid;
  cholesky_factor_corr[2] L_resid;

  vector<lower=0>[2] sigma_family;
  cholesky_factor_corr[2] L_family;

  // Ordinal thresholds
  real cutpoint1;
  vector<lower=0, upper=10>[K - 2] zeta;
}

transformed parameters {
  matrix[2, Nfamilies] mu_family;
  matrix[2, Ntip] z;
  ordered[K - 1] cutpoints;

  // Construct cutpoints
  cutpoints[1] = cutpoint1;
  for (k in 2:(K - 1)) {
    cutpoints[k] = cutpoints[k - 1] + zeta[k - 1] + 1e-2;
  }

  // Transform family-level effects and latent variables
  mu_family = diag_pre_multiply(sigma_family, L_family) * z_family;
  for (i in 1:Ntip) {
    vector[2] eps = diag_pre_multiply(sigma_resid, L_resid) * col(z_std, i);
    z[, i] = mu_family[, family_index[i]] + eps;
  }
}

model {
  // Priors
  to_vector(z_std) ~ normal(0, 1);
  to_vector(z_family) ~ normal(0, 1);

  sigma_resid ~ lognormal(0, 1);
  sigma_family ~ lognormal(0, 1);

  L_resid ~ lkj_corr_cholesky(1);
  L_family ~ lkj_corr_cholesky(1);

  cutpoint1 ~ normal(0, 2);
  zeta ~ lognormal(0, 2);

  // Likelihood
  for (i in 1:Ntip) {
    x[i] ~ ordered_logistic(z[1, i], cutpoints);
    y[i] ~ bernoulli_logit(z[2, i]);
  }
}

generated quantities {
  vector[Ntip] log_lik;
  array[Ntip] int x_rep;
  array[Ntip] int y_rep;

  matrix[2, 2] Corr_family = multiply_lower_tri_self_transpose(L_family);
  matrix[2, 2] Corr_resid = multiply_lower_tri_self_transpose(L_resid);

  real rho_family = Corr_family[1, 2];
  real rho_resid = Corr_resid[1, 2];

  for (i in 1:Ntip) {
    log_lik[i] =
      ordered_logistic_lpmf(x[i] | z[1, i], cutpoints) +
      bernoulli_logit_lpmf(y[i] | z[2, i]);

    // Posterior predictive draws
    x_rep[i] = ordered_logistic_rng(z[1, i], cutpoints);
    y_rep[i] = bernoulli_logit_rng(z[2, i]);
  }
}
if (file.exists("../models/fit_family_bivariate.rds")) {
    fit_family_bivariate <- readRDS("../models/fit_family_bivariate.rds")
} else {

    model_family_bivariate <- stan_model("bivariate_family_logistic.stan")

    # Fit model
    fit_family_bivariate <- sampling(
        model_family_bivariate,
        data = stan_data,
        chains = 4,
        iter = 10000,
        cores = 4,
        seed = 123
    )
    saveRDS(fit_family_bivariate, file = "../models/fit_family_bivariate.rds")
}
# Summarize parameters of interest
summary(fit_family_bivariate, pars = c("cutpoints", "sigma_family", "sigma_resid", "rho_resid", "rho_family"))$summary
A matrix: 10 × 10 of type dbl
mean se_mean sd 2.5% 25% 50% 75% 97.5% n_eff Rhat
cutpoints[1] -1.1856743 0.013858485 0.4693442 -2.2753330 -1.44014616 -1.1358417 -0.8726794 -0.3992866 1146.9682 1.002513
cutpoints[2] 0.1867401 0.007490110 0.4090960 -0.5485988 -0.07587707 0.1661927 0.4229213 1.0905802 2983.1436 1.001756
cutpoints[3] 2.3901725 0.032736298 0.7747425 1.3936330 1.89284076 2.2285565 2.6778149 4.4484078 560.0872 1.005723
cutpoints[4] 4.3823746 0.057602101 1.2413287 2.9428348 3.58716823 4.0644670 4.7982681 7.8109779 464.4051 1.006646
sigma_family[1] 3.1615361 0.038064180 0.8856670 2.0291648 2.57110143 2.9655321 3.5129988 5.5893180 541.3872 1.004996
sigma_family[2] 6.9272514 0.058218474 2.7816597 3.6256113 5.11538903 6.3156670 7.9888019 13.9861596 2282.8972 1.000915
sigma_resid[1] 1.7233305 0.042106957 0.8638183 0.6354722 1.13301222 1.5168173 2.0907242 4.0297206 420.8595 1.006450
sigma_resid[2] 2.0862514 0.028465494 1.1662196 0.7484309 1.33112329 1.8073942 2.5172434 5.0600078 1678.5073 1.001250
rho_resid 0.8108989 0.003073503 0.1438497 0.4766518 0.72115122 0.8405711 0.9278055 0.9934487 2190.5386 1.002305
rho_family 0.6360709 0.001731704 0.1101610 0.3901907 0.56881673 0.6482308 0.7150946 0.8164705 4046.7705 1.001075
if (file.exists("../models/loo_bivariate_family.rds")) {
    loo_bivariate_family <- readRDS("../models/loo_bivariate_family.rds")
} else {
    loo_bivariate_family <- loo(fit_family_bivariate, resp = c("x", "y"))
    saveRDS(loo_bivariate_family, file = "../models/loo_bivariate_family.rds")
}
if (file.exists("../models/marginal_family_bivariate.rds")) {
    marginal_family_bivariate <- readRDS("../models/marginal_family_bivariate.rds")
} else {
    marginal_family_bivariate <- bridge_sampler(
        fit_family_bivariate,
        method = "warp3",
        repetitions = 10
    )
    saveRDS(marginal_family_bivariate, file = "../models/marginal_family_bivariate.rds")
}

fit and assess the bivariate model with family-level random effects and phylogenetic control

Stan code bivariate_ou_logistic.stan:

data {
  int<lower=1> Ntip;
  int<lower=1> Nnodes;
  int<lower=1> Nedges;
  array[Nedges, 2] int<lower=1, upper=Nnodes> edge_matrix; // [parent, child]
  vector[Nedges] edge_lengths;

  array[Ntip] int<lower=1, upper=Nnodes> tip_indices;
  int<lower=2> K;                         // ordinal levels in x
  array[Ntip] int<lower=1, upper=K> x;    // ordinal outcome
  array[Ntip] int<lower=0, upper=1> y;    // binary response
  int<lower=1> Nfamilies;
  array[Ntip] int<lower=1, upper=Nfamilies> family_index;
  int<lower=1, upper=Nnodes> root_index;
}

parameters {
  // Family-specific correlated intercepts
  matrix[2, Nfamilies] mu_family;
  cholesky_factor_corr[2] L_family;
  vector<lower=0>[2] sigma_family;

  // OU latent state
  vector[2] root_state;
  vector[Nnodes - 1] z1_raw;
  vector[Nnodes - 1] z2_raw;
  vector[2] mu;                      // OU process stationary mean
  vector<lower=0>[2] sigma;         // OU process stationary sd
  vector<lower=0>[2] lambda;        // OU strength
  cholesky_factor_corr[2] L_OU;     // OU trait correlation

  // Ordinal cutpoints
  real cutpoint1;
  vector<lower=0, upper=10>[K - 2] zeta;
}

transformed parameters {
  ordered[K - 1] cutpoints;
  matrix[Nnodes, 2] z;

  // Construct cutpoints
  cutpoints[1] = cutpoint1;
  for (k in 2:(K - 1)) {
    cutpoints[k] = cutpoints[k - 1] + zeta[k - 1] + 1e-2;
  }

  // Assign root node
  z[root_index, 1] = root_state[1];
  z[root_index, 2] = root_state[2];

  // Forward simulation of OU process over tree
  {
    int counter = 1;
    for (e in 1:Nedges) {
      int parent = edge_matrix[e, 1];
      int child = edge_matrix[e, 2];
      real t = edge_lengths[e];

      real decay1 = exp(-lambda[1] * t);
      real decay2 = exp(-lambda[2] * t);

      vector[2] z_parent = to_vector(z[parent]);
      vector[2] mean = mu + [decay1 * (z_parent[1] - mu[1]),
                             decay2 * (z_parent[2] - mu[2])]';

      real ou_var_1 = (1 - exp(-2 * lambda[1] * t)) / (2 * lambda[1]);
      real ou_var_2 = (1 - exp(-2 * lambda[2] * t)) / (2 * lambda[2]);

      matrix[2, 2] L = diag_pre_multiply(sqrt([ou_var_1, ou_var_2]'), L_OU);
      vector[2] noise = [z1_raw[counter], z2_raw[counter]]';
      vector[2] scaled_noise = sigma .* noise;

      z[child] = (mean + L * scaled_noise)';
      counter += 1;
    }
  }
}

model {
  // OU process priors
  mu ~ normal(0, 2);
  sigma ~ lognormal(0,1);
  lambda ~ lognormal(0, 1);
  L_OU ~ lkj_corr_cholesky(1);
  z1_raw ~ std_normal();
  z2_raw ~ std_normal();
  root_state[1] ~ normal(mu[1], sigma[1] / sqrt(2 * lambda[1]));
  root_state[2] ~ normal(mu[2], sigma[2] / sqrt(2 * lambda[2]));

  // Family-level intercept priors
  L_family ~ lkj_corr_cholesky(1);
  sigma_family ~ lognormal(0, 1);
  for (j in 1:Nfamilies) {
    mu_family[, j] ~ multi_normal_cholesky(rep_vector(0, 2),
                                           diag_pre_multiply(sigma_family, L_family));
  }

  // Ordinal cutpoints
  cutpoint1 ~ normal(0, 2);
  zeta ~ lognormal(0, 2);

  // Likelihood
  for (i in 1:Ntip) {
    int node = tip_indices[i];
    x[i] ~ ordered_logistic(z[node, 1] + mu_family[1, family_index[i]], cutpoints);
    y[i] ~ bernoulli_logit(z[node, 2] + mu_family[2, family_index[i]]);
  }
}

generated quantities {
  corr_matrix[2] Rho = multiply_lower_tri_self_transpose(L_OU);
  corr_matrix[2] Rho_family = multiply_lower_tri_self_transpose(L_family);
  real rho = Rho[1, 2];
  real rho_family = Rho_family[1, 2];

  vector[Ntip] log_lik;
  for (i in 1:Ntip) {
    int node = tip_indices[i];
    log_lik[i] =
      ordered_logistic_lpmf(x[i] | z[node, 1] + mu_family[1, family_index[i]], cutpoints) +
      bernoulli_logit_lpmf(y[i] | z[node, 2] + mu_family[2, family_index[i]]);
  }
  matrix[Nnodes, 2] z_copy;
  for (i in 1:Nnodes)
    for (j in 1:2)
      z_copy[i, j] = z[i, j];
}
if (file.exists("../models/fit_bivariate_ou.rds")) {
    fit_bivariate_ou <- readRDS("../models/fit_bivariate_ou.rds")
} else {
    
    model_bivariate_ou <- stan_model("bivariate_ou_logistic.stan")

    # Fit model
    fit_bivariate_ou <- sampling(
    model_bivariate_ou,
    data = stan_data,
    chains = 4,
    iter = 20000,
    cores = 4,
    seed = 123
    )
    saveRDS(fit_bivariate_ou, file = "../models/fit_bivariate_ou.rds")
}
# Summarize parameters of interest
summary(fit_bivariate_ou, pars = c("cutpoints", "mu", "sigma", "lambda", "rho", "rho_family"))$summary
A matrix: 12 × 10 of type dbl
mean se_mean sd 2.5% 25% 50% 75% 97.5% n_eff Rhat
cutpoints[1] -0.51067868 0.0088035918 1.61489789 -3.65562437 -1.59734548 -0.51565737 0.56947688 2.6527419 33648.8587 0.9999770
cutpoints[2] 2.01415396 0.0204515026 1.72679378 -1.29352973 0.84325689 1.98767503 3.14733245 5.5002450 7129.0311 1.0008513
cutpoints[3] 6.05658152 0.0554957854 2.28301011 2.04372302 4.50581998 5.88163330 7.38368028 11.1161462 1692.3714 1.0033496
cutpoints[4] 9.69765811 0.0858397722 2.98196238 4.87964163 7.67657089 9.34574562 11.29231806 16.6540466 1206.7775 1.0046950
mu[1] 0.50537460 0.0085172032 1.61627063 -2.67676372 -0.57651308 0.50931091 1.59380308 3.6667175 36010.9139 1.0000067
mu[2] -0.75476986 0.0089048679 1.92647876 -4.49358697 -2.07660052 -0.76321190 0.52789808 3.0652263 46802.9779 0.9999748
sigma[1] 2.61783253 0.0296963429 0.88614194 1.41251485 2.00083538 2.44884891 3.03708042 4.8645997 890.4318 1.0062441
sigma[2] 3.94993762 0.0387977888 1.89803893 1.71755573 2.70439341 3.51352642 4.65564967 8.8786473 2393.2953 1.0007204
lambda[1] 0.11504302 0.0004618056 0.03918786 0.04805974 0.08707168 0.11157667 0.13945999 0.2007775 7200.8659 1.0005974
lambda[2] 0.05053272 0.0001349285 0.02144987 0.01780592 0.03506052 0.04748585 0.06280257 0.1009523 25272.1311 1.0000468
rho 0.54368963 0.0032397454 0.18820218 0.19814333 0.40250766 0.54049899 0.68487980 0.8961702 3374.6408 1.0004673
rho_family 0.55375172 0.0019144207 0.15319816 0.21457507 0.45942306 0.56834375 0.66324003 0.8109723 6403.7204 1.0003641
posterior_df <- as_draws_df(fit_bivariate_ou)

mcmc_areas(posterior_df, pars=c("rho", "rho_family"), prob=0.95) +
    labs(title="Posterior distributions of correlation parameters") +
    theme(plot.title = element_text(hjust = 0.5))

if (file.exists("../models/loo_bivariate_ou.rds")) {
    loo_bivariate_ou <- readRDS("../models/loo_bivariate_ou.rds")
} else {
    loo_bivariate_ou <- loo(fit_bivariate_ou, resp = c("x", "y"))
    saveRDS(loo_bivariate_ou, file = "../models/loo_bivariate_ou.rds")
}
if (file.exists("../models/marginal_ou_bivariate.rds")) {
    marginal_ou_bivariate <- readRDS("../models/marginal_ou_bivariate.rds")
} else {
    marginal_ou_bivariate <- bridge_sampler(
        fit_bivariate_ou,
        method = "warp3",
        silent = FALSE,
        repetitions = 10
    )
    saveRDS(marginal_ou_bivariate, file = "../models/marginal_ou_bivariate.rds")
}

extract posterior means for latent variables and family-level interecepts



z_summary <- posterior_df %>%
  spread_draws(z_copy[i, j]) %>%
  group_by(i, j) %>%
  summarise(mean_z = mean(z_copy), .groups = "drop")
mu_summary <- posterior_df %>%
  spread_draws(mu_family[i, j]) %>%
  group_by(i, j) %>%
  summarise(mean_mu = mean(mu_family), .groups = "drop")


posterior_means <- bind_rows(
  z_summary %>% rename(value = mean_z) %>% mutate(type = "z"),
  mu_summary %>% rename(value = mean_mu) %>% mutate(type = "mu")
)

write_csv(posterior_means, "../data/z_mu_posterior_means.csv")
z_affix <- posterior_means %>%
    filter(type == "z") %>%
    filter(j == 1) %>%
    pull(value)

z_adposition <- posterior_means %>%
    filter(type == "z") %>%
    filter(j == 2) %>%
    pull(value)
affix_levels <- c(
  "1" = "Strongly suffixing",
  "2" = "Weakly suffixing",
  "3" = "Equal prefixing and suffixing",
  "4" = "Weakly prefixing",
  "5" = "Strongly prefixing"
)
affix_factor <- factor(
  as.character(d$affix_ord),
  levels = names(affix_levels),
  labels = affix_levels
)

adposition_levels <- c(
    "0" = "Postposition",
    "1" = "Preposition"
)

adposition_factor <- factor(
  as.character(d$adp_bin),
  levels = names(adposition_levels),
  labels = adposition_levels
)

get the Indo-European subset of the data

ie_index <- d %>%
    filter(Family == "Indo-European") %>%
    pull(family_index) %>% unique()


ie_intercepts <- posterior_means %>%
    filter(type == "mu") %>%
    filter(j == ie_index) %>%
    pull(value)
tree_longnames <- read.tree("../data/affix_adposition_longnames.tre")

longnames <- tree_longnames$tip.label
names(longnames) <- tree$tip.label

create tibble with latent values for interior nodes

# Add node labels to the tree based on node numbering
tree$node.label <- as.character((length(tree$tip.label) + 1):(length(tree$tip.label) + tree$Nnode))


# 1. Subset auf Indo-European
d_ie <- d %>% filter(Family == "Indo-European")
glottocodes_ie <- d_ie$Glottocode

# 2. Baum prunen
ie_tree <- drop.tip(tree, setdiff(tree$tip.label, glottocodes_ie), 
                        keep.node.label = TRUE)
ie_tree_plot <- ggtree(ie_tree)
ie_tree_info <- ie_tree_plot$data



# Add  column node.label
ie_tree_info$node.label <- as.character(ie_tree_info$label)

for (i in 1:nrow(ie_tree_info)) {
    if (ie_tree_info$isTip[i]) {
        # If it's a tip, get the corresponding Glottocode
        glottocode <- ie_tree_info$label[i]
        # Find the index of the Glottocode in d_ie
        index <- which(tree$tip.label == glottocode)
        # Assign the Affix value to the node label
        ie_tree_info$node.label[i] <- index
    } else {
        # If it's not a tip, copy label
        ie_tree_info$node.label[i] <- as.integer(ie_tree_info$node.label[i])
    }
}

# trait vectors with original node ID (as in the full tree)
node_data <- tibble(
  node.label = as.character(1:length(z_affix)),
  z_affix = z_affix + ie_intercepts[1],
  z_adposition = z_adposition + ie_intercepts[2]
)

# merge with ie_tree_info
ie_tree_data <- ie_tree_info %>%
  left_join(node_data, by = "node.label")

ie_tree_data <- ie_tree_data %>%
    left_join(
        rename(select(d, Glottocode, affix_ord, adp_bin),
        label=Glottocode)
    ) %>% 
    mutate(Affix = factor(affix_ord, levels = names(affix_levels), labels = affix_levels),
           Adposition = factor(adp_bin, levels = names(adposition_levels), labels = adposition_levels)
    )
Joining with `by = join_by(label)`

clean up language names

ie_longnames <- longnames[ie_tree$tip.label] %>% as.character()
ie_longnames <- gsub("_2", "", ie_longnames)
ie_longnames <- gsub("_3", "", ie_longnames)
ie_names <- sapply(strsplit(ie_longnames, "\\."), `[`, 3)
ie_tree$tip.label <- ie_names
name2gname <- readRDS("../data/name2gname.rds")
ie_tree$tip.label <- name2gname[ie_tree$tip.label]

create tree plots

# Set up the custom palette and tree depth
custom_palette <- paletteer_c("ggthemes::Classic Orange-Blue", 30)
max_depth <- max(nodeHeights(ie_tree))
x_max <- max_depth + 3  # Adjust for spacing


# Extrahiere z. B. 3 Farben aus der Classic Orange-Blue Palette
affix_colors <- paletteer::paletteer_c("ggthemes::Classic Orange-Blue", 30)[c(5, 20, 25)]
adposition_colors <- paletteer::paletteer_c("ggthemes::Classic Orange-Blue", 30)[c(5, 30)]

names(affix_colors) <- c("Strongly suffixing", "Weakly suffixing", "Equal prefixing and suffixing")



# Create the first plot (z_affix) with transparency
affix_plot <- ggtree(ie_tree) %<+% ie_tree_data +
    geom_tree(aes(color = z_affix), size = 1.5) +
    geom_tiplab(size = 4, align = FALSE, linetype = NA, offset = .7) +
    scale_color_gradientn(colors = custom_palette) +
    theme_tree2() +
    xlim(0, x_max) +
    theme(
        panel.background = element_rect(fill = NA, color = NA),  # Transparent background
        legend.position = "left",
        axis.line.x = element_blank(),  # Remove x-axis line
        axis.text.x = element_blank(),  # Remove x-axis text
        axis.ticks.x = element_blank(),  # Remove x-axis ticks
        axis.title.x = element_blank()   # Remove x-axis title
    ) +
    geom_tippoint(aes(shape = Affix, fill=Affix), size = 4, color = "black", stroke = 0.5, position = position_nudge(x = 0.2)) +
    scale_shape_manual(values = c(21, 22, 23, 24, 25)) +
    scale_fill_manual(values = affix_colors)

# Create the second plot (z_adposition) with transparency
adposition_plot <- ggtree(ie_tree) %<+% ie_tree_data +
    geom_tree(aes(color = z_adposition), size = 1.5) +
    scale_color_gradientn(colors = custom_palette) +
    theme_tree2() +
    scale_x_reverse(limits = c(x_max, 0)) +
    theme(
        panel.background = element_rect(fill = NA, color = NA),  # Transparent background
        legend.position = "right",
        axis.line.x = element_blank(),  # Remove x-axis line
        axis.text.x = element_blank(),  # Remove x-axis text
        axis.ticks.x = element_blank(),  # Remove x-axis ticks
        axis.title.x = element_blank()   # Remove x-axis title
    ) +
    geom_tippoint(aes(shape = Adposition, fill=Adposition), size = 4, color = "black", stroke = 0.5, position = position_nudge(x = -0.2)) +
    scale_shape_manual(values = c(21, 24)) +
    scale_fill_manual(values = adposition_colors)

# Combine the two plots side by side with adjusted spacing
combined_plot <- cowplot::ggdraw() + 
    cowplot::draw_plot(adposition_plot, x = 0.4, y = 0, width = 0.5, height = 1) +
    cowplot::draw_plot(affix_plot, x = 0, y = 0, width = 0.5, height = 1) 
    
combined_plot

ggsave("../img/affix_adposition_tree.svg", combined_plot, width = 15, height = 9)

model comparison

loo_compare(loo_bivariate, loo_bivariate_family, loo_bivariate_ou)
A compare.loo: 3 × 8 of type dbl
elpd_diff se_diff elpd_loo se_elpd_loo p_loo se_p_loo looic se_looic
model3 0.0000 0.00000 -574.3539 20.88888 343.4233 14.374998 1148.708 41.77776
model2 -146.0132 11.00278 -720.3671 22.93316 274.3254 11.379858 1440.734 45.86632
model1 -407.9336 15.76019 -982.2875 15.79018 335.2332 5.832259 1964.575 31.58035
summary(marginal_bivariate)
summary(marginal_family_bivariate)
summary(marginal_ou_bivariate)
A summary.bridge_list: 1 × 6
Logml_Estimate Min Max Interquartile_Range Method Repetitions
<dbl> <dbl> <dbl> <dbl> <chr> <dbl>
-102.5171 -104.27 -101.6612 1.222981 warp3 10
A summary.bridge_list: 1 × 6
Logml_Estimate Min Max Interquartile_Range Method Repetitions
<dbl> <dbl> <dbl> <dbl> <chr> <dbl>
282.2459 280.8282 288.1325 2.016838 warp3 10
A summary.bridge_list: 1 × 6
Logml_Estimate Min Max Interquartile_Range Method Repetitions
<dbl> <dbl> <dbl> <dbl> <chr> <dbl>
1207.759 1203.447 1213.099 5.06966 warp3 10
# Compute Bayes factors
bf_bivariate_vs_ou <- bayes_factor(marginal_bivariate, marginal_ou_bivariate, log=TRUE)
bf_bivariate_vs_ou
Estimated log Bayes factor (based on medians of log marginal likelihood estimates)
 in favor of x1 over x2: -1310.27602
Range of estimates: -1316.56516 to -1305.51547
Interquartile range: 5.96180
# Compute Bayes factors
bf_hierarchial_vs_ou <- bayes_factor(marginal_family_bivariate, marginal_ou_bivariate, log=TRUE)
bf_hierarchial_vs_ou
Estimated log Bayes factor (based on medians of log marginal likelihood estimates)
 in favor of x1 over x2: -925.51298
Range of estimates: -931.28727 to -916.06188
Interquartile range: 6.42169