Phylogenetic control with OU
  • Home
  • About

Data

Running a Phylogenetic OU Regression with R and Stan

In this notebook, I will illustrate how one can set up a logistic regression with a phylogenetic random effect following an Ornstein-Uhlenbeck process using R and Stan.

This kind of analysis has conceptual advantages compared with alternatives - such as phylogenetic random effects following a Brownian motion, or correlated evolution using discrete characters and the CTMC model. However, while there are off-the-shelf solutions for the latter two, the OU regression is, to my knowledge, not yet implemented in a dedicated function or package. Such a model can be set up quite easily in Stan, however. This notebook serves to illustrate how this can be done.

Preparing the environment

The git repository https://github.com/gerhardJaeger/OU_logistic_example holds the source code for this notebook. It contains a file OU_logistic_example.yml. You can use it to set up a conda environment containing all the required packages. (Of course you can also install the packages with install.packages if you don’t use conda.) At the command prompt, in the home directory of the repository, do

conda env create -f OU_logistic_example.yml
conda activate OU_logistic_example

Loading the required packages

suppressPackageStartupMessages({
  suppressWarnings({
    library(tidyverse)
    library(rstan)
    library(brms)
    library(parallel)
    rstan_options(auto_write = TRUE)
    options(mc.cores = parallel::detectCores())  # Enable parallel Stan sampling
    library(ape)
    library(R.utils)
    library(bridgesampling)
    library(loo)
    library(bayesplot)
    library(codetools)
    library(ggtree)
    library(posterior)
    library(tidybayes)
    library(ggthemes)
    library(paletteer)
    library(ggnewscale)
    library(cowplot)
    library(paletteer)
    library(scico)
    options(repr.plot.width = 12, repr.plot.height = 8)  # For base R plots in Jupyter
    theme_set(theme_minimal(base_size = 24))  # Increase base font size
  })
})

Now let’s get some data. I will use the EDGE tree (Bouckaert et al. 2022).

# Create data directory if it doesn't exist
if (!dir.exists("../data")) dir.create("../data", recursive = TRUE)

# Download the file if it doesn't exist
tree_gz <- "../data/global-language-tree-MCC-labelled.tree.gz"
tree_file <- "../data/global-language-tree-MCC-labelled.tree"
if (!file.exists(tree_file)) {
    download.file(
        url = "https://github.com/rbouckaert/global-language-tree-pipeline/releases/download/v1.0.0/global-language-tree-MCC-labelled.tree.gz",
        destfile = tree_gz,
        mode = "wb"
    )
    R.utils::gunzip(tree_gz, destname = tree_file, overwrite = TRUE)
    # Remove the gzipped file after extraction
    if (file.exists(tree_gz)) file.remove(tree_gz)
}

edge_tree <- read.nexus(tree_file)

edge_tree$tip.label <- sapply(strsplit(edge_tree$tip.label, "_"), `[`, 1)

It is advisable to rescale the tree so that the mean branch length is 1.

edge_tree$edge.length <- edge_tree$edge.length / (mean(edge_tree$edge.length))

Grambank data

I will look at the somewhat boring feature pair:

  • GB193: What is the order of adnominal property word and noun?
  • GB133: Is a pragmatically unmarked constituent order verb-final for transitive clauses?

First we get the grambank data.

grambank_file = "../data/grambank_vals.csv"
if (!file.exists(grambank_file)) {
    values_url <- "https://raw.githubusercontent.com/grambank/grambank/refs/heads/master/cldf/values.csv"
    vals <- read_csv(values_url, na = "")
    write(vals, grambank_file)
} else {
    vals <- read_csv(grambank_file)
}
Rows: 441663 Columns: 9

── Column specification ────────────────────────────────────────────────────────

Delimiter: ","

chr (9): ID, Language_ID, Parameter_ID, Value, Code_ID, Comment, Source, Sou...



ℹ Use `spec()` to retrieve the full column specification for this data.

ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Now we filter out the datapoints for the two parameters of interest and the languages in the tree. I only consider two values per parameter, and only languages having a relevant value for both parameters.

d <- vals %>%
    filter(Language_ID %in% edge_tree$tip.label) %>%
    filter(Parameter_ID %in% c("GB193", "GB133")) %>%
    select(Language_ID, Parameter_ID, Value) %>%
    filter(Value %in% c(0, 1, 2)) %>%
    mutate(Value = as.numeric(Value)) %>%
    pivot_wider(names_from = Parameter_ID, values_from = Value) %>%
    filter(GB193 > 0) %>%
    drop_na()

Now we prune the tree to only include the languages in the data.

pruned_tree <- drop.tip(edge_tree, setdiff(edge_tree$tip.label, d$Language_ID))

# Check that d$Language_ID and pruned_tree$tip.label are the same set
if (!setequal(d$Language_ID, pruned_tree$tip.label)) {
    warning("Mismatch between d$Language_ID and pruned_tree$tip.label")
    print(setdiff(d$Language_ID, pruned_tree$tip.label))
    print(setdiff(pruned_tree$tip.label, d$Language_ID))
} else {
    message("d$Language_ID and pruned_tree$tip.label match.")
}
d$Language_ID and pruned_tree$tip.label match.

Now the rows in d are rearranged to match the order of the tips in the pruned tree.

d <- d[match(pruned_tree$tip.label, d$Language_ID), ]

Finally we convert the features to numeric values, and convert to 0/1

d <- d %>%
    mutate(x = GB193 - 1,
           y = GB133 
    )

For now, I keep the data set small and restrict myself to 100 taxa.

set.seed(123)  # You can replace 123 with any integer
d <- d[sample(nrow(d), 100), ]
pruned_tree <- drop.tip(pruned_tree, setdiff(pruned_tree$tip.label, d$Language_ID))
d <- d[match(pruned_tree$tip.label, d$Language_ID), ]

Let’s have a look at the tree and the data.

ggtree(pruned_tree, layout="circular") +
    geom_tree() +
    geom_tiplab() +            # Adds tip labels
    theme_tree2()              # Adds scale axis (x-axis for branch lengths)


d %>% head(10)
Warning message:

“`aes_()` was deprecated in ggplot2 3.0.0.

ℹ Please use tidy evaluation idioms with `aes()`

ℹ The deprecated feature was likely used in the ggtree package.

  Please report the issue at <https://github.com/YuLab-SMU/ggtree/issues>.”

Warning message in fortify(data, ...):

“Arguments in `...` must be used.

✖ Problematic arguments:

• as.Date = as.Date

• yscale_mapping = yscale_mapping

• hang = hang

ℹ Did you misspell an argument name?”

Warning message:

“`aes_string()` was deprecated in ggplot2 3.0.0.

ℹ Please use tidy evaluation idioms with `aes()`.

ℹ See also `vignette("ggplot2-in-packages")` for more information.

ℹ The deprecated feature was likely used in the ggtree package.

  Please report the issue at <https://github.com/YuLab-SMU/ggtree/issues>.”
A tibble: 10 × 5
Language_ID GB133 GB193 x y
<chr> <dbl> <dbl> <dbl> <dbl>
adan1251 1 2 1 1
ajas1235 0 1 0 0
alab1254 1 1 0 1
alya1239 0 2 1 0
arch1244 1 1 0 1
aree1239 1 2 1 1
arig1246 0 1 0 0
awar1249 1 2 1 1
band1344 0 1 0 0
bayb1234 0 1 0 0

Let’s save the data for later use.

write_csv(d, "../data/grambank_vals_pruned.csv")

Stan model

The tree is represented as a phylo object in ape, which is suitable for Stan.

Just to be on the safe side, we make sure that the branches are in postorder.

pruned_tree <- reorder(pruned_tree, "postorder")

Now we code the tree and the data as a data list.

stan_data = list(
    N = nrow(d),  # Number of languages
    Nnodes = length(unique(as.vector(pruned_tree$edge))),
    root_node = length(pruned_tree$tip.label) + 1,
    x = d$x,  # Predictor variable (numeral-noun order)
    y = d$y,  # Response variable (adnominal demonstrative-noun order)
    Nedges = nrow(pruned_tree$edge), # number of edges
    edges = pruned_tree$edge,  # Edges of the tree,
    edge_lengths = pruned_tree$edge.length  # Lengths of the edges
)

Vanilla logistic regression

We start with a simple logistic regression without any phylogenetic effects, just to see how it works.

The model formula is:

\[ \begin{align} \alpha &\sim \text{Student-t}(3, 0, 2.5)\\ \beta &\sim \mathcal N(0, 2)\\ \mu_x &:= \frac{1}{|x|}\sum_{i}x_i\\ y &\sim \text{Bernoulli}(\text{logit}^{-1}(\alpha + \beta (x-\mu_x))) \end{align} \]

The choice of priors follows the defaults used by brms for logistic regression.

Let’s run this in Stan.

stan_code_vanilla <- "
data {
    int<lower=1> N;                      // Number of observations
    vector<lower=0, upper=1>[N] x;       // Predictor variable
    array[N] int<lower=0,upper=1> y;     // Response variable (binary)
}

transformed data {
  vector[N] Xc;  // centered version of X without an intercept
  real means_X;  // mean of X before centering
  means_X = mean(x);
  Xc = x - means_X;
}

parameters {
    real alpha;      // Intercept
    real beta;       // Slope
}
model {
    // Priors
    alpha ~ student_t(3, 0, 2.5);
    beta ~ normal(0, 2);

    // Likelihood
    # y ~ bernoulli_logit(alpha + beta * Xc);
    target += bernoulli_logit_lpmf(y | alpha + beta * Xc);
    
}
generated quantities {
    vector[N] log_lik;
    array[N] int y_rep;      // Posterior predictive samples
    array[N] int y_prior;    // Prior predictive samples
    real alpha_prior;
    real beta_prior;

    for (n in 1:N) {
        // Log-likelihood
        log_lik[n] = bernoulli_logit_lpmf(y[n] | alpha + beta * Xc[n]);

        // Posterior predictive
        y_rep[n] = bernoulli_logit_rng(alpha + beta * Xc[n]);

        // Prior predictive: draw new alpha and beta from prior, then simulate
        alpha_prior = student_t_rng(3, 0, 2.5);
        beta_prior = normal_rng(0, 2);
        y_prior[n] = bernoulli_logit_rng(alpha_prior + beta_prior * Xc[n]);
    }
}
"
system.time(stan_model_vanilla <- stan_model(model_code = stan_code_vanilla))
   user  system elapsed 
 49.456   1.517  45.835 
system.time(
    fit_vanilla <- sampling(stan_model_vanilla, data = stan_data, iter = 2000, chains = 4)
)
   user  system elapsed 
  3.068   0.331   2.775 
print(fit_vanilla, pars = c("alpha", "beta"))
Inference for Stan model: anon_model.
4 chains, each with iter=2000; warmup=1000; thin=1; 
post-warmup draws per chain=1000, total post-warmup draws=4000.

       mean se_mean   sd  2.5%   25%   50%   75% 97.5% n_eff Rhat
alpha -0.47    0.00 0.21 -0.88 -0.62 -0.47 -0.33 -0.06  3803    1
beta  -1.00    0.01 0.44 -1.87 -1.31 -0.99 -0.71 -0.15  3735    1

Samples were drawn using NUTS(diag_e) at Sun Nov 16 15:37:06 2025.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at 
convergence, Rhat=1).
mcmc_areas(
    as.matrix(fit_vanilla),
    pars="beta",
    prob = 0.95
)

The posterior distribution of the slope \(\beta\) does not include zero, indicating that there is a significant relationship between the predictor variable x and the response variable y. The direction is negative, meaning that N-ANM order disfavors OV order.

For model comparison, we can use the bridgesampling package to compute the log marginal likelihood.

system.time(
    capture.output({
    log_marginal_likelihood_vanilla_regression <- bridge_sampler(fit_vanilla)
    }, file = "/dev/null")
)
print(log_marginal_likelihood_vanilla_regression)
   user  system elapsed 
  2.503   0.000   0.574 
Bridge sampling estimate of the log marginal likelihood: -64.79445
Estimate obtained in 5 iteration(s) via method "normal".

Now we do visual checks of the model using prior and posterior predictive simulations.

predictive_checks_regression <- function(y_prior, y_rep, stan_data) {
    x0_idx <- which(stan_data$x == 0)
    x1_idx <- which(stan_data$x == 1)
    obs_p_y1_x0 <- mean(stan_data$y[stan_data$x == 0])
    obs_p_y1_x1 <- mean(stan_data$y[stan_data$x == 1])

    y_prior_df <- as_tibble(y_prior) %>%
        setNames(1:ncol(y_prior)) %>%
        mutate(sample_no = 1:nrow(.)) %>%
        pivot_longer(-sample_no, names_to = "obs", values_to = "y") %>%
        mutate(type = "y_prior") %>%
        mutate(obs = as.integer(obs))

    y_rep_df <- as_tibble(y_rep) %>%
        setNames(1:ncol(y_rep)) %>%
        mutate(sample_no = 1:nrow(.)) %>%
        pivot_longer(-sample_no, names_to = "obs", values_to = "y") %>%    mutate(type = "y_rep") %>%
        mutate(obs = as.integer(obs))

    y_obs_df <- tibble(
        sample_no = 0,
        obs = 1:stan_data$N,
        y = stan_data$y,
        type = "y_obs"
    )


    df <- bind_rows(y_prior_df, y_rep_df, y_obs_df)

    df <- df %>%
        mutate(x_obs = stan_data$x[df$obs])

    empirical_df <- df %>%
        filter(sample_no == 0) %>%
        group_by(x_obs) %>%
        summarise(p_y1 = mean(y), .groups = "drop")         

    p <- df %>%
        filter(sample_no != 0) %>%
        group_by(sample_no, type, x_obs) %>%
        summarise(p_y1 = mean(y), .groups = "drop") %>%
        ggplot(aes(x = p_y1, fill = type, color = type)) +
        geom_density(alpha = 0.3, adjust = 1.2) +
        geom_vline(
        data = empirical_df,
            aes(xintercept = p_y1),
            color = "red",
            linewidth = 0.7,
            linetype = "dashed"
        ) +
        facet_wrap(
            ~x_obs,
            nrow = 1,
            labeller = labeller(x_obs = c(`0` = "x = 0", `1` = "x = 1"))
        ) +
        labs(
            title = "Predictive Check: Prior vs Posterior vs Observed",
            x = "Proportion of y = 1 (per draw)",
            y = "Density"
        ) +
        theme_minimal(base_size = 14)
    return(p)
}
post_vanilla_regression <- rstan::extract(fit_vanilla)

predictive_checks_regression(post_vanilla_regression$y_prior, post_vanilla_regression$y_rep, stan_data)
Warning message:

“The `x` argument of `as_tibble.matrix()` must have unique column names if

`.name_repair` is omitted as of tibble 2.0.0.

ℹ Using compatibility `.name_repair`.”

Vanilla logistic correlation study

A regression model assumes an asymmetry between the predictor variable (which is not modeled) and the response variabel (which is modeled). This is not really justified in observational studies where we have not control over the predictor variable. A more appropriate model is one where both variables are generated by some stochastic process.

This can be modeled as follows:

\[ \begin{align} z &\sim \mathcal N(\mu, \Sigma)\\ \Sigma &= \text{diag}(\sigma) \cdot R \cdot \text{diag}(\sigma)\\ R &\sim \text{LKJ}(2)\\ \mu_k &\sim \mathcal N(0, 2)\\ \sigma_k &\sim \text{Lognormal}(0, 1)\\ x &\sim \text{Bernoulli}(\text{logit}^{-1}(z_1))\\ y &\sim \text{Bernoulli}(\text{logit}^{-1}(z_2)) \end{align} \]

Here \(z\) is a bivariate normal variable with mean \(\mu\) and covariance \(\Sigma\). The covariance matrix \(\Sigma\) is parameterized as the product of a diagonal matrix of standard deviations \(\sigma = [\sigma_1, \sigma_2]\) and a correlation matrix \(R\). This allows the two dimensions to have different scales while still modeling their correlation. The correlation between them is given by \(\rho = R_{12} = R_{21}\). The correlation matrix is generated from an LKJ distribution, which is a flexible way to model correlations. The two variables \(x\) and \(y\) are then generated from the first and second components of \(z\), respectively, using the logistic function.

The interesting question is then whether the correlation between \(x\) and \(y\), i.e., \(\rho = r_{21}\), is significantly different from zero. If it is, this indicates that there is a significant correlation between the two variables, which is what we are interested in.

stan_code_vanilla_corr <- "
data {
  int<lower=0> N;
  array[N] int<lower=0, upper=1> x;       // first binary variable
  array[N] int<lower=0, upper=1> y;       // second binary variable
}

parameters {
  vector[2] mu;                          // Mean of latent variables
  vector<lower=0>[2] sigma;                 // Standard deviation of latent variables
  cholesky_factor_corr[2] L_R;           // Cholesky factor of correlation matrix
  matrix[2, N] z_raw;                    // Standard normal latent variables
}

transformed parameters {
  real rho = L_R[2, 1];                  // Extract the correlation coefficient from the Cholesky factor
  matrix[2, N] z;                        // Latent variables with correlation
  z = rep_matrix(mu, N) + diag_matrix(sigma) * L_R * z_raw;
}

model {
  // Priors
  mu ~ normal(0, 2);                       // Prior for the mean of latent variables
  sigma ~ lognormal(0,1);                  // Prior for the standard deviation of latent variables
  L_R ~ lkj_corr_cholesky(2);              // Prior for the Cholesky factor of the correlation matrix
  to_vector(z_raw) ~ normal(0, 1);         // standard normal prior for latent

  // Likelihood
  for (n in 1:N) {
    # x[n] ~ bernoulli_logit(z[1, n]);
    # y[n] ~ bernoulli_logit(z[2, n]);
    target += bernoulli_logit_lpmf(x[n] | z[1, n]);
    target += bernoulli_logit_lpmf(y[n] | z[2, n]);
  }
  
}

generated quantities {
  vector[N] log_lik_x;
  vector[N] log_lik_y;
  array[N] int x_rep;
  array[N] int y_rep;
  array[N] int x_prior;
  array[N] int y_prior;

  for (n in 1:N) {
    log_lik_x[n] = bernoulli_logit_lpmf(x[n] | z[1, n]);
    log_lik_y[n] = bernoulli_logit_lpmf(y[n] | z[2, n]);
    x_rep[n] = bernoulli_logit_rng(z[1, n]);
    y_rep[n] = bernoulli_logit_rng(z[2, n]);
  }

  // Prior predictive simulation
  {
    vector[2] mu_prior;
    vector[2] sigma_prior;
    mu_prior[1] = normal_rng(0, 2);
    mu_prior[2] = normal_rng(0, 2);
    sigma_prior[1] = lognormal_rng(0, 1);
    sigma_prior[2] = lognormal_rng(0, 1);

    matrix[2, 2] R_prior = lkj_corr_rng(2, 2.0);  // Sample from LKJ prior
    matrix[2, 2] L_prior = cholesky_decompose(R_prior);
    matrix[2, N] z_raw_prior;
    matrix[2, N] z_prior;

    for (n in 1:N)
      for (d in 1:2)
        z_raw_prior[d, n] = normal_rng(0, 1);

    z_prior = rep_matrix(mu_prior, N) + diag_matrix(sigma_prior) * L_prior * z_raw_prior;

    for (n in 1:N) {
      x_prior[n] = bernoulli_logit_rng(z_prior[1, n]);
      y_prior[n] = bernoulli_logit_rng(z_prior[2, n]);
    }
  }
}

"
system.time(stan_model_vanilla_corr <- stan_model(model_code = stan_code_vanilla_corr))
   user  system elapsed 
 56.529   1.559  54.962 
system.time(
    fit_vanilla_corr <- sampling(
        stan_model_vanilla_corr,
        data = stan_data,
        iter = 2000,
        chains = 4,
        control = list(adapt_delta = 0.95)
    )
)
Warning message:
“Bulk Effective Samples Size (ESS) is too low, indicating posterior means and medians may be unreliable.
Running the chains for more iterations may help. See
https://mc-stan.org/misc/warnings.html#bulk-ess”
   user  system elapsed 
 15.789   0.663   9.616 
print(fit_vanilla_corr, pars = c("mu", "rho", "sigma"))
Inference for Stan model: anon_model.
4 chains, each with iter=2000; warmup=1000; thin=1; 
post-warmup draws per chain=1000, total post-warmup draws=4000.

          mean se_mean   sd  2.5%   25%   50%   75% 97.5% n_eff Rhat
mu[1]     1.36    0.03 0.65  0.54  0.93  1.21  1.63  3.07   639 1.00
mu[2]    -0.88    0.03 0.69 -2.89 -1.10 -0.70 -0.46 -0.06   454 1.00
rho      -0.44    0.01 0.28 -0.89 -0.65 -0.48 -0.29  0.27   861 1.00
sigma[1]  2.00    0.07 1.57  0.24  0.90  1.57  2.68  6.32   443 1.01
sigma[2]  2.92    0.16 3.12  0.23  1.00  2.01  3.75 11.04   362 1.00

Samples were drawn using NUTS(diag_e) at Sun Nov 16 15:38:12 2025.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at 
convergence, Rhat=1).
mcmc_areas(
    as.matrix(fit_vanilla_corr),
    pars = "rho",
    prob = 0.95
)

The posterior of \(\rho\) includes 0, so no clear evidence for a correlation.

predictive checks

We compare the behavior of data sampled from the prior, and data sampled from the posterior, to the observed data.

First check: proportion of 1s in the \(x\)-variable

predictive_check_marginal_binary <- function(x_prior, x_rep, x_obs,
                                             varname = "x", bins = 30) {
  library(ggplot2)
  library(dplyr)

  # Compute posterior and prior predictive means (per draw)
  post_prop <- rowMeans(x_rep)
  prior_prop <- rowMeans(x_prior)
  obs_prop <- mean(x_obs)

  # Combine into tidy data frame
  df <- data.frame(
    prop = c(prior_prop, post_prop),
    source = rep(c("prior", "posterior"), times = c(length(prior_prop), length(post_prop)))
  )

  # Plot
  ggplot(df, aes(x = prop, fill = source)) +
    geom_histogram(position = "identity", alpha = 0.4, bins = bins) +
    geom_vline(xintercept = obs_prop, color = "red", linetype = "dashed", linewidth = 1) +
    labs(
      title = paste("Sanity Check:", varname, "~ Bernoulli"),
      x = paste("Proportion of", varname, "= 1 (per draw)"),
      y = "Frequency"
    ) +
    scale_fill_manual(values = c("prior" = "skyblue", "posterior" = "seagreen")) +
    theme_minimal(base_size = 14)
}
post_vanilla_corr <- rstan::extract(fit_vanilla_corr)
predictive_check_marginal_binary(
  x_prior = post_vanilla_corr$x_prior,
  x_rep = post_vanilla_corr$x_rep,
  x_obs = stan_data$x,
  varname = "x"
)

Second check: conditional probability \((y=1|x_{\text{observed}}=0)\) and \((y=1|x_{\text{observed}}=1)\)

predictive_checks_regression(post_vanilla_corr$y_prior, post_vanilla_corr$y_rep, stan_data)

For both checks, the prior is close to a uniform distribution, and the posterior is peaked around the empirical values. This is as it should be.

model comparison

We compute the log marginal likelihood for this model for later use.

system.time(
    capture.output({
        log_marginal_likelihood_vanilla_corr <- bridge_sampler(fit_vanilla_corr)
    }, file = "/dev/null")
)
print(log_marginal_likelihood_vanilla_corr)
   user  system elapsed 
  5.326   0.078   2.214 
Bridge sampling estimate of the log marginal likelihood: 55.95731
Estimate obtained in 16 iteration(s) via method "normal".

A direct model comparison via Bayes Factor is not possible because the regression model models the distribution of x given y, while the correlation models the joint distribution. However, we can apply Pareto-smoothed leave one out cross-validation if we marginalize over x in the correlation model.

# extract log-likelihood matrix: draws x observations
log_lik1 <- extract_log_lik(fit_vanilla, parameter_name = "log_lik")
log_lik2 <- extract_log_lik(fit_vanilla_corr, parameter_name = "log_lik_y")

# compute LOO
loo1 <- loo(log_lik1)
loo2 <- loo(log_lik2)


loo_list <- list(
    "vanilla_regression" = loo1, 
    "vanilla_correlation" = loo2
  )

# compare
loo_compare(loo_list)
Warning message:
“Some Pareto k diagnostic values are too high. See help('pareto-k-diagnostic') for details.
”
A compare.loo: 2 × 8 of type dbl
elpd_diff se_diff elpd_loo se_elpd_loo p_loo se_p_loo looic se_looic
vanilla_correlation 0.000000 0.00000 -62.07471 2.278148 27.944938 1.2846688 124.1494 4.556297
vanilla_regression -4.047377 1.21631 -66.12208 3.149916 2.011269 0.1372601 132.2442 6.299831

So the correlation model is clearly preferred.

The Ornstein-Uhlenbeck process

The OU process is a stochastic process, similar to Brownian motion. The crucial difference is that OU is random walk on a leash. Technically, it is the mixture of a stochastic Brownian motion process and a deterministic trajectory exponentially converging to a mean value. It is characterized by the following distribution: \[ X_t \sim \mathcal N\left(x_0e^{-\lambda t} + \mu (1-e^{-\lambda t}), \frac{\sigma}{\sqrt{2\lambda}}\sqrt{1-e^{-2\lambda t}}\right) \]

The parameters are:

  • \(x_0\): the initial value of the process
  • \(\mu\): the mean value to which the process converges
  • \(\lambda\): the rate of convergence to the mean value
  • \(\sigma\): the volatility of the process
  • \(t\): the time at which the process is evaluated

The long-term equilibrium distribution is: \[ X_t \sim \mathcal N(\mu, \frac{\sigma}{\sqrt{2\lambda}}) \]

Logistic regression with a phylogenetic random effect

We now add a phylogenetic random effect to the logistic regression from above:

\[ \begin{align} \alpha &\sim \text{Student-t}(3, 0, 2.5)\\ \beta &\sim \mathcal N(0, 2)\\ \epsilon &\sim \mathcal N(0, \Sigma)\\ y &\sim \text{Bernoulli}(\text{logit}^{-1}(\alpha + \beta x + \epsilon)) \end{align} \]

The covariance matrix \(\Sigma\) of the phylogenetic random effect is given by

\[ \Sigma_{ij} = \frac{\sigma^2}{2\lambda} e^{-\lambda t_{ij}}, \]

where \(t_{ij}\) is the patristic distance between language \(i\) and language \(j\) in the phylogeny.

A direct implementation of this formula in Stan is possible but inefficient. A better approach directly simulates the evolution of the latent variable \(\epsilon\) along the phylgeny. This means that \(\epsilon\) has a value not only at the tips but also at the internal nodes. The value at the root is sampled from the equilibrium distribution, and the value of the other nodes is drawn from the equation above, repeated here:

\[ X_t \sim \mathcal N(x_0e^{-\lambda t} + \mu (1-e^{-\lambda t}), \frac{\sigma^2}{2\lambda}(1-e^{-2\lambda t})). \]

\(X_t\) is now the value at the daughter node, \(x_0\) at the mother node, and \(t\) is the branch length.

stan_model_code_OU_regression = "
data {
  int<lower=1> N;                                         // number of tips with data
  int<lower=1> Nnodes;                                    // total number of nodes
  int<lower=1> Nedges;                                    // number of edges in the tree
  vector<lower=0, upper=1>[N] x;                          // predictor at tips
  array[N] int<lower=0, upper=1> y;                       // binary response at tips
  array[Nedges, 2] int<lower=1, upper=Nnodes> edges;      // parent → child
  vector<lower=0>[Nedges] edge_lengths;                   // edge lengths
  int<lower=1, upper=Nnodes> root_node;                   // index of root node
}
transformed data {
  vector[N] Xc;  // centered version of X without an intercept
  real means_X;  // mean of X before centering
  means_X = mean(x);
  Xc = x - means_X;
}


parameters {
  vector[Nnodes] z_std;       // standard normal reparam for latent z
  real<lower=0> sigma;        // OU diffusion
  real<lower=0> lambda;       // OU strength
  real mu;                    // OU mean
  real alpha;                 // intercept
  real beta;                  // slope
}

transformed parameters {
  vector[Nnodes] z;

  // root node
  z[root_node] = mu + (sigma / sqrt(2 * lambda)) * z_std[root_node];

  // recursive evolution
  for (e in 1:Nedges) {
    int edge_index = Nedges - e + 1; // reverse order for recursion
    int parent = edges[edge_index, 1];
    int child = edges[edge_index, 2];
    real len = edge_lengths[edge_index];

    real decay = exp(-lambda * len);
    real s = sigma * sqrt(-expm1(-2 * lambda * len) / (2 * lambda));
    real mn = mu + (z[parent] - mu) * decay;

    z[child] = mn + s * z_std[child];
  }
}

model {
  // Priors
  alpha ~ student_t(3, 0, 2.5);
  beta ~ normal(0, 2);
  sigma ~ lognormal(0, 1);
  lambda ~ lognormal(0, 1);
  mu ~ normal(0, 2);
  z_std ~ normal(0, 1);  // standard normal prior

  // Likelihood
  for (i in 1:N) {
    # y[i] ~ bernoulli_logit(alpha + beta * Xc[i] + z[i]);
    target += bernoulli_logit_lpmf(y[i] | alpha + beta * Xc[i] + z[i]);
  }

}

generated quantities {
  vector[N] log_lik;
  array[N] int y_rep;     // posterior predictive
  array[N] int y_prior;   // prior predictive

  // Posterior predictive samples (uses fitted alpha, beta, z)
  for (i in 1:N) {
    real eta = alpha + beta * Xc[i] + z[i];
    log_lik[i] = bernoulli_logit_lpmf(y[i] | eta);
    y_rep[i] = bernoulli_logit_rng(eta);
  }

  // Prior predictive samples: draw alpha, beta, z_prior from prior
  {
    real alpha_prior = student_t_rng(3, 0, 2.5);
    real beta_prior = normal_rng(0, 2);

    vector[Nnodes] z_prior_std;
    for (j in 1:Nnodes)
      z_prior_std[j] = normal_rng(0, 1);

    real mu_prior = normal_rng(0, 2);
    real lambda_prior = lognormal_rng(0, 1);
    real sigma_prior = lognormal_rng(0, 1);

    vector[Nnodes] z_prior;
    z_prior[root_node] = mu_prior + (sigma_prior / sqrt(2 * lambda_prior)) * z_prior_std[root_node];

    for (e in 1:Nedges) {
      int edge_index = Nedges - e + 1;
      int parent = edges[edge_index, 1];
      int child = edges[edge_index, 2];
      real len = edge_lengths[edge_index];
      real decay = exp(-lambda_prior * len);
      real s = sigma_prior * sqrt(-expm1(-2 * lambda_prior * len) / (2 * lambda_prior));
      real mn = mu_prior + (z_prior[parent] - mu_prior) * decay;
      z_prior[child] = mn + s * z_prior_std[child];
    }

    for (i in 1:N) {
      real eta_prior = alpha_prior + beta_prior * Xc[i] + z_prior[i];
      y_prior[i] = bernoulli_logit_rng(eta_prior);
    }
  }
}

"
system.time(stan_model_OU_regression <- stan_model(model_code = stan_model_code_OU_regression))
   user  system elapsed 
 48.893   1.458  47.463 
system.time(
    fit_OU_regression <- sampling(
        stan_model_OU_regression,
        data = stan_data,
        iter = 2000,
        chains = 4,
        control = list(adapt_delta = 0.95, max_treedepth = 15)
    )
)
   user  system elapsed 
 37.399   0.752  14.567 
print(fit_OU_regression, pars = c("mu", "sigma", "lambda", "alpha", "beta"))
Inference for Stan model: anon_model.
4 chains, each with iter=2000; warmup=1000; thin=1; 
post-warmup draws per chain=1000, total post-warmup draws=4000.

        mean se_mean   sd  2.5%   25%   50%  75% 97.5% n_eff Rhat
mu      0.35    0.03 1.76 -3.15 -0.86  0.37 1.54  3.78  4825    1
sigma   5.00    0.10 4.12  1.07  2.46  3.89 6.18 16.41  1635    1
lambda  0.11    0.00 0.04  0.04  0.08  0.10 0.13  0.20  2172    1
alpha   0.63    0.03 2.04 -3.27 -0.65  0.57 1.83  4.93  4186    1
beta   -0.77    0.03 1.63 -3.91 -1.80 -0.81 0.24  2.54  3540    1

Samples were drawn using NUTS(diag_e) at Sun Nov 16 15:39:31 2025.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at 
convergence, Rhat=1).
mcmc_areas(
    as.matrix(fit_OU_regression),
    pars = "beta",
    prob = 0.95
)

predictive checks

We perform the same predictive checks as for the regression model above.

post_ou_regression <- rstan::extract(fit_OU_regression)
predictive_checks_regression(post_ou_regression$y_prior, post_ou_regression$y_rep, stan_data)

This looks perfect - the prior is essentially flat, and the posterior is peaked around the obverved value.

model comparison

system.time(
    capture.output({
        log_marginal_likelihood_OU_regression <- bridge_sampler(fit_OU_regression)
    }, file = "/dev/null")
)
print(log_marginal_likelihood_OU_regression)
   user  system elapsed 
  4.905   0.015   1.929 
Bridge sampling estimate of the log marginal likelihood: 132.3693
Estimate obtained in 10 iteration(s) via method "normal".

Now we can compute the Bayes factor between the vanilla regression and the OU phylogenetic regression.

log_marginal_likelihood_OU_regression$logml - log_marginal_likelihood_vanilla_regression$logml
197.163753428719

This is very strong evidence in favor of the OU model.

The HPD for \(\beta\) is very broad and includes 0. So there is no evidence for the significant effect of the order of adnominal property words on verb position.

Brownian motion

Next I will fit a logistic regression with a Brownian-motion phylogenetic random effect.

A <- ape::vcv(pruned_tree, corr = TRUE)
d$Language_ID <- factor(d$Language_ID, levels = rownames(A))
system.time(
    fit_brownian_regression <- brm(
    formula = y ~ x + (1 | gr(Language_ID, cov = A)),
    data = d,
    family = bernoulli(link = "logit"),
    prior = c(
        prior(normal(0, 2), class = "b")
    ),
    data2 = list(A = A),
    control = list(adapt_delta = 0.99),
    iter = 2000,
    cores = 4,
    chains = 4,
    save_pars = save_pars(all = TRUE)
    )
)
Compiling Stan program...

Start sampling
   user  system elapsed 
 66.622   2.227  58.459 
summary(fit_brownian_regression)
 Family: bernoulli 
  Links: mu = logit 
Formula: y ~ x + (1 | gr(Language_ID, cov = A)) 
   Data: d (Number of observations: 100) 
  Draws: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
         total post-warmup draws = 4000

Multilevel Hyperparameters:
~Language_ID (Number of levels: 100) 
              Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sd(Intercept)     3.83      2.64     1.45     9.95 1.00      634     1040

Regression Coefficients:
          Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
Intercept     0.55      1.42    -2.12     3.45 1.00     1992     2083
x            -0.93      0.79    -2.51     0.61 1.00     4284     2740

Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).
mcmc_areas(
    as.matrix(fit_brownian_regression),
    pars = "b_x",
    prob = 0.95
)

The HPD interval for the slope includes 0, so there is no credible evidence for an impact of x on y.

predictive checks

In a brms model, one cannot do prior predictive simulations while model fitting. Instead, we have to fit the model again, this time with sample_prior = "only".

y_rep_brownian_regression <- posterior_predict(fit_brownian_regression)       # matrix: draws × observations

system.time(
    fit_prior <- brm(
    formula = y ~ x + (1 | gr(Language_ID, cov = A)),
    data = d,
    family = bernoulli(),
    data2 = list(A = A),
    prior = c(
        prior(normal(0, 2.5), class = "b")
    ),
    sample_prior = "only",
    control = list(adapt_delta = 0.95),
    cores = 4, chains = 4
    )
)
Compiling Stan program...

Start sampling
   user  system elapsed 
 58.697   2.215  55.196 
y_prior_brownian_regression <- posterior_predict(fit_prior)
predictive_checks_regression(y_prior_brownian_regression, y_rep_brownian_regression, stan_data)

The priors and posteriors look healthy.

model comparison

system.time(
    capture.output({
        log_marginal_likelihood_brownian_regression <- bridge_sampler(fit_brownian_regression)
    }, file = "/dev/null")
)
print(log_marginal_likelihood_brownian_regression)
   user  system elapsed 
  4.097   0.009   1.142 
Bridge sampling estimate of the log marginal likelihood: -55.03535
Estimate obtained in 9 iteration(s) via method "normal".
log_marginal_likelihood_OU_regression$logml - log_marginal_likelihood_brownian_regression$logml
187.404649315818

The OU model is favored over the Brownian-motion model by a wide margin.

Next we look as PSIS-LOOCV

# extract log-likelihood matrix: draws x observations

log_lik3 <- extract_log_lik(fit_OU_regression, parameter_name = "log_lik")
log_lik4 <- log_lik(fit_brownian_regression)


# compute LOO

loo3 <- loo(log_lik3)
loo4 <- loo(log_lik4)

loo_list$OU_regression <- loo3
loo_list$brownian_regression <- loo4

# compare
loo_compare(loo_list)
Warning message:
“Some Pareto k diagnostic values are too high. See help('pareto-k-diagnostic') for details.
”
Warning message:
“Some Pareto k diagnostic values are too high. See help('pareto-k-diagnostic') for details.
”
A compare.loo: 4 × 8 of type dbl
elpd_diff se_diff elpd_loo se_elpd_loo p_loo se_p_loo looic se_looic
OU_regression 0.00000 0.000000 -33.03397 3.831264 24.718425 3.2070885 66.06794 7.662528
brownian_regression -16.03540 2.983188 -49.06937 6.343331 19.969692 2.8638241 98.13874 12.686661
vanilla_correlation -29.04074 3.564611 -62.07471 2.278148 27.944938 1.2846688 124.14941 4.556297
vanilla_regression -33.08811 3.811224 -66.12208 3.149916 2.011269 0.1372601 132.24417 6.299831

So both Bayes Factor and PSIS-LOOCV favor OU over Brownian motion, and they favor both versions with phylogenetic control over models without phylogenetic control.

So much about various flavors of regression.

However, I find it hard to wrap my mind around what such a model actually means. It says that there is a latent variable \(\epsilon\) evolving along the tree, and the probability of a language being verb final depends both on this latent variable and on the order of adjective-noun.

A more realistic model, it seems to me, is one where there are

  • a latent variable determining adjective-noun order, and
  • a latent variable determining verb position.

These latent variables evolve simultaneously along the tree. If there is a dependency between the two observed variables resulting from diachronic processes, there is a correlation between the diachronic changes of the variables.

This can be implemented in an extension of the bivariate correlation model used above.

The model specification is as follows, where \(N\) is the number of languages, and \(t_{l_1, l_2}\) is the length of from the root of the tree to the most recent common ancestor of \(l_1\) and \(l_2\):

\[ \begin{align} R &\sim \text{LKJ}(2)\\ \Sigma_{(l_1, k_1), (l_2, k_2)} &= R_{k_1, k_2}\cdot \frac{\sigma_{k_1}\sigma_{k_2}}{\lambda_{k_1}+\lambda_{k_2}}\left(1-e^{-(\lambda_{k_1}+\lambda_{k_2})t_{l_1, l_2}}\right) \\ \mu_i &= \mathcal N(\mathbf 0, 2)\\ \sigma_i &= \text{Lognormal}(0,1)\\ \lambda_i &= \text{Lognormal}(0,1)\\ z &\sim \mathcal N(\text{repeat}(\mu, N), \Sigma)\\ x_i &\sim \text{Bernoulli}(\text{logit}^{-1}(z_{i,1}))\\ y_i &\sim \text{Bernoulli}(\text{logit}^{-1}(z_{i,2})) \end{align} \]

stan_model_code_OU_correlation <- 
"
data {
  int<lower=1> N;                                        // number of tips with data
  int<lower=1> Nnodes;                                   // total number of nodes
  int<lower=1> Nedges;                                   // number of edges in the tree
  array[N] int<lower=0, upper=1> x;                      // first binary response
  array[N] int<lower=0, upper=1> y;                      // second binary response
  array[Nedges, 2] int<lower=1, upper=Nnodes> edges;     // parent → child
  vector<lower=0>[Nedges] edge_lengths;                  // edge lengths
  int<lower=1, upper=Nnodes> root_node;                  // index of root node
}

parameters {
  matrix[Nnodes, 2] z_std;                   // standard-normal latent variables
  vector<lower=0>[2] sigma;                  // OU diffusion parameters
  vector<lower=0>[2] lambda;                 // OU pull strengths
  vector[2] mu;                              // OU stationary means
  cholesky_factor_corr[2] L_std;             // Cholesky factor of correlation matrix
}

transformed parameters {
  matrix[Nnodes, 2] z;     // latent values
  real rho = L_std[2, 1];  // correlation coefficient

  // Root node, drawn from OU stationary distribution
  z[root_node] = (mu + (sigma ./ sqrt(2 * lambda)) .* (L_std * to_vector(z_std[root_node])))';



  // Recursive evolution
  for (e in 1:Nedges) {
    int edge_index = Nedges - e + 1;   // reverse order for recursion, root to tips
    int parent = edges[edge_index, 1];
    int child  = edges[edge_index, 2];
    real len = edge_lengths[edge_index];

    // Vectorized decay and scale
    vector[2] decay = exp(-lambda * len);
    vector[2] s = sigma .* sqrt(-expm1(-2 * lambda * len) ./ (2 * lambda));
    vector[2] mean = mu + (to_vector(z[parent]) - mu) .* decay;
    vector[2] eps  = L_std * z_std[child]';
    z[child] = (mean + s .* eps)';
  }
}

model {
  // Priors
  sigma ~ lognormal(0, 1);
  lambda ~ lognormal(0, 1);
  mu ~ normal(0, 2);
  L_std ~ lkj_corr_cholesky(2.0);
  to_vector(z_std) ~ normal(0, 1);

  // Likelihood
  for (i in 1:N) {
    # x[i] ~ bernoulli_logit(z[i, 1]);
    target += bernoulli_logit_lpmf(x[i] | z[i, 1]);
    # y[i] ~ bernoulli_logit(z[i, 2]);
    target += bernoulli_logit_lpmf(y[i] | z[i, 2]);
  }
}


generated quantities {
  vector[N] log_lik_x;
  vector[N] log_lik_y;
  array[N] int x_rep;
  array[N] int y_rep;
  array[N] int x_prior;
  array[N] int y_prior;

  // Posterior predictive
  for (i in 1:N) {
    x_rep[i] = bernoulli_logit_rng(z[i, 1]);
    y_rep[i] = bernoulli_logit_rng(z[i, 2]);
    log_lik_x[i] = bernoulli_logit_lpmf(x[i] | z[i, 1]);
    log_lik_y[i] = bernoulli_logit_lpmf(y[i] | z[i, 2]);
  }

  // Prior predictive simulation
  {
    vector[2] mu_prior;
    vector[2] sigma_prior;
    vector[2] lambda_prior;
    matrix[2, 2] L_prior;
    matrix[Nnodes, 2] z_std_prior;
    matrix[Nnodes, 2] z_prior;

    for (j in 1:2) {
      mu_prior[j] = normal_rng(0.0, 2.0);
      sigma_prior[j] = lognormal_rng(0.0, 1.0);
      lambda_prior[j] = lognormal_rng(0.0, 1.0);
    }
    L_prior = cholesky_decompose(lkj_corr_rng(2, 2.0));

    for (i in 1:Nnodes)
      z_std_prior[i] = to_row_vector(normal_rng(rep_vector(0.0, 2), rep_vector(1.0, 2)));

    // Root node
    z_prior[root_node] = (mu_prior + (sigma_prior ./ sqrt(2 * lambda_prior)) .* (L_prior * to_vector(z_std_prior[root_node])))';

    // Recursion
    for (e in 1:Nedges) {
      int edge_index = Nedges - e + 1;
      int parent = edges[edge_index, 1];
      int child  = edges[edge_index, 2];
      real len = edge_lengths[edge_index];

      vector[2] decay = exp(-lambda_prior * len);
      vector[2] s = sigma_prior .* sqrt(-expm1(-2 * lambda_prior * len) ./ (2 * lambda_prior));
      vector[2] mean = mu_prior + (to_vector(z_prior[parent]) - mu_prior) .* decay;
      vector[2] eps = L_prior * z_std_prior[child]';
      z_prior[child] = (mean + s .* eps)';
    }

    for (i in 1:N) {
      x_prior[i] = bernoulli_logit_rng(z_prior[i, 1]);
      y_prior[i] = bernoulli_logit_rng(z_prior[i, 2]);
    }
  }

  // Simulate using posterior parameters but fresh noise z_std_prior
  array[N] int x_replicated_noise;
  array[N] int y_replicated_noise;
  {
    matrix[Nnodes, 2] z_std_resample;
    matrix[Nnodes, 2] z_resample;

    for (i in 1:Nnodes)
      z_std_resample[i] = to_row_vector(normal_rng(rep_vector(0.0, 2), rep_vector(1.0, 2)));

    // Root node
    z_resample[root_node] = (mu + (sigma ./ sqrt(2 * lambda)) .* (L_std * to_vector(z_std_resample[root_node])))';

    // Recursion
    for (e in 1:Nedges) {
      int edge_index = Nedges - e + 1;
      int parent = edges[edge_index, 1];
      int child  = edges[edge_index, 2];
      real len = edge_lengths[edge_index];

      vector[2] decay = exp(-lambda * len);
      vector[2] s = sigma .* sqrt(-expm1(-2 * lambda * len) ./ (2 * lambda));
      vector[2] mean = mu + (to_vector(z_resample[parent]) - mu) .* decay;
      vector[2] eps = L_std * z_std_resample[child]';
      z_resample[child] = (mean + s .* eps)';
    }

    for (i in 1:N) {
      x_replicated_noise[i] = bernoulli_logit_rng(z_resample[i, 1]);
      y_replicated_noise[i] = bernoulli_logit_rng(z_resample[i, 2]);
    }
  }

}



"

system.time(stan_model_OU_correlation <- stan_model(model_code = stan_model_code_OU_correlation))
   user  system elapsed 
 59.206   1.520  57.629 
system.time(
    fit_OU_correlation <- sampling(
        stan_model_OU_correlation,
        data = stan_data,
        iter = 10000,
        thin = 10,
        chains = 4,
        control = list(adapt_delta = 0.95, max_treedepth = 15)
    )
)
   user  system elapsed 
922.331   0.950 340.447 
print(fit_OU_correlation, pars=c("mu", "sigma", "lambda", "rho"))
Inference for Stan model: anon_model.
4 chains, each with iter=10000; warmup=5000; thin=10; 
post-warmup draws per chain=500, total post-warmup draws=2000.

           mean se_mean   sd  2.5%   25%   50%   75% 97.5% n_eff Rhat
mu[1]      1.75    0.03 1.09  0.02  1.05  1.61  2.37  4.21  1860    1
mu[2]      0.67    0.03 1.24 -1.77 -0.10  0.62  1.41  3.23  2083    1
sigma[1]   3.29    0.06 2.44  0.74  1.80  2.71  4.06  9.23  1857    1
sigma[2]   4.16    0.08 3.45  0.90  2.05  3.18  5.11 13.02  1898    1
lambda[1]  0.22    0.01 0.50  0.04  0.11  0.16  0.24  0.58  1901    1
lambda[2]  0.11    0.00 0.04  0.04  0.08  0.10  0.13  0.20  2087    1
rho       -0.27    0.00 0.21 -0.65 -0.41 -0.28 -0.13  0.13  1793    1

Samples were drawn using NUTS(diag_e) at Sun Nov 16 13:38:17 2025.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at 
convergence, Rhat=1).
mcmc_areas(
    as.matrix(fit_OU_correlation),
    pars = "rho",
    prob = 0.95
)

Since the HPD of \(\rho\) includes 0, we conclude that there is no credible evidence for a diachronic correlation between the two variables.

predictive checks

post_OU_correlation <- rstan::extract(fit_OU_correlation)

predictive_check_marginal_binary(
  x_prior = post_OU_correlation$x_prior,
  x_rep = post_OU_correlation$x_rep,
  x_obs = stan_data$x,
  varname = "x"
)

predictive_checks_regression(post_OU_correlation$y_prior, post_OU_correlation$y_rep, stan_data)

The prior and posterior simulations look okay.

It seems tempting to also use the posterior predictive simulation to estimate the equilibrium probabilities. However, this would be wrong because the posterior predictive simulation also use the posterior of the latent variable z. Instead, we have to use the posterior of the other parameters and to simulate z to obtain the posterior prediction for a copy of the tree, without regard for the real history of the tree.

I added a block to this effect in the Stan code, producing x_replicated_noise and y_replicated_noise.

x_replicated_noise = post_OU_correlation$x_replicated_noise
y_replicated_noise = post_OU_correlation$y_replicated_noise


# Create a function to compute joint probabilities for one sample
joint_freqs <- function(x, y) {
  tab <- table(factor(x, levels = 0:1), factor(y, levels = 0:1))
  as.vector(tab) / length(x)  # returns proportions in order: (0,0), (0,1), (1,0), (1,1)
}

# Apply over all 2000 posterior samples
joint_probs <- t(mapply(joint_freqs, 
                        split(x_replicated_noise, row(x_replicated_noise)), 
                        split(y_replicated_noise, row(y_replicated_noise))))

colnames(joint_probs) <- c("(0,0)", "(0,1)", "(1,0)", "(1,1)")

x_range = c(0,1)
df <- as.data.frame(joint_probs) %>%
  mutate(draw = 1:n()) %>%
  pivot_longer(-draw, names_to = "Combination", values_to = "Probability")

ggplot(df, aes(x = Probability, fill = Combination)) +
  geom_density(alpha = 0.6) +
  facet_wrap(~ Combination, scales = "fixed") +  # enforce same scale
  scale_x_continuous(limits = x_range) +         # set common x-axis limits
  theme_minimal() +
  labs(title = "Equilibrium Distribution of Joint Outcomes",
       x = "Estimated Probability",
       y = "Density")

model comparison

Bayes factor:

system.time(
    capture.output({
        log_marginal_likelihood_OU_correlation <- bridge_sampler(
            fit_OU_correlation, silent = TRUE, maxiter = 10000
        )
    }, file = "/dev/null"
    )
)
print(log_marginal_likelihood_OU_correlation)
Warning message:
“logml could not be estimated within maxiter, rerunning with adjusted starting value. 
Estimate might be more variable than usual.”
   user  system elapsed 
 76.032   0.116  72.267 
Bridge sampling estimate of the log marginal likelihood: 264.3048
Estimate obtained in 10001 iteration(s) via method "normal".
log_marginal_likelihood_OU_correlation$logml - log_marginal_likelihood_vanilla_corr$logml
208.347436955044

Again we find overwhelming evidence in favor of the phylogenetic model.

We can complement model comparison via Bayes Factor with Pareto-smoothed leave-one-out cross-validation.

# extract log-likelihood matrix: draws x observations

log_lik5 <- extract_log_lik(fit_OU_correlation, parameter_name = "log_lik_y")


# compute LOO

loo5 <- loo(log_lik5)

loo_list$OU_correlation <- loo5

# compare
loo_compare(loo_list)
Warning message:
“Some Pareto k diagnostic values are too high. See help('pareto-k-diagnostic') for details.
”
A compare.loo: 5 × 8 of type dbl
elpd_diff se_diff elpd_loo se_elpd_loo p_loo se_p_loo looic se_looic
OU_correlation 0.00000000 0.000000 -32.95478 3.754743 22.998565 2.9770741 65.90957 7.509485
OU_regression -0.07918904 1.051616 -33.03397 3.831264 24.718425 3.2070885 66.06794 7.662528
brownian_regression -14.85402549 3.109386 -47.80881 6.247695 19.809112 2.8749336 95.61762 12.495390
vanilla_correlation -29.11992435 3.385216 -62.07471 2.278148 27.944938 1.2846688 124.14941 4.556297
vanilla_regression -33.16730118 3.577467 -66.12208 3.149916 2.011269 0.1372601 132.24417 6.299831

The three phylogenetic models perform massively better than the vanilla models. Furthermore if our goal is to predict the verb position of an unseen language from its noun-adjective order and its position in the tree, the OU regression model is superior to the correlation model.

As a nice side effect of the explicit modeling of the diachronic histories, we can plot the posterior means of the latent variables at the internal nodes, thereby visualizing the evolution of the latent variables.

posterior_df <- as_draws_df(fit_OU_correlation)
z_posterior <- posterior_df %>%
  spread_draws(z[i, j]) %>%
  group_by(i, j) %>%
  summarise(mean_z = mean(z), .groups = "drop")
z_adjective <- z_posterior %>%
    filter(j == 1) %>%
    pull(mean_z)
z_vo <- z_posterior %>%
    filter(j == 2) %>%
    pull(mean_z)
pruned_tree$node.label <- as.character(
    (length(pruned_tree$tip.label) + 1):(length(pruned_tree$tip.label) + pruned_tree$Nnode)
)


tree_plot <- ggtree(pruned_tree)
tree_info <- tree_plot$data
tree_info$node.label <- as.character(tree_info$label)
Warning message in fortify(data, ...):

“Arguments in `...` must be used.

✖ Problematic arguments:

• as.Date = as.Date

• yscale_mapping = yscale_mapping

• hang = hang

ℹ Did you misspell an argument name?”
for (i in 1:nrow(tree_info)) {
    if (tree_info$isTip[i]) {
        # If it's a tip, get the corresponding Glottocode
        glottocode <- tree_info$label[i]
        # Find the index of the Glottocode in d_ie
        index <- which(pruned_tree$tip.label == glottocode)
        # Assign the Affix value to the node label
        tree_info$node.label[i] <- index
    } else {
        # If it's not a tip, copy label
        tree_info$node.label[i] <- as.integer(tree_info$node.label[i])
    }
}


node_data <- tibble(
  node.label = as.character(1:length(z_adjective)),
  z_adjective = z_adjective,
  z_vo = z_vo
)


tree_data <- tree_info %>%
  left_join(node_data, by = "node.label")

tree_data <- tree_data %>%
    left_join(
        rename(select(d, Language_ID, GB193, GB133),
        label=Language_ID)
    ) 
tree_data <- tree_data %>%
  mutate(GB193 = factor(GB193)) %>%
  mutate(GB133 = factor(GB133))
Joining with `by = join_by(label)`
custom_palette <- paletteer_c("scico::roma", 30)
p1 <- ggtree(pruned_tree, layout = "circular", branch.length = "none") %<+% tree_data +
  geom_tree(aes(color = plogis(z_adjective)), size = 1) +
  scale_color_gradientn(colors = custom_palette, name = "p(NAdj)") +
  ggnewscale::new_scale_color() +
  geom_tippoint(aes(color = GB193), size = 3) +
  scale_color_manual(values = c("1" = custom_palette[1], "2" = custom_palette[30])) +
  guides(color = guide_legend(override.aes = list(size = 4))) +
  theme(
    legend.position = "left",
    plot.margin = margin(0, 0, 0, 0),
    text = element_text(size = 18),
    legend.text = element_text(size = 18)
  )
Warning message in fortify(data, ...):

“Arguments in `...` must be used.

✖ Problematic arguments:

• as.Date = as.Date

• yscale_mapping = yscale_mapping

• hang = hang

ℹ Did you misspell an argument name?”

Warning message:

“Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.

ℹ Please use `linewidth` instead.

ℹ The deprecated feature was likely used in the ggtree package.

  Please report the issue at <https://github.com/YuLab-SMU/ggtree/issues>.”
p2 <- ggtree(pruned_tree, layout = "circular", branch.length = "none") %<+% tree_data +
  geom_tree(aes(color = plogis(z_vo)), size = 1) +
  scale_color_gradientn(colors = custom_palette, name = "p(VO)") +
  ggnewscale::new_scale_color() +
  geom_tippoint(aes(color = GB133), size = 3) +
  scale_color_manual(values = c("0" = custom_palette[1], "1" = custom_palette[30])) +
  guides(color = guide_legend(override.aes = list(size = 4))) +
  theme(
    plot.margin = margin(0, 0, 0, 0),
    text = element_text(size = 18),
    legend.text = element_text(size = 18)
  )
Warning message in fortify(data, ...):

“Arguments in `...` must be used.

✖ Problematic arguments:

• as.Date = as.Date

• yscale_mapping = yscale_mapping

• hang = hang

ℹ Did you misspell an argument name?”
options(repr.plot.width = 18, repr.plot.height = 10) 
plot_grid(p1, p2, ncol = 2, align = "hv", axis = "tb", rel_widths = c(1, 1))

options(repr.plot.width = 12, repr.plot.height = 8)  # For base R plots in Jupyter

Another advantage of this approach (both the regression and the correlation variant) is that it can easily be combined with family-level or macroarea-level random effects. Also, geographic random effects can easily be added.

More details can be found in my arxiv paper Computational Tyology (https://arxiv.org/abs/2504.15642)

Pagel & Meade style test for correlation

The test for correlation using CTMC, as developed by Pagel & Meade (2006) and implemented in BayesTraits, can also be implemented in Stan, using Felsenstein’s pruning algorithm.

stan_model_code_ctmc_independent = "

data {
    int<lower=1> N;
    int<lower=1> Nnodes;
    int<lower=1> Nedges;
    array[N] int<lower=0, upper=1> x;
    array[N] int<lower=0, upper=1> y;
    array[Nedges, 2] int<lower=1, upper=Nnodes> edges;
    vector<lower=0>[Nedges] edge_lengths;
    int<lower=1> root_node;
}

parameters {
    simplex[4] rel_rates;
    real<lower=0> total_rate;
}

transformed parameters {
    matrix[4, 4] Q = rep_matrix(0.0, 4, 4);

    Q[1, 2] = total_rate * rel_rates[1];
    Q[2, 1] = total_rate * rel_rates[2];
    Q[1, 3] = total_rate * rel_rates[3];
    Q[3, 1] = total_rate * rel_rates[4];
    Q[2, 4] = total_rate * rel_rates[3];
    Q[4, 2] = total_rate * rel_rates[4];
    Q[3, 4] = total_rate * rel_rates[1];
    Q[4, 3] = total_rate * rel_rates[2];

    for (i in 1:4) {
        Q[i, i] = -sum(Q[i, ]);
    }

    // stationary-ish distribution via long-time transition
    matrix[4, 4] P = matrix_exp(Q * 1e3);
    vector[4] stat_dist = to_vector(P[1]);
    stat_dist /= sum(stat_dist);

        matrix[Nnodes, 4] loglikelihood;
    loglikelihood = rep_matrix(0.0, Nnodes, 4);

    // tip likelihoods
    for (i in 1:N) {
        int idx = 2 * x[i] + y[i] + 1;
        loglikelihood[i] = rep_row_vector(negative_infinity(), 4);
        loglikelihood[i, idx] = 0;
    }

    // pruning recursion
    for (e in 1:Nedges) {
        int parent = edges[e, 1];
        int child = edges[e, 2];
        real t = edge_lengths[e];
        matrix[4, 4] Plocal = matrix_exp(Q * t);

        for (k in 1:4) {
            loglikelihood[parent, k] += log_sum_exp(
                to_vector(log(Plocal[k]) + loglikelihood[child])
            );
        }
    }
    real ll = log_sum_exp(loglikelihood[root_node] + log(stat_dist)');

}

model {
    // priors
    rel_rates ~ dirichlet(rep_vector(1.0, 4));
    total_rate ~ lognormal(-1, 0.5);

    target += ll;
}

generated quantities {
    real total_rate_prior = lognormal_rng(-1, 0.5);
    vector[4] rel_rates_prior = dirichlet_rng(rep_vector(1.0, 4));
    matrix[4, 4] Qprior = rep_matrix(0.0, 4, 4);

    Qprior[1, 2] = total_rate_prior * rel_rates_prior[1];
    Qprior[2, 1] = total_rate_prior * rel_rates_prior[2];
    Qprior[1, 3] = total_rate_prior * rel_rates_prior[3];
    Qprior[3, 1] = total_rate_prior * rel_rates_prior[4];
    Qprior[2, 4] = total_rate_prior * rel_rates_prior[3];
    Qprior[4, 2] = total_rate_prior * rel_rates_prior[4];
    Qprior[3, 4] = total_rate_prior * rel_rates_prior[1];
    Qprior[4, 3] = total_rate_prior * rel_rates_prior[2];
    
    for (i in 1:4) {
        Qprior[i, i] = -sum(Qprior[i, ]);
    }

    matrix[4, 4] Pprior = matrix_exp(Qprior * 1e3);
    vector[4] stat_dist_prior = to_vector(Pprior[1]);
    stat_dist_prior /= sum(stat_dist_prior);
    
    array[Nnodes] int<lower=1, upper=4> sim_states;
    sim_states[root_node] = categorical_rng(stat_dist_prior);
    
    // Loop backward through post-ordered edges
    for (i in 1:Nedges) {
        int e = Nedges - i + 1;
        int parent = edges[e, 1];
        int child = edges[e, 2];
        real t = edge_lengths[e];
        matrix[4, 4] Plocal = matrix_exp(Qprior * t);
        sim_states[child] = categorical_rng(to_vector(Plocal[sim_states[parent]]));
    }
    
    array[N] int<lower=0, upper=1> x_prior;
    array[N] int<lower=0, upper=1> y_prior;
    for (i in 1:N) {
        int state = sim_states[i] - 1;
        x_prior[i] = state / 2;
        y_prior[i] = state % 2;
    }

    
    // Posterior predictive for each tip
    array[N] real y_log_lik;  // Pointwise log-likelihood
    vector[4] ll_loo;
    array[N] int<lower=0, upper=1> x_rep;
    array[N] int<lower=0, upper=1> y_rep;
    for (tip in 1:N) {
        for (s in 1:4) {
            // Recompute likelihood WITHOUT this tip
            matrix[Nnodes, 4] loglik_loo = rep_matrix(0.0, Nnodes, 4);
            
            // Set tip likelihoods (excluding tip 'tip')
            for (i in 1:N) {
                loglik_loo[i] = rep_row_vector(negative_infinity(), 4);
                if (i != tip) {
                    int idx = 2 * x[i] + y[i] + 1;
                    loglik_loo[i, idx] = 0;
                } else {
                    loglik_loo[i, s] = 0;
                }
            }
            
            // Pruning recursion (same as model block)
            for (e in 1:Nedges) {
                int parent = edges[e, 1];
                int child = edges[e, 2];
                real t = edge_lengths[e];
                matrix[4, 4] Plocal = matrix_exp(Q * t);
                
                for (k in 1:4) {
                    loglik_loo[parent, k] += log_sum_exp(
                        to_vector(log(Plocal[k]) + loglik_loo[child])
                    );
                }
            }
            
            ll_loo[s] = log_sum_exp(loglik_loo[root_node] + log(stat_dist)');
        }
        int observed_state = 2 * x[tip] + y[tip] + 1;
        int same_x_state = 2 * x[tip] + (1 - y[tip]) + 1;
        y_log_lik[tip] = ll_loo[observed_state] - log_sum_exp(ll_loo[observed_state] , ll_loo[same_x_state]);
        int s_rep = categorical_rng(softmax(ll_loo));
        x_rep[tip] = (s_rep - 1) / 2;
        y_rep[tip] = (s_rep - 1) % 2;
    }
}

"

system.time(stan_model_ctmc_independent <- stan_model(model_code= stan_model_code_ctmc_independent))
   user  system elapsed 
 63.513   1.487  61.832 
system.time(
    fit_ctmc_independent <- sampling(
        stan_model_ctmc_independent,
        data = stan_data,
        iter = 2000,
        chains = 4
    )
)
    user   system  elapsed 
1319.705    1.671  427.752 
print(fit_ctmc_independent, pars=c("stat_dist", "rel_rates", "total_rate"))
Inference for Stan model: anon_model.
4 chains, each with iter=2000; warmup=1000; thin=1; 
post-warmup draws per chain=1000, total post-warmup draws=4000.

             mean se_mean   sd 2.5%  25%  50%  75% 97.5% n_eff Rhat
stat_dist[1] 0.15       0 0.04 0.08 0.12 0.15 0.18  0.25  4125    1
stat_dist[2] 0.19       0 0.05 0.11 0.16 0.19 0.22  0.30  3344    1
stat_dist[3] 0.29       0 0.06 0.18 0.25 0.29 0.33  0.41  3905    1
stat_dist[4] 0.36       0 0.07 0.24 0.32 0.36 0.41  0.49  4365    1
rel_rates[1] 0.22       0 0.07 0.10 0.17 0.21 0.27  0.37  2298    1
rel_rates[2] 0.18       0 0.06 0.07 0.13 0.17 0.21  0.31  2547    1
rel_rates[3] 0.40       0 0.09 0.21 0.33 0.40 0.46  0.58  2110    1
rel_rates[4] 0.21       0 0.05 0.12 0.17 0.21 0.24  0.31  3192    1
total_rate   0.27       0 0.08 0.16 0.22 0.26 0.31  0.45  2025    1

Samples were drawn using NUTS(diag_e) at Sun Nov 16 13:47:51 2025.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at 
convergence, Rhat=1).

predictive checks

post_ctmc_independent <- rstan::extract(fit_ctmc_independent)
predictive_check_marginal_binary(
  x_prior = post_ctmc_independent$x_prior,
  x_rep = post_ctmc_independent$x_rep,
  x_obs = stan_data$x,
  varname = "x"
)

predictive_checks_regression(post_ctmc_independent$y_prior, post_ctmc_independent$y_rep, stan_data)

The predictive checks indicate that the prior is already quite informative (which is bad). The posterior is still shifted towards the empirical value.

system.time(
    capture.output({
        log_marginal_likelihood_ctmc_independent <- bridge_sampler(fit_ctmc_independent)
    }, file = "/dev/null")
)
print(log_marginal_likelihood_ctmc_independent)
   user  system elapsed 
 12.846   0.018  10.964 
Bridge sampling estimate of the log marginal likelihood: -106.5887
Estimate obtained in 5 iteration(s) via method "normal".

stan_model_code_ctmc_dependent <- "
    data {
    int<lower=1> N;
    int<lower=1> Nnodes;
    int<lower=1> Nedges;
    array[N] int<lower=0, upper=1> x;
    array[N] int<lower=0, upper=1> y;
    array[Nedges, 2] int<lower=1, upper=Nnodes> edges;
    vector<lower=0>[Nedges] edge_lengths;
    int<lower=1> root_node;
}

parameters {
    simplex[8] rel_rates;
    real<lower=0> total_rate;
}

transformed parameters {
    matrix[4, 4] Q = rep_matrix(0.0, 4, 4);

    Q[1, 2] = total_rate * rel_rates[1];
    Q[2, 1] = total_rate * rel_rates[2];
    Q[1, 3] = total_rate * rel_rates[3];
    Q[3, 1] = total_rate * rel_rates[4];
    Q[2, 4] = total_rate * rel_rates[5];
    Q[4, 2] = total_rate * rel_rates[6];
    Q[3, 4] = total_rate * rel_rates[7];
    Q[4, 3] = total_rate * rel_rates[8];

    for (i in 1:4) {
        Q[i, i] = -sum(Q[i, ]);
    }

    // stationary-ish distribution via long-time transition
    matrix[4, 4] P = matrix_exp(Q * 1e3);
    vector[4] stat_dist = to_vector(P[1]);
    stat_dist /= sum(stat_dist);

    matrix[Nnodes, 4] loglikelihood;
    loglikelihood = rep_matrix(0.0, Nnodes, 4);

    // tip likelihoods
    for (i in 1:N) {
        int idx = 2 * x[i] + y[i] + 1;
        loglikelihood[i] = rep_row_vector(negative_infinity(), 4);
        loglikelihood[i, idx] = 0;
    }

    // pruning recursion
    for (e in 1:Nedges) {
        int parent = edges[e, 1];
        int child = edges[e, 2];
        real t = edge_lengths[e];
        matrix[4, 4] Plocal = matrix_exp(Q * t);

        for (k in 1:4) {
            loglikelihood[parent, k] += log_sum_exp(
                to_vector(log(Plocal[k]) + loglikelihood[child])
            );
        }
    }
    real ll = log_sum_exp(loglikelihood[root_node] + log(stat_dist)');

}

model {
    // priors
    rel_rates ~ dirichlet(rep_vector(1.0, 8));
    total_rate ~ lognormal(-1, 0.5);

    target += ll;
}

generated quantities {
    real total_rate_prior = lognormal_rng(-1, 0.5);
    vector[8] rel_rates_prior = dirichlet_rng(rep_vector(1.0, 8));
    matrix[4, 4] Qprior = rep_matrix(0.0, 4, 4);

    Qprior[1, 2] = total_rate_prior * rel_rates_prior[1];
    Qprior[2, 1] = total_rate_prior * rel_rates_prior[2];
    Qprior[1, 3] = total_rate_prior * rel_rates_prior[3];
    Qprior[3, 1] = total_rate_prior * rel_rates_prior[4];
    Qprior[2, 4] = total_rate_prior * rel_rates_prior[5];
    Qprior[4, 2] = total_rate_prior * rel_rates_prior[6];
    Qprior[3, 4] = total_rate_prior * rel_rates_prior[7];
    Qprior[4, 3] = total_rate_prior * rel_rates_prior[8];
    
    for (i in 1:4) {
        Qprior[i, i] = -sum(Qprior[i, ]);
    }

    matrix[4, 4] Pprior = matrix_exp(Qprior * 1e3);
    vector[4] stat_dist_prior = to_vector(Pprior[1]);
    stat_dist_prior /= sum(stat_dist_prior);
    
    array[Nnodes] int<lower=1, upper=4> sim_states;
    sim_states[root_node] = categorical_rng(stat_dist_prior);
    
    // Loop backward through post-ordered edges
    for (i in 1:Nedges) {
        int e = Nedges - i + 1;
        int parent = edges[e, 1];
        int child = edges[e, 2];
        real t = edge_lengths[e];
        matrix[4, 4] Plocal = matrix_exp(Qprior * t);
        sim_states[child] = categorical_rng(to_vector(Plocal[sim_states[parent]]));
    }
    
    array[N] int<lower=0, upper=1> x_prior;
    array[N] int<lower=0, upper=1> y_prior;
    for (i in 1:N) {
        int state = sim_states[i] - 1;
        x_prior[i] = state / 2;
        y_prior[i] = state % 2;
    }

    
    // Posterior predictive for each tip
    array[N] real y_log_lik;  // Pointwise log-likelihood
    vector[4] ll_loo;
    array[N] int<lower=0, upper=1> x_rep;
    array[N] int<lower=0, upper=1> y_rep;
    for (tip in 1:N) {
        for (s in 1:4) {
            // Recompute likelihood WITHOUT this tip
            matrix[Nnodes, 4] loglik_loo = rep_matrix(0.0, Nnodes, 4);
            
            // Set tip likelihoods (excluding tip 'tip')
            for (i in 1:N) {
                loglik_loo[i] = rep_row_vector(negative_infinity(), 4);
                if (i != tip) {
                    int idx = 2 * x[i] + y[i] + 1;
                    loglik_loo[i, idx] = 0;
                } else {
                    loglik_loo[i, s] = 0;
                }
            }
            
            // Pruning recursion (same as model block)
            for (e in 1:Nedges) {
                int parent = edges[e, 1];
                int child = edges[e, 2];
                real t = edge_lengths[e];
                matrix[4, 4] Plocal = matrix_exp(Q * t);
                
                for (k in 1:4) {
                    loglik_loo[parent, k] += log_sum_exp(
                        to_vector(log(Plocal[k]) + loglik_loo[child])
                    );
                }
            }
            
            ll_loo[s] = log_sum_exp(loglik_loo[root_node] + log(stat_dist)');
        }
        int observed_state = 2 * x[tip] + y[tip] + 1;
        int same_x_state = 2 * x[tip] + (1 - y[tip]) + 1;
        y_log_lik[tip] = ll_loo[observed_state] - log_sum_exp(ll_loo[observed_state] , ll_loo[same_x_state]);
        int s_rep = categorical_rng(softmax(ll_loo));
        x_rep[tip] = (s_rep - 1) / 2;
        y_rep[tip] = (s_rep - 1) % 2;
    }
}

"
system.time(    
    stan_model_ctmc_dependent <- stan_model(model_code= stan_model_code_ctmc_dependent) 
)
   user  system elapsed 
 64.618   1.462  62.203 
system.time(
    fit_ctmc_dependent <- sampling(
        stan_model_ctmc_dependent,
        data = stan_data,
        iter = 2000,
        chains = 4
    )
)
    user   system  elapsed 
1529.162    1.601  404.243 
print(fit_ctmc_dependent, pars=c("total_rate", "rel_rates", "stat_dist"))
Inference for Stan model: anon_model.
4 chains, each with iter=2000; warmup=1000; thin=1; 
post-warmup draws per chain=1000, total post-warmup draws=4000.

             mean se_mean   sd 2.5%  25%  50%  75% 97.5% n_eff Rhat
total_rate   0.47       0 0.13 0.26 0.37 0.45 0.54  0.78  2592    1
rel_rates[1] 0.10       0 0.08 0.00 0.04 0.08 0.14  0.28  2711    1
rel_rates[2] 0.07       0 0.05 0.00 0.03 0.06 0.09  0.18  2727    1
rel_rates[3] 0.29       0 0.12 0.07 0.21 0.29 0.37  0.52  2441    1
rel_rates[4] 0.11       0 0.04 0.04 0.08 0.10 0.13  0.20  3812    1
rel_rates[5] 0.10       0 0.07 0.00 0.04 0.09 0.14  0.26  1887    1
rel_rates[6] 0.09       0 0.04 0.02 0.05 0.08 0.11  0.19  2850    1
rel_rates[7] 0.13       0 0.05 0.05 0.09 0.13 0.16  0.24  3275    1
rel_rates[8] 0.12       0 0.05 0.03 0.08 0.12 0.15  0.24  3397    1
stat_dist[1] 0.13       0 0.05 0.06 0.10 0.12 0.15  0.24  4914    1
stat_dist[2] 0.25       0 0.07 0.13 0.20 0.24 0.29  0.41  4922    1
stat_dist[3] 0.31       0 0.07 0.19 0.26 0.31 0.36  0.45  5103    1
stat_dist[4] 0.31       0 0.07 0.19 0.26 0.31 0.36  0.46  5406    1

Samples were drawn using NUTS(diag_e) at Sun Nov 16 15:30:33 2025.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at 
convergence, Rhat=1).

predictive checks

post_ctmc_dependent <- rstan::extract(fit_ctmc_dependent)
predictive_check_marginal_binary(
  x_prior = post_ctmc_dependent$x_prior,
  x_rep = post_ctmc_dependent$x_rep,
  x_obs = stan_data$x,
  varname = "x"
)

predictive_checks_regression(post_ctmc_dependent$y_prior, post_ctmc_dependent$y_rep, stan_data)

Here the prior is flat. Oddly, there seems to be no difference in the distributions for x=0 and x=1.

model comparison

system.time(
    capture.output({
        log_marginal_likelihood_ctmc_dependent <- bridge_sampler(fit_ctmc_dependent)
    }, file = "/dev/null")
)
print(log_marginal_likelihood_ctmc_dependent)
   user  system elapsed 
 15.159   0.034  13.283 
Bridge sampling estimate of the log marginal likelihood: -112.9313
Estimate obtained in 5 iteration(s) via method "normal".

And now the Bayes Factor:

(log_marginal_likelihood_ctmc_independent$logml - log_marginal_likelihood_ctmc_dependent$logml)
6.34258152206782

So there is evidence in favor of the independent model.

In this connection it is worth pointing out that the OU/logistic model provides a massively better fit than the CTMC model.

(log_marginal_likelihood_OU_correlation$logml - log_marginal_likelihood_ctmc_independent$logml)
370.893464994719

Now let’s do PSIS-LOOCV for all models.

# extract log-likelihood matrix: draws x observations
log_lik6 <- extract_log_lik(fit_ctmc_independent, parameter_name = "y_log_lik")
log_lik7 <- extract_log_lik(fit_ctmc_dependent, parameter_name = "y_log_lik")


# compute LOO
loo6 <- loo(log_lik6)
loo7 <- loo(log_lik7)

loo_list$ctmc_independent <- loo6
loo_list$ctmc_dependent <- loo7

# compare
loo_compare(loo_list)
A compare.loo: 7 × 8 of type dbl
elpd_diff se_diff elpd_loo se_elpd_loo p_loo se_p_loo looic se_looic
OU_correlation 0.00000000 0.000000 -32.95478 3.754743 22.998565 2.9770741 65.90957 7.509485
OU_regression -0.07918904 1.051616 -33.03397 3.831264 24.718425 3.2070885 66.06794 7.662528
ctmc_independent -8.81393327 2.274076 -41.76872 4.930736 1.370367 0.2513402 83.53743 9.861471
ctmc_dependent -9.31603837 2.294833 -42.27082 5.097658 2.316448 0.3997178 84.54164 10.195317
brownian_regression -14.85402549 3.109386 -47.80881 6.247695 19.809112 2.8749336 95.61762 12.495390
vanilla_correlation -29.11992435 3.385216 -62.07471 2.278148 27.944938 1.2846688 124.14941 4.556297
vanilla_regression -33.16730118 3.577467 -66.12208 3.149916 2.011269 0.1372601 132.24417 6.299831

Quite surprisingly, Pareto-smoothed leave-one-out cross-validation leads to a substantially different result than Bayes Factor comparison. It should be kept in mind in this connection that the two types of model comparison compare different things:

  • The Bayes Factor compares models that model both variables simultaneously.
  • The LOOIC compares how well the models predict verb-object order if ajdective-noun order is known.
marginal_loglikelihoods <- tibble(
    model = names(loo_list),
    type = c(
        'regression',
        'correlation',
        'regression',
        'regression',
        'correlation',
        'correlation',
        'correlation'
    ),
    marginal_loglikelhood = c(
        log_marginal_likelihood_vanilla_regression$logml,
        log_marginal_likelihood_vanilla_corr$logml,
        log_marginal_likelihood_OU_regression$logml,
        log_marginal_likelihood_brownian_regression$logml,
        log_marginal_likelihood_OU_correlation$logml,
        log_marginal_likelihood_ctmc_independent$logml,
        log_marginal_likelihood_ctmc_dependent$logml
    )
)
marginal_loglikelihoods %>%
    filter(type == "regression") %>%
    arrange(desc(marginal_loglikelhood))
A tibble: 3 × 3
model type marginal_loglikelhood
<chr> <chr> <dbl>
OU_regression regression 132.36930
brownian_regression regression -54.97201
vanilla_regression regression -64.79445
marginal_loglikelihoods %>%
    filter(type == "correlation") %>%
    arrange(desc(marginal_loglikelhood))
A tibble: 4 × 3
model type marginal_loglikelhood
<chr> <chr> <dbl>
OU_correlation correlation 264.30475
vanilla_correlation correlation 55.95731
ctmc_independent correlation -106.58872
ctmc_dependent correlation -112.93130

For the features and the tree used in this case study, the models using continuous latent variables are clearly superior to the CTMC models which assume evolution of discrete characters.

Also, the Ornstein-Uhlenbeck process seems to provide a better fit to the data than a Brownian motion process.

Final thoughts

All the models discussed here use a single phylogeny. It is possible to use a sample of phylogenies to factor in phylogenetic uncertainty in this framework, but it makes the models more complex and the fitting computationally more demanding.

Essentially, what you have to do is:

  • Construct a 3-dimensional edge array (number of edges \(\times\) 2 \(\times\) number of trees), where each layer edge[i,j,k] indicates that there is a branch from node i to node j in tree number k. You have to make sure that all trees are in postorder.

  • Construct a number of edges \(\times\) number of trees matrix edge_lengths

  • In the Stan model, you have to loop over all edges in all trees. So the phylogenetic recursion should look like

for (tr in 1:Ntrees) {
    for (e in 1:Nedges) {
        int parent = edges[e, 1, tr];
        int child = edges[e, 2, tr];
        real t = edge_lengths[e, tr];
        ...
    }
}
  • Inside the model block, add the line
target -= log(Ntrees);

Together, this averages over the likelihoods of all trees in the tree sample. Since the likelihoods of all trees in the sample have to be computed, this will slow down model fitting quite a bit.

References

Bouckaert, R., Redding, D., Sheehan, O., Kyritsis, T., Gray, R., Jones, K. E., & Atkinson, Q. (2022). Global language diversification is linked to socio-ecology and threat status. SocArXiv. https://osf.io/f8tr6/download

Pagel, M., & Meade, A. (2006). Bayesian analysis of correlated evolution of discrete characters by reversible-jump Markov chain Monte Carlo. The American Naturalist, 167(6), 808–825. https://doi.org/10.1086/503444