Phylogenetic control with OU
  • Home
  • About

Data

Running a Phylogenetic OU Regression with R and Stan

Phylogenetic comparative methods for binary typological features have long relied on two approaches: the continuous-time Markov chain (CTMC) model of Pagel & Meade (2006), which treats features as evolving between discrete states, and Brownian motion random effects, which model continuous phylogenetic drift. Both have shortcomings when applied to typological data. CTMC discards gradient information by forcing evolution into a discrete state space. Brownian motion assumes unbounded drift, yet typological features cluster around cross-linguistic norms rather than diverging without limit — a pattern consistent with stabilizing selection toward typological attractors.

The Ornstein-Uhlenbeck (OU) process directly models this stabilization. It is a Brownian motion with a restoring force toward an attractor, parameterized by a mean-reversion rate \(\lambda\). As \(\lambda \to 0\) it reduces to Brownian motion; for \(\lambda > 0\) it captures the tendency of typological features to remain near cross-linguistic norms regardless of genealogical distance.

This notebook makes the case for OU-based models in phylogenetic typology. We fit a progression of models — from simple baselines through the standard CTMC approach to Brownian and OU phylogenetic models — and compare them via Bayes factors and leave-one-out cross-validation. The OU models consistently outperform the alternatives, both statistically and conceptually. The Stan implementations are provided in full; rather than inverting a large covariance matrix, we simulate the OU process node by node along the tree, which is efficient and easy to extend with additional random effects.

Preparing the environment

The git repository https://github.com/gerhardJaeger/OU_logistic_example holds the source code for this notebook. It contains a file OU_logistic_example.yml. You can use it to set up a conda environment containing all the required packages. (Of course you can also install the packages with install.packages if you don’t use conda.) At the command prompt, in the home directory of the repository, do

#| eval: false
conda env create -f OU_logistic_example.yml
conda activate OU_logistic_example

Loading the required packages

suppressPackageStartupMessages({
  suppressWarnings({
    library(tidyverse)
    library(rstan)
    library(brms)
    library(parallel)
    rstan_options(auto_write = TRUE)
    options(mc.cores = parallel::detectCores())  # Enable parallel Stan sampling
    # Workaround for rstan 2.32+ sink conflict with knitr output capture
    stan_model <- function(...) {
      while (sink.number() > 0) sink()
      rstan::stan_model(...)
    }
    library(ape)
    library(R.utils)
    library(bridgesampling)
    library(loo)
    library(bayesplot)
    library(codetools)
    library(ggtree)
    # Workaround for yulab.utils cache initialization bug in non-interactive rendering
    local({
      unlockBinding("is_ggtree_interactive", asNamespace("ggtree"))
      assign("is_ggtree_interactive", function() FALSE, envir = asNamespace("ggtree"))
    })
    library(posterior)
    library(tidybayes)
    library(ggthemes)
    library(paletteer)
    library(ggnewscale)
    library(cowplot)
    library(paletteer)
    library(scico)
    options(repr.plot.width = 12, repr.plot.height = 8)  # For base R plots in Jupyter
    theme_set(theme_minimal(base_size = 24))  # Increase base font size
  })
})

Now let’s get some data. I will use the EDGE tree (Bouckaert et al. 2022).

# Create data directory if it doesn't exist
if (!dir.exists("../data")) dir.create("../data", recursive = TRUE)

# Download the file if it doesn't exist
tree_gz <- "../data/global-language-tree-MCC-labelled.tree.gz"
tree_file <- "../data/global-language-tree-MCC-labelled.tree"
if (!file.exists(tree_file)) {
    download.file(
        url = "https://github.com/rbouckaert/global-language-tree-pipeline/releases/download/v1.0.0/global-language-tree-MCC-labelled.tree.gz",
        destfile = tree_gz,
        mode = "wb"
    )
    R.utils::gunzip(tree_gz, destname = tree_file, overwrite = TRUE)
}

edge_tree <- read.nexus(tree_file)

edge_tree$tip.label <- sapply(strsplit(edge_tree$tip.label, "_"), `[`, 1)

It is advisable to rescale the tree so that the mean branch length is 1.

edge_tree$edge.length <- edge_tree$edge.length / (mean(edge_tree$edge.length))

Grambank data

I will look at the following feature pair, which is a classic test case in word-order typology: adjective-noun order and verb-object order are among the most frequently studied cross-linguistic correlates (Dryer 1992), making them a natural benchmark for phylogenetic methods.

  • GB193: What is the order of adnominal property word and noun? (1 = NAdj, 2 = AdjN)
  • GB133: Is a pragmatically unmarked constituent order verb-final for transitive clauses? (0 = no, 1 = yes)

First we get the grambank data.

grambank_file = "../data/grambank_vals.csv"
if (!file.exists(grambank_file)) {
    values_url <- "https://raw.githubusercontent.com/grambank/grambank/refs/heads/master/cldf/values.csv"
    vals <- read_csv(values_url, na = "")
    write_csv(vals, grambank_file)
} else {
    vals <- read_csv(grambank_file)
}
Rows: 441663 Columns: 9

── Column specification ────────────────────────────────────────────────────────

Delimiter: ","

chr (9): ID, Language_ID, Parameter_ID, Value, Code_ID, Comment, Source, Sou...



ℹ Use `spec()` to retrieve the full column specification for this data.

ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Now we filter out the datapoints for the two parameters of interest and the languages in the tree. I only consider two values per parameter, and only languages having a relevant value for both parameters.

d <- vals %>%
    filter(Language_ID %in% edge_tree$tip.label) %>%
    filter(Parameter_ID %in% c("GB193", "GB133")) %>%
    select(Language_ID, Parameter_ID, Value) %>%
    filter(Value %in% c(0, 1, 2)) %>%
    mutate(Value = as.numeric(Value)) %>%
    pivot_wider(names_from = Parameter_ID, values_from = Value) %>%
    filter(GB193 > 0) %>%
    drop_na()

Now we prune the tree to only include the languages in the data.

pruned_tree <- drop.tip(edge_tree, setdiff(edge_tree$tip.label, d$Language_ID))

# Check that d$Language_ID and pruned_tree$tip.label are the same set
if (!setequal(d$Language_ID, pruned_tree$tip.label)) {
    warning("Mismatch between d$Language_ID and pruned_tree$tip.label")
    print(setdiff(d$Language_ID, pruned_tree$tip.label))
    print(setdiff(pruned_tree$tip.label, d$Language_ID))
} else {
    message("d$Language_ID and pruned_tree$tip.label match.")
}
d$Language_ID and pruned_tree$tip.label match.

Now the rows in d are rearranged to match the order of the tips in the pruned tree.

d <- d[match(pruned_tree$tip.label, d$Language_ID), ]

Finally we convert the features to numeric values, and convert to 0/1

d <- d %>%
    mutate(x = GB193 - 1,
           y = GB133 
    )

For now, I keep the data set small and restrict myself to 100 taxa.

set.seed(123)  # You can replace 123 with any integer
d <- d[sample(nrow(d), 100), ]
pruned_tree <- drop.tip(pruned_tree, setdiff(pruned_tree$tip.label, d$Language_ID))
d <- d[match(pruned_tree$tip.label, d$Language_ID), ]

Let’s have a look at the tree and the data.

ggtree(pruned_tree, layout="circular") +
    geom_tree() +
    geom_tiplab() +            # Adds tip labels
    theme_tree2()              # Adds scale axis (x-axis for branch lengths)


d %>% head(10)
A tibble: 10 × 5
Language_ID GB133 GB193 x y
<chr> <dbl> <dbl> <dbl> <dbl>
adan1251 1 2 1 1
ajas1235 0 1 0 0
alab1254 1 1 0 1
alya1239 0 2 1 0
arch1244 1 1 0 1
aree1239 1 2 1 1
arig1246 0 1 0 0
awar1249 1 2 1 1
band1344 0 1 0 0
bayb1234 0 1 0 0

Let’s save the data for later use.

write_csv(d, "../data/grambank_vals_pruned.csv")

Baseline models (no phylogenetic control)

We begin with models that ignore phylogenetic structure entirely. These serve as baselines and let us confirm that the data contain a signal before asking whether phylogenetic control is warranted.

We first encode the tree as a Stan data list. All subsequent models — vanilla, CTMC, Brownian, and OU — draw from this same data structure.

Just to be on the safe side, we make sure that the branches are in postorder.

pruned_tree <- reorder(pruned_tree, "postorder")

Now we code the tree and the data as a data list.

stan_data = list(
    N = nrow(d),  # Number of languages
    Nnodes = length(unique(as.vector(pruned_tree$edge))),
    root_node = length(pruned_tree$tip.label) + 1,
    x = d$x,  # Predictor variable (numeral-noun order)
    y = d$y,  # Response variable (adnominal demonstrative-noun order)
    Nedges = nrow(pruned_tree$edge), # number of edges
    edges = pruned_tree$edge,  # Edges of the tree,
    edge_lengths = pruned_tree$edge.length  # Lengths of the edges
)

Vanilla logistic regression

We start with a simple logistic regression without any phylogenetic effects, just to see how it works.

The model formula is:

\[ \begin{align} \alpha &\sim \text{Student-t}(3, 0, 2.5)\\ \beta &\sim \mathcal N(0, 2)\\ \mu_x &:= \frac{1}{|x|}\sum_{i}x_i\\ y &\sim \text{Bernoulli}(\text{logit}^{-1}(\alpha + \beta (x-\mu_x))) \end{align} \]

The choice of priors follows the defaults used by brms for logistic regression.

Let’s run this in Stan.

stan_code_vanilla <- "
data {
    int<lower=1> N;                      // Number of observations
    vector<lower=0, upper=1>[N] x;       // Predictor variable
    array[N] int<lower=0,upper=1> y;     // Response variable (binary)
}

transformed data {
  vector[N] Xc;  // centered version of X without an intercept
  real means_X;  // mean of X before centering
  means_X = mean(x);
  Xc = x - means_X;
}

parameters {
    real alpha;      // Intercept
    real beta;       // Slope
}
model {
    // Priors
    alpha ~ student_t(3, 0, 2.5);
    beta ~ normal(0, 2);

    // Likelihood
    // y ~ bernoulli_logit(alpha + beta * Xc);
    target += bernoulli_logit_lpmf(y | alpha + beta * Xc);
    
}
generated quantities {
    vector[N] log_lik;
    array[N] int y_rep;      // Posterior predictive samples
    array[N] int y_prior;    // Prior predictive samples
    real alpha_prior;
    real beta_prior;

    for (n in 1:N) {
        // Log-likelihood
        log_lik[n] = bernoulli_logit_lpmf(y[n] | alpha + beta * Xc[n]);

        // Posterior predictive
        y_rep[n] = bernoulli_logit_rng(alpha + beta * Xc[n]);

        // Prior predictive: draw new alpha and beta from prior, then simulate
        alpha_prior = student_t_rng(3, 0, 2.5);
        beta_prior = normal_rng(0, 2);
        y_prior[n] = bernoulli_logit_rng(alpha_prior + beta_prior * Xc[n]);
    }
}
"
system.time(
  if (file.exists("stan_model_vanilla.rds")) {
    stan_model_vanilla <- readRDS("stan_model_vanilla.rds")
  } else {
    stan_model_vanilla <- stan_model(model_code = stan_code_vanilla)
    saveRDS(stan_model_vanilla, "stan_model_vanilla.rds")
  }
)
   user  system elapsed 
  0.014   0.000   0.014 
system.time(
    fit_vanilla <- sampling(stan_model_vanilla, data = stan_data, iter = 2000, chains = 4)
)
   user  system elapsed 
  3.023   0.362   2.738 
print(fit_vanilla, pars = c("alpha", "beta"))
Inference for Stan model: anon_model.
4 chains, each with iter=2000; warmup=1000; thin=1; 
post-warmup draws per chain=1000, total post-warmup draws=4000.

       mean se_mean   sd  2.5%   25%   50%   75% 97.5% n_eff Rhat
alpha -0.47    0.00 0.21 -0.88 -0.62 -0.47 -0.33 -0.06  3803    1
beta  -1.00    0.01 0.44 -1.87 -1.31 -0.99 -0.71 -0.15  3735    1

Samples were drawn using NUTS(diag_e) at Sat Apr  4 07:44:14 2026.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at 
convergence, Rhat=1).
mcmc_areas(
    as.matrix(fit_vanilla),
    pars="beta",
    prob = 0.95
)

The posterior distribution of the slope \(\beta\) does not include zero, indicating that there is a significant relationship between the predictor variable x and the response variable y. The direction is negative, meaning that N-ANM order disfavors OV order.

For model comparison, we can use the bridgesampling package to compute the log marginal likelihood.

system.time(
    capture.output({
    log_marginal_likelihood_vanilla_regression <- bridge_sampler(fit_vanilla)
    }, file = "/dev/null")
)
print(log_marginal_likelihood_vanilla_regression)
   user  system elapsed 
  2.461   0.005   0.537 
Bridge sampling estimate of the log marginal likelihood: -64.79445
Estimate obtained in 5 iteration(s) via method "normal".

Now we do visual checks of the model using prior and posterior predictive simulations.

predictive_checks_regression <- function(y_prior, y_rep, stan_data) {
    x0_idx <- which(stan_data$x == 0)
    x1_idx <- which(stan_data$x == 1)
    obs_p_y1_x0 <- mean(stan_data$y[stan_data$x == 0])
    obs_p_y1_x1 <- mean(stan_data$y[stan_data$x == 1])

    y_prior_df <- as_tibble(y_prior) %>%
        setNames(1:ncol(y_prior)) %>%
        mutate(sample_no = 1:nrow(.)) %>%
        pivot_longer(-sample_no, names_to = "obs", values_to = "y") %>%
        mutate(type = "y_prior") %>%
        mutate(obs = as.integer(obs))

    y_rep_df <- as_tibble(y_rep) %>%
        setNames(1:ncol(y_rep)) %>%
        mutate(sample_no = 1:nrow(.)) %>%
        pivot_longer(-sample_no, names_to = "obs", values_to = "y") %>%
        mutate(type = "y_rep") %>%
        mutate(obs = as.integer(obs))

    y_obs_df <- tibble(
        sample_no = 0,
        obs = 1:stan_data$N,
        y = stan_data$y,
        type = "y_obs"
    )


    df <- bind_rows(y_prior_df, y_rep_df, y_obs_df)

    df <- df %>%
        mutate(x_obs = stan_data$x[df$obs])

    empirical_df <- df %>%
        filter(sample_no == 0) %>%
        group_by(x_obs) %>%
        summarise(p_y1 = mean(y), .groups = "drop")         

    p <- df %>%
        filter(sample_no != 0) %>%
        group_by(sample_no, type, x_obs) %>%
        summarise(p_y1 = mean(y), .groups = "drop") %>%
        ggplot(aes(x = p_y1, fill = type, color = type)) +
        geom_density(alpha = 0.3, adjust = 1.2) +
        geom_vline(
        data = empirical_df,
            aes(xintercept = p_y1),
            color = "red",
            linewidth = 0.7,
            linetype = "dashed"
        ) +
        facet_wrap(
            ~x_obs,
            nrow = 1,
            labeller = labeller(x_obs = c(`0` = "x = 0", `1` = "x = 1"))
        ) +
        labs(
            title = "Predictive Check: Prior vs Posterior vs Observed",
            x = "Proportion of y = 1 (per draw)",
            y = "Density"
        ) +
        theme_minimal(base_size = 14)
    return(p)
}
post_vanilla_regression <- rstan::extract(fit_vanilla)

predictive_checks_regression(post_vanilla_regression$y_prior, post_vanilla_regression$y_rep, stan_data)
Warning message:

“The `x` argument of `as_tibble.matrix()` must have unique column names if

`.name_repair` is omitted as of tibble 2.0.0.

ℹ Using compatibility `.name_repair`.”

Vanilla logistic correlation study

A regression model assumes an asymmetry between the predictor variable (which is not modeled) and the response variabel (which is modeled). This is not really justified in observational studies where we have not control over the predictor variable. A more appropriate model is one where both variables are generated by some stochastic process.

This can be modeled as follows:

\[ \begin{align} z &\sim \mathcal N(\mu, \Sigma)\\ \Sigma &= \text{diag}(\sigma) \cdot R \cdot \text{diag}(\sigma)\\ R &\sim \text{LKJ}(2)\\ \mu_k &\sim \mathcal N(0, 2)\\ \sigma_k &\sim \text{Lognormal}(0, 1)\\ x &\sim \text{Bernoulli}(\text{logit}^{-1}(z_1))\\ y &\sim \text{Bernoulli}(\text{logit}^{-1}(z_2)) \end{align} \]

Here \(z\) is a bivariate normal variable with mean \(\mu\) and covariance \(\Sigma\). The covariance matrix \(\Sigma\) is parameterized as the product of a diagonal matrix of standard deviations \(\sigma = [\sigma_1, \sigma_2]\) and a correlation matrix \(R\). This allows the two dimensions to have different scales while still modeling their correlation. The correlation between them is given by \(\rho = R_{12} = R_{21}\). The correlation matrix is generated from an LKJ distribution, which is a flexible way to model correlations. The two variables \(x\) and \(y\) are then generated from the first and second components of \(z\), respectively, using the logistic function.

The interesting question is then whether the correlation between \(x\) and \(y\), i.e., \(\rho = r_{21}\), is significantly different from zero. If it is, this indicates that there is a significant correlation between the two variables, which is what we are interested in.

stan_code_vanilla_corr <- "
data {
  int<lower=0> N;
  array[N] int<lower=0, upper=1> x;       // first binary variable
  array[N] int<lower=0, upper=1> y;       // second binary variable
}

parameters {
  vector[2] mu;                          // Mean of latent variables
  vector<lower=0>[2] sigma;                 // Standard deviation of latent variables
  cholesky_factor_corr[2] L_R;           // Cholesky factor of correlation matrix
  matrix[2, N] z_raw;                    // Standard normal latent variables
}

transformed parameters {
  real rho = L_R[2, 1];                  // Extract the correlation coefficient from the Cholesky factor
  matrix[2, N] z;                        // Latent variables with correlation
  z = rep_matrix(mu, N) + diag_matrix(sigma) * L_R * z_raw;
}

model {
  // Priors
  mu ~ normal(0, 2);                       // Prior for the mean of latent variables
  sigma ~ lognormal(0,1);                  // Prior for the standard deviation of latent variables
  L_R ~ lkj_corr_cholesky(2);              // Prior for the Cholesky factor of the correlation matrix
  to_vector(z_raw) ~ normal(0, 1);         // standard normal prior for latent

  // Likelihood
  for (n in 1:N) {
    // x[n] ~ bernoulli_logit(z[1, n]);
    // y[n] ~ bernoulli_logit(z[2, n]);
    target += bernoulli_logit_lpmf(x[n] | z[1, n]);
    target += bernoulli_logit_lpmf(y[n] | z[2, n]);
  }
  
}

generated quantities {
  vector[N] log_lik_x;
  vector[N] log_lik_y;
  array[N] int x_rep;
  array[N] int y_rep;
  array[N] int x_prior;
  array[N] int y_prior;

  for (n in 1:N) {
    log_lik_x[n] = bernoulli_logit_lpmf(x[n] | z[1, n]);
    log_lik_y[n] = bernoulli_logit_lpmf(y[n] | z[2, n]);
    x_rep[n] = bernoulli_logit_rng(z[1, n]);
    y_rep[n] = bernoulli_logit_rng(z[2, n]);
  }

  // Prior predictive simulation
  {
    vector[2] mu_prior;
    vector[2] sigma_prior;
    mu_prior[1] = normal_rng(0, 2);
    mu_prior[2] = normal_rng(0, 2);
    sigma_prior[1] = lognormal_rng(0, 1);
    sigma_prior[2] = lognormal_rng(0, 1);

    matrix[2, 2] R_prior = lkj_corr_rng(2, 2.0);  // Sample from LKJ prior
    matrix[2, 2] L_prior = cholesky_decompose(R_prior);
    matrix[2, N] z_raw_prior;
    matrix[2, N] z_prior;

    for (n in 1:N)
      for (d in 1:2)
        z_raw_prior[d, n] = normal_rng(0, 1);

    z_prior = rep_matrix(mu_prior, N) + diag_matrix(sigma_prior) * L_prior * z_raw_prior;

    for (n in 1:N) {
      x_prior[n] = bernoulli_logit_rng(z_prior[1, n]);
      y_prior[n] = bernoulli_logit_rng(z_prior[2, n]);
    }
  }
}

"
system.time(
  if (file.exists("stan_model_vanilla_corr.rds")) {
    stan_model_vanilla_corr <- readRDS("stan_model_vanilla_corr.rds")
  } else {
    stan_model_vanilla_corr <- stan_model(model_code = stan_code_vanilla_corr)
    saveRDS(stan_model_vanilla_corr, "stan_model_vanilla_corr.rds")
  }
)
   user  system elapsed 
  0.017   0.000   0.017 
system.time(
    fit_vanilla_corr <- sampling(
        stan_model_vanilla_corr,
        data = stan_data,
        iter = 2000,
        chains = 4,
        control = list(adapt_delta = 0.95)
    )
)
Warning message:
“Bulk Effective Samples Size (ESS) is too low, indicating posterior means and medians may be unreliable.
Running the chains for more iterations may help. See
https://mc-stan.org/misc/warnings.html#bulk-ess”
   user  system elapsed 
 15.653   0.547  10.313 
print(fit_vanilla_corr, pars = c("mu", "rho", "sigma"))
Inference for Stan model: anon_model.
4 chains, each with iter=2000; warmup=1000; thin=1; 
post-warmup draws per chain=1000, total post-warmup draws=4000.

          mean se_mean   sd  2.5%   25%   50%   75% 97.5% n_eff Rhat
mu[1]     1.36    0.03 0.65  0.54  0.93  1.21  1.63  3.07   639 1.00
mu[2]    -0.88    0.03 0.69 -2.89 -1.10 -0.70 -0.46 -0.06   454 1.00
rho      -0.44    0.01 0.28 -0.89 -0.65 -0.48 -0.29  0.27   861 1.00
sigma[1]  2.00    0.07 1.57  0.24  0.90  1.57  2.68  6.32   443 1.01
sigma[2]  2.92    0.16 3.12  0.23  1.00  2.01  3.75 11.04   362 1.00

Samples were drawn using NUTS(diag_e) at Sat Apr  4 07:44:26 2026.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at 
convergence, Rhat=1).
mcmc_areas(
    as.matrix(fit_vanilla_corr),
    pars = "rho",
    prob = 0.95
)

The posterior of \(\rho\) includes 0, so no clear evidence for a correlation.

predictive checks

We compare the behavior of data sampled from the prior, and data sampled from the posterior, to the observed data.

First check: proportion of 1s in the \(x\)-variable

predictive_check_marginal_binary <- function(x_prior, x_rep, x_obs,
                                             varname = "x", bins = 30) {
  library(ggplot2)
  library(dplyr)

  # Compute posterior and prior predictive means (per draw)
  post_prop <- rowMeans(x_rep)
  prior_prop <- rowMeans(x_prior)
  obs_prop <- mean(x_obs)

  # Combine into tidy data frame
  df <- data.frame(
    prop = c(prior_prop, post_prop),
    source = rep(c("prior", "posterior"), times = c(length(prior_prop), length(post_prop)))
  )

  # Plot
  ggplot(df, aes(x = prop, fill = source)) +
    geom_histogram(position = "identity", alpha = 0.4, bins = bins) +
    geom_vline(xintercept = obs_prop, color = "red", linetype = "dashed", linewidth = 1) +
    labs(
      title = paste("Sanity Check:", varname, "~ Bernoulli"),
      x = paste("Proportion of", varname, "= 1 (per draw)"),
      y = "Frequency"
    ) +
    scale_fill_manual(values = c("prior" = "skyblue", "posterior" = "seagreen")) +
    theme_minimal(base_size = 14)
}
post_vanilla_corr <- rstan::extract(fit_vanilla_corr)
predictive_check_marginal_binary(
  x_prior = post_vanilla_corr$x_prior,
  x_rep = post_vanilla_corr$x_rep,
  x_obs = stan_data$x,
  varname = "x"
)

Second check: conditional probability \((y=1|x_{\text{observed}}=0)\) and \((y=1|x_{\text{observed}}=1)\)

predictive_checks_regression(post_vanilla_corr$y_prior, post_vanilla_corr$y_rep, stan_data)

For both checks, the prior is close to a uniform distribution, and the posterior is peaked around the empirical values. This is as it should be.

model comparison

We compute the log marginal likelihood for this model for later use.

system.time(
    capture.output({
        log_marginal_likelihood_vanilla_corr <- bridge_sampler(fit_vanilla_corr)
    }, file = "/dev/null")
)
print(log_marginal_likelihood_vanilla_corr)
   user  system elapsed 
  5.203   0.008   2.193 
Bridge sampling estimate of the log marginal likelihood: 55.95731
Estimate obtained in 16 iteration(s) via method "normal".

A direct model comparison via Bayes Factor is not possible because the regression model models the distribution of x given y, while the correlation models the joint distribution. However, we can apply Pareto-smoothed leave one out cross-validation if we marginalize over x in the correlation model.

# extract log-likelihood matrix: draws x observations
log_lik1 <- extract_log_lik(fit_vanilla, parameter_name = "log_lik")
log_lik2 <- extract_log_lik(fit_vanilla_corr, parameter_name = "log_lik_y")

# compute LOO
loo1 <- loo(log_lik1)
loo2 <- loo(log_lik2)


loo_list <- list(
    "vanilla_regression" = loo1, 
    "vanilla_correlation" = loo2
  )

# compare
loo_compare(loo_list)
Warning message:
“Some Pareto k diagnostic values are too high. See help('pareto-k-diagnostic') for details.
”
A compare.loo: 2 × 8 of type dbl
elpd_diff se_diff elpd_loo se_elpd_loo p_loo se_p_loo looic se_looic
vanilla_correlation 0.000000 0.00000 -62.07471 2.278148 27.944938 1.2846688 124.1494 4.556297
vanilla_regression -4.047377 1.21631 -66.12208 3.149916 2.011269 0.1372601 132.2442 6.299831

So the correlation model is clearly preferred. Both vanilla models ignore phylogenetic structure, however. We now turn to phylogenetically controlled approaches, starting with the method that is most widely used in the field.

Pagel & Meade style test for correlation

The CTMC-based test for correlated evolution developed by Pagel & Meade (2006) is the dominant method in phylogenetic linguistics for assessing whether two binary features co-evolve. It models the joint evolution of \((x, y)\) as transitions among the four combined states \(\{(0,0),(0,1),(1,0),(1,1)\}\) via a continuous-time Markov chain with rate matrix \(Q\). Under the independent model, transition rates satisfy independence constraints (four free parameters); under the dependent model, all eight off-diagonal rates are free. A Bayes factor between the two quantifies evidence for correlated evolution.

We implement this directly in Stan using Felsenstein’s pruning algorithm. One consequence of doing so — rather than using specialized software — is that the full Bayesian workflow comes at no extra cost: prior and posterior predictive checks, Bayes factor comparison via bridge sampling, and leave-one-out cross-validation are all available within the same reproducible script.

The joint state space of the two binary variables is \(\{(x,y)\} = \{(0,0),(0,1),(1,0),(1,1)\}\), which we encode as states \(\{1,2,3,4\}\) via the bijection \(\text{idx} = 2x + y + 1\). The CTMC models transitions among these four joint states.

stan_model_code_ctmc_independent = "

data {
    int<lower=1> N;
    int<lower=1> Nnodes;
    int<lower=1> Nedges;
    array[N] int<lower=0, upper=1> x;
    array[N] int<lower=0, upper=1> y;
    array[Nedges, 2] int<lower=1, upper=Nnodes> edges;
    vector<lower=0>[Nedges] edge_lengths;
    int<lower=1> root_node;
}

parameters {
    simplex[4] rel_rates;
    real<lower=0> total_rate;
}

transformed parameters {
    matrix[4, 4] Q = rep_matrix(0.0, 4, 4);

    Q[1, 2] = total_rate * rel_rates[1];
    Q[2, 1] = total_rate * rel_rates[2];
    Q[1, 3] = total_rate * rel_rates[3];
    Q[3, 1] = total_rate * rel_rates[4];
    Q[2, 4] = total_rate * rel_rates[3];
    Q[4, 2] = total_rate * rel_rates[4];
    Q[3, 4] = total_rate * rel_rates[1];
    Q[4, 3] = total_rate * rel_rates[2];

    for (i in 1:4) {
        Q[i, i] = -sum(Q[i, ]);
    }

    // stationary-ish distribution via long-time transition
    matrix[4, 4] P = matrix_exp(Q * 1e3);
    vector[4] stat_dist = to_vector(P[1]);
    stat_dist /= sum(stat_dist);

        matrix[Nnodes, 4] loglikelihood;
    loglikelihood = rep_matrix(0.0, Nnodes, 4);

    // tip likelihoods
    for (i in 1:N) {
        int idx = 2 * x[i] + y[i] + 1;
        loglikelihood[i] = rep_row_vector(negative_infinity(), 4);
        loglikelihood[i, idx] = 0;
    }

    // pruning recursion
    for (e in 1:Nedges) {
        int parent = edges[e, 1];
        int child = edges[e, 2];
        real t = edge_lengths[e];
        matrix[4, 4] Plocal = matrix_exp(Q * t);

        for (k in 1:4) {
            loglikelihood[parent, k] += log_sum_exp(
                to_vector(log(Plocal[k]) + loglikelihood[child])
            );
        }
    }
    real ll = log_sum_exp(loglikelihood[root_node] + log(stat_dist)');

}

model {
    // priors
    rel_rates ~ dirichlet(rep_vector(1.0, 4));
    total_rate ~ lognormal(-1, 0.5);

    target += ll;
}

generated quantities {
    real total_rate_prior = lognormal_rng(-1, 0.5);
    vector[4] rel_rates_prior = dirichlet_rng(rep_vector(1.0, 4));
    matrix[4, 4] Qprior = rep_matrix(0.0, 4, 4);

    Qprior[1, 2] = total_rate_prior * rel_rates_prior[1];
    Qprior[2, 1] = total_rate_prior * rel_rates_prior[2];
    Qprior[1, 3] = total_rate_prior * rel_rates_prior[3];
    Qprior[3, 1] = total_rate_prior * rel_rates_prior[4];
    Qprior[2, 4] = total_rate_prior * rel_rates_prior[3];
    Qprior[4, 2] = total_rate_prior * rel_rates_prior[4];
    Qprior[3, 4] = total_rate_prior * rel_rates_prior[1];
    Qprior[4, 3] = total_rate_prior * rel_rates_prior[2];
    
    for (i in 1:4) {
        Qprior[i, i] = -sum(Qprior[i, ]);
    }

    matrix[4, 4] Pprior = matrix_exp(Qprior * 1e3);
    vector[4] stat_dist_prior = to_vector(Pprior[1]);
    stat_dist_prior /= sum(stat_dist_prior);
    
    array[Nnodes] int<lower=1, upper=4> sim_states;
    sim_states[root_node] = categorical_rng(stat_dist_prior);
    
    // Loop backward through post-ordered edges
    for (i in 1:Nedges) {
        int e = Nedges - i + 1;
        int parent = edges[e, 1];
        int child = edges[e, 2];
        real t = edge_lengths[e];
        matrix[4, 4] Plocal = matrix_exp(Qprior * t);
        sim_states[child] = categorical_rng(to_vector(Plocal[sim_states[parent]]));
    }
    
    array[N] int<lower=0, upper=1> x_prior;
    array[N] int<lower=0, upper=1> y_prior;
    for (i in 1:N) {
        int state = sim_states[i] - 1;
        x_prior[i] = state / 2;
        y_prior[i] = state % 2;
    }

    
    // Posterior predictive for each tip
    array[N] real y_log_lik;  // Pointwise log-likelihood
    vector[4] ll_loo;
    array[N] int<lower=0, upper=1> x_rep;
    array[N] int<lower=0, upper=1> y_rep;
    for (tip in 1:N) {
        for (s in 1:4) {
            // Recompute likelihood WITHOUT this tip
            matrix[Nnodes, 4] loglik_loo = rep_matrix(0.0, Nnodes, 4);
            
            // Set tip likelihoods (excluding tip 'tip')
            for (i in 1:N) {
                loglik_loo[i] = rep_row_vector(negative_infinity(), 4);
                if (i != tip) {
                    int idx = 2 * x[i] + y[i] + 1;
                    loglik_loo[i, idx] = 0;
                } else {
                    loglik_loo[i, s] = 0;
                }
            }
            
            // Pruning recursion (same as model block)
            for (e in 1:Nedges) {
                int parent = edges[e, 1];
                int child = edges[e, 2];
                real t = edge_lengths[e];
                matrix[4, 4] Plocal = matrix_exp(Q * t);
                
                for (k in 1:4) {
                    loglik_loo[parent, k] += log_sum_exp(
                        to_vector(log(Plocal[k]) + loglik_loo[child])
                    );
                }
            }
            
            ll_loo[s] = log_sum_exp(loglik_loo[root_node] + log(stat_dist)');
        }
        int observed_state = 2 * x[tip] + y[tip] + 1;
        int same_x_state = 2 * x[tip] + (1 - y[tip]) + 1;
        y_log_lik[tip] = ll_loo[observed_state] - log_sum_exp(ll_loo[observed_state] , ll_loo[same_x_state]);
        int s_rep = categorical_rng(softmax(ll_loo));
        x_rep[tip] = (s_rep - 1) / 2;
        y_rep[tip] = (s_rep - 1) % 2;
    }
}

"

system.time(
  if (file.exists("ctmc_independent.rds")) {
    stan_model_ctmc_independent <- readRDS("ctmc_independent.rds")
  } else {
    stan_model_ctmc_independent <- stan_model(model_code = stan_model_code_ctmc_independent)
    saveRDS(stan_model_ctmc_independent, "ctmc_independent.rds")
  }
)
   user  system elapsed 
  0.019   0.001   0.020 
system.time(
    fit_ctmc_independent <- sampling(
        stan_model_ctmc_independent,
        data = stan_data,
        iter = 2000,
        chains = 4
    )
)
    user   system  elapsed 
2073.146    1.013  564.488 
print(fit_ctmc_independent, pars=c("stat_dist", "rel_rates", "total_rate"))
Inference for Stan model: anon_model.
4 chains, each with iter=2000; warmup=1000; thin=1; 
post-warmup draws per chain=1000, total post-warmup draws=4000.

             mean se_mean   sd 2.5%  25%  50%  75% 97.5% n_eff Rhat
stat_dist[1] 0.15       0 0.04 0.08 0.13 0.15 0.18  0.25  4107    1
stat_dist[2] 0.19       0 0.05 0.11 0.16 0.19 0.22  0.29  3553    1
stat_dist[3] 0.29       0 0.06 0.18 0.25 0.29 0.33  0.42  4493    1
stat_dist[4] 0.36       0 0.07 0.24 0.32 0.36 0.40  0.50  4551    1
rel_rates[1] 0.22       0 0.07 0.10 0.17 0.22 0.26  0.37  2334    1
rel_rates[2] 0.18       0 0.06 0.08 0.13 0.17 0.22  0.32  2824    1
rel_rates[3] 0.39       0 0.09 0.22 0.33 0.39 0.46  0.57  2183    1
rel_rates[4] 0.21       0 0.05 0.12 0.17 0.20 0.24  0.31  3236    1
total_rate   0.27       0 0.07 0.16 0.22 0.26 0.31  0.44  2416    1

Samples were drawn using NUTS(diag_e) at Sat Apr  4 07:54:05 2026.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at 
convergence, Rhat=1).

predictive checks

post_ctmc_independent <- rstan::extract(fit_ctmc_independent)
predictive_check_marginal_binary(
  x_prior = post_ctmc_independent$x_prior,
  x_rep = post_ctmc_independent$x_rep,
  x_obs = stan_data$x,
  varname = "x"
)

predictive_checks_regression(post_ctmc_independent$y_prior, post_ctmc_independent$y_rep, stan_data)

The predictive checks indicate that the prior is already quite informative (which is bad). The posterior is still shifted towards the empirical value.

system.time(
    capture.output({
        log_marginal_likelihood_ctmc_independent <- bridge_sampler(fit_ctmc_independent)
    }, file = "/dev/null")
)
print(log_marginal_likelihood_ctmc_independent)
   user  system elapsed 
 11.390   0.013   9.481 
Bridge sampling estimate of the log marginal likelihood: -106.5915
Estimate obtained in 5 iteration(s) via method "normal".
stan_model_code_ctmc_dependent = "
    data {
    int<lower=1> N;
    int<lower=1> Nnodes;
    int<lower=1> Nedges;
    array[N] int<lower=0, upper=1> x;
    array[N] int<lower=0, upper=1> y;
    array[Nedges, 2] int<lower=1, upper=Nnodes> edges;
    vector<lower=0>[Nedges] edge_lengths;
    int<lower=1> root_node;
}

parameters {
    simplex[8] rel_rates;
    real<lower=0> total_rate;
}

transformed parameters {
    matrix[4, 4] Q = rep_matrix(0.0, 4, 4);

    Q[1, 2] = total_rate * rel_rates[1];
    Q[2, 1] = total_rate * rel_rates[2];
    Q[1, 3] = total_rate * rel_rates[3];
    Q[3, 1] = total_rate * rel_rates[4];
    Q[2, 4] = total_rate * rel_rates[5];
    Q[4, 2] = total_rate * rel_rates[6];
    Q[3, 4] = total_rate * rel_rates[7];
    Q[4, 3] = total_rate * rel_rates[8];

    for (i in 1:4) {
        Q[i, i] = -sum(Q[i, ]);
    }

    // stationary-ish distribution via long-time transition
    matrix[4, 4] P = matrix_exp(Q * 1e3);
    vector[4] stat_dist = to_vector(P[1]);
    stat_dist /= sum(stat_dist);

        matrix[Nnodes, 4] loglikelihood;
    loglikelihood = rep_matrix(0.0, Nnodes, 4);

    // tip likelihoods
    for (i in 1:N) {
        int idx = 2 * x[i] + y[i] + 1;
        loglikelihood[i] = rep_row_vector(negative_infinity(), 4);
        loglikelihood[i, idx] = 0;
    }

    // pruning recursion
    for (e in 1:Nedges) {
        int parent = edges[e, 1];
        int child = edges[e, 2];
        real t = edge_lengths[e];
        matrix[4, 4] Plocal = matrix_exp(Q * t);

        for (k in 1:4) {
            loglikelihood[parent, k] += log_sum_exp(
                to_vector(log(Plocal[k]) + loglikelihood[child])
            );
        }
    }
    real ll = log_sum_exp(loglikelihood[root_node] + log(stat_dist)');

}

model {
    // priors
    rel_rates ~ dirichlet(rep_vector(1.0, 8));
    total_rate ~ lognormal(-1, 0.5);

    target += ll;
}

generated quantities {
    real total_rate_prior = lognormal_rng(-1, 0.5);
    vector[8] rel_rates_prior = dirichlet_rng(rep_vector(1.0, 8));
    matrix[4, 4] Qprior = rep_matrix(0.0, 4, 4);

    Qprior[1, 2] = total_rate_prior * rel_rates_prior[1];
    Qprior[2, 1] = total_rate_prior * rel_rates_prior[2];
    Qprior[1, 3] = total_rate_prior * rel_rates_prior[3];
    Qprior[3, 1] = total_rate_prior * rel_rates_prior[4];
    Qprior[2, 4] = total_rate_prior * rel_rates_prior[5];
    Qprior[4, 2] = total_rate_prior * rel_rates_prior[6];
    Qprior[3, 4] = total_rate_prior * rel_rates_prior[7];
    Qprior[4, 3] = total_rate_prior * rel_rates_prior[8];
    
    for (i in 1:4) {
        Qprior[i, i] = -sum(Qprior[i, ]);
    }

    matrix[4, 4] Pprior = matrix_exp(Qprior * 1e3);
    vector[4] stat_dist_prior = to_vector(Pprior[1]);
    stat_dist_prior /= sum(stat_dist_prior);
    
    array[Nnodes] int<lower=1, upper=4> sim_states;
    sim_states[root_node] = categorical_rng(stat_dist_prior);
    
    // Loop backward through post-ordered edges
    for (i in 1:Nedges) {
        int e = Nedges - i + 1;
        int parent = edges[e, 1];
        int child = edges[e, 2];
        real t = edge_lengths[e];
        matrix[4, 4] Plocal = matrix_exp(Qprior * t);
        sim_states[child] = categorical_rng(to_vector(Plocal[sim_states[parent]]));
    }
    
    array[N] int<lower=0, upper=1> x_prior;
    array[N] int<lower=0, upper=1> y_prior;
    for (i in 1:N) {
        int state = sim_states[i] - 1;
        x_prior[i] = state / 2;
        y_prior[i] = state % 2;
    }

    
    // Posterior predictive for each tip
    array[N] real y_log_lik;  // Pointwise log-likelihood
    vector[4] ll_loo;
    array[N] int<lower=0, upper=1> x_rep;
    array[N] int<lower=0, upper=1> y_rep;
    for (tip in 1:N) {
        for (s in 1:4) {
            // Recompute likelihood WITHOUT this tip
            matrix[Nnodes, 4] loglik_loo = rep_matrix(0.0, Nnodes, 4);
            
            // Set tip likelihoods (excluding tip 'tip')
            for (i in 1:N) {
                loglik_loo[i] = rep_row_vector(negative_infinity(), 4);
                if (i != tip) {
                    int idx = 2 * x[i] + y[i] + 1;
                    loglik_loo[i, idx] = 0;
                } else {
                    loglik_loo[i, s] = 0;
                }
            }
            
            // Pruning recursion (same as model block)
            for (e in 1:Nedges) {
                int parent = edges[e, 1];
                int child = edges[e, 2];
                real t = edge_lengths[e];
                matrix[4, 4] Plocal = matrix_exp(Q * t);
                
                for (k in 1:4) {
                    loglik_loo[parent, k] += log_sum_exp(
                        to_vector(log(Plocal[k]) + loglik_loo[child])
                    );
                }
            }
            
            ll_loo[s] = log_sum_exp(loglik_loo[root_node] + log(stat_dist)');
        }
        int observed_state = 2 * x[tip] + y[tip] + 1;
        int same_x_state = 2 * x[tip] + (1 - y[tip]) + 1;
        y_log_lik[tip] = ll_loo[observed_state] - log_sum_exp(ll_loo[observed_state] , ll_loo[same_x_state]);
        int s_rep = categorical_rng(softmax(ll_loo));
        x_rep[tip] = (s_rep - 1) / 2;
        y_rep[tip] = (s_rep - 1) % 2;
    }
}
"
system.time(
  if (file.exists("ctmc_dependent.rds")) {
    stan_model_ctmc_dependent <- readRDS("ctmc_dependent.rds")
  } else {
    stan_model_ctmc_dependent <- stan_model(model_code = stan_model_code_ctmc_dependent)
    saveRDS(stan_model_ctmc_dependent, "ctmc_dependent.rds")
  }
)
   user  system elapsed 
  0.021   0.000   0.022 
system.time(
    fit_ctmc_dependent <- sampling(
        stan_model_ctmc_dependent,
        data = stan_data,
        iter = 2000,
        chains = 4
    )
)
    user   system  elapsed 
2248.899    1.232  648.557 
print(fit_ctmc_dependent, pars=c("total_rate", "rel_rates", "stat_dist"))
Inference for Stan model: anon_model.
4 chains, each with iter=2000; warmup=1000; thin=1; 
post-warmup draws per chain=1000, total post-warmup draws=4000.

             mean se_mean   sd 2.5%  25%  50%  75% 97.5% n_eff Rhat
total_rate   0.46       0 0.13 0.27 0.37 0.45 0.53  0.76  3294    1
rel_rates[1] 0.10       0 0.08 0.00 0.03 0.08 0.14  0.29  3168    1
rel_rates[2] 0.07       0 0.05 0.00 0.03 0.06 0.09  0.17  3187    1
rel_rates[3] 0.29       0 0.12 0.07 0.21 0.29 0.37  0.52  2713    1
rel_rates[4] 0.11       0 0.04 0.04 0.07 0.10 0.13  0.20  3768    1
rel_rates[5] 0.10       0 0.07 0.00 0.04 0.09 0.14  0.26  2490    1
rel_rates[6] 0.09       0 0.05 0.02 0.05 0.08 0.11  0.20  3311    1
rel_rates[7] 0.13       0 0.05 0.04 0.09 0.13 0.16  0.24  3371    1
rel_rates[8] 0.12       0 0.05 0.03 0.08 0.11 0.15  0.24  3206    1
stat_dist[1] 0.13       0 0.05 0.06 0.10 0.12 0.16  0.24  4370    1
stat_dist[2] 0.25       0 0.07 0.12 0.20 0.24 0.29  0.40  5282    1
stat_dist[3] 0.31       0 0.07 0.19 0.27 0.31 0.36  0.46  5188    1
stat_dist[4] 0.31       0 0.07 0.19 0.26 0.31 0.36  0.46  5224    1

Samples were drawn using NUTS(diag_e) at Sat Apr  4 08:05:12 2026.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at 
convergence, Rhat=1).

predictive checks

post_ctmc_dependent <- rstan::extract(fit_ctmc_dependent)
predictive_check_marginal_binary(
  x_prior = post_ctmc_dependent$x_prior,
  x_rep = post_ctmc_dependent$x_rep,
  x_obs = stan_data$x,
  varname = "x"
)

predictive_checks_regression(post_ctmc_dependent$y_prior, post_ctmc_dependent$y_rep, stan_data)

Here the prior is flat. Oddly, there seems to be no difference in the distributions for x=0 and x=1.

model comparison

system.time(
    capture.output({
        log_marginal_likelihood_ctmc_dependent <- bridge_sampler(fit_ctmc_dependent)
    }, file = "/dev/null")
)
print(log_marginal_likelihood_ctmc_dependent)
   user  system elapsed 
 10.853   0.016   9.385 
Bridge sampling estimate of the log marginal likelihood: -112.9271
Estimate obtained in 4 iteration(s) via method "normal".

And now the Bayes Factor:

(log_marginal_likelihood_ctmc_independent$logml - log_marginal_likelihood_ctmc_dependent$logml)
6.33556205649873

So there is evidence in favor of the independent model.

The CTMC approach has one fundamental limitation: it treats typological features as discrete characters with a small number of states. This forces the model to ignore the gradient nature of typological tendencies — the fact that a language is not simply “verb-final” or “not verb-final” but sits on a continuum of probability governed by its history and relatedness to other languages. We now turn to continuous models that avoid this limitation.

Logistic regression with a phylogenetic random effect

The Ornstein-Uhlenbeck process

The OU process is a stochastic process, similar to Brownian motion. The crucial difference is that OU is a “random walk on a leash”. Technically, it is the mixture of a stochastic Brownian motion process and a deterministic trajectory exponentially converging to a mean value. For typological applications, this mean value corresponds to the cross-linguistic attractor: the value a language tends toward over long time scales regardless of ancestry. The OU process is characterized by the following transition distribution: \[ X_t \sim \mathcal N\left(x_0e^{-\lambda t} + \mu (1-e^{-\lambda t}), \frac{\sigma}{\sqrt{2\lambda}}\sqrt{1-e^{-2\lambda t}}\right) \]

The parameters are:

  • \(x_0\): the initial value of the process
  • \(\mu\): the long-run mean to which the process is attracted (the typological attractor)
  • \(\lambda\): the rate of mean-reversion (strength of the attractor)
  • \(\sigma\): the diffusion coefficient (instantaneous noise amplitude in the SDE \(dX = \lambda(\mu-X)dt + \sigma\,dW\))
  • \(t\): the time at which the process is evaluated

Note that \(\sigma\) here is a diffusion coefficient, not a standard deviation. The equilibrium standard deviation is \(\sigma/\sqrt{2\lambda}\). When \(\lambda = 0\), the OU process reduces to Brownian motion, so Brownian motion is a special (limiting) case of OU rather than a competing framework.

The long-term equilibrium distribution is: \[ X_\infty \sim \mathcal N\!\left(\mu,\, \frac{\sigma}{\sqrt{2\lambda}}\right) \]

We first fit a regression version (one trait as predictor, one as response), then a symmetric correlation version (both traits modeled jointly as co-evolving). We also include Brownian motion as a weaker baseline to show that mean-reversion provides a better fit. The OU models are implemented directly in Stan; the Brownian model uses the brms package for comparison.

We now add a phylogenetic random effect to the logistic regression from above:

\[ \begin{align} \alpha &\sim \text{Student-t}(3, 0, 2.5)\\ \beta &\sim \mathcal N(0, 2)\\ \epsilon &\sim \mathcal N(0, \Sigma)\\ y &\sim \text{Bernoulli}(\text{logit}^{-1}(\alpha + \beta x + \epsilon)) \end{align} \]

The covariance matrix \(\Sigma\) of the phylogenetic random effect is given by

\[ \Sigma_{ij} = \frac{\sigma^2}{2\lambda} e^{-\lambda t_{ij}}, \]

where \(t_{ij}\) is the patristic distance between language \(i\) and language \(j\) in the phylogeny.

A direct implementation of this formula in Stan is possible but inefficient. A better approach directly simulates the evolution of the latent variable \(\epsilon\) along the phylgeny. This means that \(\epsilon\) has a value not only at the tips but also at the internal nodes. The value at the root is sampled from the equilibrium distribution, and the value of the other nodes is drawn from the equation above, repeated here:

\[ X_t \sim \mathcal N\!\left(x_0e^{-\lambda t} + \mu (1-e^{-\lambda t}),\; \frac{\sigma}{\sqrt{2\lambda}}\sqrt{1-e^{-2\lambda t}}\right). \]

\(X_t\) is now the value at the daughter node, \(x_0\) at the mother node, and \(t\) is the branch length. (This matches the Stan code directly: s = sigma * sqrt((1 - exp(-2*lambda*len)) / (2*lambda)).)

stan_model_code_OU_regression = "
data {
  int<lower=1> N;                                         // number of tips with data
  int<lower=1> Nnodes;                                    // total number of nodes
  int<lower=1> Nedges;                                    // number of edges in the tree
  vector<lower=0, upper=1>[N] x;                          // predictor at tips
  array[N] int<lower=0, upper=1> y;                       // binary response at tips
  array[Nedges, 2] int<lower=1, upper=Nnodes> edges;      // parent → child
  vector<lower=0>[Nedges] edge_lengths;                   // edge lengths
  int<lower=1, upper=Nnodes> root_node;                   // index of root node
}
transformed data {
  vector[N] Xc;  // centered version of X without an intercept
  real means_X;  // mean of X before centering
  means_X = mean(x);
  Xc = x - means_X;
}


parameters {
  vector[Nnodes] z_std;       // standard normal reparam for latent z
  real<lower=0> sigma;        // OU diffusion
  real<lower=0> lambda;       // OU strength
  real mu;                    // OU mean
  real alpha;                 // intercept
  real beta;                  // slope
}

transformed parameters {
  vector[Nnodes] z;

  // root node
  z[root_node] = mu + (sigma / sqrt(2 * lambda)) * z_std[root_node];

  // recursive evolution
  for (e in 1:Nedges) {
    int edge_index = Nedges - e + 1; // reverse order for recursion
    int parent = edges[edge_index, 1];
    int child = edges[edge_index, 2];
    real len = edge_lengths[edge_index];

    real decay = exp(-lambda * len);
    real s = sigma * sqrt(-expm1(-2 * lambda * len) / (2 * lambda));
    real mn = mu + (z[parent] - mu) * decay;

    z[child] = mn + s * z_std[child];
  }
}

model {
  // Priors
  alpha ~ student_t(3, 0, 2.5);
  beta ~ normal(0, 2);
  sigma ~ lognormal(0, 1);
  lambda ~ lognormal(0, 1);
  mu ~ normal(0, 2);
  z_std ~ normal(0, 1);  // standard normal prior

  // Likelihood
  for (i in 1:N) {
    # y[i] ~ bernoulli_logit(alpha + beta * Xc[i] + z[i]);
    target += bernoulli_logit_lpmf(y[i] | alpha + beta * Xc[i] + z[i]);
  }

}

generated quantities {
  vector[N] log_lik;
  array[N] int y_rep;     // posterior predictive
  array[N] int y_prior;   // prior predictive

  // Posterior predictive samples (uses fitted alpha, beta, z)
  for (i in 1:N) {
    real eta = alpha + beta * Xc[i] + z[i];
    log_lik[i] = bernoulli_logit_lpmf(y[i] | eta);
    y_rep[i] = bernoulli_logit_rng(eta);
  }

  // Prior predictive samples: draw alpha, beta, z_prior from prior
  {
    real alpha_prior = student_t_rng(3, 0, 2.5);
    real beta_prior = normal_rng(0, 2);

    vector[Nnodes] z_prior_std;
    for (j in 1:Nnodes)
      z_prior_std[j] = normal_rng(0, 1);

    real mu_prior = normal_rng(0, 2);
    real lambda_prior = lognormal_rng(0, 1);
    real sigma_prior = lognormal_rng(0, 1);

    vector[Nnodes] z_prior;
    z_prior[root_node] = mu_prior + (sigma_prior / sqrt(2 * lambda_prior)) * z_prior_std[root_node];

    for (e in 1:Nedges) {
      int edge_index = Nedges - e + 1;
      int parent = edges[edge_index, 1];
      int child = edges[edge_index, 2];
      real len = edge_lengths[edge_index];
      real decay = exp(-lambda_prior * len);
      real s = sigma_prior * sqrt(-expm1(-2 * lambda_prior * len) / (2 * lambda_prior));
      real mn = mu_prior + (z_prior[parent] - mu_prior) * decay;
      z_prior[child] = mn + s * z_prior_std[child];
    }

    for (i in 1:N) {
      real eta_prior = alpha_prior + beta_prior * Xc[i] + z_prior[i];
      y_prior[i] = bernoulli_logit_rng(eta_prior);
    }
  }
}

"
system.time(
  if (file.exists("OU_regression.rds")) {
    stan_model_OU_regression <- readRDS("OU_regression.rds")
  } else {
    stan_model_OU_regression <- stan_model(model_code = stan_model_code_OU_regression)
    saveRDS(stan_model_OU_regression, "OU_regression.rds")
  }
)
   user  system elapsed 
  0.018   0.000   0.019 
system.time(
    fit_OU_regression <- sampling(
        stan_model_OU_regression,
        data = stan_data,
        iter = 2000,
        chains = 4,
        control = list(adapt_delta = 0.95, max_treedepth = 15)
    )
)
   user  system elapsed 
 43.497   0.790  16.307 
print(fit_OU_regression, pars = c("mu", "sigma", "lambda", "alpha", "beta"))
Inference for Stan model: anon_model.
4 chains, each with iter=2000; warmup=1000; thin=1; 
post-warmup draws per chain=1000, total post-warmup draws=4000.

        mean se_mean   sd  2.5%   25%   50%  75% 97.5% n_eff Rhat
mu      0.35    0.02 1.81 -3.23 -0.84  0.36 1.59  3.86  5901    1
sigma   5.34    0.17 4.47  1.25  2.67  4.03 6.53 16.88   719    1
lambda  0.11    0.00 0.04  0.04  0.08  0.10 0.13  0.20  2697    1
alpha   0.64    0.03 2.19 -3.60 -0.72  0.58 1.88  5.33  4204    1
beta   -0.82    0.02 1.62 -3.93 -1.89 -0.83 0.19  2.47  4732    1

Samples were drawn using NUTS(diag_e) at Sat Apr  4 08:05:46 2026.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at 
convergence, Rhat=1).
mcmc_areas(
    as.matrix(fit_OU_regression),
    pars = "beta",
    prob = 0.95
)

predictive checks

We perform the same predictive checks as for the regression model above.

post_ou_regression <- rstan::extract(fit_OU_regression)
predictive_checks_regression(post_ou_regression$y_prior, post_ou_regression$y_rep, stan_data)

This looks perfect - the prior is essentially flat, and the posterior is peaked around the obverved value.

model comparison

system.time(
    capture.output({
        log_marginal_likelihood_OU_regression <- bridge_sampler(fit_OU_regression)
    }, file = "/dev/null")
)
print(log_marginal_likelihood_OU_regression)
   user  system elapsed 
 22.094   0.032   2.355 
Bridge sampling estimate of the log marginal likelihood: 131.7628
Estimate obtained in 13 iteration(s) via method "normal".

Now we can compute the Bayes factor between the vanilla regression and the OU phylogenetic regression.

log_marginal_likelihood_OU_regression$logml - log_marginal_likelihood_vanilla_regression$logml
196.557213674028

This is very strong evidence in favor of the OU model.

The HPD for \(\beta\) is very broad and includes 0. So there is no evidence for the significant effect of the order of adnominal property words on verb position.

Brownian motion

Before turning to the full OU model, we fit a Brownian motion baseline. Brownian motion is the \(\lambda \to 0\) limit of OU: it models phylogenetic drift without any restoring force. In brms, a Brownian motion random effect is available via gr(..., cov = A) where A is the phylogenetic covariance matrix. This is convenient, but the brms formulation is difficult to extend — adding geographic random effects, switching to OU, or modifying the covariance structure all require stepping outside the package. The Stan OU implementation in the next section is only marginally more code and offers full flexibility.

We use brms here purely as a convenient baseline implementation of Brownian motion.

A <- ape::vcv(pruned_tree, corr = TRUE)
d$Language_ID <- factor(d$Language_ID, levels = rownames(A))
system.time(
    fit_brownian_regression <- brm(
    formula = y ~ x + (1 | gr(Language_ID, cov = A)),
    data = d,
    family = bernoulli(link = "logit"),
    prior = c(
        prior(normal(0, 2), class = "b")
    ),
    data2 = list(A = A),
    control = list(adapt_delta = 0.99),
    iter = 2000,
    cores = 4,
    chains = 4,
    save_pars = save_pars(all = TRUE)
    )
)
Compiling Stan program...

Start sampling
   user  system elapsed 
 54.533   2.004  52.498 
summary(fit_brownian_regression)
 Family: bernoulli 
  Links: mu = logit 
Formula: y ~ x + (1 | gr(Language_ID, cov = A)) 
   Data: d (Number of observations: 100) 
  Draws: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
         total post-warmup draws = 4000

Multilevel Hyperparameters:
~Language_ID (Number of levels: 100) 
              Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sd(Intercept)     4.09      2.76     1.52    11.50 1.00      778      947

Regression Coefficients:
          Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
Intercept     0.55      1.43    -2.30     3.63 1.00     3067     2605
x            -0.91      0.85    -2.50     0.90 1.00     3869     2026

Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).
mcmc_areas(
    as.matrix(fit_brownian_regression),
    pars = "b_x",
    prob = 0.95
)

The HPD interval for the slope includes 0, so there is no credible evidence for an impact of x on y.

predictive checks

In a brms model, one cannot do prior predictive simulations while model fitting. Instead, we have to fit the model again, this time with sample_prior = "only".

y_rep_brownian_regression <- posterior_predict(fit_brownian_regression)       # matrix: draws × observations

system.time(
    fit_prior <- brm(
    formula = y ~ x + (1 | gr(Language_ID, cov = A)),
    data = d,
    family = bernoulli(),
    data2 = list(A = A),
    prior = c(
        prior(normal(0, 2.5), class = "b")
    ),
    sample_prior = "only",
    control = list(adapt_delta = 0.95),
    cores = 4, chains = 4
    )
)
Compiling Stan program...

Start sampling
   user  system elapsed 
 54.572   1.837  52.192 
y_prior_brownian_regression <- posterior_predict(fit_prior)
predictive_checks_regression(y_prior_brownian_regression, y_rep_brownian_regression, stan_data)

The priors and posteriors look healthy.

model comparison

system.time(
    capture.output({
        log_marginal_likelihood_brownian_regression <- bridge_sampler(fit_brownian_regression)
    }, file = "/dev/null")
)
print(log_marginal_likelihood_brownian_regression)
   user  system elapsed 
  3.729   0.002   0.738 
Bridge sampling estimate of the log marginal likelihood: -54.88183
Estimate obtained in 6 iteration(s) via method "normal".
log_marginal_likelihood_OU_regression$logml - log_marginal_likelihood_brownian_regression$logml
186.644586782689

The OU model is favored over the Brownian-motion model by a wide margin.

Next we look as PSIS-LOOCV

# extract log-likelihood matrix: draws x observations

log_lik3 <- extract_log_lik(fit_OU_regression, parameter_name = "log_lik")
log_lik4 <- log_lik(fit_brownian_regression)


# compute LOO

loo3 <- loo(log_lik3)
loo4 <- loo(log_lik4)

loo_list$OU_regression <- loo3
loo_list$brownian_regression <- loo4

# compare
loo_compare(loo_list)
Warning message:
“Some Pareto k diagnostic values are too high. See help('pareto-k-diagnostic') for details.
”
Warning message:
“Some Pareto k diagnostic values are too high. See help('pareto-k-diagnostic') for details.
”
A compare.loo: 4 × 8 of type dbl
elpd_diff se_diff elpd_loo se_elpd_loo p_loo se_p_loo looic se_looic
OU_regression 0.00000 0.000000 -31.22161 3.449315 23.477541 2.8654072 62.44322 6.898630
brownian_regression -17.11986 3.363718 -48.34147 6.378125 20.098800 2.9554648 96.68295 12.756250
vanilla_correlation -30.85310 3.287856 -62.07471 2.278148 27.944938 1.2846688 124.14941 4.556297
vanilla_regression -34.90047 3.606842 -66.12208 3.149916 2.011269 0.1372601 132.24417 6.299831

So both Bayes Factor and PSIS-LOOCV favor OU over Brownian motion, and they favor both versions with phylogenetic control over models without phylogenetic control.

The regression models above treat one variable as a predictor and the other as a response — an asymmetry that is hard to justify in observational typological data where neither variable is experimentally controlled. A conceptually more natural model treats both features symmetrically as co-evolving: there is a latent variable governing adjective-noun order and a separate latent variable governing verb position, and both evolve jointly along the phylogeny. If a diachronic dependency exists between the two features, it will manifest as a correlation between their evolutionary trajectories.

This is the bivariate OU correlation model.

The model specification is as follows, where \(N\) is the number of languages, and \(t_{l_1, l_2}\) is the length of from the root of the tree to the most recent common ancestor of \(l_1\) and \(l_2\):

\[ \begin{align} R &\sim \text{LKJ}(2)\\ \Sigma_{(l_1, k_1), (l_2, k_2)} &= R_{k_1, k_2}\cdot \frac{\sigma_{k_1}\sigma_{k_2}}{\lambda_{k_1}+\lambda_{k_2}}\left(1-e^{-(\lambda_{k_1}+\lambda_{k_2})t_{l_1, l_2}}\right) \\ \mu_i &= \mathcal N(\mathbf 0, 2)\\ \sigma_i &= \text{Lognormal}(0,1)\\ \lambda_i &= \text{Lognormal}(0,1)\\ z &\sim \mathcal N(\text{repeat}(\mu, N), \Sigma)\\ x_i &\sim \text{Bernoulli}(\text{logit}^{-1}(z_{i,1}))\\ y_i &\sim \text{Bernoulli}(\text{logit}^{-1}(z_{i,2})) \end{align} \]

The Stan code for this model is in OU_correlation.stan. It is kept in a separate file because it is substantially longer than the regression model: the latent variable \(z\) is now \(2N\)-dimensional (two traits, \(N\) languages), requiring a full bivariate OU recursion and a corresponding prior predictive block.

stan_model_OU_correlation <- stan_model(file = "OU_correlation.stan")
system.time(
    fit_OU_correlation <- sampling(
        stan_model_OU_correlation,
        data = stan_data,
        iter = 10000,
        thin = 10,
        chains = 4,
        control = list(adapt_delta = 0.95, max_treedepth = 15)
    )
)
   user  system elapsed 
  5.692   0.473 268.376 
print(fit_OU_correlation, pars=c("mu", "sigma", "lambda", "rho"))
Inference for Stan model: anon_model.
4 chains, each with iter=10000; warmup=5000; thin=10; 
post-warmup draws per chain=500, total post-warmup draws=2000.

           mean se_mean   sd  2.5%   25%   50%   75% 97.5% n_eff Rhat
mu[1]      1.81    0.03 1.09  0.07  1.08  1.67  2.37  4.44  1837    1
mu[2]      0.71    0.03 1.28 -1.82 -0.05  0.64  1.43  3.44  1907    1
sigma[1]   3.19    0.05 2.25  0.72  1.82  2.67  3.91  9.06  1886    1
sigma[2]   4.33    0.08 3.55  0.95  2.10  3.31  5.36 13.38  1835    1
lambda[1]  0.21    0.01 0.28  0.05  0.11  0.16  0.24  0.62  1345    1
lambda[2]  0.11    0.00 0.04  0.04  0.08  0.10  0.13  0.21  2154    1
rho       -0.28    0.00 0.20 -0.67 -0.42 -0.28 -0.16  0.11  1687    1

Samples were drawn using NUTS(diag_e) at Sat Apr  4 08:12:14 2026.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at 
convergence, Rhat=1).
mcmc_areas(
    as.matrix(fit_OU_correlation),
    pars = "rho",
    prob = 0.95
)

Since the HPD of \(\rho\) includes 0, we conclude that there is no credible evidence for a diachronic correlation between the two variables.

predictive checks

post_OU_correlation <- rstan::extract(fit_OU_correlation)

predictive_check_marginal_binary(
  x_prior = post_OU_correlation$x_prior,
  x_rep = post_OU_correlation$x_rep,
  x_obs = stan_data$x,
  varname = "x"
)

predictive_checks_regression(post_OU_correlation$y_prior, post_OU_correlation$y_rep, stan_data)

The prior and posterior simulations look okay.

It seems tempting to also use the posterior predictive simulation to estimate the equilibrium probabilities. However, this would be wrong because the posterior predictive simulation also use the posterior of the latent variable z. Instead, we have to use the posterior of the other parameters and to simulate z to obtain the posterior prediction for a copy of the tree, without regard for the real history of the tree.

I added a block to this effect in the Stan code, producing x_replicated_noise and y_replicated_noise.

x_replicated_noise = post_OU_correlation$x_replicated_noise
y_replicated_noise = post_OU_correlation$y_replicated_noise


# Create a function to compute joint probabilities for one sample
joint_freqs <- function(x, y) {
  tab <- table(factor(x, levels = 0:1), factor(y, levels = 0:1))
  as.vector(tab) / length(x)  # returns proportions in order: (0,0), (0,1), (1,0), (1,1)
}

# Apply over all 2000 posterior samples
joint_probs <- t(mapply(joint_freqs, 
                        split(x_replicated_noise, row(x_replicated_noise)), 
                        split(y_replicated_noise, row(y_replicated_noise))))

colnames(joint_probs) <- c("(0,0)", "(0,1)", "(1,0)", "(1,1)")

x_range = c(0,1)
df <- as.data.frame(joint_probs) %>%
  mutate(draw = 1:n()) %>%
  pivot_longer(-draw, names_to = "Combination", values_to = "Probability")

ggplot(df, aes(x = Probability, fill = Combination)) +
  geom_density(alpha = 0.6) +
  facet_wrap(~ Combination, scales = "fixed") +  # enforce same scale
  scale_x_continuous(limits = x_range) +         # set common x-axis limits
  theme_minimal() +
  labs(title = "Equilibrium Distribution of Joint Outcomes",
       x = "Estimated Probability",
       y = "Density")

model comparison

Bayes factor:

system.time(
    capture.output({
        log_marginal_likelihood_OU_correlation <- bridge_sampler(
            fit_OU_correlation, silent = TRUE, maxiter = 10000
        )
    }, file = "/dev/null"
    )
)
print(log_marginal_likelihood_OU_correlation)
Warning message:
“logml could not be estimated within maxiter, rerunning with adjusted starting value. 
Estimate might be more variable than usual.”
   user  system elapsed 
 55.551   0.043  51.657 
Bridge sampling estimate of the log marginal likelihood: 263.1094
Estimate obtained in 10001 iteration(s) via method "normal".
log_marginal_likelihood_OU_correlation$logml - log_marginal_likelihood_vanilla_corr$logml
207.152106739818

Again we find overwhelming evidence in favor of the phylogenetic model.

We can complement model comparison via Bayes Factor with Pareto-smoothed leave-one-out cross-validation.

# extract log-likelihood matrix: draws x observations

log_lik5 <- extract_log_lik(fit_OU_correlation, parameter_name = "log_lik_y")


# compute LOO

loo5 <- loo(log_lik5)

loo_list$OU_correlation <- loo5

# compare
loo_compare(loo_list)
Warning message:
“Some Pareto k diagnostic values are too high. See help('pareto-k-diagnostic') for details.
”
A compare.loo: 5 × 8 of type dbl
elpd_diff se_diff elpd_loo se_elpd_loo p_loo se_p_loo looic se_looic
OU_regression 0.000000 0.000000 -31.22161 3.449315 23.477541 2.8654072 62.44322 6.898630
OU_correlation -2.066913 1.378074 -33.28852 4.210245 23.622574 3.4623539 66.57705 8.420491
brownian_regression -17.119864 3.363718 -48.34147 6.378125 20.098800 2.9554648 96.68295 12.756250
vanilla_correlation -30.853097 3.287856 -62.07471 2.278148 27.944938 1.2846688 124.14941 4.556297
vanilla_regression -34.900474 3.606842 -66.12208 3.149916 2.011269 0.1372601 132.24417 6.299831

The three phylogenetic models perform massively better than the vanilla models. Furthermore if our goal is to predict the verb position of an unseen language from its noun-adjective order and its position in the tree, the OU regression model is superior to the correlation model.

As a nice side effect of the explicit modeling of the diachronic histories, we can plot the posterior means of the latent variables at the internal nodes, thereby visualizing the evolution of the latent variables.

posterior_df <- as_draws_df(fit_OU_correlation)
z_posterior <- posterior_df %>%
  spread_draws(z[i, j]) %>%
  group_by(i, j) %>%
  summarise(mean_z = mean(z), .groups = "drop")
z_adjective <- z_posterior %>%
    filter(j == 1) %>%
    pull(mean_z)
z_vo <- z_posterior %>%
    filter(j == 2) %>%
    pull(mean_z)
pruned_tree$node.label <- as.character(
    (length(pruned_tree$tip.label) + 1):(length(pruned_tree$tip.label) + pruned_tree$Nnode)
)


tree_plot <- ggtree(pruned_tree)
tree_info <- tree_plot$data
tree_info$node.label <- as.character(tree_info$label)
for (i in 1:nrow(tree_info)) {
    if (tree_info$isTip[i]) {
        # If it's a tip, get the corresponding Glottocode
        glottocode <- tree_info$label[i]
        # Find the index of the Glottocode in d_ie
        index <- which(pruned_tree$tip.label == glottocode)
        # Assign the Affix value to the node label
        tree_info$node.label[i] <- index
    } else {
        # If it's not a tip, copy label
        tree_info$node.label[i] <- as.integer(tree_info$node.label[i])
    }
}


node_data <- tibble(
  node.label = as.character(1:length(z_adjective)),
  z_adjective = z_adjective,
  z_vo = z_vo
)


tree_data <- tree_info %>%
  left_join(node_data, by = "node.label")

tree_data <- tree_data %>%
    left_join(
        rename(select(d, Language_ID, GB193, GB133),
        label=Language_ID)
    ) 
tree_data <- tree_data %>%
  mutate(GB193 = factor(GB193)) %>%
  mutate(GB133 = factor(GB133))
Joining with `by = join_by(label)`
custom_palette <- paletteer_c("scico::roma", 30)
p1 <- ggtree(pruned_tree, layout = "circular", branch.length = "none") %<+% tree_data +
  geom_tree(aes(color = plogis(z_adjective)), size = 1) +
  scale_color_gradientn(colors = custom_palette, name = "p(NAdj)") +
  ggnewscale::new_scale_color() +
  geom_tippoint(aes(color = GB193), size = 3) +
  scale_color_manual(values = c("1" = custom_palette[1], "2" = custom_palette[30])) +
  guides(color = guide_legend(override.aes = list(size = 4))) +
  theme(
    legend.position = "left",
    plot.margin = margin(0, 0, 0, 0),
    text = element_text(size = 18),
    legend.text = element_text(size = 18)
  )
Warning message:

“Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.

ℹ Please use `linewidth` instead.

ℹ The deprecated feature was likely used in the ggtree package.

  Please report the issue at <https://github.com/YuLab-SMU/ggtree/issues>.”
p2 <- ggtree(pruned_tree, layout = "circular", branch.length = "none") %<+% tree_data +
  geom_tree(aes(color = plogis(z_vo)), size = 1) +
  scale_color_gradientn(colors = custom_palette, name = "p(VO)") +
  ggnewscale::new_scale_color() +
  geom_tippoint(aes(color = GB133), size = 3) +
  scale_color_manual(values = c("0" = custom_palette[1], "1" = custom_palette[30])) +
  guides(color = guide_legend(override.aes = list(size = 4))) +
  theme(
    plot.margin = margin(0, 0, 0, 0),
    text = element_text(size = 18),
    legend.text = element_text(size = 18)
  )
options(repr.plot.width = 18, repr.plot.height = 10) 
plot_grid(p1, p2, ncol = 2, align = "hv", axis = "tb", rel_widths = c(1, 1))

options(repr.plot.width = 12, repr.plot.height = 8)  # For base R plots in Jupyter

Another advantage of this approach (both the regression and the correlation variant) is that it can easily be combined with family-level or macroarea-level random effects. Also, geographic random effects can easily be added.

More details can be found in my arxiv paper Computational Tyology (https://arxiv.org/abs/2504.15642)

Final comparison

We now compare all models together. First, the Bayes factor between the best OU model and the CTMC approach:

(log_marginal_likelihood_OU_correlation$logml - log_marginal_likelihood_ctmc_independent$logml)
369.700948887178

The OU correlation model is overwhelmingly preferred over the CTMC approach. Now let’s do PSIS-LOOCV for all models.

One methodological note: for the OU and Brownian models, the LOO log-likelihoods are extracted from the posterior samples and the leave-one-out scores are estimated via Pareto-smoothed importance sampling (PSIS), which is an approximation. For the CTMC models, we instead computed exact leave-one-out log-likelihoods directly in the generated quantities block via a full Felsenstein pruning run with each tip held out in turn. Despite the different computation strategies, loo() treats both identically for the final comparison.

# extract log-likelihood matrix: draws x observations
log_lik6 <- extract_log_lik(fit_ctmc_independent, parameter_name = "y_log_lik")
log_lik7 <- extract_log_lik(fit_ctmc_dependent, parameter_name = "y_log_lik")


# compute LOO
loo6 <- loo(log_lik6)
loo7 <- loo(log_lik7)

loo_list$ctmc_independent <- loo6
loo_list$ctmc_dependent <- loo7

# compare
loo_compare(loo_list)
A compare.loo: 7 × 8 of type dbl
elpd_diff se_diff elpd_loo se_elpd_loo p_loo se_p_loo looic se_looic
OU_regression 0.000000 0.000000 -31.22161 3.449315 23.477541 2.8654072 62.44322 6.898630
OU_correlation -2.066913 1.378074 -33.28852 4.210245 23.622574 3.4623539 66.57705 8.420491
ctmc_independent -10.602767 2.177939 -41.82438 4.928832 1.406254 0.2599591 83.64875 9.857664
ctmc_dependent -11.106186 2.278025 -42.32780 5.127281 2.338103 0.4079971 84.65559 10.254562
brownian_regression -17.119864 3.363718 -48.34147 6.378125 20.098800 2.9554648 96.68295 12.756250
vanilla_correlation -30.853097 3.287856 -62.07471 2.278148 27.944938 1.2846688 124.14941 4.556297
vanilla_regression -34.900474 3.606842 -66.12208 3.149916 2.011269 0.1372601 132.24417 6.299831

Pareto-smoothed leave-one-out cross-validation leads to a substantially different result than Bayes Factor comparison. The two metrics measure different things:

  • The Bayes Factor evaluates the joint model of both variables \((x, y)\) together. A model that fits the joint distribution better — capturing how \(x\) and \(y\) co-evolve — will be favored, regardless of whether it is particularly useful for predicting one variable from the other.
  • The LOOIC here measures predictive accuracy for \(y\) alone (verb-final order), given the observed \(x\) (adjective-noun order) for each language. The CTMC models, despite their coarser representation, happen to make relatively sharp predictions about \(y\) given \(x\) for individual left-out languages, because the discrete state space limits the range of predicted probabilities.

In short: the OU correlation model is a better generative model of the data, but the CTMC model is not necessarily worse at the specific prediction task of inferring verb position from word order in a leave-one-out setting.

marginal_loglikelihoods <- tibble(
    model = names(loo_list),
    type = c(
        'regression',
        'correlation',
        'regression',
        'regression',
        'correlation',
        'correlation',
        'correlation'
    ),
    marginal_loglikelhood = c(
        log_marginal_likelihood_vanilla_regression$logml,
        log_marginal_likelihood_vanilla_corr$logml,
        log_marginal_likelihood_OU_regression$logml,
        log_marginal_likelihood_brownian_regression$logml,
        log_marginal_likelihood_OU_correlation$logml,
        log_marginal_likelihood_ctmc_independent$logml,
        log_marginal_likelihood_ctmc_dependent$logml
    )
)
marginal_loglikelihoods %>%
    filter(type == "regression") %>%
    arrange(desc(marginal_loglikelhood))
A tibble: 3 × 3
model type marginal_loglikelhood
<chr> <chr> <dbl>
OU_regression regression 131.76276
brownian_regression regression -54.88183
vanilla_regression regression -64.79445
marginal_loglikelihoods %>%
    filter(type == "correlation") %>%
    arrange(desc(marginal_loglikelhood))
A tibble: 4 × 3
model type marginal_loglikelhood
<chr> <chr> <dbl>
OU_correlation correlation 263.10942
vanilla_correlation correlation 55.95731
ctmc_independent correlation -106.59153
ctmc_dependent correlation -112.92709

The results are consistent across both metrics. Continuous latent-variable models substantially outperform CTMC. Among the continuous models, OU outperforms Brownian motion. The match between the statistical results and the conceptual argument is reassuring: OU is not just computationally convenient but a genuinely better-fitting model for typological data.

Practical recommendation: For binary typological features, the OU correlation model (bivariate, both traits modeled jointly) is the preferred choice when the research question concerns co-evolution or correlation. The OU regression model is appropriate when the question is directional — whether one feature predicts another. In both cases the Stan implementation provided here is straightforward to adapt to other feature pairs, other phylogenies, or additional random effects (geographic, family-level, etc.).

Final thoughts

All the models discussed here use a single phylogeny. It is possible to use a sample of phylogenies to factor in phylogenetic uncertainty in this framework, but it makes the models more complex and the fitting computationally more demanding.

Essentially, what you have to do is:

  • Construct a 3-dimensional edge array (number of edges \(\times\) 2 \(\times\) number of trees), where each layer edge[i,j,k] indicates that there is a branch from node i to node j in tree number k. You have to make sure that all trees are in postorder.

  • Construct a number of edges \(\times\) number of trees matrix edge_lengths

  • In the Stan model, you have to loop over all edges in all trees. So the phylogenetic recursion should look like

for (tr in 1:Ntrees) {
    for (e in 1:Nedges) {
        int parent = edges[e, 1, tr];
        int child = edges[e, 2, tr];
        real t = edge_lengths[e, tr];
        ...
    }
}
  • Inside the model block, add the line
target -= log(Ntrees);

Together, this averages over the likelihoods of all trees in the tree sample. Since the likelihoods of all trees in the sample have to be computed, this will slow down model fitting quite a bit.

A working implementation of this multi-tree approach — using the 902 posterior EDGE trees included in the repository — is provided in the companion notebook phylogenetic_OU_regression_julia.qmd, which uses Turing.jl instead of Stan. The Julia version also demonstrates how to use Pigeons.jl for more efficient sampling when the posterior is multi-modal due to phylogenetic uncertainty.

References

Bouckaert, R., Redding, D., Sheehan, O., Kyritsis, T., Gray, R., Jones, K. E., & Atkinson, Q. (2022). Global language diversification is linked to socio-ecology and threat status. SocArXiv. https://osf.io/f8tr6/download

Dryer, M. S. (1992). The Greenbergian word order correlations. Language, 68(1), 81–138. https://doi.org/10.2307/416463

Pagel, M., & Meade, A. (2006). Bayesian analysis of correlated evolution of discrete characters by reversible-jump Markov chain Monte Carlo. The American Naturalist, 167(6), 808–825. https://doi.org/10.1086/503444