suppressPackageStartupMessages({
suppressWarnings({
library(tidyverse)
library(rstan)
library(brms)
library(parallel)
rstan_options(auto_write = TRUE)
options(mc.cores = parallel::detectCores()) # Enable parallel Stan sampling
library(ape)
library(R.utils)
library(bridgesampling)
library(loo)
library(bayesplot)
library(codetools)
library(ggtree)
library(posterior)
library(tidybayes)
library(ggthemes)
library(paletteer)
library(ggnewscale)
library(cowplot)
library(paletteer)
library(scico)
options(repr.plot.width = 12, repr.plot.height = 8) # For base R plots in Jupyter
theme_set(theme_minimal(base_size = 24)) # Increase base font size
})
})Data
Running a Phylogenetic OU Regression with R and Stan
In this notebook, I will illustrate how one can set up a logistic regression with a phylogenetic random effect following an Ornstein-Uhlenbeck process using R and Stan.
This kind of analysis has conceptual advantages compared with alternatives - such as phylogenetic random effects following a Brownian motion, or correlated evolution using discrete characters and the CTMC model. However, while there are off-the-shelf solutions for the latter two, the OU regression is, to my knowledge, not yet implemented in a dedicated function or package. Such a model can be set up quite easily in Stan, however. This notebook serves to illustrate how this can be done.
Preparing the environment
The git repository https://github.com/gerhardJaeger/OU_logistic_example holds the source code for this notebook. It contains a file OU_logistic_example.yml. You can use it to set up a conda environment containing all the required packages. (Of course you can also install the packages with install.packages if you don’t use conda.) At the command prompt, in the home directory of the repository, do
conda env create -f OU_logistic_example.yml
conda activate OU_logistic_example
Loading the required packages
Now let’s get some data. I will use the EDGE tree (Bouckaert et al. 2022).
# Create data directory if it doesn't exist
if (!dir.exists("../data")) dir.create("../data", recursive = TRUE)
# Download the file if it doesn't exist
tree_gz <- "../data/global-language-tree-MCC-labelled.tree.gz"
tree_file <- "../data/global-language-tree-MCC-labelled.tree"
if (!file.exists(tree_file)) {
download.file(
url = "https://github.com/rbouckaert/global-language-tree-pipeline/releases/download/v1.0.0/global-language-tree-MCC-labelled.tree.gz",
destfile = tree_gz,
mode = "wb"
)
R.utils::gunzip(tree_gz, destname = tree_file, overwrite = TRUE)
# Remove the gzipped file after extraction
if (file.exists(tree_gz)) file.remove(tree_gz)
}
edge_tree <- read.nexus(tree_file)
edge_tree$tip.label <- sapply(strsplit(edge_tree$tip.label, "_"), `[`, 1)It is advisable to rescale the tree so that the mean branch length is 1.
edge_tree$edge.length <- edge_tree$edge.length / (mean(edge_tree$edge.length))Grambank data
I will look at the somewhat boring feature pair:
- GB193: What is the order of adnominal property word and noun?
- GB133: Is a pragmatically unmarked constituent order verb-final for transitive clauses?
First we get the grambank data.
grambank_file = "../data/grambank_vals.csv"
if (!file.exists(grambank_file)) {
values_url <- "https://raw.githubusercontent.com/grambank/grambank/refs/heads/master/cldf/values.csv"
vals <- read_csv(values_url, na = "")
write(vals, grambank_file)
} else {
vals <- read_csv(grambank_file)
}Rows: 441663 Columns: 9 ── Column specification ──────────────────────────────────────────────────────── Delimiter: "," chr (9): ID, Language_ID, Parameter_ID, Value, Code_ID, Comment, Source, Sou... ℹ Use `spec()` to retrieve the full column specification for this data. ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Now we filter out the datapoints for the two parameters of interest and the languages in the tree. I only consider two values per parameter, and only languages having a relevant value for both parameters.
d <- vals %>%
filter(Language_ID %in% edge_tree$tip.label) %>%
filter(Parameter_ID %in% c("GB193", "GB133")) %>%
select(Language_ID, Parameter_ID, Value) %>%
filter(Value %in% c(0, 1, 2)) %>%
mutate(Value = as.numeric(Value)) %>%
pivot_wider(names_from = Parameter_ID, values_from = Value) %>%
filter(GB193 > 0) %>%
drop_na()Now we prune the tree to only include the languages in the data.
pruned_tree <- drop.tip(edge_tree, setdiff(edge_tree$tip.label, d$Language_ID))
# Check that d$Language_ID and pruned_tree$tip.label are the same set
if (!setequal(d$Language_ID, pruned_tree$tip.label)) {
warning("Mismatch between d$Language_ID and pruned_tree$tip.label")
print(setdiff(d$Language_ID, pruned_tree$tip.label))
print(setdiff(pruned_tree$tip.label, d$Language_ID))
} else {
message("d$Language_ID and pruned_tree$tip.label match.")
}d$Language_ID and pruned_tree$tip.label match.
Now the rows in d are rearranged to match the order of the tips in the pruned tree.
d <- d[match(pruned_tree$tip.label, d$Language_ID), ]Finally we convert the features to numeric values, and convert to 0/1
d <- d %>%
mutate(x = GB193 - 1,
y = GB133
)For now, I keep the data set small and restrict myself to 100 taxa.
set.seed(123) # You can replace 123 with any integer
d <- d[sample(nrow(d), 100), ]
pruned_tree <- drop.tip(pruned_tree, setdiff(pruned_tree$tip.label, d$Language_ID))
d <- d[match(pruned_tree$tip.label, d$Language_ID), ]Let’s have a look at the tree and the data.
ggtree(pruned_tree, layout="circular") +
geom_tree() +
geom_tiplab() + # Adds tip labels
theme_tree2() # Adds scale axis (x-axis for branch lengths)
d %>% head(10)Warning message: “`aes_()` was deprecated in ggplot2 3.0.0. ℹ Please use tidy evaluation idioms with `aes()` ℹ The deprecated feature was likely used in the ggtree package. Please report the issue at <https://github.com/YuLab-SMU/ggtree/issues>.” Warning message in fortify(data, ...): “Arguments in `...` must be used. ✖ Problematic arguments: • as.Date = as.Date • yscale_mapping = yscale_mapping • hang = hang ℹ Did you misspell an argument name?” Warning message: “`aes_string()` was deprecated in ggplot2 3.0.0. ℹ Please use tidy evaluation idioms with `aes()`. ℹ See also `vignette("ggplot2-in-packages")` for more information. ℹ The deprecated feature was likely used in the ggtree package. Please report the issue at <https://github.com/YuLab-SMU/ggtree/issues>.”
| Language_ID | GB133 | GB193 | x | y |
|---|---|---|---|---|
| <chr> | <dbl> | <dbl> | <dbl> | <dbl> |
| adan1251 | 1 | 2 | 1 | 1 |
| ajas1235 | 0 | 1 | 0 | 0 |
| alab1254 | 1 | 1 | 0 | 1 |
| alya1239 | 0 | 2 | 1 | 0 |
| arch1244 | 1 | 1 | 0 | 1 |
| aree1239 | 1 | 2 | 1 | 1 |
| arig1246 | 0 | 1 | 0 | 0 |
| awar1249 | 1 | 2 | 1 | 1 |
| band1344 | 0 | 1 | 0 | 0 |
| bayb1234 | 0 | 1 | 0 | 0 |

Let’s save the data for later use.
write_csv(d, "../data/grambank_vals_pruned.csv")Stan model
The tree is represented as a phylo object in ape, which is suitable for Stan.
Just to be on the safe side, we make sure that the branches are in postorder.
pruned_tree <- reorder(pruned_tree, "postorder")Now we code the tree and the data as a data list.
stan_data = list(
N = nrow(d), # Number of languages
Nnodes = length(unique(as.vector(pruned_tree$edge))),
root_node = length(pruned_tree$tip.label) + 1,
x = d$x, # Predictor variable (numeral-noun order)
y = d$y, # Response variable (adnominal demonstrative-noun order)
Nedges = nrow(pruned_tree$edge), # number of edges
edges = pruned_tree$edge, # Edges of the tree,
edge_lengths = pruned_tree$edge.length # Lengths of the edges
)Vanilla logistic regression
We start with a simple logistic regression without any phylogenetic effects, just to see how it works.
The model formula is:
\[ \begin{align} \alpha &\sim \text{Student-t}(3, 0, 2.5)\\ \beta &\sim \mathcal N(0, 2)\\ \mu_x &:= \frac{1}{|x|}\sum_{i}x_i\\ y &\sim \text{Bernoulli}(\text{logit}^{-1}(\alpha + \beta (x-\mu_x))) \end{align} \]
The choice of priors follows the defaults used by brms for logistic regression.
Let’s run this in Stan.
stan_code_vanilla <- "
data {
int<lower=1> N; // Number of observations
vector<lower=0, upper=1>[N] x; // Predictor variable
array[N] int<lower=0,upper=1> y; // Response variable (binary)
}
transformed data {
vector[N] Xc; // centered version of X without an intercept
real means_X; // mean of X before centering
means_X = mean(x);
Xc = x - means_X;
}
parameters {
real alpha; // Intercept
real beta; // Slope
}
model {
// Priors
alpha ~ student_t(3, 0, 2.5);
beta ~ normal(0, 2);
// Likelihood
# y ~ bernoulli_logit(alpha + beta * Xc);
target += bernoulli_logit_lpmf(y | alpha + beta * Xc);
}
generated quantities {
vector[N] log_lik;
array[N] int y_rep; // Posterior predictive samples
array[N] int y_prior; // Prior predictive samples
real alpha_prior;
real beta_prior;
for (n in 1:N) {
// Log-likelihood
log_lik[n] = bernoulli_logit_lpmf(y[n] | alpha + beta * Xc[n]);
// Posterior predictive
y_rep[n] = bernoulli_logit_rng(alpha + beta * Xc[n]);
// Prior predictive: draw new alpha and beta from prior, then simulate
alpha_prior = student_t_rng(3, 0, 2.5);
beta_prior = normal_rng(0, 2);
y_prior[n] = bernoulli_logit_rng(alpha_prior + beta_prior * Xc[n]);
}
}
"
system.time(stan_model_vanilla <- stan_model(model_code = stan_code_vanilla)) user system elapsed
49.456 1.517 45.835
system.time(
fit_vanilla <- sampling(stan_model_vanilla, data = stan_data, iter = 2000, chains = 4)
) user system elapsed
3.068 0.331 2.775
print(fit_vanilla, pars = c("alpha", "beta"))Inference for Stan model: anon_model.
4 chains, each with iter=2000; warmup=1000; thin=1;
post-warmup draws per chain=1000, total post-warmup draws=4000.
mean se_mean sd 2.5% 25% 50% 75% 97.5% n_eff Rhat
alpha -0.47 0.00 0.21 -0.88 -0.62 -0.47 -0.33 -0.06 3803 1
beta -1.00 0.01 0.44 -1.87 -1.31 -0.99 -0.71 -0.15 3735 1
Samples were drawn using NUTS(diag_e) at Sun Nov 16 15:37:06 2025.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at
convergence, Rhat=1).
mcmc_areas(
as.matrix(fit_vanilla),
pars="beta",
prob = 0.95
)
The posterior distribution of the slope \(\beta\) does not include zero, indicating that there is a significant relationship between the predictor variable x and the response variable y. The direction is negative, meaning that N-ANM order disfavors OV order.
For model comparison, we can use the bridgesampling package to compute the log marginal likelihood.
system.time(
capture.output({
log_marginal_likelihood_vanilla_regression <- bridge_sampler(fit_vanilla)
}, file = "/dev/null")
)
print(log_marginal_likelihood_vanilla_regression) user system elapsed
2.503 0.000 0.574
Bridge sampling estimate of the log marginal likelihood: -64.79445
Estimate obtained in 5 iteration(s) via method "normal".
Now we do visual checks of the model using prior and posterior predictive simulations.
predictive_checks_regression <- function(y_prior, y_rep, stan_data) {
x0_idx <- which(stan_data$x == 0)
x1_idx <- which(stan_data$x == 1)
obs_p_y1_x0 <- mean(stan_data$y[stan_data$x == 0])
obs_p_y1_x1 <- mean(stan_data$y[stan_data$x == 1])
y_prior_df <- as_tibble(y_prior) %>%
setNames(1:ncol(y_prior)) %>%
mutate(sample_no = 1:nrow(.)) %>%
pivot_longer(-sample_no, names_to = "obs", values_to = "y") %>%
mutate(type = "y_prior") %>%
mutate(obs = as.integer(obs))
y_rep_df <- as_tibble(y_rep) %>%
setNames(1:ncol(y_rep)) %>%
mutate(sample_no = 1:nrow(.)) %>%
pivot_longer(-sample_no, names_to = "obs", values_to = "y") %>% mutate(type = "y_rep") %>%
mutate(obs = as.integer(obs))
y_obs_df <- tibble(
sample_no = 0,
obs = 1:stan_data$N,
y = stan_data$y,
type = "y_obs"
)
df <- bind_rows(y_prior_df, y_rep_df, y_obs_df)
df <- df %>%
mutate(x_obs = stan_data$x[df$obs])
empirical_df <- df %>%
filter(sample_no == 0) %>%
group_by(x_obs) %>%
summarise(p_y1 = mean(y), .groups = "drop")
p <- df %>%
filter(sample_no != 0) %>%
group_by(sample_no, type, x_obs) %>%
summarise(p_y1 = mean(y), .groups = "drop") %>%
ggplot(aes(x = p_y1, fill = type, color = type)) +
geom_density(alpha = 0.3, adjust = 1.2) +
geom_vline(
data = empirical_df,
aes(xintercept = p_y1),
color = "red",
linewidth = 0.7,
linetype = "dashed"
) +
facet_wrap(
~x_obs,
nrow = 1,
labeller = labeller(x_obs = c(`0` = "x = 0", `1` = "x = 1"))
) +
labs(
title = "Predictive Check: Prior vs Posterior vs Observed",
x = "Proportion of y = 1 (per draw)",
y = "Density"
) +
theme_minimal(base_size = 14)
return(p)
}post_vanilla_regression <- rstan::extract(fit_vanilla)
predictive_checks_regression(post_vanilla_regression$y_prior, post_vanilla_regression$y_rep, stan_data)Warning message:
“The `x` argument of `as_tibble.matrix()` must have unique column names if
`.name_repair` is omitted as of tibble 2.0.0.
ℹ Using compatibility `.name_repair`.”

Vanilla logistic correlation study
A regression model assumes an asymmetry between the predictor variable (which is not modeled) and the response variabel (which is modeled). This is not really justified in observational studies where we have not control over the predictor variable. A more appropriate model is one where both variables are generated by some stochastic process.
This can be modeled as follows:
\[ \begin{align} z &\sim \mathcal N(\mu, \Sigma)\\ \Sigma &= \text{diag}(\sigma) \cdot R \cdot \text{diag}(\sigma)\\ R &\sim \text{LKJ}(2)\\ \mu_k &\sim \mathcal N(0, 2)\\ \sigma_k &\sim \text{Lognormal}(0, 1)\\ x &\sim \text{Bernoulli}(\text{logit}^{-1}(z_1))\\ y &\sim \text{Bernoulli}(\text{logit}^{-1}(z_2)) \end{align} \]
Here \(z\) is a bivariate normal variable with mean \(\mu\) and covariance \(\Sigma\). The covariance matrix \(\Sigma\) is parameterized as the product of a diagonal matrix of standard deviations \(\sigma = [\sigma_1, \sigma_2]\) and a correlation matrix \(R\). This allows the two dimensions to have different scales while still modeling their correlation. The correlation between them is given by \(\rho = R_{12} = R_{21}\). The correlation matrix is generated from an LKJ distribution, which is a flexible way to model correlations. The two variables \(x\) and \(y\) are then generated from the first and second components of \(z\), respectively, using the logistic function.
The interesting question is then whether the correlation between \(x\) and \(y\), i.e., \(\rho = r_{21}\), is significantly different from zero. If it is, this indicates that there is a significant correlation between the two variables, which is what we are interested in.
stan_code_vanilla_corr <- "
data {
int<lower=0> N;
array[N] int<lower=0, upper=1> x; // first binary variable
array[N] int<lower=0, upper=1> y; // second binary variable
}
parameters {
vector[2] mu; // Mean of latent variables
vector<lower=0>[2] sigma; // Standard deviation of latent variables
cholesky_factor_corr[2] L_R; // Cholesky factor of correlation matrix
matrix[2, N] z_raw; // Standard normal latent variables
}
transformed parameters {
real rho = L_R[2, 1]; // Extract the correlation coefficient from the Cholesky factor
matrix[2, N] z; // Latent variables with correlation
z = rep_matrix(mu, N) + diag_matrix(sigma) * L_R * z_raw;
}
model {
// Priors
mu ~ normal(0, 2); // Prior for the mean of latent variables
sigma ~ lognormal(0,1); // Prior for the standard deviation of latent variables
L_R ~ lkj_corr_cholesky(2); // Prior for the Cholesky factor of the correlation matrix
to_vector(z_raw) ~ normal(0, 1); // standard normal prior for latent
// Likelihood
for (n in 1:N) {
# x[n] ~ bernoulli_logit(z[1, n]);
# y[n] ~ bernoulli_logit(z[2, n]);
target += bernoulli_logit_lpmf(x[n] | z[1, n]);
target += bernoulli_logit_lpmf(y[n] | z[2, n]);
}
}
generated quantities {
vector[N] log_lik_x;
vector[N] log_lik_y;
array[N] int x_rep;
array[N] int y_rep;
array[N] int x_prior;
array[N] int y_prior;
for (n in 1:N) {
log_lik_x[n] = bernoulli_logit_lpmf(x[n] | z[1, n]);
log_lik_y[n] = bernoulli_logit_lpmf(y[n] | z[2, n]);
x_rep[n] = bernoulli_logit_rng(z[1, n]);
y_rep[n] = bernoulli_logit_rng(z[2, n]);
}
// Prior predictive simulation
{
vector[2] mu_prior;
vector[2] sigma_prior;
mu_prior[1] = normal_rng(0, 2);
mu_prior[2] = normal_rng(0, 2);
sigma_prior[1] = lognormal_rng(0, 1);
sigma_prior[2] = lognormal_rng(0, 1);
matrix[2, 2] R_prior = lkj_corr_rng(2, 2.0); // Sample from LKJ prior
matrix[2, 2] L_prior = cholesky_decompose(R_prior);
matrix[2, N] z_raw_prior;
matrix[2, N] z_prior;
for (n in 1:N)
for (d in 1:2)
z_raw_prior[d, n] = normal_rng(0, 1);
z_prior = rep_matrix(mu_prior, N) + diag_matrix(sigma_prior) * L_prior * z_raw_prior;
for (n in 1:N) {
x_prior[n] = bernoulli_logit_rng(z_prior[1, n]);
y_prior[n] = bernoulli_logit_rng(z_prior[2, n]);
}
}
}
"
system.time(stan_model_vanilla_corr <- stan_model(model_code = stan_code_vanilla_corr)) user system elapsed
56.529 1.559 54.962
system.time(
fit_vanilla_corr <- sampling(
stan_model_vanilla_corr,
data = stan_data,
iter = 2000,
chains = 4,
control = list(adapt_delta = 0.95)
)
)Warning message:
“Bulk Effective Samples Size (ESS) is too low, indicating posterior means and medians may be unreliable.
Running the chains for more iterations may help. See
https://mc-stan.org/misc/warnings.html#bulk-ess”
user system elapsed
15.789 0.663 9.616
print(fit_vanilla_corr, pars = c("mu", "rho", "sigma"))Inference for Stan model: anon_model.
4 chains, each with iter=2000; warmup=1000; thin=1;
post-warmup draws per chain=1000, total post-warmup draws=4000.
mean se_mean sd 2.5% 25% 50% 75% 97.5% n_eff Rhat
mu[1] 1.36 0.03 0.65 0.54 0.93 1.21 1.63 3.07 639 1.00
mu[2] -0.88 0.03 0.69 -2.89 -1.10 -0.70 -0.46 -0.06 454 1.00
rho -0.44 0.01 0.28 -0.89 -0.65 -0.48 -0.29 0.27 861 1.00
sigma[1] 2.00 0.07 1.57 0.24 0.90 1.57 2.68 6.32 443 1.01
sigma[2] 2.92 0.16 3.12 0.23 1.00 2.01 3.75 11.04 362 1.00
Samples were drawn using NUTS(diag_e) at Sun Nov 16 15:38:12 2025.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at
convergence, Rhat=1).
mcmc_areas(
as.matrix(fit_vanilla_corr),
pars = "rho",
prob = 0.95
)
The posterior of \(\rho\) includes 0, so no clear evidence for a correlation.
predictive checks
We compare the behavior of data sampled from the prior, and data sampled from the posterior, to the observed data.
First check: proportion of 1s in the \(x\)-variable
predictive_check_marginal_binary <- function(x_prior, x_rep, x_obs,
varname = "x", bins = 30) {
library(ggplot2)
library(dplyr)
# Compute posterior and prior predictive means (per draw)
post_prop <- rowMeans(x_rep)
prior_prop <- rowMeans(x_prior)
obs_prop <- mean(x_obs)
# Combine into tidy data frame
df <- data.frame(
prop = c(prior_prop, post_prop),
source = rep(c("prior", "posterior"), times = c(length(prior_prop), length(post_prop)))
)
# Plot
ggplot(df, aes(x = prop, fill = source)) +
geom_histogram(position = "identity", alpha = 0.4, bins = bins) +
geom_vline(xintercept = obs_prop, color = "red", linetype = "dashed", linewidth = 1) +
labs(
title = paste("Sanity Check:", varname, "~ Bernoulli"),
x = paste("Proportion of", varname, "= 1 (per draw)"),
y = "Frequency"
) +
scale_fill_manual(values = c("prior" = "skyblue", "posterior" = "seagreen")) +
theme_minimal(base_size = 14)
}post_vanilla_corr <- rstan::extract(fit_vanilla_corr)
predictive_check_marginal_binary(
x_prior = post_vanilla_corr$x_prior,
x_rep = post_vanilla_corr$x_rep,
x_obs = stan_data$x,
varname = "x"
)
Second check: conditional probability \((y=1|x_{\text{observed}}=0)\) and \((y=1|x_{\text{observed}}=1)\)
predictive_checks_regression(post_vanilla_corr$y_prior, post_vanilla_corr$y_rep, stan_data)
For both checks, the prior is close to a uniform distribution, and the posterior is peaked around the empirical values. This is as it should be.
model comparison
We compute the log marginal likelihood for this model for later use.
system.time(
capture.output({
log_marginal_likelihood_vanilla_corr <- bridge_sampler(fit_vanilla_corr)
}, file = "/dev/null")
)
print(log_marginal_likelihood_vanilla_corr) user system elapsed
5.326 0.078 2.214
Bridge sampling estimate of the log marginal likelihood: 55.95731
Estimate obtained in 16 iteration(s) via method "normal".
A direct model comparison via Bayes Factor is not possible because the regression model models the distribution of x given y, while the correlation models the joint distribution. However, we can apply Pareto-smoothed leave one out cross-validation if we marginalize over x in the correlation model.
# extract log-likelihood matrix: draws x observations
log_lik1 <- extract_log_lik(fit_vanilla, parameter_name = "log_lik")
log_lik2 <- extract_log_lik(fit_vanilla_corr, parameter_name = "log_lik_y")
# compute LOO
loo1 <- loo(log_lik1)
loo2 <- loo(log_lik2)
loo_list <- list(
"vanilla_regression" = loo1,
"vanilla_correlation" = loo2
)
# compare
loo_compare(loo_list)Warning message:
“Some Pareto k diagnostic values are too high. See help('pareto-k-diagnostic') for details.
”
| elpd_diff | se_diff | elpd_loo | se_elpd_loo | p_loo | se_p_loo | looic | se_looic | |
|---|---|---|---|---|---|---|---|---|
| vanilla_correlation | 0.000000 | 0.00000 | -62.07471 | 2.278148 | 27.944938 | 1.2846688 | 124.1494 | 4.556297 |
| vanilla_regression | -4.047377 | 1.21631 | -66.12208 | 3.149916 | 2.011269 | 0.1372601 | 132.2442 | 6.299831 |
So the correlation model is clearly preferred.
The Ornstein-Uhlenbeck process
The OU process is a stochastic process, similar to Brownian motion. The crucial difference is that OU is random walk on a leash. Technically, it is the mixture of a stochastic Brownian motion process and a deterministic trajectory exponentially converging to a mean value. It is characterized by the following distribution: \[
X_t \sim \mathcal N\left(x_0e^{-\lambda t} + \mu (1-e^{-\lambda t}), \frac{\sigma}{\sqrt{2\lambda}}\sqrt{1-e^{-2\lambda t}}\right)
\]
The parameters are:
- \(x_0\): the initial value of the process
- \(\mu\): the mean value to which the process converges
- \(\lambda\): the rate of convergence to the mean value
- \(\sigma\): the volatility of the process
- \(t\): the time at which the process is evaluated
The long-term equilibrium distribution is: \[ X_t \sim \mathcal N(\mu, \frac{\sigma}{\sqrt{2\lambda}}) \]
Logistic regression with a phylogenetic random effect
We now add a phylogenetic random effect to the logistic regression from above:
\[ \begin{align} \alpha &\sim \text{Student-t}(3, 0, 2.5)\\ \beta &\sim \mathcal N(0, 2)\\ \epsilon &\sim \mathcal N(0, \Sigma)\\ y &\sim \text{Bernoulli}(\text{logit}^{-1}(\alpha + \beta x + \epsilon)) \end{align} \]
The covariance matrix \(\Sigma\) of the phylogenetic random effect is given by
\[ \Sigma_{ij} = \frac{\sigma^2}{2\lambda} e^{-\lambda t_{ij}}, \]
where \(t_{ij}\) is the patristic distance between language \(i\) and language \(j\) in the phylogeny.
A direct implementation of this formula in Stan is possible but inefficient. A better approach directly simulates the evolution of the latent variable \(\epsilon\) along the phylgeny. This means that \(\epsilon\) has a value not only at the tips but also at the internal nodes. The value at the root is sampled from the equilibrium distribution, and the value of the other nodes is drawn from the equation above, repeated here:
\[ X_t \sim \mathcal N(x_0e^{-\lambda t} + \mu (1-e^{-\lambda t}), \frac{\sigma^2}{2\lambda}(1-e^{-2\lambda t})). \]
\(X_t\) is now the value at the daughter node, \(x_0\) at the mother node, and \(t\) is the branch length.
stan_model_code_OU_regression = "
data {
int<lower=1> N; // number of tips with data
int<lower=1> Nnodes; // total number of nodes
int<lower=1> Nedges; // number of edges in the tree
vector<lower=0, upper=1>[N] x; // predictor at tips
array[N] int<lower=0, upper=1> y; // binary response at tips
array[Nedges, 2] int<lower=1, upper=Nnodes> edges; // parent → child
vector<lower=0>[Nedges] edge_lengths; // edge lengths
int<lower=1, upper=Nnodes> root_node; // index of root node
}
transformed data {
vector[N] Xc; // centered version of X without an intercept
real means_X; // mean of X before centering
means_X = mean(x);
Xc = x - means_X;
}
parameters {
vector[Nnodes] z_std; // standard normal reparam for latent z
real<lower=0> sigma; // OU diffusion
real<lower=0> lambda; // OU strength
real mu; // OU mean
real alpha; // intercept
real beta; // slope
}
transformed parameters {
vector[Nnodes] z;
// root node
z[root_node] = mu + (sigma / sqrt(2 * lambda)) * z_std[root_node];
// recursive evolution
for (e in 1:Nedges) {
int edge_index = Nedges - e + 1; // reverse order for recursion
int parent = edges[edge_index, 1];
int child = edges[edge_index, 2];
real len = edge_lengths[edge_index];
real decay = exp(-lambda * len);
real s = sigma * sqrt(-expm1(-2 * lambda * len) / (2 * lambda));
real mn = mu + (z[parent] - mu) * decay;
z[child] = mn + s * z_std[child];
}
}
model {
// Priors
alpha ~ student_t(3, 0, 2.5);
beta ~ normal(0, 2);
sigma ~ lognormal(0, 1);
lambda ~ lognormal(0, 1);
mu ~ normal(0, 2);
z_std ~ normal(0, 1); // standard normal prior
// Likelihood
for (i in 1:N) {
# y[i] ~ bernoulli_logit(alpha + beta * Xc[i] + z[i]);
target += bernoulli_logit_lpmf(y[i] | alpha + beta * Xc[i] + z[i]);
}
}
generated quantities {
vector[N] log_lik;
array[N] int y_rep; // posterior predictive
array[N] int y_prior; // prior predictive
// Posterior predictive samples (uses fitted alpha, beta, z)
for (i in 1:N) {
real eta = alpha + beta * Xc[i] + z[i];
log_lik[i] = bernoulli_logit_lpmf(y[i] | eta);
y_rep[i] = bernoulli_logit_rng(eta);
}
// Prior predictive samples: draw alpha, beta, z_prior from prior
{
real alpha_prior = student_t_rng(3, 0, 2.5);
real beta_prior = normal_rng(0, 2);
vector[Nnodes] z_prior_std;
for (j in 1:Nnodes)
z_prior_std[j] = normal_rng(0, 1);
real mu_prior = normal_rng(0, 2);
real lambda_prior = lognormal_rng(0, 1);
real sigma_prior = lognormal_rng(0, 1);
vector[Nnodes] z_prior;
z_prior[root_node] = mu_prior + (sigma_prior / sqrt(2 * lambda_prior)) * z_prior_std[root_node];
for (e in 1:Nedges) {
int edge_index = Nedges - e + 1;
int parent = edges[edge_index, 1];
int child = edges[edge_index, 2];
real len = edge_lengths[edge_index];
real decay = exp(-lambda_prior * len);
real s = sigma_prior * sqrt(-expm1(-2 * lambda_prior * len) / (2 * lambda_prior));
real mn = mu_prior + (z_prior[parent] - mu_prior) * decay;
z_prior[child] = mn + s * z_prior_std[child];
}
for (i in 1:N) {
real eta_prior = alpha_prior + beta_prior * Xc[i] + z_prior[i];
y_prior[i] = bernoulli_logit_rng(eta_prior);
}
}
}
"
system.time(stan_model_OU_regression <- stan_model(model_code = stan_model_code_OU_regression)) user system elapsed
48.893 1.458 47.463
system.time(
fit_OU_regression <- sampling(
stan_model_OU_regression,
data = stan_data,
iter = 2000,
chains = 4,
control = list(adapt_delta = 0.95, max_treedepth = 15)
)
) user system elapsed
37.399 0.752 14.567
print(fit_OU_regression, pars = c("mu", "sigma", "lambda", "alpha", "beta"))Inference for Stan model: anon_model.
4 chains, each with iter=2000; warmup=1000; thin=1;
post-warmup draws per chain=1000, total post-warmup draws=4000.
mean se_mean sd 2.5% 25% 50% 75% 97.5% n_eff Rhat
mu 0.35 0.03 1.76 -3.15 -0.86 0.37 1.54 3.78 4825 1
sigma 5.00 0.10 4.12 1.07 2.46 3.89 6.18 16.41 1635 1
lambda 0.11 0.00 0.04 0.04 0.08 0.10 0.13 0.20 2172 1
alpha 0.63 0.03 2.04 -3.27 -0.65 0.57 1.83 4.93 4186 1
beta -0.77 0.03 1.63 -3.91 -1.80 -0.81 0.24 2.54 3540 1
Samples were drawn using NUTS(diag_e) at Sun Nov 16 15:39:31 2025.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at
convergence, Rhat=1).
mcmc_areas(
as.matrix(fit_OU_regression),
pars = "beta",
prob = 0.95
)
predictive checks
We perform the same predictive checks as for the regression model above.
post_ou_regression <- rstan::extract(fit_OU_regression)
predictive_checks_regression(post_ou_regression$y_prior, post_ou_regression$y_rep, stan_data)
This looks perfect - the prior is essentially flat, and the posterior is peaked around the obverved value.
model comparison
system.time(
capture.output({
log_marginal_likelihood_OU_regression <- bridge_sampler(fit_OU_regression)
}, file = "/dev/null")
)
print(log_marginal_likelihood_OU_regression) user system elapsed
4.905 0.015 1.929
Bridge sampling estimate of the log marginal likelihood: 132.3693
Estimate obtained in 10 iteration(s) via method "normal".
Now we can compute the Bayes factor between the vanilla regression and the OU phylogenetic regression.
log_marginal_likelihood_OU_regression$logml - log_marginal_likelihood_vanilla_regression$logmlThis is very strong evidence in favor of the OU model.
The HPD for \(\beta\) is very broad and includes 0. So there is no evidence for the significant effect of the order of adnominal property words on verb position.
Brownian motion
Next I will fit a logistic regression with a Brownian-motion phylogenetic random effect.
A <- ape::vcv(pruned_tree, corr = TRUE)
d$Language_ID <- factor(d$Language_ID, levels = rownames(A))system.time(
fit_brownian_regression <- brm(
formula = y ~ x + (1 | gr(Language_ID, cov = A)),
data = d,
family = bernoulli(link = "logit"),
prior = c(
prior(normal(0, 2), class = "b")
),
data2 = list(A = A),
control = list(adapt_delta = 0.99),
iter = 2000,
cores = 4,
chains = 4,
save_pars = save_pars(all = TRUE)
)
)Compiling Stan program...
Start sampling
user system elapsed
66.622 2.227 58.459
summary(fit_brownian_regression) Family: bernoulli
Links: mu = logit
Formula: y ~ x + (1 | gr(Language_ID, cov = A))
Data: d (Number of observations: 100)
Draws: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
total post-warmup draws = 4000
Multilevel Hyperparameters:
~Language_ID (Number of levels: 100)
Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sd(Intercept) 3.83 2.64 1.45 9.95 1.00 634 1040
Regression Coefficients:
Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
Intercept 0.55 1.42 -2.12 3.45 1.00 1992 2083
x -0.93 0.79 -2.51 0.61 1.00 4284 2740
Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).
mcmc_areas(
as.matrix(fit_brownian_regression),
pars = "b_x",
prob = 0.95
)
The HPD interval for the slope includes 0, so there is no credible evidence for an impact of x on y.
predictive checks
In a brms model, one cannot do prior predictive simulations while model fitting. Instead, we have to fit the model again, this time with sample_prior = "only".
y_rep_brownian_regression <- posterior_predict(fit_brownian_regression) # matrix: draws × observations
system.time(
fit_prior <- brm(
formula = y ~ x + (1 | gr(Language_ID, cov = A)),
data = d,
family = bernoulli(),
data2 = list(A = A),
prior = c(
prior(normal(0, 2.5), class = "b")
),
sample_prior = "only",
control = list(adapt_delta = 0.95),
cores = 4, chains = 4
)
)Compiling Stan program...
Start sampling
user system elapsed
58.697 2.215 55.196
y_prior_brownian_regression <- posterior_predict(fit_prior)
predictive_checks_regression(y_prior_brownian_regression, y_rep_brownian_regression, stan_data)
The priors and posteriors look healthy.
model comparison
system.time(
capture.output({
log_marginal_likelihood_brownian_regression <- bridge_sampler(fit_brownian_regression)
}, file = "/dev/null")
)
print(log_marginal_likelihood_brownian_regression) user system elapsed
4.097 0.009 1.142
Bridge sampling estimate of the log marginal likelihood: -55.03535
Estimate obtained in 9 iteration(s) via method "normal".
log_marginal_likelihood_OU_regression$logml - log_marginal_likelihood_brownian_regression$logmlThe OU model is favored over the Brownian-motion model by a wide margin.
Next we look as PSIS-LOOCV
# extract log-likelihood matrix: draws x observations
log_lik3 <- extract_log_lik(fit_OU_regression, parameter_name = "log_lik")
log_lik4 <- log_lik(fit_brownian_regression)
# compute LOO
loo3 <- loo(log_lik3)
loo4 <- loo(log_lik4)
loo_list$OU_regression <- loo3
loo_list$brownian_regression <- loo4
# compare
loo_compare(loo_list)Warning message:
“Some Pareto k diagnostic values are too high. See help('pareto-k-diagnostic') for details.
”
Warning message:
“Some Pareto k diagnostic values are too high. See help('pareto-k-diagnostic') for details.
”
| elpd_diff | se_diff | elpd_loo | se_elpd_loo | p_loo | se_p_loo | looic | se_looic | |
|---|---|---|---|---|---|---|---|---|
| OU_regression | 0.00000 | 0.000000 | -33.03397 | 3.831264 | 24.718425 | 3.2070885 | 66.06794 | 7.662528 |
| brownian_regression | -16.03540 | 2.983188 | -49.06937 | 6.343331 | 19.969692 | 2.8638241 | 98.13874 | 12.686661 |
| vanilla_correlation | -29.04074 | 3.564611 | -62.07471 | 2.278148 | 27.944938 | 1.2846688 | 124.14941 | 4.556297 |
| vanilla_regression | -33.08811 | 3.811224 | -66.12208 | 3.149916 | 2.011269 | 0.1372601 | 132.24417 | 6.299831 |
So both Bayes Factor and PSIS-LOOCV favor OU over Brownian motion, and they favor both versions with phylogenetic control over models without phylogenetic control.
So much about various flavors of regression.
However, I find it hard to wrap my mind around what such a model actually means. It says that there is a latent variable \(\epsilon\) evolving along the tree, and the probability of a language being verb final depends both on this latent variable and on the order of adjective-noun.
A more realistic model, it seems to me, is one where there are
- a latent variable determining adjective-noun order, and
- a latent variable determining verb position.
These latent variables evolve simultaneously along the tree. If there is a dependency between the two observed variables resulting from diachronic processes, there is a correlation between the diachronic changes of the variables.
This can be implemented in an extension of the bivariate correlation model used above.
The model specification is as follows, where \(N\) is the number of languages, and \(t_{l_1, l_2}\) is the length of from the root of the tree to the most recent common ancestor of \(l_1\) and \(l_2\):
\[ \begin{align} R &\sim \text{LKJ}(2)\\ \Sigma_{(l_1, k_1), (l_2, k_2)} &= R_{k_1, k_2}\cdot \frac{\sigma_{k_1}\sigma_{k_2}}{\lambda_{k_1}+\lambda_{k_2}}\left(1-e^{-(\lambda_{k_1}+\lambda_{k_2})t_{l_1, l_2}}\right) \\ \mu_i &= \mathcal N(\mathbf 0, 2)\\ \sigma_i &= \text{Lognormal}(0,1)\\ \lambda_i &= \text{Lognormal}(0,1)\\ z &\sim \mathcal N(\text{repeat}(\mu, N), \Sigma)\\ x_i &\sim \text{Bernoulli}(\text{logit}^{-1}(z_{i,1}))\\ y_i &\sim \text{Bernoulli}(\text{logit}^{-1}(z_{i,2})) \end{align} \]
stan_model_code_OU_correlation <-
"
data {
int<lower=1> N; // number of tips with data
int<lower=1> Nnodes; // total number of nodes
int<lower=1> Nedges; // number of edges in the tree
array[N] int<lower=0, upper=1> x; // first binary response
array[N] int<lower=0, upper=1> y; // second binary response
array[Nedges, 2] int<lower=1, upper=Nnodes> edges; // parent → child
vector<lower=0>[Nedges] edge_lengths; // edge lengths
int<lower=1, upper=Nnodes> root_node; // index of root node
}
parameters {
matrix[Nnodes, 2] z_std; // standard-normal latent variables
vector<lower=0>[2] sigma; // OU diffusion parameters
vector<lower=0>[2] lambda; // OU pull strengths
vector[2] mu; // OU stationary means
cholesky_factor_corr[2] L_std; // Cholesky factor of correlation matrix
}
transformed parameters {
matrix[Nnodes, 2] z; // latent values
real rho = L_std[2, 1]; // correlation coefficient
// Root node, drawn from OU stationary distribution
z[root_node] = (mu + (sigma ./ sqrt(2 * lambda)) .* (L_std * to_vector(z_std[root_node])))';
// Recursive evolution
for (e in 1:Nedges) {
int edge_index = Nedges - e + 1; // reverse order for recursion, root to tips
int parent = edges[edge_index, 1];
int child = edges[edge_index, 2];
real len = edge_lengths[edge_index];
// Vectorized decay and scale
vector[2] decay = exp(-lambda * len);
vector[2] s = sigma .* sqrt(-expm1(-2 * lambda * len) ./ (2 * lambda));
vector[2] mean = mu + (to_vector(z[parent]) - mu) .* decay;
vector[2] eps = L_std * z_std[child]';
z[child] = (mean + s .* eps)';
}
}
model {
// Priors
sigma ~ lognormal(0, 1);
lambda ~ lognormal(0, 1);
mu ~ normal(0, 2);
L_std ~ lkj_corr_cholesky(2.0);
to_vector(z_std) ~ normal(0, 1);
// Likelihood
for (i in 1:N) {
# x[i] ~ bernoulli_logit(z[i, 1]);
target += bernoulli_logit_lpmf(x[i] | z[i, 1]);
# y[i] ~ bernoulli_logit(z[i, 2]);
target += bernoulli_logit_lpmf(y[i] | z[i, 2]);
}
}
generated quantities {
vector[N] log_lik_x;
vector[N] log_lik_y;
array[N] int x_rep;
array[N] int y_rep;
array[N] int x_prior;
array[N] int y_prior;
// Posterior predictive
for (i in 1:N) {
x_rep[i] = bernoulli_logit_rng(z[i, 1]);
y_rep[i] = bernoulli_logit_rng(z[i, 2]);
log_lik_x[i] = bernoulli_logit_lpmf(x[i] | z[i, 1]);
log_lik_y[i] = bernoulli_logit_lpmf(y[i] | z[i, 2]);
}
// Prior predictive simulation
{
vector[2] mu_prior;
vector[2] sigma_prior;
vector[2] lambda_prior;
matrix[2, 2] L_prior;
matrix[Nnodes, 2] z_std_prior;
matrix[Nnodes, 2] z_prior;
for (j in 1:2) {
mu_prior[j] = normal_rng(0.0, 2.0);
sigma_prior[j] = lognormal_rng(0.0, 1.0);
lambda_prior[j] = lognormal_rng(0.0, 1.0);
}
L_prior = cholesky_decompose(lkj_corr_rng(2, 2.0));
for (i in 1:Nnodes)
z_std_prior[i] = to_row_vector(normal_rng(rep_vector(0.0, 2), rep_vector(1.0, 2)));
// Root node
z_prior[root_node] = (mu_prior + (sigma_prior ./ sqrt(2 * lambda_prior)) .* (L_prior * to_vector(z_std_prior[root_node])))';
// Recursion
for (e in 1:Nedges) {
int edge_index = Nedges - e + 1;
int parent = edges[edge_index, 1];
int child = edges[edge_index, 2];
real len = edge_lengths[edge_index];
vector[2] decay = exp(-lambda_prior * len);
vector[2] s = sigma_prior .* sqrt(-expm1(-2 * lambda_prior * len) ./ (2 * lambda_prior));
vector[2] mean = mu_prior + (to_vector(z_prior[parent]) - mu_prior) .* decay;
vector[2] eps = L_prior * z_std_prior[child]';
z_prior[child] = (mean + s .* eps)';
}
for (i in 1:N) {
x_prior[i] = bernoulli_logit_rng(z_prior[i, 1]);
y_prior[i] = bernoulli_logit_rng(z_prior[i, 2]);
}
}
// Simulate using posterior parameters but fresh noise z_std_prior
array[N] int x_replicated_noise;
array[N] int y_replicated_noise;
{
matrix[Nnodes, 2] z_std_resample;
matrix[Nnodes, 2] z_resample;
for (i in 1:Nnodes)
z_std_resample[i] = to_row_vector(normal_rng(rep_vector(0.0, 2), rep_vector(1.0, 2)));
// Root node
z_resample[root_node] = (mu + (sigma ./ sqrt(2 * lambda)) .* (L_std * to_vector(z_std_resample[root_node])))';
// Recursion
for (e in 1:Nedges) {
int edge_index = Nedges - e + 1;
int parent = edges[edge_index, 1];
int child = edges[edge_index, 2];
real len = edge_lengths[edge_index];
vector[2] decay = exp(-lambda * len);
vector[2] s = sigma .* sqrt(-expm1(-2 * lambda * len) ./ (2 * lambda));
vector[2] mean = mu + (to_vector(z_resample[parent]) - mu) .* decay;
vector[2] eps = L_std * z_std_resample[child]';
z_resample[child] = (mean + s .* eps)';
}
for (i in 1:N) {
x_replicated_noise[i] = bernoulli_logit_rng(z_resample[i, 1]);
y_replicated_noise[i] = bernoulli_logit_rng(z_resample[i, 2]);
}
}
}
"
system.time(stan_model_OU_correlation <- stan_model(model_code = stan_model_code_OU_correlation)) user system elapsed
59.206 1.520 57.629
system.time(
fit_OU_correlation <- sampling(
stan_model_OU_correlation,
data = stan_data,
iter = 10000,
thin = 10,
chains = 4,
control = list(adapt_delta = 0.95, max_treedepth = 15)
)
) user system elapsed
922.331 0.950 340.447
print(fit_OU_correlation, pars=c("mu", "sigma", "lambda", "rho"))Inference for Stan model: anon_model.
4 chains, each with iter=10000; warmup=5000; thin=10;
post-warmup draws per chain=500, total post-warmup draws=2000.
mean se_mean sd 2.5% 25% 50% 75% 97.5% n_eff Rhat
mu[1] 1.75 0.03 1.09 0.02 1.05 1.61 2.37 4.21 1860 1
mu[2] 0.67 0.03 1.24 -1.77 -0.10 0.62 1.41 3.23 2083 1
sigma[1] 3.29 0.06 2.44 0.74 1.80 2.71 4.06 9.23 1857 1
sigma[2] 4.16 0.08 3.45 0.90 2.05 3.18 5.11 13.02 1898 1
lambda[1] 0.22 0.01 0.50 0.04 0.11 0.16 0.24 0.58 1901 1
lambda[2] 0.11 0.00 0.04 0.04 0.08 0.10 0.13 0.20 2087 1
rho -0.27 0.00 0.21 -0.65 -0.41 -0.28 -0.13 0.13 1793 1
Samples were drawn using NUTS(diag_e) at Sun Nov 16 13:38:17 2025.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at
convergence, Rhat=1).
mcmc_areas(
as.matrix(fit_OU_correlation),
pars = "rho",
prob = 0.95
)
Since the HPD of \(\rho\) includes 0, we conclude that there is no credible evidence for a diachronic correlation between the two variables.
predictive checks
post_OU_correlation <- rstan::extract(fit_OU_correlation)
predictive_check_marginal_binary(
x_prior = post_OU_correlation$x_prior,
x_rep = post_OU_correlation$x_rep,
x_obs = stan_data$x,
varname = "x"
)
predictive_checks_regression(post_OU_correlation$y_prior, post_OU_correlation$y_rep, stan_data)
The prior and posterior simulations look okay.
It seems tempting to also use the posterior predictive simulation to estimate the equilibrium probabilities. However, this would be wrong because the posterior predictive simulation also use the posterior of the latent variable z. Instead, we have to use the posterior of the other parameters and to simulate z to obtain the posterior prediction for a copy of the tree, without regard for the real history of the tree.
I added a block to this effect in the Stan code, producing x_replicated_noise and y_replicated_noise.
x_replicated_noise = post_OU_correlation$x_replicated_noise
y_replicated_noise = post_OU_correlation$y_replicated_noise
# Create a function to compute joint probabilities for one sample
joint_freqs <- function(x, y) {
tab <- table(factor(x, levels = 0:1), factor(y, levels = 0:1))
as.vector(tab) / length(x) # returns proportions in order: (0,0), (0,1), (1,0), (1,1)
}
# Apply over all 2000 posterior samples
joint_probs <- t(mapply(joint_freqs,
split(x_replicated_noise, row(x_replicated_noise)),
split(y_replicated_noise, row(y_replicated_noise))))
colnames(joint_probs) <- c("(0,0)", "(0,1)", "(1,0)", "(1,1)")
x_range = c(0,1)
df <- as.data.frame(joint_probs) %>%
mutate(draw = 1:n()) %>%
pivot_longer(-draw, names_to = "Combination", values_to = "Probability")
ggplot(df, aes(x = Probability, fill = Combination)) +
geom_density(alpha = 0.6) +
facet_wrap(~ Combination, scales = "fixed") + # enforce same scale
scale_x_continuous(limits = x_range) + # set common x-axis limits
theme_minimal() +
labs(title = "Equilibrium Distribution of Joint Outcomes",
x = "Estimated Probability",
y = "Density")
model comparison
Bayes factor:
system.time(
capture.output({
log_marginal_likelihood_OU_correlation <- bridge_sampler(
fit_OU_correlation, silent = TRUE, maxiter = 10000
)
}, file = "/dev/null"
)
)
print(log_marginal_likelihood_OU_correlation)Warning message:
“logml could not be estimated within maxiter, rerunning with adjusted starting value.
Estimate might be more variable than usual.”
user system elapsed
76.032 0.116 72.267
Bridge sampling estimate of the log marginal likelihood: 264.3048
Estimate obtained in 10001 iteration(s) via method "normal".
log_marginal_likelihood_OU_correlation$logml - log_marginal_likelihood_vanilla_corr$logmlAgain we find overwhelming evidence in favor of the phylogenetic model.
We can complement model comparison via Bayes Factor with Pareto-smoothed leave-one-out cross-validation.
# extract log-likelihood matrix: draws x observations
log_lik5 <- extract_log_lik(fit_OU_correlation, parameter_name = "log_lik_y")
# compute LOO
loo5 <- loo(log_lik5)
loo_list$OU_correlation <- loo5
# compare
loo_compare(loo_list)Warning message:
“Some Pareto k diagnostic values are too high. See help('pareto-k-diagnostic') for details.
”
| elpd_diff | se_diff | elpd_loo | se_elpd_loo | p_loo | se_p_loo | looic | se_looic | |
|---|---|---|---|---|---|---|---|---|
| OU_correlation | 0.00000000 | 0.000000 | -32.95478 | 3.754743 | 22.998565 | 2.9770741 | 65.90957 | 7.509485 |
| OU_regression | -0.07918904 | 1.051616 | -33.03397 | 3.831264 | 24.718425 | 3.2070885 | 66.06794 | 7.662528 |
| brownian_regression | -14.85402549 | 3.109386 | -47.80881 | 6.247695 | 19.809112 | 2.8749336 | 95.61762 | 12.495390 |
| vanilla_correlation | -29.11992435 | 3.385216 | -62.07471 | 2.278148 | 27.944938 | 1.2846688 | 124.14941 | 4.556297 |
| vanilla_regression | -33.16730118 | 3.577467 | -66.12208 | 3.149916 | 2.011269 | 0.1372601 | 132.24417 | 6.299831 |
The three phylogenetic models perform massively better than the vanilla models. Furthermore if our goal is to predict the verb position of an unseen language from its noun-adjective order and its position in the tree, the OU regression model is superior to the correlation model.
As a nice side effect of the explicit modeling of the diachronic histories, we can plot the posterior means of the latent variables at the internal nodes, thereby visualizing the evolution of the latent variables.
posterior_df <- as_draws_df(fit_OU_correlation)
z_posterior <- posterior_df %>%
spread_draws(z[i, j]) %>%
group_by(i, j) %>%
summarise(mean_z = mean(z), .groups = "drop")
z_adjective <- z_posterior %>%
filter(j == 1) %>%
pull(mean_z)
z_vo <- z_posterior %>%
filter(j == 2) %>%
pull(mean_z)
pruned_tree$node.label <- as.character(
(length(pruned_tree$tip.label) + 1):(length(pruned_tree$tip.label) + pruned_tree$Nnode)
)
tree_plot <- ggtree(pruned_tree)
tree_info <- tree_plot$data
tree_info$node.label <- as.character(tree_info$label)Warning message in fortify(data, ...): “Arguments in `...` must be used. ✖ Problematic arguments: • as.Date = as.Date • yscale_mapping = yscale_mapping • hang = hang ℹ Did you misspell an argument name?”
for (i in 1:nrow(tree_info)) {
if (tree_info$isTip[i]) {
# If it's a tip, get the corresponding Glottocode
glottocode <- tree_info$label[i]
# Find the index of the Glottocode in d_ie
index <- which(pruned_tree$tip.label == glottocode)
# Assign the Affix value to the node label
tree_info$node.label[i] <- index
} else {
# If it's not a tip, copy label
tree_info$node.label[i] <- as.integer(tree_info$node.label[i])
}
}
node_data <- tibble(
node.label = as.character(1:length(z_adjective)),
z_adjective = z_adjective,
z_vo = z_vo
)
tree_data <- tree_info %>%
left_join(node_data, by = "node.label")
tree_data <- tree_data %>%
left_join(
rename(select(d, Language_ID, GB193, GB133),
label=Language_ID)
)
tree_data <- tree_data %>%
mutate(GB193 = factor(GB193)) %>%
mutate(GB133 = factor(GB133))Joining with `by = join_by(label)`
custom_palette <- paletteer_c("scico::roma", 30)p1 <- ggtree(pruned_tree, layout = "circular", branch.length = "none") %<+% tree_data +
geom_tree(aes(color = plogis(z_adjective)), size = 1) +
scale_color_gradientn(colors = custom_palette, name = "p(NAdj)") +
ggnewscale::new_scale_color() +
geom_tippoint(aes(color = GB193), size = 3) +
scale_color_manual(values = c("1" = custom_palette[1], "2" = custom_palette[30])) +
guides(color = guide_legend(override.aes = list(size = 4))) +
theme(
legend.position = "left",
plot.margin = margin(0, 0, 0, 0),
text = element_text(size = 18),
legend.text = element_text(size = 18)
)Warning message in fortify(data, ...): “Arguments in `...` must be used. ✖ Problematic arguments: • as.Date = as.Date • yscale_mapping = yscale_mapping • hang = hang ℹ Did you misspell an argument name?” Warning message: “Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0. ℹ Please use `linewidth` instead. ℹ The deprecated feature was likely used in the ggtree package. Please report the issue at <https://github.com/YuLab-SMU/ggtree/issues>.”
p2 <- ggtree(pruned_tree, layout = "circular", branch.length = "none") %<+% tree_data +
geom_tree(aes(color = plogis(z_vo)), size = 1) +
scale_color_gradientn(colors = custom_palette, name = "p(VO)") +
ggnewscale::new_scale_color() +
geom_tippoint(aes(color = GB133), size = 3) +
scale_color_manual(values = c("0" = custom_palette[1], "1" = custom_palette[30])) +
guides(color = guide_legend(override.aes = list(size = 4))) +
theme(
plot.margin = margin(0, 0, 0, 0),
text = element_text(size = 18),
legend.text = element_text(size = 18)
)Warning message in fortify(data, ...): “Arguments in `...` must be used. ✖ Problematic arguments: • as.Date = as.Date • yscale_mapping = yscale_mapping • hang = hang ℹ Did you misspell an argument name?”
options(repr.plot.width = 18, repr.plot.height = 10)
plot_grid(p1, p2, ncol = 2, align = "hv", axis = "tb", rel_widths = c(1, 1))
options(repr.plot.width = 12, repr.plot.height = 8) # For base R plots in JupyterAnother advantage of this approach (both the regression and the correlation variant) is that it can easily be combined with family-level or macroarea-level random effects. Also, geographic random effects can easily be added.
More details can be found in my arxiv paper Computational Tyology (https://arxiv.org/abs/2504.15642)
Pagel & Meade style test for correlation
The test for correlation using CTMC, as developed by Pagel & Meade (2006) and implemented in BayesTraits, can also be implemented in Stan, using Felsenstein’s pruning algorithm.
stan_model_code_ctmc_independent = "
data {
int<lower=1> N;
int<lower=1> Nnodes;
int<lower=1> Nedges;
array[N] int<lower=0, upper=1> x;
array[N] int<lower=0, upper=1> y;
array[Nedges, 2] int<lower=1, upper=Nnodes> edges;
vector<lower=0>[Nedges] edge_lengths;
int<lower=1> root_node;
}
parameters {
simplex[4] rel_rates;
real<lower=0> total_rate;
}
transformed parameters {
matrix[4, 4] Q = rep_matrix(0.0, 4, 4);
Q[1, 2] = total_rate * rel_rates[1];
Q[2, 1] = total_rate * rel_rates[2];
Q[1, 3] = total_rate * rel_rates[3];
Q[3, 1] = total_rate * rel_rates[4];
Q[2, 4] = total_rate * rel_rates[3];
Q[4, 2] = total_rate * rel_rates[4];
Q[3, 4] = total_rate * rel_rates[1];
Q[4, 3] = total_rate * rel_rates[2];
for (i in 1:4) {
Q[i, i] = -sum(Q[i, ]);
}
// stationary-ish distribution via long-time transition
matrix[4, 4] P = matrix_exp(Q * 1e3);
vector[4] stat_dist = to_vector(P[1]);
stat_dist /= sum(stat_dist);
matrix[Nnodes, 4] loglikelihood;
loglikelihood = rep_matrix(0.0, Nnodes, 4);
// tip likelihoods
for (i in 1:N) {
int idx = 2 * x[i] + y[i] + 1;
loglikelihood[i] = rep_row_vector(negative_infinity(), 4);
loglikelihood[i, idx] = 0;
}
// pruning recursion
for (e in 1:Nedges) {
int parent = edges[e, 1];
int child = edges[e, 2];
real t = edge_lengths[e];
matrix[4, 4] Plocal = matrix_exp(Q * t);
for (k in 1:4) {
loglikelihood[parent, k] += log_sum_exp(
to_vector(log(Plocal[k]) + loglikelihood[child])
);
}
}
real ll = log_sum_exp(loglikelihood[root_node] + log(stat_dist)');
}
model {
// priors
rel_rates ~ dirichlet(rep_vector(1.0, 4));
total_rate ~ lognormal(-1, 0.5);
target += ll;
}
generated quantities {
real total_rate_prior = lognormal_rng(-1, 0.5);
vector[4] rel_rates_prior = dirichlet_rng(rep_vector(1.0, 4));
matrix[4, 4] Qprior = rep_matrix(0.0, 4, 4);
Qprior[1, 2] = total_rate_prior * rel_rates_prior[1];
Qprior[2, 1] = total_rate_prior * rel_rates_prior[2];
Qprior[1, 3] = total_rate_prior * rel_rates_prior[3];
Qprior[3, 1] = total_rate_prior * rel_rates_prior[4];
Qprior[2, 4] = total_rate_prior * rel_rates_prior[3];
Qprior[4, 2] = total_rate_prior * rel_rates_prior[4];
Qprior[3, 4] = total_rate_prior * rel_rates_prior[1];
Qprior[4, 3] = total_rate_prior * rel_rates_prior[2];
for (i in 1:4) {
Qprior[i, i] = -sum(Qprior[i, ]);
}
matrix[4, 4] Pprior = matrix_exp(Qprior * 1e3);
vector[4] stat_dist_prior = to_vector(Pprior[1]);
stat_dist_prior /= sum(stat_dist_prior);
array[Nnodes] int<lower=1, upper=4> sim_states;
sim_states[root_node] = categorical_rng(stat_dist_prior);
// Loop backward through post-ordered edges
for (i in 1:Nedges) {
int e = Nedges - i + 1;
int parent = edges[e, 1];
int child = edges[e, 2];
real t = edge_lengths[e];
matrix[4, 4] Plocal = matrix_exp(Qprior * t);
sim_states[child] = categorical_rng(to_vector(Plocal[sim_states[parent]]));
}
array[N] int<lower=0, upper=1> x_prior;
array[N] int<lower=0, upper=1> y_prior;
for (i in 1:N) {
int state = sim_states[i] - 1;
x_prior[i] = state / 2;
y_prior[i] = state % 2;
}
// Posterior predictive for each tip
array[N] real y_log_lik; // Pointwise log-likelihood
vector[4] ll_loo;
array[N] int<lower=0, upper=1> x_rep;
array[N] int<lower=0, upper=1> y_rep;
for (tip in 1:N) {
for (s in 1:4) {
// Recompute likelihood WITHOUT this tip
matrix[Nnodes, 4] loglik_loo = rep_matrix(0.0, Nnodes, 4);
// Set tip likelihoods (excluding tip 'tip')
for (i in 1:N) {
loglik_loo[i] = rep_row_vector(negative_infinity(), 4);
if (i != tip) {
int idx = 2 * x[i] + y[i] + 1;
loglik_loo[i, idx] = 0;
} else {
loglik_loo[i, s] = 0;
}
}
// Pruning recursion (same as model block)
for (e in 1:Nedges) {
int parent = edges[e, 1];
int child = edges[e, 2];
real t = edge_lengths[e];
matrix[4, 4] Plocal = matrix_exp(Q * t);
for (k in 1:4) {
loglik_loo[parent, k] += log_sum_exp(
to_vector(log(Plocal[k]) + loglik_loo[child])
);
}
}
ll_loo[s] = log_sum_exp(loglik_loo[root_node] + log(stat_dist)');
}
int observed_state = 2 * x[tip] + y[tip] + 1;
int same_x_state = 2 * x[tip] + (1 - y[tip]) + 1;
y_log_lik[tip] = ll_loo[observed_state] - log_sum_exp(ll_loo[observed_state] , ll_loo[same_x_state]);
int s_rep = categorical_rng(softmax(ll_loo));
x_rep[tip] = (s_rep - 1) / 2;
y_rep[tip] = (s_rep - 1) % 2;
}
}
"
system.time(stan_model_ctmc_independent <- stan_model(model_code= stan_model_code_ctmc_independent)) user system elapsed
63.513 1.487 61.832
system.time(
fit_ctmc_independent <- sampling(
stan_model_ctmc_independent,
data = stan_data,
iter = 2000,
chains = 4
)
) user system elapsed
1319.705 1.671 427.752
print(fit_ctmc_independent, pars=c("stat_dist", "rel_rates", "total_rate"))Inference for Stan model: anon_model.
4 chains, each with iter=2000; warmup=1000; thin=1;
post-warmup draws per chain=1000, total post-warmup draws=4000.
mean se_mean sd 2.5% 25% 50% 75% 97.5% n_eff Rhat
stat_dist[1] 0.15 0 0.04 0.08 0.12 0.15 0.18 0.25 4125 1
stat_dist[2] 0.19 0 0.05 0.11 0.16 0.19 0.22 0.30 3344 1
stat_dist[3] 0.29 0 0.06 0.18 0.25 0.29 0.33 0.41 3905 1
stat_dist[4] 0.36 0 0.07 0.24 0.32 0.36 0.41 0.49 4365 1
rel_rates[1] 0.22 0 0.07 0.10 0.17 0.21 0.27 0.37 2298 1
rel_rates[2] 0.18 0 0.06 0.07 0.13 0.17 0.21 0.31 2547 1
rel_rates[3] 0.40 0 0.09 0.21 0.33 0.40 0.46 0.58 2110 1
rel_rates[4] 0.21 0 0.05 0.12 0.17 0.21 0.24 0.31 3192 1
total_rate 0.27 0 0.08 0.16 0.22 0.26 0.31 0.45 2025 1
Samples were drawn using NUTS(diag_e) at Sun Nov 16 13:47:51 2025.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at
convergence, Rhat=1).
predictive checks
post_ctmc_independent <- rstan::extract(fit_ctmc_independent)
predictive_check_marginal_binary(
x_prior = post_ctmc_independent$x_prior,
x_rep = post_ctmc_independent$x_rep,
x_obs = stan_data$x,
varname = "x"
)
predictive_checks_regression(post_ctmc_independent$y_prior, post_ctmc_independent$y_rep, stan_data)
The predictive checks indicate that the prior is already quite informative (which is bad). The posterior is still shifted towards the empirical value.
system.time(
capture.output({
log_marginal_likelihood_ctmc_independent <- bridge_sampler(fit_ctmc_independent)
}, file = "/dev/null")
)
print(log_marginal_likelihood_ctmc_independent) user system elapsed
12.846 0.018 10.964
Bridge sampling estimate of the log marginal likelihood: -106.5887
Estimate obtained in 5 iteration(s) via method "normal".
stan_model_code_ctmc_dependent <- "
data {
int<lower=1> N;
int<lower=1> Nnodes;
int<lower=1> Nedges;
array[N] int<lower=0, upper=1> x;
array[N] int<lower=0, upper=1> y;
array[Nedges, 2] int<lower=1, upper=Nnodes> edges;
vector<lower=0>[Nedges] edge_lengths;
int<lower=1> root_node;
}
parameters {
simplex[8] rel_rates;
real<lower=0> total_rate;
}
transformed parameters {
matrix[4, 4] Q = rep_matrix(0.0, 4, 4);
Q[1, 2] = total_rate * rel_rates[1];
Q[2, 1] = total_rate * rel_rates[2];
Q[1, 3] = total_rate * rel_rates[3];
Q[3, 1] = total_rate * rel_rates[4];
Q[2, 4] = total_rate * rel_rates[5];
Q[4, 2] = total_rate * rel_rates[6];
Q[3, 4] = total_rate * rel_rates[7];
Q[4, 3] = total_rate * rel_rates[8];
for (i in 1:4) {
Q[i, i] = -sum(Q[i, ]);
}
// stationary-ish distribution via long-time transition
matrix[4, 4] P = matrix_exp(Q * 1e3);
vector[4] stat_dist = to_vector(P[1]);
stat_dist /= sum(stat_dist);
matrix[Nnodes, 4] loglikelihood;
loglikelihood = rep_matrix(0.0, Nnodes, 4);
// tip likelihoods
for (i in 1:N) {
int idx = 2 * x[i] + y[i] + 1;
loglikelihood[i] = rep_row_vector(negative_infinity(), 4);
loglikelihood[i, idx] = 0;
}
// pruning recursion
for (e in 1:Nedges) {
int parent = edges[e, 1];
int child = edges[e, 2];
real t = edge_lengths[e];
matrix[4, 4] Plocal = matrix_exp(Q * t);
for (k in 1:4) {
loglikelihood[parent, k] += log_sum_exp(
to_vector(log(Plocal[k]) + loglikelihood[child])
);
}
}
real ll = log_sum_exp(loglikelihood[root_node] + log(stat_dist)');
}
model {
// priors
rel_rates ~ dirichlet(rep_vector(1.0, 8));
total_rate ~ lognormal(-1, 0.5);
target += ll;
}
generated quantities {
real total_rate_prior = lognormal_rng(-1, 0.5);
vector[8] rel_rates_prior = dirichlet_rng(rep_vector(1.0, 8));
matrix[4, 4] Qprior = rep_matrix(0.0, 4, 4);
Qprior[1, 2] = total_rate_prior * rel_rates_prior[1];
Qprior[2, 1] = total_rate_prior * rel_rates_prior[2];
Qprior[1, 3] = total_rate_prior * rel_rates_prior[3];
Qprior[3, 1] = total_rate_prior * rel_rates_prior[4];
Qprior[2, 4] = total_rate_prior * rel_rates_prior[5];
Qprior[4, 2] = total_rate_prior * rel_rates_prior[6];
Qprior[3, 4] = total_rate_prior * rel_rates_prior[7];
Qprior[4, 3] = total_rate_prior * rel_rates_prior[8];
for (i in 1:4) {
Qprior[i, i] = -sum(Qprior[i, ]);
}
matrix[4, 4] Pprior = matrix_exp(Qprior * 1e3);
vector[4] stat_dist_prior = to_vector(Pprior[1]);
stat_dist_prior /= sum(stat_dist_prior);
array[Nnodes] int<lower=1, upper=4> sim_states;
sim_states[root_node] = categorical_rng(stat_dist_prior);
// Loop backward through post-ordered edges
for (i in 1:Nedges) {
int e = Nedges - i + 1;
int parent = edges[e, 1];
int child = edges[e, 2];
real t = edge_lengths[e];
matrix[4, 4] Plocal = matrix_exp(Qprior * t);
sim_states[child] = categorical_rng(to_vector(Plocal[sim_states[parent]]));
}
array[N] int<lower=0, upper=1> x_prior;
array[N] int<lower=0, upper=1> y_prior;
for (i in 1:N) {
int state = sim_states[i] - 1;
x_prior[i] = state / 2;
y_prior[i] = state % 2;
}
// Posterior predictive for each tip
array[N] real y_log_lik; // Pointwise log-likelihood
vector[4] ll_loo;
array[N] int<lower=0, upper=1> x_rep;
array[N] int<lower=0, upper=1> y_rep;
for (tip in 1:N) {
for (s in 1:4) {
// Recompute likelihood WITHOUT this tip
matrix[Nnodes, 4] loglik_loo = rep_matrix(0.0, Nnodes, 4);
// Set tip likelihoods (excluding tip 'tip')
for (i in 1:N) {
loglik_loo[i] = rep_row_vector(negative_infinity(), 4);
if (i != tip) {
int idx = 2 * x[i] + y[i] + 1;
loglik_loo[i, idx] = 0;
} else {
loglik_loo[i, s] = 0;
}
}
// Pruning recursion (same as model block)
for (e in 1:Nedges) {
int parent = edges[e, 1];
int child = edges[e, 2];
real t = edge_lengths[e];
matrix[4, 4] Plocal = matrix_exp(Q * t);
for (k in 1:4) {
loglik_loo[parent, k] += log_sum_exp(
to_vector(log(Plocal[k]) + loglik_loo[child])
);
}
}
ll_loo[s] = log_sum_exp(loglik_loo[root_node] + log(stat_dist)');
}
int observed_state = 2 * x[tip] + y[tip] + 1;
int same_x_state = 2 * x[tip] + (1 - y[tip]) + 1;
y_log_lik[tip] = ll_loo[observed_state] - log_sum_exp(ll_loo[observed_state] , ll_loo[same_x_state]);
int s_rep = categorical_rng(softmax(ll_loo));
x_rep[tip] = (s_rep - 1) / 2;
y_rep[tip] = (s_rep - 1) % 2;
}
}
"
system.time(
stan_model_ctmc_dependent <- stan_model(model_code= stan_model_code_ctmc_dependent)
) user system elapsed
64.618 1.462 62.203
system.time(
fit_ctmc_dependent <- sampling(
stan_model_ctmc_dependent,
data = stan_data,
iter = 2000,
chains = 4
)
) user system elapsed
1529.162 1.601 404.243
print(fit_ctmc_dependent, pars=c("total_rate", "rel_rates", "stat_dist"))Inference for Stan model: anon_model.
4 chains, each with iter=2000; warmup=1000; thin=1;
post-warmup draws per chain=1000, total post-warmup draws=4000.
mean se_mean sd 2.5% 25% 50% 75% 97.5% n_eff Rhat
total_rate 0.47 0 0.13 0.26 0.37 0.45 0.54 0.78 2592 1
rel_rates[1] 0.10 0 0.08 0.00 0.04 0.08 0.14 0.28 2711 1
rel_rates[2] 0.07 0 0.05 0.00 0.03 0.06 0.09 0.18 2727 1
rel_rates[3] 0.29 0 0.12 0.07 0.21 0.29 0.37 0.52 2441 1
rel_rates[4] 0.11 0 0.04 0.04 0.08 0.10 0.13 0.20 3812 1
rel_rates[5] 0.10 0 0.07 0.00 0.04 0.09 0.14 0.26 1887 1
rel_rates[6] 0.09 0 0.04 0.02 0.05 0.08 0.11 0.19 2850 1
rel_rates[7] 0.13 0 0.05 0.05 0.09 0.13 0.16 0.24 3275 1
rel_rates[8] 0.12 0 0.05 0.03 0.08 0.12 0.15 0.24 3397 1
stat_dist[1] 0.13 0 0.05 0.06 0.10 0.12 0.15 0.24 4914 1
stat_dist[2] 0.25 0 0.07 0.13 0.20 0.24 0.29 0.41 4922 1
stat_dist[3] 0.31 0 0.07 0.19 0.26 0.31 0.36 0.45 5103 1
stat_dist[4] 0.31 0 0.07 0.19 0.26 0.31 0.36 0.46 5406 1
Samples were drawn using NUTS(diag_e) at Sun Nov 16 15:30:33 2025.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at
convergence, Rhat=1).
predictive checks
post_ctmc_dependent <- rstan::extract(fit_ctmc_dependent)predictive_check_marginal_binary(
x_prior = post_ctmc_dependent$x_prior,
x_rep = post_ctmc_dependent$x_rep,
x_obs = stan_data$x,
varname = "x"
)
predictive_checks_regression(post_ctmc_dependent$y_prior, post_ctmc_dependent$y_rep, stan_data)
Here the prior is flat. Oddly, there seems to be no difference in the distributions for x=0 and x=1.
model comparison
system.time(
capture.output({
log_marginal_likelihood_ctmc_dependent <- bridge_sampler(fit_ctmc_dependent)
}, file = "/dev/null")
)
print(log_marginal_likelihood_ctmc_dependent) user system elapsed
15.159 0.034 13.283
Bridge sampling estimate of the log marginal likelihood: -112.9313
Estimate obtained in 5 iteration(s) via method "normal".
And now the Bayes Factor:
(log_marginal_likelihood_ctmc_independent$logml - log_marginal_likelihood_ctmc_dependent$logml)So there is evidence in favor of the independent model.
In this connection it is worth pointing out that the OU/logistic model provides a massively better fit than the CTMC model.
(log_marginal_likelihood_OU_correlation$logml - log_marginal_likelihood_ctmc_independent$logml)Now let’s do PSIS-LOOCV for all models.
# extract log-likelihood matrix: draws x observations
log_lik6 <- extract_log_lik(fit_ctmc_independent, parameter_name = "y_log_lik")
log_lik7 <- extract_log_lik(fit_ctmc_dependent, parameter_name = "y_log_lik")
# compute LOO
loo6 <- loo(log_lik6)
loo7 <- loo(log_lik7)
loo_list$ctmc_independent <- loo6
loo_list$ctmc_dependent <- loo7
# compare
loo_compare(loo_list)| elpd_diff | se_diff | elpd_loo | se_elpd_loo | p_loo | se_p_loo | looic | se_looic | |
|---|---|---|---|---|---|---|---|---|
| OU_correlation | 0.00000000 | 0.000000 | -32.95478 | 3.754743 | 22.998565 | 2.9770741 | 65.90957 | 7.509485 |
| OU_regression | -0.07918904 | 1.051616 | -33.03397 | 3.831264 | 24.718425 | 3.2070885 | 66.06794 | 7.662528 |
| ctmc_independent | -8.81393327 | 2.274076 | -41.76872 | 4.930736 | 1.370367 | 0.2513402 | 83.53743 | 9.861471 |
| ctmc_dependent | -9.31603837 | 2.294833 | -42.27082 | 5.097658 | 2.316448 | 0.3997178 | 84.54164 | 10.195317 |
| brownian_regression | -14.85402549 | 3.109386 | -47.80881 | 6.247695 | 19.809112 | 2.8749336 | 95.61762 | 12.495390 |
| vanilla_correlation | -29.11992435 | 3.385216 | -62.07471 | 2.278148 | 27.944938 | 1.2846688 | 124.14941 | 4.556297 |
| vanilla_regression | -33.16730118 | 3.577467 | -66.12208 | 3.149916 | 2.011269 | 0.1372601 | 132.24417 | 6.299831 |
Quite surprisingly, Pareto-smoothed leave-one-out cross-validation leads to a substantially different result than Bayes Factor comparison. It should be kept in mind in this connection that the two types of model comparison compare different things:
- The Bayes Factor compares models that model both variables simultaneously.
- The LOOIC compares how well the models predict verb-object order if ajdective-noun order is known.
marginal_loglikelihoods <- tibble(
model = names(loo_list),
type = c(
'regression',
'correlation',
'regression',
'regression',
'correlation',
'correlation',
'correlation'
),
marginal_loglikelhood = c(
log_marginal_likelihood_vanilla_regression$logml,
log_marginal_likelihood_vanilla_corr$logml,
log_marginal_likelihood_OU_regression$logml,
log_marginal_likelihood_brownian_regression$logml,
log_marginal_likelihood_OU_correlation$logml,
log_marginal_likelihood_ctmc_independent$logml,
log_marginal_likelihood_ctmc_dependent$logml
)
)marginal_loglikelihoods %>%
filter(type == "regression") %>%
arrange(desc(marginal_loglikelhood))| model | type | marginal_loglikelhood |
|---|---|---|
| <chr> | <chr> | <dbl> |
| OU_regression | regression | 132.36930 |
| brownian_regression | regression | -54.97201 |
| vanilla_regression | regression | -64.79445 |
marginal_loglikelihoods %>%
filter(type == "correlation") %>%
arrange(desc(marginal_loglikelhood))| model | type | marginal_loglikelhood |
|---|---|---|
| <chr> | <chr> | <dbl> |
| OU_correlation | correlation | 264.30475 |
| vanilla_correlation | correlation | 55.95731 |
| ctmc_independent | correlation | -106.58872 |
| ctmc_dependent | correlation | -112.93130 |
For the features and the tree used in this case study, the models using continuous latent variables are clearly superior to the CTMC models which assume evolution of discrete characters.
Also, the Ornstein-Uhlenbeck process seems to provide a better fit to the data than a Brownian motion process.
Final thoughts
All the models discussed here use a single phylogeny. It is possible to use a sample of phylogenies to factor in phylogenetic uncertainty in this framework, but it makes the models more complex and the fitting computationally more demanding.
Essentially, what you have to do is:
Construct a 3-dimensional
edgearray (number of edges \(\times\) 2 \(\times\) number of trees), where each layeredge[i,j,k]indicates that there is a branch from nodeito nodejin tree numberk. You have to make sure that all trees are in postorder.Construct a number of edges \(\times\) number of trees matrix
edge_lengthsIn the Stan model, you have to loop over all edges in all trees. So the phylogenetic recursion should look like
for (tr in 1:Ntrees) {
for (e in 1:Nedges) {
int parent = edges[e, 1, tr];
int child = edges[e, 2, tr];
real t = edge_lengths[e, tr];
...
}
}
- Inside the model block, add the line
target -= log(Ntrees);
Together, this averages over the likelihoods of all trees in the tree sample. Since the likelihoods of all trees in the sample have to be computed, this will slow down model fitting quite a bit.
References
Bouckaert, R., Redding, D., Sheehan, O., Kyritsis, T., Gray, R., Jones, K. E., & Atkinson, Q. (2022). Global language diversification is linked to socio-ecology and threat status. SocArXiv. https://osf.io/f8tr6/download
Pagel, M., & Meade, A. (2006). Bayesian analysis of correlated evolution of discrete characters by reversible-jump Markov chain Monte Carlo. The American Naturalist, 167(6), 808–825. https://doi.org/10.1086/503444