library(repr)
options(repr.plot.width=12, repr.plot.height=10)Finding the best marginal model for each variable
Before I investigate the correlations between the variables, I want to check which phylogenetic model provides the best fit of the individual variables.
All variables are ordinal.
I will test three models for each variable:
- ordinal regression
- ordinal regression with phylogenetic random effect
- ordinal regression with phylogenetic random effect and uncorrelated random effect
For all models, Bayesian inference will be used, implemented in Stan.
preparation
Im my experience, R is the best environment for working with Stan.
I start with loading relevant packages.
library(tidyverse)
library(gridExtra)
library(patchwork)
library(stringr)
library(reshape2)
library(rstan)
library(bridgesampling)
library(loo)
library(ape)
library(bayesplot)
library(shinystan)
library(coda)
library(codetools)
library(brms)
library(abind)
options(mc.cores = parallel::detectCores())
rstan_options("auto_write" = TRUE)── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.5
✔ forcats 1.0.0 ✔ stringr 1.5.1
✔ ggplot2 3.4.4 ✔ tibble 3.2.1
✔ lubridate 1.9.3 ✔ tidyr 1.3.1
✔ purrr 1.0.2
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
Attaching package: ‘gridExtra’
The following object is masked from ‘package:dplyr’:
combine
Attaching package: ‘reshape2’
The following object is masked from ‘package:tidyr’:
smiths
Loading required package: StanHeaders
rstan version 2.32.3 (Stan version 2.26.1)
For execution on a local, multicore CPU with excess RAM we recommend calling
options(mc.cores = parallel::detectCores()).
To avoid recompilation of unchanged Stan programs, we recommend calling
rstan_options(auto_write = TRUE)
For within-chain threading using `reduce_sum()` or `map_rect()` Stan functions,
change `threads_per_chain` option:
rstan_options(threads_per_chain = 1)
Attaching package: ‘rstan’
The following object is masked from ‘package:tidyr’:
extract
This is loo version 2.6.0
- Online documentation and vignettes at mc-stan.org/loo
- As of v2.0.0 loo defaults to 1 core but we recommend using as many as possible. Use the 'cores' argument or set options(mc.cores = NUM_CORES) for an entire session.
Attaching package: ‘loo’
The following object is masked from ‘package:rstan’:
loo
Attaching package: ‘ape’
The following object is masked from ‘package:dplyr’:
where
This is bayesplot version 1.10.0
- Online documentation and vignettes at mc-stan.org/bayesplot
- bayesplot theme set to bayesplot::theme_default()
* Does _not_ affect other ggplot2 plots
* See ?bayesplot_theme_set for details on theme setting
Loading required package: shiny
This is shinystan version 2.6.0
Attaching package: ‘coda’
The following object is masked from ‘package:rstan’:
traceplot
Loading required package: Rcpp
Loading 'brms' package (version 2.20.4). Useful instructions
can be found by typing help('brms'). A more detailed introduction
to the package is available through vignette('brms_overview').
Attaching package: ‘brms’
The following object is masked from ‘package:bayesplot’:
rhat
The following object is masked from ‘package:bridgesampling’:
bf
The following object is masked from ‘package:rstan’:
loo
The following object is masked from ‘package:stats’:
ar
karminrot <- rgb(206/255, 143/255, 137/255)
gold <- rgb(217/255, 205/255, 177/255)
anthrazit <- rgb(143/255, 143/255, 149/255)Next, the data are loaded into a tibble called d.
d <- read_csv("../data/data_with_families.csv") %>%
distinct(glottocode, .keep_all = TRUE)
taxa <- d$glottocode
head(d)Rows: 1291 Columns: 10
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (3): soc_id, glottocode, Family
dbl (7): political_complexity, hierarchy_within, domestic_organisation, agri...
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
| soc_id | glottocode | political_complexity | hierarchy_within | domestic_organisation | agricultureLevel | settlement_strategy | exogamy | crop_type | Family |
|---|---|---|---|---|---|---|---|---|---|
| <chr> | <chr> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <chr> |
| Aa1 | juho1239 | 1 | 3 | 3 | 0 | 1 | 1 | 1 | Kxa |
| Aa2 | okie1245 | 1 | 2 | 1 | 0 | 1 | 1 | 1 | Nilotic |
| Aa3 | nama1265 | 2 | 2 | 1 | 0 | 1 | 1 | 1 | Khoe_Kwadi |
| Aa4 | dama1270 | NA | NA | 2 | 0 | NA | 1 | NA | Khoe_Kwadi |
| Aa5 | bila1255 | 1 | 2 | 1 | 0 | 1 | 1 | 1 | Atlantic_Congo |
| Aa6 | sand1273 | 2 | 2 | 1 | 5 | 3 | 1 | 6 | Sandawe |
In the next step, the maximum clade credibility trees are loaded as phylo objects.
mcc_trees <- list.files("../data/mcc_trees", pattern = ".tre", full.names = TRUE)
trees <- purrr::map(mcc_trees, read.tree)
# get a vector of tip labels of all trees in trees
tips <- purrr::map(trees, function(x) x$tip.label) %>% unlist %>% uniqueA brief sanity check to ensure that all tips of the trees correspond to a Glottocode in d.
if (length(setdiff(tips, taxa)) == 0 ) {
print("All tips in trees are in taxa")
} else {
print("Not all tips in trees are in taxa")
}[1] "All tips in trees are in taxa"
All Glottocodes in d that do not occur as a tip are isolates (for the purpose of this study).
For each isolate I create a list object with the fields "tip.lable" and "Nnode", to make them compatible with the phylo objects for language families. These list objects are added to the list trees.
isolates <- setdiff(taxa, tips)
# convert isolates to one-node trees and add to trees
for (i in isolates) {
tr <- list()
tr[["tip.label"]] <- i
tr[["Nnode"]] <- 0
trees[[length(trees) + 1]] <- tr
}Just to make sure, I test whether the tips of this forest are exactly the Glottocodes in the data.
tips <- c()
for (tr in trees) {
tips <- c(tips, tr$tip.label)
}
if (all(sort(tips) == sort(taxa))) {
print("taxa and tips are identical")
} else {
print("taxa and tips are not identical")
}[1] "taxa and tips are identical"
Now I re-order the rows in d to make sure they are in the same order as tips.
d <- d[match(tips, d$glottocode), ]The trees in trees form a forest, i.e., a disconnected directed acyclic graph. I want to construct a representation of this graph which I can pass to Stan, consisting of
- the number of nodes
N, - a Nx2-matrix
edges, and - a length-N vector of edge lengths,
edge_lengths. - a vector of root nodes
- a vector of tip nodes
First I create an Nx4 matrix where each row represents one node. The columns are
- node number
- tree number
- tree-internal node number
- label
For internal nodes, the label is the empty string.
node_matrix <- c()
node_number <- 1
for (tn in 1:length(trees)) {
tr <- trees[[tn]]
tr_nodes <- length(tr$tip.label) + tr$Nnode
for (i in 1:tr_nodes) {
tl <- ""
if (i <= length(tr$tip.label)) {
tl <- tr$tip.label[i]
}
node_matrix <- rbind(node_matrix, c(node_number, tn, i, tl))
node_number <- node_number+1
}
}I convert the matrix to a tibble.
nodes <- tibble(
node_number = as.numeric(node_matrix[,1]),
tree_number = as.numeric(node_matrix[,2]),
internal_number = as.numeric(node_matrix[,3]),
label = node_matrix[,4]
)
head(nodes)
tail(nodes)| node_number | tree_number | internal_number | label |
|---|---|---|---|
| <dbl> | <dbl> | <dbl> | <chr> |
| 1 | 1 | 1 | kron1241 |
| 2 | 1 | 2 | tuli1249 |
| 3 | 1 | 3 | |
| 4 | 2 | 1 | zarm1239 |
| 5 | 2 | 2 | koyr1242 |
| 6 | 2 | 3 |
| node_number | tree_number | internal_number | label |
|---|---|---|---|
| <dbl> | <dbl> | <dbl> | <chr> |
| 2231 | 179 | 1 | qawa1238 |
| 2232 | 180 | 1 | cham1315 |
| 2233 | 181 | 1 | nort2971 |
| 2234 | 182 | 1 | trum1247 |
| 2235 | 183 | 1 | sout2994 |
| 2236 | 184 | 1 | guat1253 |
Next I define a function that maps a tree number and an internal node number to the global node number.
get_global_id <- function(tree_number, local_id) which((nodes$tree_number == tree_number) & (nodes$internal_number == local_id))Now I create a matrix of edges and, concomittantly, a vector of edge lengths.
edges <- c()
edge_lengths <- c()
for (tn in 1:length(trees)) {
tr <- trees[[tn]]
if (length(tr$edge) > 0 ) {
for (en in 1:nrow(tr$edge)) {
mother <- tr$edge[en, 1]
daughter <- tr$edge[en, 2]
mother_id <- get_global_id(tn, mother)
daughter_id <- get_global_id(tn, daughter)
el <- tr$edge.length[en]
edges <- rbind(edges, c(mother_id, daughter_id))
edge_lengths <- c(edge_lengths, el)
}
}
}Finally I identify the root nodes and tip nodes.
roots <- c(sort(setdiff(edges[,1], edges[,2])), match(isolates, nodes$label))
tips <- which(nodes$label != "")Next I define a function computing the post-order.
get_postorder <- function(roots, edges) {
input <- roots
output <- c()
while (length(input) > 0) {
pivot <- input[1]
daughters <- setdiff(edges[edges[,1] == pivot, 2], output)
if (length(daughters) == 0) {
input <- input[-1]
output <- c(output, pivot)
} else {
input <- c(daughters, input)
}
}
return(output)
}Ordinal regression
For ordinal regression, the model specifications are as follows. \(K\) is the number of possible values.
\[ \begin{aligned} C_k &\sim \mathcal N(0, 5)\\ Y_i=1 &\sim \mathrm{Bernoulli}(\mathrm{logit}^{-1}(C_1))\\ Y_i=k &\sim \mathrm{Bernoulli}(\mathrm{logit}^{-1}(C_k) - \mathrm{logit}^{-1}(C_{k-1}))\\ Y_K=K &\sim \mathrm{Bernoulli}(1-\mathrm{logit}^{-1}(C_{K-1}))\\ \end{aligned} \]
Variable 1: Political complexity
variable <- "political_complexity"Y <- d %>%
select(all_of(variable)) %>%
na.omit %>%
pull
Y <- match(Y, sort(unique(Y)))ordinal_data <- list()
ordinal_data$Ndata <- length(Y)
ordinal_data$nthres <- max(Y) -1
ordinal_data$Y <- YI generated Stan code for ordinal regression from brms and then did some manual adjustments.
ordinal_code <- "
data {
int<lower=1> Ndata; // total number of observations
int Y[Ndata]; // response variable
int<lower=2> nthres; // number of thresholds
}
parameters {
ordered[nthres] Intercept; // temporary thresholds for centered predictors
}
model {
for (n in 1:Ndata) {
target += ordered_logistic_lpmf(Y[n] | 0, Intercept);
}
target += normal_lpdf(Intercept | 0, 5);
}
generated quantities {
int y_prior[Ndata];
int y_posterior[Ndata];
vector[nthres] Intercept_sample;
ordered[nthres] Intercept_prior;
vector[Ndata] log_lik;
for (k in 1:nthres) {
Intercept_sample[k] = normal_rng(0, 5);
}
Intercept_prior = sort_asc(Intercept_sample);
for (n in 1:Ndata) {
y_prior[n] = ordered_logistic_rng(0, Intercept_prior);
y_posterior[n] = ordered_logistic_rng(0, Intercept);
log_lik[n] = ordered_logistic_lpmf(Y[n] | 0, Intercept);
}
}
"model_ordinal <- stan_model(model_code=ordinal_code)fit_ordinal <- rstan::sampling(
model_ordinal,
data=ordinal_data,
chains=4,
iter=2000,
)print(fit_ordinal, pars=c("Intercept", "lp__"))Inference for Stan model: anon_model.
4 chains, each with iter=2000; warmup=1000; thin=1;
post-warmup draws per chain=1000, total post-warmup draws=4000.
mean se_mean sd 2.5% 25% 50% 75% 97.5%
Intercept[1] -0.25 0.00 0.06 -0.37 -0.29 -0.25 -0.20 -0.12
Intercept[2] 1.06 0.00 0.07 0.93 1.02 1.06 1.11 1.20
Intercept[3] 2.07 0.00 0.09 1.88 2.00 2.06 2.13 2.25
Intercept[4] 3.18 0.00 0.15 2.89 3.07 3.18 3.28 3.49
lp__ -1436.95 0.03 1.39 -1440.41 -1437.64 -1436.68 -1435.94 -1435.19
n_eff Rhat
Intercept[1] 3469 1
Intercept[2] 5482 1
Intercept[3] 4885 1
Intercept[4] 4118 1
lp__ 2333 1
Samples were drawn using NUTS(diag_e) at Sun Feb 11 15:59:38 2024.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at
convergence, Rhat=1).
(loo_ordinal <- loo(fit_ordinal))
Computed from 4000 by 1077 log-likelihood matrix
Estimate SE
elpd_loo -1428.9 21.9
p_loo 4.0 0.2
looic 2857.8 43.7
------
Monte Carlo SE of elpd_loo is 0.0.
All Pareto k estimates are good (k < 0.5).
See help('pareto-k-diagnostic') for details.
# Extract the generated quantities
generated_quantities <- extract(fit_ordinal)Predictive checks
Following Jäger & Wahle (2021), I define three statistics to be used in prior and posterior predictive checks:
- total variance (tv)
- mean lineage-wise variance (lv)
- cross-family variance (cv)
Here are the corresponding R functions.
total_variance <- function(Y) var(Y)lineage_wise_variance <- function(Y, variable=variable) {
out <- d %>%
filter(!is.na(d[[variable]])) %>%
mutate(Y = Y) %>%
group_by(Family) %>%
summarize(variance = var(Y, na.rm = TRUE)) %>%
summarize(mean_variance = mean(variance, na.rm = TRUE)) %>%
pull
return(out)
}crossfamily_variance <- function(Y, variable=variable) {
out <- d %>%
filter(!is.na(d[[variable]])) %>%
mutate(Y = Y) %>%
group_by(Family) %>%
summarize(centroids = mean(Y, na.rm = TRUE)) %>%
pull(centroids) %>%
var
return(out)
}First we get the empirical values.
Y <- ordinal_data$Y
empirical <- tibble(
tv = total_variance(Y),
lv = lineage_wise_variance(Y, variable),
cv = crossfamily_variance(Y, variable)
)Thanks to the generated quantities block in the Stan programm, fitting the model simultaneously produced prior and posterior predictive samples.
prior_pc <- tibble(
tv = apply(generated_quantities$y_prior, 1, total_variance),
lv = apply(
generated_quantities$y_prior, 1, function(x) lineage_wise_variance(x, variable)
),
cv = apply(
generated_quantities$y_prior, 1, function(x) crossfamily_variance(x, variable)
),
)predictive_plot <- function(prior_pc) {
p1 <- prior_pc %>%
mutate(
x_jittered = 1 + rnorm(n(), mean = 0, sd = 0.01),
y_jittered = tv + rnorm(n(), mean = 0, sd = sd(prior_pc$tv)/10)
) %>%
ggplot(aes(x = x_jittered, y = tv)) +
geom_boxplot(aes(x = factor(1), y = tv), width = 0.5, fill=karminrot) +
geom_point(alpha = 0.5, size = 0.1, position = position_identity()) +
theme_grey(base_size = 16) +
theme(
axis.text.x = element_blank(),
axis.ticks.x = element_blank(),
axis.title.x = element_blank(),
axis.title.y = element_text(size = 18),
plot.title = element_text(size = 20, hjust = 0.5)
) +
labs(title = "total variance", y = "") +
geom_point(aes(x=1, y=empirical$tv, color="empirical"), size=5, fill='red') +
guides(color = "none")
p2 <- prior_pc %>%
mutate(
x_jittered = 1 + rnorm(n(), mean = 0, sd = 0.01),
y_jittered = lv + rnorm(n(), mean = 0, sd = sd(prior_pc$lv)/10)
) %>%
ggplot(aes(x = x_jittered, y = lv)) +
geom_boxplot(aes(x = factor(1), y = lv), width = 0.5, fill=gold) +
geom_point(alpha = 0.5, size = 0.1, position = position_identity()) +
theme_grey(base_size = 16) +
theme(
axis.text.x = element_blank(),
axis.ticks.x = element_blank(),
axis.title.x = element_blank(),
axis.title.y = element_text(size = 18),
plot.title = element_text(size = 20, hjust = 0.5)
) +
labs(title = "lineage-wise variance", y = "") +
geom_point(aes(x=1, y=empirical$lv, color="empirical"), size=5, fill='red') +
guides(color = "none")
p3 <- prior_pc %>%
mutate(
x_jittered = 1 + rnorm(n(), mean = 0, sd = 0.01),
y_jittered = cv + rnorm(n(), mean = 0, sd = sd(prior_pc$cv)/10)
) %>%
ggplot(aes(x = x_jittered, y = cv)) +
geom_boxplot(aes(x = factor(1), y = cv), width = 0.5, fill=anthrazit) +
geom_point(alpha = 0.5, size = 0.1, position = position_identity()) +
theme_grey(base_size = 16) +
theme(
axis.text.x = element_blank(),
axis.ticks.x = element_blank(),
axis.title.x = element_blank(),
axis.title.y = element_text(size = 18),
plot.title = element_text(size = 20, hjust = 0.5)
) +
labs(title = "cross-lineage variance", y = "") +
geom_point(aes(x=1, y=empirical$cv, color="empirical"), size=5, fill='red') +
labs(color = "") +
guides(color = "none")
return(grid.arrange(p1, p2, p3, ncol=3))
}predictive_plot(prior_pc)
We see that the prior cover the empirical values quite well. Here is what happens after the model is fitted:
posterior_pc <- tibble(
tv = apply(generated_quantities$y_posterior, 1, total_variance),
lv = apply(
generated_quantities$y_posterior, 1, function(x) lineage_wise_variance(x, variable)
),
cv = apply(
generated_quantities$y_posterior, 1, function(x) crossfamily_variance(x, variable)
),
)
predictive_plot(posterior_pc)
Something is seriously wrong with this model, obviously.
Setting up a phylogenetic model
Brownian motion
Next I will consider the phylogenetic Brownian motion model, combined with a logistic component. Here the basic assumptions are
- there is a continuous latent variable \(z\) which has a value at each node
- at the root nodes, \(z\) is drawn from a distribution like \(\mathcal N(0, 1)\)
- once the value of \(z\) at a mother node is fixed, the values at the daughter nodes is drawn from \(\mathcal N(z_i, \sigma t_j)\), where \(z_i\) is the value at the root and \(t_j\) is the branch length to the daughter
- the observed values \(y_i\) at tip \(i\) are distributed as
\[ \begin{aligned} Y_i=1 &\sim \mathrm{Bernoulli}(\mathrm{logit}^{-1}(C_1-z_i))\\ Y_i=k &\sim \mathrm{Bernoulli}(\mathrm{logit}^{-1}(C_k-z_i) - \mathrm{logit}^{-1}(C_{k-1}-z_i))\\ Y_K=K &\sim \mathrm{Bernoulli}(1-\mathrm{logit}^{-1}(C_{K-1}-z_i))\\ \end{aligned} \]
The parameter \(\sigma\) can be interpreted as the rate of evolution.
defined <- which(!is.na(d[[variable]]))
undefined <- which(is.na(d[[variable]]))po <- c()
for (i in get_postorder(roots, edges)) {
if ((i %in% edges[,2]) & !(i %in% undefined)) {
po <- c(po, i)
}
}
edge_po <- match(po, edges[,2])phylo_data = list()
phylo_data$Ntrees <- length(trees)
phylo_data$Nnodes <- nrow(nodes)
phylo_data$Y <- ordinal_data$Y
phylo_data$Nobservations <- length(phylo_data$Y)
phylo_data$Nedges <- length(edge_po)
phylo_data$observations <- match(pull(d[defined, "glottocode"]), nodes$label)
phylo_data$roots <- roots
phylo_data$edges <- edges[edge_po,]
phylo_data$edge_lengths <- edge_lengths[edge_po]
phylo_data$nthres <- length(unique(phylo_data$Y)) - 1brown_code <- "
data {
int<lower=1> Ntrees;
int<lower=0> Nnodes;
int<lower=1> Nobservations;
int<lower=1> Nedges;
int<lower=1,upper=Nnodes> observations[Nobservations]; // indices of the observed tips
int<lower=1, upper=Nnodes> roots[Ntrees];
int<lower=1, upper=Nnodes> edges[Nedges, 2];
vector[Nedges] edge_lengths;
int<lower=2> nthres; // number of thresholds
int Y[Nobservations]; // response variable
}
parameters {
vector[Nnodes] z_normal;
real mu;
real<lower=0> sigma;
real<lower=0> sigma_root;
ordered[nthres] Intercept; // temporary thresholds for centered predictors
}
model {
vector[Nnodes] z = rep_vector(0, Nnodes);
target += normal_lpdf(mu | 0, 1);
target += exponential_lpdf(sigma | 1);
target += exponential_lpdf(sigma_root | 1);
target += normal_lpdf(z_normal | 0, 1);
target += normal_lpdf(Intercept | 0, 5);
for (i in 1:Ntrees) {
int idx = roots[i];
z[idx] = mu + z_normal[idx] * sigma_root;
}
for (i in 1:Nedges) {
int j = Nedges - i + 1;
int mother_node = edges[j, 1];
int daughter_node = edges[j, 2];
real local_mu = z[mother_node];
real local_sigma = sigma * sqrt(edge_lengths[j]);
z[daughter_node] = local_mu + z_normal[daughter_node] * local_sigma;
}
for (n in 1:Nobservations) {
target += ordered_logistic_lpmf(Y[n] | z[observations[n]], Intercept);
}
}
generated quantities {
vector[Nnodes] z = rep_vector(0, Nnodes);
int y_prior[Nobservations]; // Prior predictive samples
int y_posterior[Nobservations]; // Posterior predictive samples
vector[Nnodes] z_prior = rep_vector(0, Nnodes);
vector[Nnodes] z_posterior = rep_vector(0, Nnodes);
real mu_prior;
real<lower=0> sigma_prior;
real<lower=0> sigma_root_prior;
real log_lik[Nobservations];
vector[nthres] Intercept_sample;
ordered[nthres] Intercept_prior;
for (i in 1:Ntrees) {
int idx = roots[i];
z[idx] = mu + z_normal[idx] * sigma_root;
}
for (i in 1:Nedges) {
int j = Nedges - i + 1;
int mother_node = edges[j, 1];
int daughter_node = edges[j, 2];
real local_mu = z[mother_node];
real local_sigma = sigma * sqrt(edge_lengths[j]);
z[daughter_node] = local_mu + z_normal[daughter_node] * local_sigma;
}
for (k in 1:nthres) {
Intercept_sample[k] = normal_rng(0, 5);
}
Intercept_prior = sort_asc(Intercept_sample);
mu_prior = normal_rng(0, 10);
sigma_prior = exponential_rng(1);
sigma_root_prior = exponential_rng(1);
for (i in 1:Ntrees) {
z_prior[roots[i]] = normal_rng(mu_prior, sigma_root_prior);
}
for (i in 1:Ntrees) {
int idx = roots[i];
z_posterior[idx] = mu + z_normal[idx] * sigma_root;
}
for (i in 1:Nedges) {
int j = Nedges - i + 1;
int mother_node = edges[j, 1];
int daughter_node = edges[j, 2];
real local_mu = z_posterior[mother_node];
real local_sigma = sigma * sqrt(edge_lengths[j]);
z_prior[daughter_node] = normal_rng(
z_prior[mother_node],
sigma_prior * sqrt(edge_lengths[j])
);
z_posterior[daughter_node] = local_mu + z_normal[daughter_node] * local_sigma;
}
for (n in 1:Nobservations) {
y_prior[n] = ordered_logistic_rng(z_prior[observations[n]], Intercept_prior);
y_posterior[n] = ordered_logistic_rng(z[observations[n]], Intercept);
log_lik[n] = ordered_logistic_lpmf(Y[n] | z[observations[n]], Intercept);
}
}
"model_brownian <- stan_model(model_code=brown_code)fit_brownian <- rstan::sampling(
model_brownian,
data = phylo_data,
chains = 4,
iter=2000
)print(fit_brownian, pars=c("mu", "sigma", "sigma_root", "Intercept", "lp__"))Inference for Stan model: anon_model.
4 chains, each with iter=2000; warmup=1000; thin=1;
post-warmup draws per chain=1000, total post-warmup draws=4000.
mean se_mean sd 2.5% 25% 50% 75%
mu -1.46 0.01 0.28 -2.02 -1.64 -1.45 -1.26
sigma 1.90 0.01 0.27 1.38 1.72 1.89 2.07
sigma_root 1.90 0.01 0.27 1.42 1.72 1.90 2.07
Intercept[1] -0.65 0.00 0.15 -0.95 -0.75 -0.65 -0.55
Intercept[2] 1.32 0.00 0.16 1.01 1.21 1.32 1.43
Intercept[3] 2.72 0.00 0.20 2.33 2.58 2.72 2.85
Intercept[4] 4.07 0.00 0.25 3.60 3.89 4.06 4.24
lp__ -4319.19 1.47 41.59 -4404.31 -4345.90 -4317.98 -4290.17
97.5% n_eff Rhat
mu -0.94 1581 1.00
sigma 2.43 743 1.01
sigma_root 2.47 1203 1.00
Intercept[1] -0.34 3592 1.00
Intercept[2] 1.65 5223 1.00
Intercept[3] 3.13 3012 1.00
Intercept[4] 4.58 2774 1.00
lp__ -4239.83 800 1.00
Samples were drawn using NUTS(diag_e) at Sun Feb 11 16:04:29 2024.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at
convergence, Rhat=1).
log_lik <- extract_log_lik(
fit_brownian,
parameter_name = "log_lik",
merge_chains = FALSE
)
reff <- relative_eff(exp(log_lik))
(loo_brownian <- loo(log_lik, r_eff=reff))Warning message:
“Some Pareto k diagnostic values are too high. See help('pareto-k-diagnostic') for details.
”
Computed from 4000 by 1077 log-likelihood matrix
Estimate SE
elpd_loo -1261.6 25.4
p_loo 209.2 6.9
looic 2523.2 50.8
------
Monte Carlo SE of elpd_loo is NA.
Pareto k diagnostic values:
Count Pct. Min. n_eff
(-Inf, 0.5] (good) 963 89.4% 350
(0.5, 0.7] (ok) 104 9.7% 174
(0.7, 1] (bad) 10 0.9% 72
(1, Inf) (very bad) 0 0.0% <NA>
See help('pareto-k-diagnostic') for details.
generated_quantities <- extract(fit_brownian)prior predictive check
prior_pc <- tibble(
tv = apply(generated_quantities$y_prior, 1, total_variance),
lv = apply(
generated_quantities$y_prior, 1, function(x) lineage_wise_variance(x, variable)
),
cv = apply(
generated_quantities$y_prior, 1, function(x) crossfamily_variance(x, variable)
),
)
predictive_plot(prior_pc)
posterior predictive check
posterior_pc <- tibble(
tv = apply(generated_quantities$y_posterior, 1, total_variance),
lv = apply(
generated_quantities$y_posterior, 1, function(x) lineage_wise_variance(x, variable)
),
cv = apply(
generated_quantities$y_posterior, 1, function(x) crossfamily_variance(x, variable)
),
)
predictive_plot(posterior_pc)
Mixed model
Next I will fit a model with two random effect, one phylogenetic and one uncorrelated. The variable \(z\) is as in the Brownian motion model. Additionally, we have a random effect variable \(\epsilon\) which is drawn from a normal distribution with mean 0 and standard deviation \(\sigma_\epsilon\).
\[ \begin{aligned} \sigma_\epsilon^2 &\sim \text{Inverse\_Gamma}(2, 0.01)\\ \epsilon &\sim \mathcal N(0, \sigma_\epsilon)\\ Y_i=1 &\sim \mathrm{Bernoulli}(\mathrm{logit}^{-1}(C_1-z_i-\epsilon_i))\\ Y_i=k &\sim \mathrm{Bernoulli}(\mathrm{logit}^{-1}(C_k-z_i-\epsilon_i) - \mathrm{logit}^{-1}(C_{k-1}-z_i))\\ Y_K=K &\sim \mathrm{Bernoulli}(1-\mathrm{logit}^{-1}(C_{K-1}-z_i-\epsilon_i))\\ \end{aligned} \]
mixed_code <- "
data {
int<lower=1> Ntrees;
int<lower=0> Nnodes;
int<lower=1> Nobservations;
int<lower=1> Nedges;
int<lower=1,upper=Nnodes> observations[Nobservations]; // indices of the observed tips
int<lower=1, upper=Nnodes> roots[Ntrees];
int<lower=1, upper=Nnodes> edges[Nedges, 2];
vector[Nedges] edge_lengths;
int<lower=2> nthres; // number of thresholds
int Y[Nobservations]; // response variable
}
parameters {
vector[Nnodes] z_normal;
real mu;
real<lower=0> sigma;
real<lower=0> sigma_root;
ordered[nthres] Intercept; // temporary thresholds for centered predictors
real<lower=0> sigma_error;
vector[Nobservations] z_error;
}
model {
vector[Nnodes] z = rep_vector(0, Nnodes);
target += normal_lpdf(mu | 0, 10);
target += exponential_lpdf(sigma | 1);
target += exponential_lpdf(sigma_root | 1);
target += normal_lpdf(z_normal | 0, 1);
target += normal_lpdf(Intercept | 0, 5);
target += inv_gamma_lpdf(square(sigma_error) | 2, 0.1);
target += normal_lpdf(z_error | 0, sigma_error);
for (i in 1:Ntrees) {
int idx = roots[i];
z[idx] = mu + z_normal[idx] * sigma_root;
}
for (i in 1:Nedges) {
int j = Nedges - i + 1;
int mother_node = edges[j, 1];
int daughter_node = edges[j, 2];
real local_mu = z[mother_node];
real local_sigma = sigma * sqrt(edge_lengths[j]);
z[daughter_node] = local_mu + z_normal[daughter_node] * local_sigma;
}
for (n in 1:Nobservations) {
target += ordered_logistic_lpmf(Y[n] | z[observations[n]] + z_error[n], Intercept);
}
}
generated quantities {
vector[Nnodes] z = rep_vector(0, Nnodes);
int y_prior[Nobservations]; // Prior predictive samples
int y_posterior[Nobservations]; // Posterior predictive samples
vector[Nnodes] z_prior = rep_vector(0, Nnodes);
vector[Nnodes] z_posterior = rep_vector(0, Nnodes);
real mu_prior;
real<lower=0> sigma_prior;
real<lower=0> sigma_root_prior;
real log_lik[Nobservations];
vector[nthres] Intercept_sample;
ordered[nthres] Intercept_prior;
vector[Nobservations] z_error_prior;
real sigma_error_sq = 1 / gamma_rng(2, 10);
real sigma_error_prior = sqrt(sigma_error_sq);
for (i in 1:Ntrees) {
int idx = roots[i];
z[idx] = mu + z_normal[idx] * sigma_root;
}
for (n in 1:Nobservations) {
z_error_prior[n] = normal_rng(0, sigma_error_prior);
}
for (i in 1:Nedges) {
int j = Nedges - i + 1;
int mother_node = edges[j, 1];
int daughter_node = edges[j, 2];
real local_mu = z[mother_node];
real local_sigma = sigma * sqrt(edge_lengths[j]);
z[daughter_node] = local_mu + z_normal[daughter_node] * local_sigma;
}
for (k in 1:nthres) {
Intercept_sample[k] = normal_rng(0, 5);
}
Intercept_prior = sort_asc(Intercept_sample);
mu_prior = normal_rng(0, 10);
sigma_prior = exponential_rng(1);
sigma_root_prior = exponential_rng(1);
for (i in 1:Ntrees) {
z_prior[roots[i]] = normal_rng(mu_prior, sigma_root_prior);
}
for (i in 1:Ntrees) {
int idx = roots[i];
z_posterior[idx] = mu + z_normal[idx] * sigma_root;
}
for (i in 1:Nedges) {
int j = Nedges - i + 1;
int mother_node = edges[j, 1];
int daughter_node = edges[j, 2];
real local_mu = z_posterior[mother_node];
real local_sigma = sigma * sqrt(edge_lengths[j]);
z_prior[daughter_node] = normal_rng(
z_prior[mother_node],
sigma_prior * sqrt(edge_lengths[j])
);
z_posterior[daughter_node] = local_mu + z_normal[daughter_node] * local_sigma;
}
for (n in 1:Nobservations) {
y_prior[n] = ordered_logistic_rng(z_prior[observations[n]] + z_error_prior[n], Intercept_prior);
y_posterior[n] = ordered_logistic_rng(z[observations[n]] + z_error[n], Intercept);
log_lik[n] = ordered_logistic_lpmf(Y[n] | z[observations[n]] + z_error[n], Intercept);
}
}
"model_mixed <- stan_model(model_code=mixed_code)## fit_mixed1 <- readRDS("models/fit_mixed1.rds")fit_mixed1 <- rstan::sampling(
model_mixed,
data = phylo_data,
chains = 4,
iter=40000,
thin=20,
control=list(max_treedepth=15)
)Warning message:
“There were 4 chains where the estimated Bayesian Fraction of Missing Information was low. See
https://mc-stan.org/misc/warnings.html#bfmi-low”
Warning message:
“Examine the pairs() plot to diagnose sampling problems
”
Warning message:
“Bulk Effective Samples Size (ESS) is too low, indicating posterior means and medians may be unreliable.
Running the chains for more iterations may help. See
https://mc-stan.org/misc/warnings.html#bulk-ess”
Warning message:
“Tail Effective Samples Size (ESS) is too low, indicating posterior variances and tail quantiles may be unreliable.
Running the chains for more iterations may help. See
https://mc-stan.org/misc/warnings.html#tail-ess”
saveRDS(fit_mixed1, "models/fit_mixed1.rds")print(fit_mixed1, pars=c("mu", "sigma", "sigma_root", "Intercept", "sigma_error", "lp__"))Inference for Stan model: anon_model.
4 chains, each with iter=40000; warmup=20000; thin=20;
post-warmup draws per chain=1000, total post-warmup draws=4000.
mean se_mean sd 2.5% 25% 50% 75%
mu -1.61 0.01 0.31 -2.25 -1.81 -1.60 -1.39
sigma 1.93 0.00 0.28 1.42 1.73 1.93 2.12
sigma_root 1.98 0.00 0.28 1.48 1.78 1.96 2.16
Intercept[1] -0.70 0.00 0.16 -1.01 -0.80 -0.70 -0.59
Intercept[2] 1.31 0.00 0.17 0.99 1.19 1.31 1.42
Intercept[3] 2.72 0.00 0.21 2.34 2.58 2.72 2.86
Intercept[4] 4.08 0.00 0.26 3.59 3.90 4.08 4.25
sigma_error 0.22 0.01 0.09 0.13 0.17 0.20 0.25
lp__ -4158.44 22.95 340.75 -4931.29 -4346.85 -4120.79 -3926.49
97.5% n_eff Rhat
mu -1.03 3705 1.00
sigma 2.51 3930 1.00
sigma_root 2.59 3823 1.00
Intercept[1] -0.38 3815 1.00
Intercept[2] 1.65 3938 1.00
Intercept[3] 3.13 3395 1.00
Intercept[4] 4.59 3134 1.00
sigma_error 0.43 188 1.01
lp__ -3608.28 220 1.01
Samples were drawn using NUTS(diag_e) at Sun Feb 11 17:10:02 2024.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at
convergence, Rhat=1).
log_lik <- extract_log_lik(
fit_mixed1,
parameter_name = "log_lik",
merge_chains = FALSE
)
reff <- relative_eff(exp(log_lik))
(loo_mixed <- loo(log_lik, r_eff=reff))Warning message:
“Some Pareto k diagnostic values are too high. See help('pareto-k-diagnostic') for details.
”
Computed from 4000 by 1077 log-likelihood matrix
Estimate SE
elpd_loo -1260.7 25.4
p_loo 222.6 7.2
looic 2521.3 50.9
------
Monte Carlo SE of elpd_loo is NA.
Pareto k diagnostic values:
Count Pct. Min. n_eff
(-Inf, 0.5] (good) 952 88.4% 423
(0.5, 0.7] (ok) 117 10.9% 239
(0.7, 1] (bad) 8 0.7% 49
(1, Inf) (very bad) 0 0.0% <NA>
See help('pareto-k-diagnostic') for details.
generated_quantities <- extract(fit_mixed1)prior predictive check
prior_pc <- tibble(
tv = apply(generated_quantities$y_prior, 1, total_variance),
lv = apply(
generated_quantities$y_prior, 1, function(x) lineage_wise_variance(x, variable)
),
cv = apply(
generated_quantities$y_prior, 1, function(x) crossfamily_variance(x, variable)
),
)
predictive_plot(prior_pc)
posterior predictive check
posterior_pc <- tibble(
tv = apply(generated_quantities$y_posterior, 1, total_variance),
lv = apply(
generated_quantities$y_posterior, 1, function(x) lineage_wise_variance(x, variable)
),
cv = apply(
generated_quantities$y_posterior, 1, function(x) crossfamily_variance(x, variable)
),
)
predictive_plot(posterior_pc)
Let us summarize the model comparisons:
This is very decisive evidence in favor of the CTMC model.
model comparison
tibble(model = c("ordinal", "brownian", "mixed"),
looic = c(
loo_ordinal$estimates[3,1],
loo_brownian$estimates[3,1],
loo_mixed$estimates[3,1]
)
) %>%
mutate(delta_looic = round(looic-min(looic), 3)) %>%
arrange(delta_looic)| model | looic | delta_looic |
|---|---|---|
| <chr> | <dbl> | <dbl> |
| mixed | 2521.317 | 0.000 |
| brownian | 2523.228 | 1.911 |
| ordinal | 2857.780 | 336.463 |
We can conclude that phylogenetic control massively improves the fit to the data.
Variable 2: hierarchy_within
variable <- "hierarchy_within"Y <- d %>%
select(all_of(variable)) %>%
na.omit %>%
pull
Y <- match(Y, sort(unique(Y)))ordinal_data <- list()
ordinal_data$Ndata <- length(Y)
ordinal_data$nthres <- max(Y) -1
ordinal_data$Y <- Yfit_ordinal <- rstan::sampling(
model_ordinal,
data=ordinal_data,
chains=4,
iter=2000,
)print(fit_ordinal, pars=c("Intercept", "lp__"))Inference for Stan model: anon_model.
4 chains, each with iter=2000; warmup=1000; thin=1;
post-warmup draws per chain=1000, total post-warmup draws=4000.
mean se_mean sd 2.5% 25% 50% 75% 97.5%
Intercept[1] -0.86 0.00 0.07 -0.99 -0.9 -0.85 -0.81 -0.72
Intercept[2] 1.86 0.00 0.09 1.69 1.8 1.86 1.92 2.04
lp__ -1043.10 0.03 1.04 -1045.95 -1043.5 -1042.78 -1042.36 -1042.08
n_eff Rhat
Intercept[1] 1730 1
Intercept[2] 3359 1
lp__ 1529 1
Samples were drawn using NUTS(diag_e) at Sun Feb 11 17:15:35 2024.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at
convergence, Rhat=1).
(loo_ordinal <- loo(fit_ordinal))
Computed from 4000 by 1089 log-likelihood matrix
Estimate SE
elpd_loo -1040.0 16.7
p_loo 2.1 0.1
looic 2080.0 33.3
------
Monte Carlo SE of elpd_loo is 0.0.
All Pareto k estimates are good (k < 0.5).
See help('pareto-k-diagnostic') for details.
# Extract the generated quantities
generated_quantities <- extract(fit_ordinal)Y <- ordinal_data$Y
empirical <- tibble(
tv = total_variance(Y),
lv = lineage_wise_variance(Y, variable),
cv = crossfamily_variance(Y, variable)
)prior_pc <- tibble(
tv = apply(generated_quantities$y_prior, 1, total_variance),
lv = apply(
generated_quantities$y_prior, 1, function(x) lineage_wise_variance(x, variable)
),
cv = apply(
generated_quantities$y_prior, 1, function(x) crossfamily_variance(x, variable)
),
)
predictive_plot(prior_pc)
posterior_pc <- tibble(
tv = apply(generated_quantities$y_posterior, 1, total_variance),
lv = apply(
generated_quantities$y_posterior, 1, function(x) lineage_wise_variance(x, variable)
),
cv = apply(
generated_quantities$y_posterior, 1, function(x) crossfamily_variance(x, variable)
),
)
predictive_plot(posterior_pc)
defined <- which(!is.na(d[[variable]]))
undefined <- which(is.na(d[[variable]]))po <- c()
for (i in get_postorder(roots, edges)) {
if ((i %in% edges[,2]) && !(i %in% undefined)) {
po <- c(po, i)
}
}
edge_po <- match(po, edges[,2])phylo_data = list()
phylo_data$Ntrees <- length(trees)
phylo_data$Nnodes <- nrow(nodes)
phylo_data$Y <- ordinal_data$Y
phylo_data$Nobservations <- length(phylo_data$Y)
phylo_data$Nedges <- length(edge_po)
phylo_data$observations <- match(pull(d[defined, "glottocode"]), nodes$label)
phylo_data$roots <- roots
phylo_data$edges <- edges[edge_po,]
phylo_data$edge_lengths <- edge_lengths[edge_po]
phylo_data$nthres <- length(unique(phylo_data$Y)) - 1fit_brownian <- rstan::sampling(
model_brownian,
data = phylo_data,
chains = 4,
iter=2000
)print(fit_brownian, pars=c("mu", "sigma", "sigma_root", "Intercept", "lp__"))Inference for Stan model: anon_model.
4 chains, each with iter=2000; warmup=1000; thin=1;
post-warmup draws per chain=1000, total post-warmup draws=4000.
mean se_mean sd 2.5% 25% 50% 75%
mu 0.38 0.00 0.22 -0.05 0.24 0.38 0.53
sigma 1.95 0.01 0.29 1.42 1.74 1.94 2.13
sigma_root 1.14 0.01 0.21 0.75 1.00 1.13 1.28
Intercept[1] -0.63 0.00 0.15 -0.93 -0.73 -0.63 -0.53
Intercept[2] 3.15 0.01 0.23 2.69 2.99 3.14 3.30
lp__ -3996.47 1.49 41.60 -4077.98 -4024.97 -3997.11 -3969.64
97.5% n_eff Rhat
mu 0.82 2841 1
sigma 2.57 755 1
sigma_root 1.57 875 1
Intercept[1] -0.33 3752 1
Intercept[2] 3.61 1494 1
lp__ -3911.79 781 1
Samples were drawn using NUTS(diag_e) at Sun Feb 11 17:20:01 2024.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at
convergence, Rhat=1).
(loo_brownian <- loo(fit_brownian))Warning message:
“Some Pareto k diagnostic values are too high. See help('pareto-k-diagnostic') for details.
”
Computed from 4000 by 1089 log-likelihood matrix
Estimate SE
elpd_loo -937.2 18.3
p_loo 201.7 6.7
looic 1874.4 36.6
------
Monte Carlo SE of elpd_loo is NA.
Pareto k diagnostic values:
Count Pct. Min. n_eff
(-Inf, 0.5] (good) 1023 93.9% 848
(0.5, 0.7] (ok) 63 5.8% 297
(0.7, 1] (bad) 3 0.3% 92
(1, Inf) (very bad) 0 0.0% <NA>
See help('pareto-k-diagnostic') for details.
# Extract the generated quantities
generated_quantities <- extract(fit_brownian)Y <- ordinal_data$Y
empirical <- tibble(
tv = total_variance(Y),
lv = lineage_wise_variance(Y, variable),
cv = crossfamily_variance(Y, variable)
)prior_pc <- tibble(
tv = apply(generated_quantities$y_prior, 1, total_variance),
lv = apply(
generated_quantities$y_prior, 1, function(x) lineage_wise_variance(x, variable)
),
cv = apply(
generated_quantities$y_prior, 1, function(x) crossfamily_variance(x, variable)
),
)
predictive_plot(prior_pc)
posterior_pc <- tibble(
tv = apply(generated_quantities$y_posterior, 1, total_variance),
lv = apply(
generated_quantities$y_posterior, 1, function(x) lineage_wise_variance(x, variable)
),
cv = apply(
generated_quantities$y_posterior, 1, function(x) crossfamily_variance(x, variable)
),
)
predictive_plot(posterior_pc)
# fit_mixed2 <- readRDS("models/fit_mixed2.rds")fit_mixed2 <- rstan::sampling(
model_mixed,
data = phylo_data,
chains = 4,
iter=40000,
thin=20,
control=list(max_treedepth=15)
)Warning message:
“There were 4 chains where the estimated Bayesian Fraction of Missing Information was low. See
https://mc-stan.org/misc/warnings.html#bfmi-low”
Warning message:
“Examine the pairs() plot to diagnose sampling problems
”
Warning message:
“Bulk Effective Samples Size (ESS) is too low, indicating posterior means and medians may be unreliable.
Running the chains for more iterations may help. See
https://mc-stan.org/misc/warnings.html#bulk-ess”
Warning message:
“Tail Effective Samples Size (ESS) is too low, indicating posterior variances and tail quantiles may be unreliable.
Running the chains for more iterations may help. See
https://mc-stan.org/misc/warnings.html#tail-ess”
saveRDS(fit_mixed2, "models/fit_mixed2.rds")print(fit_mixed2, pars=c("mu", "sigma", "sigma_root", "Intercept", "sigma_error", "lp__"))Inference for Stan model: anon_model.
4 chains, each with iter=40000; warmup=20000; thin=20;
post-warmup draws per chain=1000, total post-warmup draws=4000.
mean se_mean sd 2.5% 25% 50% 75%
mu 0.41 0.00 0.23 -0.03 0.25 0.41 0.56
sigma 1.96 0.00 0.28 1.45 1.77 1.96 2.14
sigma_root 1.16 0.00 0.21 0.77 1.01 1.15 1.30
Intercept[1] -0.63 0.00 0.15 -0.93 -0.74 -0.63 -0.53
Intercept[2] 3.18 0.00 0.23 2.75 3.02 3.18 3.33
sigma_error 0.23 0.01 0.08 0.13 0.18 0.22 0.27
lp__ -3900.68 25.61 344.83 -4637.51 -4108.79 -3876.92 -3654.71
97.5% n_eff Rhat
mu 0.85 4120 1.00
sigma 2.55 3891 1.00
sigma_root 1.60 4123 1.00
Intercept[1] -0.33 4113 1.00
Intercept[2] 3.67 4076 1.00
sigma_error 0.44 181 1.03
lp__ -3311.09 181 1.03
Samples were drawn using NUTS(diag_e) at Sun Feb 11 18:23:05 2024.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at
convergence, Rhat=1).
(loo_mixed <- loo(fit_mixed2))Warning message:
“Some Pareto k diagnostic values are too high. See help('pareto-k-diagnostic') for details.
”
Computed from 4000 by 1089 log-likelihood matrix
Estimate SE
elpd_loo -936.6 18.3
p_loo 211.7 6.9
looic 1873.3 36.6
------
Monte Carlo SE of elpd_loo is NA.
Pareto k diagnostic values:
Count Pct. Min. n_eff
(-Inf, 0.5] (good) 1027 94.3% 383
(0.5, 0.7] (ok) 59 5.4% 220
(0.7, 1] (bad) 3 0.3% 194
(1, Inf) (very bad) 0 0.0% <NA>
See help('pareto-k-diagnostic') for details.
# Extract the generated quantities
generated_quantities <- extract(fit_mixed2)Y <- ordinal_data$Y
empirical <- tibble(
tv = total_variance(Y),
lv = lineage_wise_variance(Y, variable),
cv = crossfamily_variance(Y, variable)
)prior_pc <- tibble(
tv = apply(generated_quantities$y_prior, 1, total_variance),
lv = apply(
generated_quantities$y_prior, 1, function(x) lineage_wise_variance(x, variable)
),
cv = apply(
generated_quantities$y_prior, 1, function(x) crossfamily_variance(x, variable)
),
)
predictive_plot(prior_pc)
posterior_pc <- tibble(
tv = apply(generated_quantities$y_posterior, 1, total_variance),
lv = apply(
generated_quantities$y_posterior, 1, function(x) lineage_wise_variance(x, variable)
),
cv = apply(
generated_quantities$y_posterior, 1, function(x) crossfamily_variance(x, variable)
),
)
predictive_plot(posterior_pc)
model comparison
tibble(model = c("ordinal", "brownian", "mixed"),
looic = c(
loo_ordinal$estimates[3,1],
loo_brownian$estimates[3,1],
loo_mixed$estimates[3,1]
)
) %>%
mutate(delta_looic = round(looic-min(looic), 3)) %>%
arrange(delta_looic)| model | looic | delta_looic |
|---|---|---|
| <chr> | <dbl> | <dbl> |
| mixed | 1873.273 | 0.000 |
| brownian | 1874.419 | 1.147 |
| ordinal | 2080.010 | 206.737 |
Variable 3: domestic_organisation
variable <- "domestic_organisation"Y <- d %>%
select(all_of(variable)) %>%
na.omit %>%
pull
Y <- match(Y, sort(unique(Y)))ordinal_data <- list()
ordinal_data$Ndata <- length(Y)
ordinal_data$nthres <- max(Y) -1
ordinal_data$Y <- Yfit_ordinal <- rstan::sampling(
model_ordinal,
data=ordinal_data,
chains=4,
iter=2000,
)print(fit_ordinal, pars=c("Intercept", "lp__"))Inference for Stan model: anon_model.
4 chains, each with iter=2000; warmup=1000; thin=1;
post-warmup draws per chain=1000, total post-warmup draws=4000.
mean se_mean sd 2.5% 25% 50% 75% 97.5%
Intercept[1] -0.93 0.00 0.06 -1.05 -0.97 -0.93 -0.88 -0.80
Intercept[2] 0.05 0.00 0.06 -0.06 0.01 0.05 0.09 0.17
Intercept[3] 1.45 0.00 0.07 1.31 1.40 1.45 1.50 1.60
lp__ -1631.06 0.03 1.23 -1634.26 -1631.63 -1630.73 -1630.16 -1629.66
n_eff Rhat
Intercept[1] 2747 1
Intercept[2] 3929 1
Intercept[3] 3925 1
lp__ 1634 1
Samples were drawn using NUTS(diag_e) at Sun Feb 11 18:28:43 2024.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at
convergence, Rhat=1).
(loo_ordinal <- loo(fit_ordinal))
Computed from 4000 by 1183 log-likelihood matrix
Estimate SE
elpd_loo -1625.3 5.9
p_loo 3.1 0.0
looic 3250.5 11.8
------
Monte Carlo SE of elpd_loo is 0.0.
All Pareto k estimates are good (k < 0.5).
See help('pareto-k-diagnostic') for details.
# Extract the generated quantities
generated_quantities <- extract(fit_ordinal)Y <- ordinal_data$Y
empirical <- tibble(
tv = total_variance(Y),
lv = lineage_wise_variance(Y, variable),
cv = crossfamily_variance(Y, variable)
)prior_pc <- tibble(
tv = apply(generated_quantities$y_prior, 1, total_variance),
lv = apply(
generated_quantities$y_prior, 1, function(x) lineage_wise_variance(x, variable)
),
cv = apply(
generated_quantities$y_prior, 1, function(x) crossfamily_variance(x, variable)
),
)
predictive_plot(prior_pc)
posterior_pc <- tibble(
tv = apply(generated_quantities$y_posterior, 1, total_variance),
lv = apply(
generated_quantities$y_posterior, 1, function(x) lineage_wise_variance(x, variable)
),
cv = apply(
generated_quantities$y_posterior, 1, function(x) crossfamily_variance(x, variable)
),
)
predictive_plot(posterior_pc)
defined <- which(!is.na(d[[variable]]))
undefined <- which(is.na(d[[variable]]))po <- c()
for (i in get_postorder(roots, edges)) {
if ((i %in% edges[,2]) & !(i %in% undefined)) {
po <- c(po, i)
}
}
edge_po <- match(po, edges[,2])phylo_data = list()
phylo_data$Ntrees <- length(trees)
phylo_data$Nnodes <- nrow(nodes)
phylo_data$Y <- ordinal_data$Y
phylo_data$Nobservations <- length(phylo_data$Y)
phylo_data$Nedges <- length(edge_po)
phylo_data$observations <- match(pull(d[defined, "glottocode"]), nodes$label)
phylo_data$roots <- roots
phylo_data$edges <- edges[edge_po,]
phylo_data$edge_lengths <- edge_lengths[edge_po]
phylo_data$nthres <- length(unique(phylo_data$Y)) - 1fit_brownian <- rstan::sampling(
model_brownian,
data = phylo_data,
chains = 4,
iter=2000
)print(fit_brownian, pars=c("mu", "sigma", "sigma_root", "Intercept", "lp__"))Inference for Stan model: anon_model.
4 chains, each with iter=2000; warmup=1000; thin=1;
post-warmup draws per chain=1000, total post-warmup draws=4000.
mean se_mean sd 2.5% 25% 50% 75%
mu -0.17 0.01 0.30 -0.74 -0.37 -0.17 0.03
sigma 1.54 0.01 0.22 1.12 1.38 1.53 1.69
sigma_root 0.97 0.01 0.18 0.65 0.84 0.96 1.09
Intercept[1] -1.23 0.01 0.28 -1.78 -1.41 -1.23 -1.04
Intercept[2] 0.01 0.01 0.28 -0.54 -0.17 0.01 0.20
Intercept[3] 1.79 0.01 0.29 1.20 1.60 1.79 1.98
lp__ -4598.51 1.46 42.43 -4680.89 -4626.97 -4599.23 -4569.67
97.5% n_eff Rhat
mu 0.40 2039 1.00
sigma 2.01 649 1.01
sigma_root 1.37 787 1.01
Intercept[1] -0.68 2109 1.00
Intercept[2] 0.53 2203 1.00
Intercept[3] 2.36 1908 1.00
lp__ -4513.59 844 1.01
Samples were drawn using NUTS(diag_e) at Sun Feb 11 18:33:54 2024.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at
convergence, Rhat=1).
(loo_brownian <- loo(fit_brownian))Warning message:
“Some Pareto k diagnostic values are too high. See help('pareto-k-diagnostic') for details.
”
Computed from 4000 by 1183 log-likelihood matrix
Estimate SE
elpd_loo -1539.0 13.8
p_loo 211.4 5.0
looic 3077.9 27.5
------
Monte Carlo SE of elpd_loo is NA.
Pareto k diagnostic values:
Count Pct. Min. n_eff
(-Inf, 0.5] (good) 1156 97.7% 671
(0.5, 0.7] (ok) 25 2.1% 253
(0.7, 1] (bad) 2 0.2% 233
(1, Inf) (very bad) 0 0.0% <NA>
See help('pareto-k-diagnostic') for details.
# Extract the generated quantities
generated_quantities <- extract(fit_brownian)Y <- ordinal_data$Y
empirical <- tibble(
tv = total_variance(Y),
lv = lineage_wise_variance(Y, variable),
cv = crossfamily_variance(Y, variable)
)prior_pc <- tibble(
tv = apply(generated_quantities$y_prior, 1, total_variance),
lv = apply(
generated_quantities$y_prior, 1, function(x) lineage_wise_variance(x, variable)
),
cv = apply(
generated_quantities$y_prior, 1, function(x) crossfamily_variance(x, variable)
),
)
predictive_plot(prior_pc)
posterior_pc <- tibble(
tv = apply(generated_quantities$y_posterior, 1, total_variance),
lv = apply(
generated_quantities$y_posterior, 1, function(x) lineage_wise_variance(x, variable)
),
cv = apply(
generated_quantities$y_posterior, 1, function(x) crossfamily_variance(x, variable)
),
)
predictive_plot(posterior_pc)
#fit_mixed3 <- readRDS("models/fit_mixed3.rds")fit_mixed3 <- rstan::sampling(
model_mixed,
data = phylo_data,
chains = 4,
iter = 40000,
thin = 20,
control=list(max_treedepth=15)
)Warning message:
“There were 4 chains where the estimated Bayesian Fraction of Missing Information was low. See
https://mc-stan.org/misc/warnings.html#bfmi-low”
Warning message:
“Examine the pairs() plot to diagnose sampling problems
”
Warning message:
“Bulk Effective Samples Size (ESS) is too low, indicating posterior means and medians may be unreliable.
Running the chains for more iterations may help. See
https://mc-stan.org/misc/warnings.html#bulk-ess”
Warning message:
“Tail Effective Samples Size (ESS) is too low, indicating posterior variances and tail quantiles may be unreliable.
Running the chains for more iterations may help. See
https://mc-stan.org/misc/warnings.html#tail-ess”
saveRDS(fit_mixed3, "models/fit_mixed3.rds")print(fit_mixed3, pars=c("mu", "sigma", "sigma_root", "Intercept", "sigma_error", "lp__"))Inference for Stan model: anon_model.
4 chains, each with iter=40000; warmup=20000; thin=20;
post-warmup draws per chain=1000, total post-warmup draws=4000.
mean se_mean sd 2.5% 25% 50% 75%
mu -0.19 0.01 0.32 -0.83 -0.40 -0.20 0.02
sigma 1.54 0.00 0.23 1.12 1.38 1.53 1.68
sigma_root 0.99 0.00 0.18 0.66 0.86 0.98 1.11
Intercept[1] -1.26 0.00 0.29 -1.85 -1.45 -1.26 -1.07
Intercept[2] 0.00 0.00 0.29 -0.59 -0.19 -0.01 0.19
Intercept[3] 1.80 0.01 0.31 1.20 1.59 1.79 2.01
sigma_error 0.26 0.01 0.11 0.13 0.18 0.23 0.30
lp__ -4572.68 34.85 442.38 -5545.25 -4847.57 -4534.39 -4243.60
97.5% n_eff Rhat
mu 0.43 3847 1.00
sigma 2.01 3835 1.00
sigma_root 1.37 3165 1.00
Intercept[1] -0.70 3648 1.00
Intercept[2] 0.56 3828 1.00
Intercept[3] 2.41 3630 1.00
sigma_error 0.55 151 1.03
lp__ -3833.28 161 1.03
Samples were drawn using NUTS(diag_e) at Sun Feb 11 20:09:05 2024.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at
convergence, Rhat=1).
(loo_mixed <- loo(fit_mixed3))Warning message:
“Some Pareto k diagnostic values are too high. See help('pareto-k-diagnostic') for details.
”
Computed from 4000 by 1183 log-likelihood matrix
Estimate SE
elpd_loo -1539.3 13.8
p_loo 228.6 5.3
looic 3078.6 27.5
------
Monte Carlo SE of elpd_loo is NA.
Pareto k diagnostic values:
Count Pct. Min. n_eff
(-Inf, 0.5] (good) 1151 97.3% 293
(0.5, 0.7] (ok) 30 2.5% 229
(0.7, 1] (bad) 2 0.2% 205
(1, Inf) (very bad) 0 0.0% <NA>
See help('pareto-k-diagnostic') for details.
# Extract the generated quantities
generated_quantities <- extract(fit_mixed3)Y <- ordinal_data$Y
empirical <- tibble(
tv = total_variance(Y),
lv = lineage_wise_variance(Y, variable),
cv = crossfamily_variance(Y, variable)
)prior_pc <- tibble(
tv = apply(generated_quantities$y_prior, 1, total_variance),
lv = apply(
generated_quantities$y_prior, 1, function(x) lineage_wise_variance(x, variable)
),
cv = apply(
generated_quantities$y_prior, 1, function(x) crossfamily_variance(x, variable)
),
)
predictive_plot(prior_pc)
posterior_pc <- tibble(
tv = apply(generated_quantities$y_posterior, 1, total_variance),
lv = apply(
generated_quantities$y_posterior, 1, function(x) lineage_wise_variance(x, variable)
),
cv = apply(
generated_quantities$y_posterior, 1, function(x) crossfamily_variance(x, variable)
),
)
predictive_plot(posterior_pc)
model comparison
tibble(model = c("ordinal", "brownian", "mixed"),
looic = c(
loo_ordinal$estimates[3,1],
loo_brownian$estimates[3,1],
loo_mixed$estimates[3,1]
)
) %>%
mutate(delta_looic = round(looic-min(looic), 3)) %>%
arrange(delta_looic)| model | looic | delta_looic |
|---|---|---|
| <chr> | <dbl> | <dbl> |
| brownian | 3077.910 | 0.000 |
| mixed | 3078.557 | 0.647 |
| ordinal | 3250.511 | 172.601 |
Variable 4: agricultureLevel
variable <- "agricultureLevel"Y <- d %>%
select(all_of(variable)) %>%
na.omit %>%
pull
Y <- match(Y, sort(unique(Y)))ordinal_data <- list()
ordinal_data$Ndata <- length(Y)
ordinal_data$nthres <- max(Y) -1
ordinal_data$Y <- Yfit_ordinal <- rstan::sampling(
model_ordinal,
data=ordinal_data,
chains=4,
iter=2000,
)print(fit_ordinal, pars=c("Intercept", "lp__"))Inference for Stan model: anon_model.
4 chains, each with iter=2000; warmup=1000; thin=1;
post-warmup draws per chain=1000, total post-warmup draws=4000.
mean se_mean sd 2.5% 25% 50% 75% 97.5%
Intercept[1] -1.53 0.00 0.08 -1.68 -1.58 -1.53 -1.48 -1.38
Intercept[2] -1.32 0.00 0.07 -1.47 -1.37 -1.32 -1.28 -1.19
Intercept[3] -1.19 0.00 0.07 -1.32 -1.23 -1.18 -1.14 -1.05
Intercept[4] -1.02 0.00 0.07 -1.15 -1.07 -1.02 -0.98 -0.89
Intercept[5] -0.67 0.00 0.06 -0.79 -0.72 -0.67 -0.63 -0.55
Intercept[6] 0.03 0.00 0.06 -0.08 -0.01 0.03 0.07 0.14
Intercept[7] 1.08 0.00 0.07 0.95 1.04 1.08 1.12 1.21
Intercept[8] 2.35 0.00 0.10 2.15 2.28 2.35 2.42 2.55
Intercept[9] 4.12 0.00 0.23 3.70 3.96 4.11 4.27 4.59
lp__ -2448.95 0.05 2.12 -2453.95 -2450.09 -2448.57 -2447.47 -2445.82
n_eff Rhat
Intercept[1] 3772 1
Intercept[2] 4469 1
Intercept[3] 4537 1
Intercept[4] 4666 1
Intercept[5] 5386 1
Intercept[6] 4908 1
Intercept[7] 4597 1
Intercept[8] 5763 1
Intercept[9] 6636 1
lp__ 1880 1
Samples were drawn using NUTS(diag_e) at Sun Feb 11 20:15:19 2024.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at
convergence, Rhat=1).
(loo_ordinal <- loo(fit_ordinal))
Computed from 4000 by 1209 log-likelihood matrix
Estimate SE
elpd_loo -2424.1 23.3
p_loo 8.9 0.3
looic 4848.1 46.6
------
Monte Carlo SE of elpd_loo is 0.0.
All Pareto k estimates are good (k < 0.5).
See help('pareto-k-diagnostic') for details.
# Extract the generated quantities
generated_quantities <- extract(fit_ordinal)Y <- ordinal_data$Y
empirical <- tibble(
tv = total_variance(Y),
lv = lineage_wise_variance(Y, variable),
cv = crossfamily_variance(Y, variable)
)prior_pc <- tibble(
tv = apply(generated_quantities$y_prior, 1, total_variance),
lv = apply(
generated_quantities$y_prior, 1, function(x) lineage_wise_variance(x, variable)
),
cv = apply(
generated_quantities$y_prior, 1, function(x) crossfamily_variance(x, variable)
),
)
predictive_plot(prior_pc)
posterior_pc <- tibble(
tv = apply(generated_quantities$y_posterior, 1, total_variance),
lv = apply(
generated_quantities$y_posterior, 1, function(x) lineage_wise_variance(x, variable)
),
cv = apply(
generated_quantities$y_posterior, 1, function(x) crossfamily_variance(x, variable)
),
)
predictive_plot(posterior_pc)
defined <- which(!is.na(d[[variable]]))
undefined <- which(is.na(d[[variable]]))po <- c()
for (i in get_postorder(roots, edges)) {
if ((i %in% edges[,2]) & !(i %in% undefined)) {
po <- c(po, i)
}
}
edge_po <- match(po, edges[,2])phylo_data = list()
phylo_data$Ntrees <- length(trees)
phylo_data$Nnodes <- nrow(nodes)
phylo_data$Y <- ordinal_data$Y
phylo_data$Nobservations <- length(phylo_data$Y)
phylo_data$Nedges <- length(edge_po)
phylo_data$observations <- match(pull(d[defined, "glottocode"]), nodes$label)
phylo_data$roots <- roots
phylo_data$edges <- edges[edge_po,]
phylo_data$edge_lengths <- edge_lengths[edge_po]
phylo_data$nthres <- length(unique(phylo_data$Y)) - 1fit_brownian <- rstan::sampling(
model_brownian,
data = phylo_data,
chains = 4,
iter=2000
)print(fit_brownian, pars=c("mu", "sigma", "sigma_root", "Intercept", "lp__"))Inference for Stan model: anon_model.
4 chains, each with iter=2000; warmup=1000; thin=1;
post-warmup draws per chain=1000, total post-warmup draws=4000.
mean se_mean sd 2.5% 25% 50% 75%
mu 0.13 0.01 0.81 -1.46 -0.39 0.13 0.66
sigma 1.77 0.01 0.24 1.32 1.61 1.76 1.92
sigma_root 3.03 0.01 0.29 2.51 2.83 3.02 3.21
Intercept[1] -1.42 0.01 0.83 -3.09 -1.96 -1.42 -0.88
Intercept[2] -0.89 0.01 0.82 -2.55 -1.42 -0.89 -0.34
Intercept[3] -0.56 0.01 0.83 -2.21 -1.09 -0.56 -0.01
Intercept[4] -0.18 0.01 0.83 -1.84 -0.73 -0.19 0.36
Intercept[5] 0.58 0.01 0.82 -1.10 0.03 0.57 1.12
Intercept[6] 1.93 0.01 0.82 0.28 1.39 1.94 2.49
Intercept[7] 3.64 0.01 0.84 1.95 3.10 3.65 4.19
Intercept[8] 5.41 0.01 0.85 3.71 4.85 5.42 5.97
Intercept[9] 7.49 0.01 0.89 5.71 6.91 7.50 8.08
lp__ -5016.46 1.70 45.46 -5106.60 -5046.41 -5016.48 -4985.85
97.5% n_eff Rhat
mu 1.71 5781 1.00
sigma 2.27 725 1.00
sigma_root 3.65 1206 1.00
Intercept[1] 0.16 5330 1.00
Intercept[2] 0.69 5316 1.00
Intercept[3] 1.04 5296 1.00
Intercept[4] 1.41 5341 1.00
Intercept[5] 2.14 5302 1.00
Intercept[6] 3.52 5211 1.00
Intercept[7] 5.26 4847 1.00
Intercept[8] 7.07 4545 1.00
Intercept[9] 9.23 4381 1.00
lp__ -4927.68 713 1.01
Samples were drawn using NUTS(diag_e) at Sun Feb 11 20:24:00 2024.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at
convergence, Rhat=1).
(loo_brownian <- loo(fit_brownian))Warning message:
“Some Pareto k diagnostic values are too high. See help('pareto-k-diagnostic') for details.
”
Computed from 4000 by 1209 log-likelihood matrix
Estimate SE
elpd_loo -2009.1 34.5
p_loo 313.3 9.4
looic 4018.2 69.0
------
Monte Carlo SE of elpd_loo is NA.
Pareto k diagnostic values:
Count Pct. Min. n_eff
(-Inf, 0.5] (good) 1045 86.4% 357
(0.5, 0.7] (ok) 112 9.3% 116
(0.7, 1] (bad) 50 4.1% 54
(1, Inf) (very bad) 2 0.2% 42
See help('pareto-k-diagnostic') for details.
# Extract the generated quantities
generated_quantities <- extract(fit_brownian)Y <- ordinal_data$Y
empirical <- tibble(
tv = total_variance(Y),
lv = lineage_wise_variance(Y, variable),
cv = crossfamily_variance(Y, variable)
)prior_pc <- tibble(
tv = apply(generated_quantities$y_prior, 1, total_variance),
lv = apply(
generated_quantities$y_prior, 1, function(x) lineage_wise_variance(x, variable)
),
cv = apply(
generated_quantities$y_prior, 1, function(x) crossfamily_variance(x, variable)
),
)
predictive_plot(prior_pc)
posterior_pc <- tibble(
tv = apply(generated_quantities$y_posterior, 1, total_variance),
lv = apply(
generated_quantities$y_posterior, 1, function(x) lineage_wise_variance(x, variable)
),
cv = apply(
generated_quantities$y_posterior, 1, function(x) crossfamily_variance(x, variable)
),
)
predictive_plot(posterior_pc)
#fit_mixed4 <- readRDS("models/fit_mixed4.rds")fit_mixed4 <- rstan::sampling(
model_mixed,
data = phylo_data,
chains = 4,
iter=40000,
thin=20,
control=list(max_treedepth=15)
)Warning message:
“There were 4 chains where the estimated Bayesian Fraction of Missing Information was low. See
https://mc-stan.org/misc/warnings.html#bfmi-low”
Warning message:
“Examine the pairs() plot to diagnose sampling problems
”
Warning message:
“Bulk Effective Samples Size (ESS) is too low, indicating posterior means and medians may be unreliable.
Running the chains for more iterations may help. See
https://mc-stan.org/misc/warnings.html#bulk-ess”
Warning message:
“Tail Effective Samples Size (ESS) is too low, indicating posterior variances and tail quantiles may be unreliable.
Running the chains for more iterations may help. See
https://mc-stan.org/misc/warnings.html#tail-ess”
saveRDS(fit_mixed4, "models/fit_mixed4.rds")print(fit_mixed4, pars=c("mu", "sigma", "sigma_root", "Intercept", "sigma_error", "lp__"))Inference for Stan model: anon_model.
4 chains, each with iter=40000; warmup=20000; thin=20;
post-warmup draws per chain=1000, total post-warmup draws=4000.
mean se_mean sd 2.5% 25% 50% 75%
mu 0.33 0.02 1.41 -2.49 -0.59 0.32 1.28
sigma 1.79 0.00 0.24 1.34 1.62 1.78 1.95
sigma_root 3.05 0.00 0.30 2.53 2.85 3.03 3.25
Intercept[1] -1.24 0.02 1.39 -4.06 -2.16 -1.23 -0.32
Intercept[2] -0.71 0.02 1.39 -3.51 -1.63 -0.69 0.21
Intercept[3] -0.37 0.02 1.39 -3.17 -1.31 -0.35 0.56
Intercept[4] 0.01 0.02 1.39 -2.77 -0.92 0.04 0.94
Intercept[5] 0.77 0.02 1.39 -2.04 -0.17 0.80 1.71
Intercept[6] 2.14 0.02 1.39 -0.66 1.22 2.16 3.08
Intercept[7] 3.87 0.02 1.40 1.08 2.92 3.89 4.82
Intercept[8] 5.65 0.02 1.41 2.84 4.70 5.66 6.61
Intercept[9] 7.74 0.02 1.42 4.91 6.78 7.76 8.72
sigma_error 0.23 0.01 0.09 0.12 0.17 0.21 0.26
lp__ -4852.29 29.82 398.07 -5800.10 -5077.76 -4811.82 -4572.81
97.5% n_eff Rhat
mu 3.05 4003 1.00
sigma 2.29 3995 1.00
sigma_root 3.65 4056 1.00
Intercept[1] 1.42 4045 1.00
Intercept[2] 2.00 4008 1.00
Intercept[3] 2.29 4007 1.00
Intercept[4] 2.67 4014 1.00
Intercept[5] 3.45 4012 1.00
Intercept[6] 4.82 3999 1.00
Intercept[7] 6.56 3928 1.00
Intercept[8] 8.38 3927 1.00
Intercept[9] 10.44 3921 1.00
sigma_error 0.47 151 1.02
lp__ -4200.04 178 1.01
Samples were drawn using NUTS(diag_e) at Sun Feb 11 23:14:35 2024.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at
convergence, Rhat=1).
(loo_mixed <- loo(fit_mixed4))Warning message:
“Some Pareto k diagnostic values are too high. See help('pareto-k-diagnostic') for details.
”
Computed from 4000 by 1209 log-likelihood matrix
Estimate SE
elpd_loo -2009.8 34.5
p_loo 327.9 9.7
looic 4019.5 69.0
------
Monte Carlo SE of elpd_loo is NA.
Pareto k diagnostic values:
Count Pct. Min. n_eff
(-Inf, 0.5] (good) 1036 85.7% 379
(0.5, 0.7] (ok) 105 8.7% 149
(0.7, 1] (bad) 68 5.6% 42
(1, Inf) (very bad) 0 0.0% <NA>
See help('pareto-k-diagnostic') for details.
# Extract the generated quantities
generated_quantities <- extract(fit_mixed4)Y <- ordinal_data$Y
empirical <- tibble(
tv = total_variance(Y),
lv = lineage_wise_variance(Y, variable),
cv = crossfamily_variance(Y, variable)
)prior_pc <- tibble(
tv = apply(generated_quantities$y_prior, 1, total_variance),
lv = apply(
generated_quantities$y_prior, 1, function(x) lineage_wise_variance(x, variable)
),
cv = apply(
generated_quantities$y_prior, 1, function(x) crossfamily_variance(x, variable)
),
)
predictive_plot(prior_pc)
posterior_pc <- tibble(
tv = apply(generated_quantities$y_posterior, 1, total_variance),
lv = apply(
generated_quantities$y_posterior, 1, function(x) lineage_wise_variance(x, variable)
),
cv = apply(
generated_quantities$y_posterior, 1, function(x) crossfamily_variance(x, variable)
),
)
predictive_plot(posterior_pc)
model comparison
tibble(model = c("ordinal", "brownian", "mixed"),
looic = c(
loo_ordinal$estimates[3,1],
loo_brownian$estimates[3,1],
loo_mixed$estimates[3,1]
)
) %>%
mutate(delta_looic = round(looic-min(looic), 3)) %>%
arrange(delta_looic)| model | looic | delta_looic |
|---|---|---|
| <chr> | <dbl> | <dbl> |
| brownian | 4018.195 | 0.00 |
| mixed | 4019.505 | 1.31 |
| ordinal | 4848.144 | 829.95 |
Variable 5: settlement_strategy
variable <- "settlement_strategy"Y <- d %>%
select(all_of(variable)) %>%
na.omit %>%
pull
Y <- match(Y, sort(unique(Y)))ordinal_data <- list()
ordinal_data$Ndata <- length(Y)
ordinal_data$nthres <- max(Y) -1
ordinal_data$Y <- Yfit_ordinal <- rstan::sampling(
model_ordinal,
data=ordinal_data,
chains=4,
iter=2000,
)print(fit_ordinal, pars=c("Intercept", "lp__"))Inference for Stan model: anon_model.
4 chains, each with iter=2000; warmup=1000; thin=1;
post-warmup draws per chain=1000, total post-warmup draws=4000.
mean se_mean sd 2.5% 25% 50% 75% 97.5%
Intercept[1] -1.42 0.00 0.07 -1.56 -1.47 -1.42 -1.37 -1.28
Intercept[2] -0.88 0.00 0.06 -1.01 -0.93 -0.88 -0.84 -0.76
Intercept[3] 0.06 0.00 0.06 -0.06 0.02 0.06 0.10 0.18
lp__ -1371.80 0.03 1.21 -1374.83 -1372.33 -1371.48 -1370.92 -1370.45
n_eff Rhat
Intercept[1] 2225 1
Intercept[2] 3110 1
Intercept[3] 5618 1
lp__ 1803 1
Samples were drawn using NUTS(diag_e) at Sun Feb 11 23:20:48 2024.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at
convergence, Rhat=1).
(loo_ordinal <- loo(fit_ordinal))
Computed from 4000 by 1107 log-likelihood matrix
Estimate SE
elpd_loo -1364.9 18.0
p_loo 2.9 0.1
looic 2729.9 36.0
------
Monte Carlo SE of elpd_loo is 0.0.
All Pareto k estimates are good (k < 0.5).
See help('pareto-k-diagnostic') for details.
# Extract the generated quantities
generated_quantities <- extract(fit_ordinal)Y <- ordinal_data$Y
empirical <- tibble(
tv = total_variance(Y),
lv = lineage_wise_variance(Y, variable),
cv = crossfamily_variance(Y, variable)
)prior_pc <- tibble(
tv = apply(generated_quantities$y_prior, 1, total_variance),
lv = apply(
generated_quantities$y_prior, 1, function(x) lineage_wise_variance(x, variable)
),
cv = apply(
generated_quantities$y_prior, 1, function(x) crossfamily_variance(x, variable)
),
)
predictive_plot(prior_pc)
posterior_pc <- tibble(
tv = apply(generated_quantities$y_posterior, 1, total_variance),
lv = apply(
generated_quantities$y_posterior, 1, function(x) lineage_wise_variance(x, variable)
),
cv = apply(
generated_quantities$y_posterior, 1, function(x) crossfamily_variance(x, variable)
),
)
predictive_plot(posterior_pc)
defined <- which(!is.na(d[[variable]]))
undefined <- which(is.na(d[[variable]]))po <- c()
for (i in get_postorder(roots, edges)) {
if ((i %in% edges[,2]) & !(i %in% undefined)) {
po <- c(po, i)
}
}
edge_po <- match(po, edges[,2])phylo_data = list()
phylo_data$Ntrees <- length(trees)
phylo_data$Nnodes <- nrow(nodes)
phylo_data$Y <- ordinal_data$Y
phylo_data$Nobservations <- length(phylo_data$Y)
phylo_data$Nedges <- length(edge_po)
phylo_data$observations <- match(pull(d[defined, "glottocode"]), nodes$label)
phylo_data$roots <- roots
phylo_data$edges <- edges[edge_po,]
phylo_data$edge_lengths <- edge_lengths[edge_po]
phylo_data$nthres <- length(unique(phylo_data$Y)) - 1fit_brownian <- rstan::sampling(
model_brownian,
data = phylo_data,
chains = 4,
iter=2000
)print(fit_brownian, pars=c("mu", "sigma", "sigma_root", "Intercept", "lp__"))Inference for Stan model: anon_model.
4 chains, each with iter=2000; warmup=1000; thin=1;
post-warmup draws per chain=1000, total post-warmup draws=4000.
mean se_mean sd 2.5% 25% 50% 75%
mu 0.34 0.00 0.24 -0.14 0.19 0.35 0.50
sigma 1.35 0.02 0.36 0.69 1.10 1.33 1.58
sigma_root 1.75 0.01 0.26 1.29 1.57 1.74 1.92
Intercept[1] -1.17 0.00 0.17 -1.51 -1.28 -1.17 -1.06
Intercept[2] -0.39 0.00 0.16 -0.71 -0.49 -0.39 -0.29
Intercept[3] 0.95 0.00 0.16 0.63 0.84 0.94 1.05
lp__ -4306.59 1.70 42.79 -4391.52 -4335.71 -4306.72 -4277.15
97.5% n_eff Rhat
mu 0.82 2607 1.00
sigma 2.08 492 1.01
sigma_root 2.31 1159 1.00
Intercept[1] -0.85 3356 1.00
Intercept[2] -0.08 5316 1.00
Intercept[3] 1.27 5119 1.00
lp__ -4224.04 637 1.00
Samples were drawn using NUTS(diag_e) at Sun Feb 11 23:25:39 2024.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at
convergence, Rhat=1).
(loo_brownian <- loo(fit_brownian))Warning message:
“Some Pareto k diagnostic values are too high. See help('pareto-k-diagnostic') for details.
”
Computed from 4000 by 1107 log-likelihood matrix
Estimate SE
elpd_loo -1232.0 21.8
p_loo 182.0 5.9
looic 2464.0 43.5
------
Monte Carlo SE of elpd_loo is NA.
Pareto k diagnostic values:
Count Pct. Min. n_eff
(-Inf, 0.5] (good) 1017 91.9% 562
(0.5, 0.7] (ok) 85 7.7% 288
(0.7, 1] (bad) 5 0.5% 228
(1, Inf) (very bad) 0 0.0% <NA>
See help('pareto-k-diagnostic') for details.
# Extract the generated quantities
generated_quantities <- extract(fit_brownian)Y <- ordinal_data$Y
empirical <- tibble(
tv = total_variance(Y),
lv = lineage_wise_variance(Y, variable),
cv = crossfamily_variance(Y, variable)
)prior_pc <- tibble(
tv = apply(generated_quantities$y_prior, 1, total_variance),
lv = apply(
generated_quantities$y_prior, 1, function(x) lineage_wise_variance(x, variable)
),
cv = apply(
generated_quantities$y_prior, 1, function(x) crossfamily_variance(x, variable)
),
)
predictive_plot(prior_pc)
posterior_pc <- tibble(
tv = apply(generated_quantities$y_posterior, 1, total_variance),
lv = apply(
generated_quantities$y_posterior, 1, function(x) lineage_wise_variance(x, variable)
),
cv = apply(
generated_quantities$y_posterior, 1, function(x) crossfamily_variance(x, variable)
),
)
predictive_plot(posterior_pc)
# fit_mixed5 <- readRDS("models/fit_mixed5.rds")fit_mixed5 <- rstan::sampling(
model_mixed,
data = phylo_data,
chains = 4,
iter=40000,
thin=20,
control=list(max_treedepth=15)
)Warning message:
“There were 1 transitions after warmup that exceeded the maximum treedepth. Increase max_treedepth above 15. See
https://mc-stan.org/misc/warnings.html#maximum-treedepth-exceeded”
Warning message:
“There were 4 chains where the estimated Bayesian Fraction of Missing Information was low. See
https://mc-stan.org/misc/warnings.html#bfmi-low”
Warning message:
“Examine the pairs() plot to diagnose sampling problems
”
Warning message:
“Bulk Effective Samples Size (ESS) is too low, indicating posterior means and medians may be unreliable.
Running the chains for more iterations may help. See
https://mc-stan.org/misc/warnings.html#bulk-ess”
Warning message:
“Tail Effective Samples Size (ESS) is too low, indicating posterior variances and tail quantiles may be unreliable.
Running the chains for more iterations may help. See
https://mc-stan.org/misc/warnings.html#tail-ess”
saveRDS(fit_mixed5, "models/fit_mixed5.rds")print(fit_mixed5, pars=c("mu", "sigma", "sigma_root", "Intercept", "sigma_error", "lp__"))Inference for Stan model: anon_model.
4 chains, each with iter=40000; warmup=20000; thin=20;
post-warmup draws per chain=1000, total post-warmup draws=4000.
mean se_mean sd 2.5% 25% 50% 75%
mu 0.36 0.00 0.24 -0.12 0.19 0.36 0.53
sigma 1.36 0.01 0.37 0.67 1.11 1.35 1.60
sigma_root 1.77 0.00 0.27 1.29 1.58 1.75 1.94
Intercept[1] -1.18 0.00 0.17 -1.53 -1.29 -1.18 -1.06
Intercept[2] -0.39 0.00 0.16 -0.71 -0.50 -0.39 -0.27
Intercept[3] 0.97 0.00 0.17 0.64 0.85 0.97 1.08
sigma_error 0.24 0.01 0.10 0.13 0.17 0.21 0.28
lp__ -4216.27 28.05 389.25 -5089.27 -4451.84 -4168.79 -3933.01
97.5% n_eff Rhat
mu 0.84 3964 1.00
sigma 2.11 3787 1.00
sigma_root 2.37 3910 1.00
Intercept[1] -0.86 3959 1.00
Intercept[2] -0.08 4015 1.00
Intercept[3] 1.30 3593 1.00
sigma_error 0.50 173 1.02
lp__ -3587.28 193 1.01
Samples were drawn using NUTS(diag_e) at Mon Feb 12 00:51:07 2024.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at
convergence, Rhat=1).
(loo_mixed <- loo(fit_mixed5))Warning message:
“Some Pareto k diagnostic values are too high. See help('pareto-k-diagnostic') for details.
”
Computed from 4000 by 1107 log-likelihood matrix
Estimate SE
elpd_loo -1231.1 21.7
p_loo 194.1 5.9
looic 2462.1 43.4
------
Monte Carlo SE of elpd_loo is NA.
Pareto k diagnostic values:
Count Pct. Min. n_eff
(-Inf, 0.5] (good) 1025 92.6% 376
(0.5, 0.7] (ok) 79 7.1% 215
(0.7, 1] (bad) 3 0.3% 167
(1, Inf) (very bad) 0 0.0% <NA>
See help('pareto-k-diagnostic') for details.
# Extract the generated quantities
generated_quantities <- extract(fit_mixed5)Y <- ordinal_data$Y
empirical <- tibble(
tv = total_variance(Y),
lv = lineage_wise_variance(Y, variable),
cv = crossfamily_variance(Y, variable)
)prior_pc <- tibble(
tv = apply(generated_quantities$y_prior, 1, total_variance),
lv = apply(
generated_quantities$y_prior, 1, function(x) lineage_wise_variance(x, variable)
),
cv = apply(
generated_quantities$y_prior, 1, function(x) crossfamily_variance(x, variable)
),
)
predictive_plot(prior_pc)
posterior_pc <- tibble(
tv = apply(generated_quantities$y_posterior, 1, total_variance),
lv = apply(
generated_quantities$y_posterior, 1, function(x) lineage_wise_variance(x, variable)
),
cv = apply(
generated_quantities$y_posterior, 1, function(x) crossfamily_variance(x, variable)
),
)
predictive_plot(posterior_pc)
model comparison
tibble(model = c("ordinal", "brownian", "mixed"),
looic = c(
loo_ordinal$estimates[3,1],
loo_brownian$estimates[3,1],
loo_mixed$estimates[3,1]
)
) %>%
mutate(delta_looic = round(looic-min(looic), 3)) %>%
arrange(delta_looic)| model | looic | delta_looic |
|---|---|---|
| <chr> | <dbl> | <dbl> |
| mixed | 2462.106 | 0.000 |
| brownian | 2464.050 | 1.944 |
| ordinal | 2729.880 | 267.774 |
Variable 6: exogamy
variable <- "exogamy"Y <- d %>%
select(all_of(variable)) %>%
na.omit %>%
pull
Y <- match(Y, sort(unique(Y)))Since there are only two values for this variable, we need a logistic rather than an ordinal model.
binary_data <- list()
binary_data$Ndata <- length(Y)
binary_data$Y <- Y-1binary_code <- "
data {
int<lower=1> Ndata; // total number of observations
int<lower=0, upper=1> Y[Ndata]; // binary response variable
}
parameters {
real Intercept; // intercept for logistic regression
}
model {
target += normal_lpdf(Intercept | 0, 2);
for (n in 1:Ndata) {
target += bernoulli_logit_lpmf(Y[n] | Intercept);
}
}
generated quantities {
int y_prior[Ndata];
int y_posterior[Ndata];
vector[Ndata] log_lik;
real Intercept_prior = normal_rng(0, 2); // sample from the prior
for (n in 1:Ndata) {
y_prior[n] = bernoulli_logit_rng(Intercept_prior); // simulate prior predictive data
y_posterior[n] = bernoulli_logit_rng(Intercept); // simulate posterior predictive data
log_lik[n] = bernoulli_logit_lpmf(Y[n] | Intercept); // log likelihood for each observation
}
}
"model_binary <- stan_model(model_code=binary_code)fit_binary <- rstan::sampling(
model_binary,
data=binary_data,
chains=4,
iter=2000,
)print(fit_binary, pars=c("Intercept", "lp__"))Inference for Stan model: anon_model.
4 chains, each with iter=2000; warmup=1000; thin=1;
post-warmup draws per chain=1000, total post-warmup draws=4000.
mean se_mean sd 2.5% 25% 50% 75% 97.5% n_eff
Intercept -0.66 0.00 0.06 -0.78 -0.70 -0.66 -0.62 -0.53 1447
lp__ -666.75 0.02 0.69 -668.83 -666.88 -666.50 -666.33 -666.28 1764
Rhat
Intercept 1
lp__ 1
Samples were drawn using NUTS(diag_e) at Mon Feb 12 00:58:07 2024.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at
convergence, Rhat=1).
(loo_binary <- loo(fit_binary))
Computed from 4000 by 1036 log-likelihood matrix
Estimate SE
elpd_loo -665.6 10.1
p_loo 0.9 0.0
looic 1331.1 20.1
------
Monte Carlo SE of elpd_loo is 0.0.
All Pareto k estimates are good (k < 0.5).
See help('pareto-k-diagnostic') for details.
# Extract the generated quantities
generated_quantities <- extract(fit_binary)Y <- binary_data$Y
empirical <- tibble(
tv = total_variance(Y),
lv = lineage_wise_variance(Y, variable),
cv = crossfamily_variance(Y, variable)
)prior_pc <- tibble(
tv = apply(generated_quantities$y_prior, 1, total_variance),
lv = apply(
generated_quantities$y_prior, 1, function(x) lineage_wise_variance(x, variable)
),
cv = apply(
generated_quantities$y_prior, 1, function(x) crossfamily_variance(x, variable)
),
)
predictive_plot(prior_pc)
posterior_pc <- tibble(
tv = apply(generated_quantities$y_posterior, 1, total_variance),
lv = apply(
generated_quantities$y_posterior, 1, function(x) lineage_wise_variance(x, variable)
),
cv = apply(
generated_quantities$y_posterior, 1, function(x) crossfamily_variance(x, variable)
),
)
predictive_plot(posterior_pc)
defined <- which(!is.na(d[[variable]]))
undefined <- which(is.na(d[[variable]]))po <- c()
for (i in get_postorder(roots, edges)) {
if ((i %in% edges[,2]) & !(i %in% undefined)) {
po <- c(po, i)
}
}
edge_po <- match(po, edges[,2])phylo_data = list()
phylo_data$Ntrees <- length(trees)
phylo_data$Nnodes <- nrow(nodes)
phylo_data$Y <- binary_data$Y
phylo_data$Nobservations <- length(phylo_data$Y)
phylo_data$Nedges <- length(edge_po)
phylo_data$observations <- match(pull(d[defined, "glottocode"]), nodes$label)
phylo_data$roots <- roots
phylo_data$edges <- edges[edge_po,]
phylo_data$edge_lengths <- edge_lengths[edge_po]
phylo_data$nthres <- length(unique(phylo_data$Y)) - 1The phylogenetic ordinal regression models also have to be adjusted to a binary outcome variable.
brownian_binary_code <- "
data {
int<lower=1> Ntrees;
int<lower=0> Nnodes;
int<lower=1> Nobservations;
int<lower=1> Nedges;
int<lower=1,upper=Nnodes> observations[Nobservations]; // indices of the observed tips
int<lower=1, upper=Nnodes> roots[Ntrees];
int<lower=1, upper=Nnodes> edges[Nedges, 2];
vector[Nedges] edge_lengths;
int<lower=0, upper=1> Y[Nobservations]; // response variable
}
parameters {
vector[Nnodes] z_normal;
real mu;
real<lower=0> sigma;
real<lower=0> sigma_root;
}
model {
vector[Nnodes] z = rep_vector(0, Nnodes);
target += normal_lpdf(mu | 0, 1);
target += exponential_lpdf(sigma | 1);
target += normal_lpdf(z_normal | 0, 1);
for (i in 1:Ntrees) {
int idx = roots[i];
z[idx] = mu + z_normal[idx] * sigma_root;
}
for (i in 1:Nedges) {
int j = Nedges - i + 1;
int mother_node = edges[j, 1];
int daughter_node = edges[j, 2];
real local_mu = z[mother_node];
real local_sigma = sigma * sqrt(edge_lengths[j]);
z[daughter_node] = local_mu + z_normal[daughter_node] * local_sigma;
}
for (n in 1:Nobservations) {
target += bernoulli_logit_lpmf(Y[n] | z[observations[n]]);
}
}
generated quantities {
vector[Nnodes] z = rep_vector(0, Nnodes);
int y_prior[Nobservations]; // Prior predictive samples
int y_posterior[Nobservations]; // Posterior predictive samples
vector[Nnodes] z_prior = rep_vector(0, Nnodes);
vector[Nnodes] z_posterior = rep_vector(0, Nnodes);
real mu_prior;
real<lower=0> sigma_prior;
real<lower=0> sigma_root_prior;
real log_lik[Nobservations];
for (i in 1:Ntrees) {
int idx = roots[i];
z[idx] = mu + z_normal[idx] * sigma_root;
}
for (i in 1:Nedges) {
int j = Nedges - i + 1;
int mother_node = edges[j, 1];
int daughter_node = edges[j, 2];
real local_mu = z[mother_node];
real local_sigma = sigma * sqrt(edge_lengths[j]);
z[daughter_node] = local_mu + z_normal[daughter_node] * local_sigma;
}
mu_prior = normal_rng(0, 10);
sigma_prior = exponential_rng(1);
sigma_root_prior = exponential_rng(1);
for (i in 1:Ntrees) {
z_prior[roots[i]] = normal_rng(mu_prior, sigma_root_prior);
}
for (i in 1:Ntrees) {
int idx = roots[i];
z_posterior[idx] = mu + z_normal[idx] * sigma_root;
}
for (i in 1:Nedges) {
int j = Nedges - i + 1;
int mother_node = edges[j, 1];
int daughter_node = edges[j, 2];
real local_mu = z_posterior[mother_node];
real local_sigma = sigma * sqrt(edge_lengths[j]);
z_prior[daughter_node] = normal_rng(
z_prior[mother_node],
sigma_prior * sqrt(edge_lengths[j])
);
z_posterior[daughter_node] = local_mu + z_normal[daughter_node] * local_sigma;
}
for (n in 1:Nobservations) {
y_prior[n] = bernoulli_logit_rng(z_prior[observations[n]]);
y_posterior[n] = bernoulli_logit_rng(z[observations[n]]);
log_lik[n] = bernoulli_logit_lpmf(Y[n] | z[observations[n]]);
}
}
"model_brownian_binary <- stan_model(model_code=brownian_binary_code)fit_brownian <- rstan::sampling(
model_brownian_binary,
data = phylo_data,
chains = 4,
iter=2000
)print(fit_brownian, pars=c("mu", "sigma", "sigma_root", "lp__"))Inference for Stan model: anon_model.
4 chains, each with iter=2000; warmup=1000; thin=1;
post-warmup draws per chain=1000, total post-warmup draws=4000.
mean se_mean sd 2.5% 25% 50% 75% 97.5%
mu -1.10 0.01 0.24 -1.63 -1.25 -1.08 -0.93 -0.66
sigma 3.19 0.04 0.73 2.03 2.69 3.09 3.60 4.84
sigma_root 1.58 0.03 0.51 0.70 1.23 1.53 1.90 2.70
lp__ -3631.70 2.25 47.67 -3720.89 -3664.01 -3633.21 -3600.92 -3532.82
n_eff Rhat
mu 1091 1.00
sigma 401 1.01
sigma_root 402 1.01
lp__ 449 1.01
Samples were drawn using NUTS(diag_e) at Mon Feb 12 01:03:28 2024.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at
convergence, Rhat=1).
(loo_brownian <- loo(fit_brownian))Warning message:
“Some Pareto k diagnostic values are too high. See help('pareto-k-diagnostic') for details.
”
Computed from 4000 by 1036 log-likelihood matrix
Estimate SE
elpd_loo -597.9 13.3
p_loo 213.8 6.9
looic 1195.8 26.5
------
Monte Carlo SE of elpd_loo is NA.
Pareto k diagnostic values:
Count Pct. Min. n_eff
(-Inf, 0.5] (good) 614 59.3% 427
(0.5, 0.7] (ok) 299 28.9% 194
(0.7, 1] (bad) 44 4.2% 86
(1, Inf) (very bad) 79 7.6% 145
See help('pareto-k-diagnostic') for details.
# Extract the generated quantities
generated_quantities <- extract(fit_brownian)Y <- binary_data$Y
empirical <- tibble(
tv = total_variance(Y),
lv = lineage_wise_variance(Y, variable),
cv = crossfamily_variance(Y, variable)
)prior_pc <- tibble(
tv = apply(generated_quantities$y_prior, 1, total_variance),
lv = apply(
generated_quantities$y_prior, 1, function(x) lineage_wise_variance(x, variable)
),
cv = apply(
generated_quantities$y_prior, 1, function(x) crossfamily_variance(x, variable)
),
)
predictive_plot(prior_pc)
posterior_pc <- tibble(
tv = apply(generated_quantities$y_posterior, 1, total_variance),
lv = apply(
generated_quantities$y_posterior, 1, function(x) lineage_wise_variance(x, variable)
),
cv = apply(
generated_quantities$y_posterior, 1, function(x) crossfamily_variance(x, variable)
),
)
predictive_plot(posterior_pc)
mixed_binary_code <- "
data {
int<lower=1> Ntrees;
int<lower=0> Nnodes;
int<lower=1> Nobservations;
int<lower=1> Nedges;
int<lower=1,upper=Nnodes> observations[Nobservations]; // indices of the observed tips
int<lower=1, upper=Nnodes> roots[Ntrees];
int<lower=1, upper=Nnodes> edges[Nedges, 2];
vector[Nedges] edge_lengths;
int<lower=0, upper=1> Y[Nobservations]; // response variable
}
parameters {
vector[Nnodes] z_normal;
real mu;
real<lower=0> sigma;
real<lower=0> sigma_root;
real<lower=0> sigma_error;
vector[Nobservations] z_error;
}
model {
vector[Nnodes] z = rep_vector(0, Nnodes);
target += normal_lpdf(mu | 0, 10);
target += exponential_lpdf(sigma | 1);
target += normal_lpdf(z_normal | 0, 1);
target += inv_gamma_lpdf(square(sigma_error) | 2, 0.1);
target += normal_lpdf(z_error | 0, sigma_error);
for (i in 1:Ntrees) {
int idx = roots[i];
z[idx] = mu + z_normal[idx] * sigma_root;
}
for (i in 1:Nedges) {
int j = Nedges - i + 1;
int mother_node = edges[j, 1];
int daughter_node = edges[j, 2];
real local_mu = z[mother_node];
real local_sigma = sigma * sqrt(edge_lengths[j]);
z[daughter_node] = local_mu + z_normal[daughter_node] * local_sigma;
}
for (n in 1:Nobservations) {
target += bernoulli_logit_lpmf(Y[n] | z[observations[n]] + z_error[n]);
}
}
generated quantities {
vector[Nnodes] z = rep_vector(0, Nnodes);
int y_prior[Nobservations]; // Prior predictive samples
int y_posterior[Nobservations]; // Posterior predictive samples
vector[Nnodes] z_prior = rep_vector(0, Nnodes);
vector[Nnodes] z_posterior = rep_vector(0, Nnodes);
real mu_prior;
real<lower=0> sigma_prior;
real<lower=0> sigma_root_prior;
real log_lik[Nobservations];
vector[Nobservations] z_error_prior;
real sigma_error_sq = 1 / gamma_rng(2, 10);
real sigma_error_prior = sqrt(sigma_error_sq);
for (i in 1:Ntrees) {
int idx = roots[i];
z[idx] = mu + z_normal[idx] * sigma_root;
}
for (n in 1:Nobservations) {
z_error_prior[n] = normal_rng(0, sigma_error_prior);
}
for (i in 1:Nedges) {
int j = Nedges - i + 1;
int mother_node = edges[j, 1];
int daughter_node = edges[j, 2];
real local_mu = z[mother_node];
real local_sigma = sigma * sqrt(edge_lengths[j]);
z[daughter_node] = local_mu + z_normal[daughter_node] * local_sigma;
}
mu_prior = normal_rng(0, 10);
sigma_prior = exponential_rng(1);
sigma_root_prior = exponential_rng(1);
for (i in 1:Ntrees) {
z_prior[roots[i]] = normal_rng(mu_prior, sigma_root_prior);
}
for (i in 1:Ntrees) {
int idx = roots[i];
z_posterior[idx] = mu + z_normal[idx] * sigma_root;
}
for (i in 1:Nedges) {
int j = Nedges - i + 1;
int mother_node = edges[j, 1];
int daughter_node = edges[j, 2];
real local_mu = z_posterior[mother_node];
real local_sigma = sigma * sqrt(edge_lengths[j]);
z_prior[daughter_node] = normal_rng(
z_prior[mother_node],
sigma_prior * sqrt(edge_lengths[j])
);
z_posterior[daughter_node] = local_mu + z_normal[daughter_node] * local_sigma;
}
for (n in 1:Nobservations) {
y_prior[n] = bernoulli_logit_rng(z_prior[observations[n]] + z_error_prior[n]);
y_posterior[n] = bernoulli_logit_rng(z[observations[n]] + z_error[n]);
log_lik[n] = bernoulli_logit_lpmf(Y[n] | z[observations[n]] + z_error[n]);
}
}
"model_mixed_binary <- stan_model(model_code=mixed_binary_code)#fit_mixed6 <- readRDS("models/fit_mixed6.rds")fit_mixed6 <- rstan::sampling(
model_mixed_binary,
data = phylo_data,
chains = 4,
iter=40000,
thin=20,
control=list(max_treedepth=15)
)Warning message:
“There were 4 chains where the estimated Bayesian Fraction of Missing Information was low. See
https://mc-stan.org/misc/warnings.html#bfmi-low”
Warning message:
“Examine the pairs() plot to diagnose sampling problems
”
Warning message:
“Bulk Effective Samples Size (ESS) is too low, indicating posterior means and medians may be unreliable.
Running the chains for more iterations may help. See
https://mc-stan.org/misc/warnings.html#bulk-ess”
Warning message:
“Tail Effective Samples Size (ESS) is too low, indicating posterior variances and tail quantiles may be unreliable.
Running the chains for more iterations may help. See
https://mc-stan.org/misc/warnings.html#tail-ess”
saveRDS(fit_mixed6, "models/fit_mixed6.rds")print(fit_mixed6, pars=c("mu", "sigma", "sigma_root", "sigma_error", "lp__"))Inference for Stan model: anon_model.
4 chains, each with iter=40000; warmup=20000; thin=20;
post-warmup draws per chain=1000, total post-warmup draws=4000.
mean se_mean sd 2.5% 25% 50% 75%
mu -1.21 0.00 0.28 -1.85 -1.38 -1.18 -1.02
sigma 3.38 0.01 0.79 2.12 2.83 3.28 3.83
sigma_root 1.73 0.01 0.56 0.80 1.34 1.67 2.04
sigma_error 0.24 0.01 0.10 0.13 0.18 0.21 0.27
lp__ -3525.01 24.65 353.58 -4337.38 -3721.87 -3482.58 -3284.97
97.5% n_eff Rhat
mu -0.73 3485 1.00
sigma 5.22 3537 1.00
sigma_root 2.99 4074 1.00
sigma_error 0.49 195 1.01
lp__ -2940.62 206 1.01
Samples were drawn using NUTS(diag_e) at Mon Feb 12 01:52:53 2024.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at
convergence, Rhat=1).
(loo_mixed <- loo(fit_mixed6))Warning message:
“Some Pareto k diagnostic values are too high. See help('pareto-k-diagnostic') for details.
”
Computed from 4000 by 1036 log-likelihood matrix
Estimate SE
elpd_loo -597.0 13.5
p_loo 228.1 7.4
looic 1193.9 27.1
------
Monte Carlo SE of elpd_loo is NA.
Pareto k diagnostic values:
Count Pct. Min. n_eff
(-Inf, 0.5] (good) 647 62.5% 493
(0.5, 0.7] (ok) 324 31.3% 211
(0.7, 1] (bad) 64 6.2% 43
(1, Inf) (very bad) 1 0.1% 39
See help('pareto-k-diagnostic') for details.
# Extract the generated quantities
generated_quantities <- extract(fit_mixed6)Y <- binary_data$Y
empirical <- tibble(
tv = total_variance(Y),
lv = lineage_wise_variance(Y, variable),
cv = crossfamily_variance(Y, variable)
)prior_pc <- tibble(
tv = apply(generated_quantities$y_prior, 1, total_variance),
lv = apply(
generated_quantities$y_prior, 1, function(x) lineage_wise_variance(x, variable)
),
cv = apply(
generated_quantities$y_prior, 1, function(x) crossfamily_variance(x, variable)
),
)
predictive_plot(prior_pc)
posterior_pc <- tibble(
tv = apply(generated_quantities$y_posterior, 1, total_variance),
lv = apply(
generated_quantities$y_posterior, 1, function(x) lineage_wise_variance(x, variable)
),
cv = apply(
generated_quantities$y_posterior, 1, function(x) crossfamily_variance(x, variable)
),
)
predictive_plot(posterior_pc)
model comparison
tibble(model = c("binary", "brownian", "mixed"),
looic = c(
loo_binary$estimates[3,1],
loo_brownian$estimates[3,1],
loo_mixed$estimates[3,1]
)
) %>%
mutate(delta_looic = round(looic-min(looic), 3)) %>%
arrange(delta_looic)| model | looic | delta_looic |
|---|---|---|
| <chr> | <dbl> | <dbl> |
| mixed | 1193.911 | 0.000 |
| brownian | 1195.849 | 1.937 |
| binary | 1331.121 | 137.210 |
Variable 7: crop_type
variable <- "crop_type"Y <- d %>%
select(all_of(variable)) %>%
na.omit %>%
pull
Y <- match(Y, sort(unique(Y)))ordinal_data <- list()
ordinal_data$Ndata <- length(Y)
ordinal_data$nthres <- max(Y) -1
ordinal_data$Y <- Yfit_ordinal <- rstan::sampling(
model_ordinal,
data=ordinal_data,
chains=4,
iter=2000,
)print(fit_ordinal, pars=c("Intercept", "lp__"))Inference for Stan model: anon_model.
4 chains, each with iter=2000; warmup=1000; thin=1;
post-warmup draws per chain=1000, total post-warmup draws=4000.
mean se_mean sd 2.5% 25% 50% 75% 97.5%
Intercept[1] -1.50 0.00 0.08 -1.66 -1.55 -1.49 -1.44 -1.35
Intercept[2] -1.48 0.00 0.08 -1.64 -1.53 -1.48 -1.43 -1.33
Intercept[3] -1.44 0.00 0.08 -1.60 -1.50 -1.44 -1.39 -1.29
Intercept[4] -1.04 0.00 0.07 -1.18 -1.09 -1.04 -1.00 -0.91
Intercept[5] -0.12 0.00 0.06 -0.24 -0.16 -0.12 -0.08 -0.01
lp__ -1342.38 0.04 1.60 -1346.35 -1343.20 -1342.06 -1341.21 -1340.28
n_eff Rhat
Intercept[1] 3286 1
Intercept[2] 3260 1
Intercept[3] 3411 1
Intercept[4] 4344 1
Intercept[5] 5025 1
lp__ 1827 1
Samples were drawn using NUTS(diag_e) at Mon Feb 12 01:59:13 2024.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at
convergence, Rhat=1).
(loo_ordinal <- loo(fit_ordinal))
Computed from 4000 by 1103 log-likelihood matrix
Estimate SE
elpd_loo -1323.4 24.0
p_loo 4.7 0.7
looic 2646.8 48.1
------
Monte Carlo SE of elpd_loo is 0.0.
All Pareto k estimates are good (k < 0.5).
See help('pareto-k-diagnostic') for details.
# Extract the generated quantities
generated_quantities <- extract(fit_ordinal)Y <- ordinal_data$Y
empirical <- tibble(
tv = total_variance(Y),
lv = lineage_wise_variance(Y, variable),
cv = crossfamily_variance(Y, variable)
)prior_pc <- tibble(
tv = apply(generated_quantities$y_prior, 1, total_variance),
lv = apply(
generated_quantities$y_prior, 1, function(x) lineage_wise_variance(x, variable)
),
cv = apply(
generated_quantities$y_prior, 1, function(x) crossfamily_variance(x, variable)
),
)
predictive_plot(prior_pc)
posterior_pc <- tibble(
tv = apply(generated_quantities$y_posterior, 1, total_variance),
lv = apply(
generated_quantities$y_posterior, 1, function(x) lineage_wise_variance(x, variable)
),
cv = apply(
generated_quantities$y_posterior, 1, function(x) crossfamily_variance(x, variable)
),
)
predictive_plot(posterior_pc)
defined <- which(!is.na(d[[variable]]))
undefined <- which(is.na(d[[variable]]))po <- c()
for (i in get_postorder(roots, edges)) {
if ((i %in% edges[,2]) & !(i %in% undefined)) {
po <- c(po, i)
}
}
edge_po <- match(po, edges[,2])phylo_data = list()
phylo_data$Ntrees <- length(trees)
phylo_data$Nnodes <- nrow(nodes)
phylo_data$Y <- ordinal_data$Y
phylo_data$Nobservations <- length(phylo_data$Y)
phylo_data$Nedges <- length(edge_po)
phylo_data$observations <- match(pull(d[defined, "glottocode"]), nodes$label)
phylo_data$roots <- roots
phylo_data$edges <- edges[edge_po,]
phylo_data$edge_lengths <- edge_lengths[edge_po]
phylo_data$nthres <- length(unique(phylo_data$Y)) - 1fit_brownian <- rstan::sampling(
model_brownian,
data = phylo_data,
chains = 4,
iter=2000
)print(fit_brownian, pars=c("mu", "sigma", "sigma_root", "Intercept", "lp__"))Inference for Stan model: anon_model.
4 chains, each with iter=2000; warmup=1000; thin=1;
post-warmup draws per chain=1000, total post-warmup draws=4000.
mean se_mean sd 2.5% 25% 50% 75%
mu -0.93 0.01 0.33 -1.58 -1.15 -0.93 -0.72
sigma 1.71 0.01 0.26 1.25 1.53 1.69 1.87
sigma_root 3.27 0.01 0.41 2.54 2.99 3.25 3.53
Intercept[1] -2.43 0.00 0.20 -2.81 -2.56 -2.42 -2.29
Intercept[2] -2.39 0.00 0.20 -2.78 -2.52 -2.39 -2.26
Intercept[3] -2.32 0.00 0.20 -2.70 -2.45 -2.32 -2.18
Intercept[4] -1.56 0.00 0.18 -1.93 -1.69 -1.56 -1.43
Intercept[5] 0.08 0.00 0.17 -0.25 -0.03 0.08 0.19
lp__ -4111.76 1.22 39.06 -4186.64 -4138.22 -4112.04 -4084.78
97.5% n_eff Rhat
mu -0.29 1167 1.00
sigma 2.26 969 1.00
sigma_root 4.17 936 1.01
Intercept[1] -2.05 2563 1.00
Intercept[2] -2.01 2596 1.00
Intercept[3] -1.95 2708 1.00
Intercept[4] -1.21 3863 1.00
Intercept[5] 0.40 5636 1.00
lp__ -4034.93 1028 1.00
Samples were drawn using NUTS(diag_e) at Mon Feb 12 02:04:48 2024.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at
convergence, Rhat=1).
(loo_brownian <- loo(fit_brownian))Warning message:
“Some Pareto k diagnostic values are too high. See help('pareto-k-diagnostic') for details.
”
Computed from 4000 by 1103 log-likelihood matrix
Estimate SE
elpd_loo -1045.2 31.4
p_loo 200.5 8.3
looic 2090.4 62.8
------
Monte Carlo SE of elpd_loo is NA.
Pareto k diagnostic values:
Count Pct. Min. n_eff
(-Inf, 0.5] (good) 939 85.1% 303
(0.5, 0.7] (ok) 98 8.9% 135
(0.7, 1] (bad) 64 5.8% 48
(1, Inf) (very bad) 2 0.2% 54
See help('pareto-k-diagnostic') for details.
# Extract the generated quantities
generated_quantities <- extract(fit_brownian)Y <- ordinal_data$Y
empirical <- tibble(
tv = total_variance(Y),
lv = lineage_wise_variance(Y, variable),
cv = crossfamily_variance(Y, variable)
)prior_pc <- tibble(
tv = apply(generated_quantities$y_prior, 1, total_variance),
lv = apply(
generated_quantities$y_prior, 1, function(x) lineage_wise_variance(x, variable)
),
cv = apply(
generated_quantities$y_prior, 1, function(x) crossfamily_variance(x, variable)
),
)
predictive_plot(prior_pc)
posterior_pc <- tibble(
tv = apply(generated_quantities$y_posterior, 1, total_variance),
lv = apply(
generated_quantities$y_posterior, 1, function(x) lineage_wise_variance(x, variable)
),
cv = apply(
generated_quantities$y_posterior, 1, function(x) crossfamily_variance(x, variable)
),
)
predictive_plot(posterior_pc)
#fit_mixed7 <- readRDS("models/fit_mixed7.rds")fit_mixed7 <- rstan::sampling(
model_mixed,
data = phylo_data,
chains = 4,
iter=40000,
thin=20,
control=list(max_treedepth=15)
)Warning message:
“There were 4 chains where the estimated Bayesian Fraction of Missing Information was low. See
https://mc-stan.org/misc/warnings.html#bfmi-low”
Warning message:
“Examine the pairs() plot to diagnose sampling problems
”
Warning message:
“Bulk Effective Samples Size (ESS) is too low, indicating posterior means and medians may be unreliable.
Running the chains for more iterations may help. See
https://mc-stan.org/misc/warnings.html#bulk-ess”
Warning message:
“Tail Effective Samples Size (ESS) is too low, indicating posterior variances and tail quantiles may be unreliable.
Running the chains for more iterations may help. See
https://mc-stan.org/misc/warnings.html#tail-ess”
saveRDS(fit_mixed7, "models/fit_mixed7.rds")print(fit_mixed7, pars=c("mu", "sigma", "sigma_root", "Intercept", "sigma_error", "lp__"))Inference for Stan model: anon_model.
4 chains, each with iter=40000; warmup=20000; thin=20;
post-warmup draws per chain=1000, total post-warmup draws=4000.
mean se_mean sd 2.5% 25% 50% 75%
mu -1.06 0.01 0.35 -1.77 -1.30 -1.05 -0.82
sigma 1.75 0.00 0.27 1.28 1.56 1.74 1.92
sigma_root 3.33 0.01 0.41 2.60 3.03 3.30 3.59
Intercept[1] -2.47 0.00 0.21 -2.90 -2.61 -2.47 -2.33
Intercept[2] -2.44 0.00 0.21 -2.86 -2.57 -2.43 -2.29
Intercept[3] -2.36 0.00 0.21 -2.79 -2.50 -2.36 -2.22
Intercept[4] -1.60 0.00 0.19 -1.98 -1.72 -1.59 -1.47
Intercept[5] 0.06 0.00 0.17 -0.29 -0.05 0.07 0.18
sigma_error 0.23 0.01 0.09 0.12 0.17 0.21 0.26
lp__ -3960.97 29.75 373.59 -4782.49 -4175.64 -3922.39 -3694.76
97.5% n_eff Rhat
mu -0.38 4000 1.00
sigma 2.33 4011 1.00
sigma_root 4.23 4011 1.00
Intercept[1] -2.08 3931 1.00
Intercept[2] -2.04 3968 1.00
Intercept[3] -1.98 3961 1.00
Intercept[4] -1.24 3993 1.00
Intercept[5] 0.39 3929 1.00
sigma_error 0.45 144 1.03
lp__ -3329.02 158 1.03
Samples were drawn using NUTS(diag_e) at Mon Feb 12 03:24:58 2024.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at
convergence, Rhat=1).
(loo_mixed <- loo(fit_mixed7))Warning message:
“Some Pareto k diagnostic values are too high. See help('pareto-k-diagnostic') for details.
”
Computed from 4000 by 1103 log-likelihood matrix
Estimate SE
elpd_loo -1043.6 31.3
p_loo 210.0 8.3
looic 2087.1 62.7
------
Monte Carlo SE of elpd_loo is NA.
Pareto k diagnostic values:
Count Pct. Min. n_eff
(-Inf, 0.5] (good) 933 84.6% 366
(0.5, 0.7] (ok) 94 8.5% 216
(0.7, 1] (bad) 73 6.6% 41
(1, Inf) (very bad) 3 0.3% 27
See help('pareto-k-diagnostic') for details.
# Extract the generated quantities
generated_quantities <- extract(fit_mixed7)Y <- ordinal_data$Y
empirical <- tibble(
tv = total_variance(Y),
lv = lineage_wise_variance(Y, variable),
cv = crossfamily_variance(Y, variable)
)prior_pc <- tibble(
tv = apply(generated_quantities$y_prior, 1, total_variance),
lv = apply(
generated_quantities$y_prior, 1, function(x) lineage_wise_variance(x, variable)
),
cv = apply(
generated_quantities$y_prior, 1, function(x) crossfamily_variance(x, variable)
),
)
predictive_plot(prior_pc)
posterior_pc <- tibble(
tv = apply(generated_quantities$y_posterior, 1, total_variance),
lv = apply(
generated_quantities$y_posterior, 1, function(x) lineage_wise_variance(x, variable)
),
cv = apply(
generated_quantities$y_posterior, 1, function(x) crossfamily_variance(x, variable)
),
)
predictive_plot(posterior_pc)
model comparison
tibble(model = c("ordinal", "brownian", "mixed"),
looic = c(
loo_ordinal$estimates[3,1],
loo_brownian$estimates[3,1],
loo_mixed$estimates[3,1]
)
) %>%
mutate(delta_looic = round(looic-min(looic), 3)) %>%
arrange(delta_looic)| model | looic | delta_looic |
|---|---|---|
| <chr> | <dbl> | <dbl> |
| mixed | 2087.136 | 0.000 |
| brownian | 2090.430 | 3.295 |
| ordinal | 2646.801 | 559.666 |
wrapping up
For all seven variables, the mixed model is clearly better than the non-phylogenetic logistic model, and it is about as good as the Brownian motion model. It is thus justified to work with for downstream tasks.
# # create directory "models"
# dir.create("models")z_error <- list(
extract(fit_mixed1, pars="z_error")$z_error,
extract(fit_mixed2, pars="z_error")$z_error,
extract(fit_mixed3, pars="z_error")$z_error,
extract(fit_mixed4, pars="z_error")$z_error,
extract(fit_mixed5, pars="z_error")$z_error,
extract(fit_mixed6, pars="z_error")$z_error,
extract(fit_mixed7, pars="z_error")$z_error
)z <- list(
extract(fit_mixed1, pars="z")$z,
extract(fit_mixed2, pars="z")$z,
extract(fit_mixed3, pars="z")$z,
extract(fit_mixed4, pars="z")$z,
extract(fit_mixed5, pars="z")$z,
extract(fit_mixed6, pars="z")$z,
extract(fit_mixed7, pars="z")$z
)variables <- colnames(d)[3:9]# for each i in 1:7, create a new column in d and fill those rows where d[[variables[i]]] is NA with NA, and the other rows with the posterior mean of z_error[[i]]
for (i in 1:7) {
d[[paste0(variables[i], "_residuals")]] <- NA
d[[paste0(variables[i], "_residuals")]][!is.na(d[[variables[i]]])] <- apply(z_error[[i]], 2, mean)
}tip_indices <- match(d$glottocode, nodes$label)for (i in 1:7) {
d[[paste0(variables[i], "_z")]] <- apply(z[[i]], 2, mean)[tip_indices]
}write_csv(d, "../data/data_with_latent_variables.csv")