library(repr)
options(repr.plot.width=12, repr.plot.height=10)
Finding the best marginal model for each variable
Before I investigate the correlations between the variables, I want to check which phylogenetic model provides the best fit of the individual variables.
All variables are ordinal.
I will test two models for each variable:
- ordinal regression
- Poisson regression
Both will be done in a non-phylogenetic and a phylogenetic setting.
For all models, Bayesian inference will be used, implemented in Stan.
preparation
Im my experience, R is the best environment for working with Stan.
I start with loading relevant packages.
library(tidyverse)
library(gridExtra)
library(patchwork)
library(stringr)
library(reshape2)
library(rstan)
library(bridgesampling)
library(loo)
library(ape)
library(bayesplot)
library(shinystan)
library(coda)
library(codetools)
library(brms)
library(abind)
options(mc.cores = parallel::detectCores())
rstan_options("auto_write" = TRUE)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.5
✔ forcats 1.0.0 ✔ stringr 1.5.1
✔ ggplot2 3.5.0 ✔ tibble 3.2.1
✔ lubridate 1.9.3 ✔ tidyr 1.3.1
✔ purrr 1.0.2
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
Attaching package: ‘gridExtra’
The following object is masked from ‘package:dplyr’:
combine
Attaching package: ‘reshape2’
The following object is masked from ‘package:tidyr’:
smiths
Loading required package: StanHeaders
rstan version 2.32.6 (Stan version 2.32.2)
For execution on a local, multicore CPU with excess RAM we recommend calling
options(mc.cores = parallel::detectCores()).
To avoid recompilation of unchanged Stan programs, we recommend calling
rstan_options(auto_write = TRUE)
For within-chain threading using `reduce_sum()` or `map_rect()` Stan functions,
change `threads_per_chain` option:
rstan_options(threads_per_chain = 1)
Attaching package: ‘rstan’
The following object is masked from ‘package:tidyr’:
extract
This is loo version 2.7.0
- Online documentation and vignettes at mc-stan.org/loo
- As of v2.0.0 loo defaults to 1 core but we recommend using as many as possible. Use the 'cores' argument or set options(mc.cores = NUM_CORES) for an entire session.
Attaching package: ‘loo’
The following object is masked from ‘package:rstan’:
loo
Attaching package: ‘ape’
The following object is masked from ‘package:dplyr’:
where
This is bayesplot version 1.11.1
- Online documentation and vignettes at mc-stan.org/bayesplot
- bayesplot theme set to bayesplot::theme_default()
* Does _not_ affect other ggplot2 plots
* See ?bayesplot_theme_set for details on theme setting
Loading required package: shiny
This is shinystan version 2.6.0
Attaching package: ‘coda’
The following object is masked from ‘package:rstan’:
traceplot
Loading required package: Rcpp
Loading 'brms' package (version 2.20.4). Useful instructions
can be found by typing help('brms'). A more detailed introduction
to the package is available through vignette('brms_overview').
Attaching package: ‘brms’
The following object is masked from ‘package:bayesplot’:
rhat
The following object is masked from ‘package:bridgesampling’:
bf
The following object is masked from ‘package:rstan’:
loo
The following object is masked from ‘package:stats’:
ar
<- rgb(206/255, 143/255, 137/255)
karminrot <- rgb(217/255, 205/255, 177/255)
gold <- rgb(143/255, 143/255, 149/255) anthrazit
Next, the data are loaded into a tibble called d
.
<- read_csv("../data/data_with_families.csv") %>%
d distinct(glottocode, .keep_all = TRUE)
<- d$glottocode
taxa head(d)
Rows: 1291 Columns: 10
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (3): soc_id, glottocode, Family
dbl (7): political_complexity, hierarchy_within, domestic_organisation, agri...
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
soc_id | glottocode | political_complexity | hierarchy_within | domestic_organisation | agricultureLevel | settlement_strategy | exogamy | crop_type | Family |
---|---|---|---|---|---|---|---|---|---|
<chr> | <chr> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <chr> |
Aa1 | juho1239 | 1 | 3 | 3 | 0 | 1 | 1 | 1 | Kxa |
Aa2 | okie1245 | 1 | 2 | 1 | 0 | 1 | 1 | 1 | Nilotic |
Aa3 | nama1265 | 2 | 2 | 1 | 0 | 1 | 1 | 1 | Khoe_Kwadi |
Aa4 | dama1270 | NA | NA | 2 | 0 | NA | 1 | NA | Khoe_Kwadi |
Aa5 | bila1255 | 1 | 2 | 1 | 0 | 1 | 1 | 1 | Atlantic_Congo |
Aa6 | sand1273 | 2 | 2 | 1 | 5 | 3 | 1 | 6 | Sandawe |
In the next step, the maximum clade credibility trees are loaded as phylo
objects.
<- list.files("../data/mcc_trees", pattern = ".tre", full.names = TRUE)
mcc_trees
<- purrr::map(mcc_trees, read.tree)
trees
# get a vector of tip labels of all trees in trees
<- purrr::map(trees, function(x) x$tip.label) %>% unlist %>% unique tips
A brief sanity check to ensure that all tips of the trees correspond to a Glottocode in d
.
if (length(setdiff(tips, taxa)) == 0 ) {
print("All tips in trees are in taxa")
else {
} print("Not all tips in trees are in taxa")
}
[1] "All tips in trees are in taxa"
All Glottocodes in d
that do not occur as a tip are isolates (for the purpose of this study).
For each isolate I create a list
object with the fields "tip.lable"
and "Nnode"
, to make them compatible with the phylo
objects for language families. These list objects are added to the list trees
.
<- setdiff(taxa, tips)
isolates
# convert isolates to one-node trees and add to trees
for (i in isolates) {
<- list()
tr "tip.label"]] <- i
tr[["Nnode"]] <- 0
tr[[length(trees) + 1]] <- tr
trees[[ }
Just to make sure, I test whether the tips of this forest are exactly the Glottocodes in the data.
<- c()
tips for (tr in trees) {
<- c(tips, tr$tip.label)
tips
}
if (all(sort(tips) == sort(taxa))) {
print("taxa and tips are identical")
else {
} print("taxa and tips are not identical")
}
[1] "taxa and tips are identical"
Now I re-order the rows in d
to make sure they are in the same order as tips
.
<- d[match(tips, d$glottocode), ] d
The trees in trees
form a forest, i.e., a disconnected directed acyclic graph. I want to construct a representation of this graph which I can pass to Stan, consisting of
- the number of nodes
N
, - a Nx2-matrix
edges
, and - a length-N vector of edge lengths,
edge_lengths
. - a vector of root nodes
- a vector of tip nodes
First I create an Nx4 matrix where each row represents one node. The columns are
- node number
- tree number
- tree-internal node number
- label
For internal nodes, the label is the empty string.
<- c()
node_matrix <- 1
node_number for (tn in 1:length(trees)) {
<- trees[[tn]]
tr <- length(tr$tip.label) + tr$Nnode
tr_nodes for (i in 1:tr_nodes) {
<- ""
tl if (i <= length(tr$tip.label)) {
<- tr$tip.label[i]
tl
}<- rbind(node_matrix, c(node_number, tn, i, tl))
node_matrix <- node_number+1
node_number
} }
I convert the matrix to a tibble.
<- tibble(
nodes node_number = as.numeric(node_matrix[,1]),
tree_number = as.numeric(node_matrix[,2]),
internal_number = as.numeric(node_matrix[,3]),
label = node_matrix[,4]
)head(nodes)
tail(nodes)
node_number | tree_number | internal_number | label |
---|---|---|---|
<dbl> | <dbl> | <dbl> | <chr> |
1 | 1 | 1 | kron1241 |
2 | 1 | 2 | tuli1249 |
3 | 1 | 3 | |
4 | 2 | 1 | zarm1239 |
5 | 2 | 2 | koyr1242 |
6 | 2 | 3 |
node_number | tree_number | internal_number | label |
---|---|---|---|
<dbl> | <dbl> | <dbl> | <chr> |
2231 | 179 | 1 | qawa1238 |
2232 | 180 | 1 | cham1315 |
2233 | 181 | 1 | nort2971 |
2234 | 182 | 1 | trum1247 |
2235 | 183 | 1 | sout2994 |
2236 | 184 | 1 | guat1253 |
Next I define a function that maps a tree number and an internal node number to the global node number.
<- function(tree_number, local_id) which((nodes$tree_number == tree_number) & (nodes$internal_number == local_id)) get_global_id
Now I create a matrix of edges and, concomittantly, a vector of edge lengths.
<- c()
edges <- c()
edge_lengths for (tn in 1:length(trees)) {
<- trees[[tn]]
tr if (length(tr$edge) > 0 ) {
for (en in 1:nrow(tr$edge)) {
<- tr$edge[en, 1]
mother <- tr$edge[en, 2]
daughter <- get_global_id(tn, mother)
mother_id <- get_global_id(tn, daughter)
daughter_id <- tr$edge.length[en]
el <- rbind(edges, c(mother_id, daughter_id))
edges <- c(edge_lengths, el)
edge_lengths
}
} }
Finally I identify the root nodes and tip nodes.
<- c(sort(setdiff(edges[,1], edges[,2])), match(isolates, nodes$label))
roots <- which(nodes$label != "") tips
Next I define a function computing the post-order.
<- function(roots, edges) {
get_postorder <- roots
input <- c()
output while (length(input) > 0) {
<- input[1]
pivot <- setdiff(edges[edges[,1] == pivot, 2], output)
daughters if (length(daughters) == 0) {
<- input[-1]
input <- c(output, pivot)
output else {
} <- c(daughters, input)
input
}
}return(output)
}
orrdinal regression
For orrdinal regression, the model specifications are as follows. \(K\) is the number of possible values.
\[ \begin{aligned} C_k &\sim \mathcal N(0, 5)\\ Y_i=1 &\sim \mathrm{Bernoulli}(\mathrm{logit}^{-1}(C_1))\\ Y_i=k &\sim \mathrm{Bernoulli}(\mathrm{logit}^{-1}(C_k) - \mathrm{logit}^{-1}(C_{k-1}))\\ Y_K=K &\sim \mathrm{Bernoulli}(1-\mathrm{logit}^{-1}(C_{K-1}))\\ \end{aligned} \]
Variable 1: Political complexity
<- "political_complexity" variable
<- d %>%
Y select(all_of(variable)) %>%
%>%
na.omit
pull<- match(Y, sort(unique(Y))) Y
<- list()
ordinal_data $Ndata <- length(Y)
ordinal_data$nthres <- max(Y) -1
ordinal_data$Y <- Y ordinal_data
I generated Stan code for ordinal regression from brms and then did some manual adjustments.
<- "
ordinal_code data {
int<lower=1> Ndata; // total number of observations
int Y[Ndata]; // response variable
int<lower=2> nthres; // number of thresholds
}
parameters {
ordered[nthres] Intercept; // temporary thresholds for centered predictors
}
model {
for (n in 1:Ndata) {
target += ordered_logistic_lpmf(Y[n] | 0, Intercept);
}
target += normal_lpdf(Intercept | 0, 5);
}
generated quantities {
int y_prior[Ndata];
int y_posterior[Ndata];
vector[nthres] Intercept_sample;
ordered[nthres] Intercept_prior;
vector[Ndata] log_lik;
for (k in 1:nthres) {
Intercept_sample[k] = normal_rng(0, 5);
}
Intercept_prior = sort_asc(Intercept_sample);
for (n in 1:Ndata) {
y_prior[n] = ordered_logistic_rng(0, Intercept_prior);
y_posterior[n] = ordered_logistic_rng(0, Intercept);
log_lik[n] = ordered_logistic_lpmf(Y[n] | 0, Intercept);
}
}
"
<- stan_model(model_code=ordinal_code) model_ordinal
<- rstan::sampling(
fit_ordinal
model_ordinal,data=ordinal_data,
chains=4,
iter=2000,
)
print(fit_ordinal, pars=c("Intercept", "lp__"))
Inference for Stan model: anon_model.
4 chains, each with iter=2000; warmup=1000; thin=1;
post-warmup draws per chain=1000, total post-warmup draws=4000.
mean se_mean sd 2.5% 25% 50% 75% 97.5%
Intercept[1] -0.25 0.00 0.06 -0.37 -0.29 -0.24 -0.20 -0.12
Intercept[2] 1.06 0.00 0.07 0.93 1.01 1.06 1.11 1.20
Intercept[3] 2.06 0.00 0.10 1.87 2.00 2.06 2.13 2.26
Intercept[4] 3.17 0.00 0.16 2.88 3.07 3.17 3.28 3.50
lp__ -1437.01 0.03 1.46 -1440.74 -1437.73 -1436.66 -1435.94 -1435.20
n_eff Rhat
Intercept[1] 3788 1
Intercept[2] 5320 1
Intercept[3] 5574 1
Intercept[4] 4919 1
lp__ 1775 1
Samples were drawn using NUTS(diag_e) at Sat Mar 16 20:09:19 2024.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at
convergence, Rhat=1).
<- loo(fit_ordinal)) (loo_ordinal
Computed from 4000 by 1077 log-likelihood matrix.
Estimate SE
elpd_loo -1429.0 21.8
p_loo 4.1 0.2
looic 2858.0 43.6
------
MCSE of elpd_loo is 0.0.
MCSE and ESS estimates assume MCMC draws (r_eff in [0.8, 1.2]).
All Pareto k estimates are good (k < 0.7).
See help('pareto-k-diagnostic') for details.
# Extract the generated quantities
<- extract(fit_ordinal) generated_quantities
Predictive checks
Following Jäger & Wahle (2021), I define three statistics to be used in prior and posterior predictive checks:
- total variance (tv)
- mean lineage-wise variance (lv)
- cross-family variance (cv)
Here are the corresponding R functions.
<- function(Y) var(Y) total_variance
<- function(Y, variable=variable) {
lineage_wise_variance <- d %>%
out filter(!is.na(d[[variable]])) %>%
mutate(Y = Y) %>%
group_by(Family) %>%
summarize(variance = var(Y, na.rm = TRUE)) %>%
summarize(mean_variance = mean(variance, na.rm = TRUE)) %>%
pullreturn(out)
}
<- function(Y, variable=variable) {
crossfamily_variance <- d %>%
out filter(!is.na(d[[variable]])) %>%
mutate(Y = Y) %>%
group_by(Family) %>%
summarize(centroids = mean(Y, na.rm = TRUE)) %>%
pull(centroids) %>%
varreturn(out)
}
First we get the empirical values.
<- ordinal_data$Y
Y <- tibble(
empirical tv = total_variance(Y),
lv = lineage_wise_variance(Y, variable),
cv = crossfamily_variance(Y, variable)
)
Thanks to the generated quantities
block in the Stan programm, fitting the model simultaneously produced prior and posterior predictive samples.
<- tibble(
prior_pc tv = apply(generated_quantities$y_prior, 1, total_variance),
lv = apply(
$y_prior, 1, function(x) lineage_wise_variance(x, variable)
generated_quantities
),cv = apply(
$y_prior, 1, function(x) crossfamily_variance(x, variable)
generated_quantities
), )
<- function(prior_pc) {
predictive_plot <- prior_pc %>%
p1 mutate(
x_jittered = 1 + rnorm(n(), mean = 0, sd = 0.01),
y_jittered = tv + rnorm(n(), mean = 0, sd = sd(prior_pc$tv)/10)
%>%
) ggplot(aes(x = x_jittered, y = tv)) +
geom_boxplot(aes(x = factor(1), y = tv), width = 0.5, fill=karminrot) +
geom_point(alpha = 0.5, size = 0.1, position = position_identity()) +
theme_grey(base_size = 16) +
theme(
axis.text.x = element_blank(),
axis.ticks.x = element_blank(),
axis.title.x = element_blank(),
axis.title.y = element_text(size = 18),
plot.title = element_text(size = 20, hjust = 0.5)
+
) labs(title = "total variance", y = "") +
geom_point(aes(x=1, y=empirical$tv, color="empirical"), size=5, fill='red') +
guides(color = "none")
<- prior_pc %>%
p2 mutate(
x_jittered = 1 + rnorm(n(), mean = 0, sd = 0.01),
y_jittered = lv + rnorm(n(), mean = 0, sd = sd(prior_pc$lv)/10)
%>%
) ggplot(aes(x = x_jittered, y = lv)) +
geom_boxplot(aes(x = factor(1), y = lv), width = 0.5, fill=gold) +
geom_point(alpha = 0.5, size = 0.1, position = position_identity()) +
theme_grey(base_size = 16) +
theme(
axis.text.x = element_blank(),
axis.ticks.x = element_blank(),
axis.title.x = element_blank(),
axis.title.y = element_text(size = 18),
plot.title = element_text(size = 20, hjust = 0.5)
+
) labs(title = "lineage-wise variance", y = "") +
geom_point(aes(x=1, y=empirical$lv, color="empirical"), size=5, fill='red') +
guides(color = "none")
<- prior_pc %>%
p3 mutate(
x_jittered = 1 + rnorm(n(), mean = 0, sd = 0.01),
y_jittered = cv + rnorm(n(), mean = 0, sd = sd(prior_pc$cv)/10)
%>%
) ggplot(aes(x = x_jittered, y = cv)) +
geom_boxplot(aes(x = factor(1), y = cv), width = 0.5, fill=anthrazit) +
geom_point(alpha = 0.5, size = 0.1, position = position_identity()) +
theme_grey(base_size = 16) +
theme(
axis.text.x = element_blank(),
axis.ticks.x = element_blank(),
axis.title.x = element_blank(),
axis.title.y = element_text(size = 18),
plot.title = element_text(size = 20, hjust = 0.5)
+
) labs(title = "cross-lineage variance", y = "") +
geom_point(aes(x=1, y=empirical$cv, color="empirical"), size=5, fill='red') +
labs(color = "") +
guides(color = "none")
return(grid.arrange(p1, p2, p3, ncol=3))
}
predictive_plot(prior_pc)
Warning message in geom_point(aes(x = 1, y = empirical$tv, color = "empirical"), :
“All aesthetics have length 1, but the data has 4000 rows.
ℹ Did you mean to use `annotate()`?”
Warning message in geom_point(aes(x = 1, y = empirical$lv, color = "empirical"), :
“All aesthetics have length 1, but the data has 4000 rows.
ℹ Did you mean to use `annotate()`?”
Warning message in geom_point(aes(x = 1, y = empirical$cv, color = "empirical"), :
“All aesthetics have length 1, but the data has 4000 rows.
ℹ Did you mean to use `annotate()`?”
We see that the prior cover the empirical values quite well. Here is what happens after the model is fitted:
<- tibble(
posterior_pc tv = apply(generated_quantities$y_posterior, 1, total_variance),
lv = apply(
$y_posterior, 1, function(x) lineage_wise_variance(x, variable)
generated_quantities
),cv = apply(
$y_posterior, 1, function(x) crossfamily_variance(x, variable)
generated_quantities
),
)predictive_plot(posterior_pc)
Warning message in geom_point(aes(x = 1, y = empirical$tv, color = "empirical"), :
“All aesthetics have length 1, but the data has 4000 rows.
ℹ Did you mean to use `annotate()`?”
Warning message in geom_point(aes(x = 1, y = empirical$lv, color = "empirical"), :
“All aesthetics have length 1, but the data has 4000 rows.
ℹ Did you mean to use `annotate()`?”
Warning message in geom_point(aes(x = 1, y = empirical$cv, color = "empirical"), :
“All aesthetics have length 1, but the data has 4000 rows.
ℹ Did you mean to use `annotate()`?”
Something is seriously wrong with this model, obviously.
Setting up a phylogenetic model
Brownian motion
Next I will consider the phylogenetic Brownian motion model, combined with a logistic component. Here the basic assumptions are
- there is a continuous latent variable \(z\) which has a value at each node
- at the root nodes, \(z\) is drawn from a distribution like \(\mathcal N(0, 1)\)
- once the value of \(z\) at a mother node is fixed, the values at the daughter nodes is drawn from \(\mathcal N(z_i, \sigma t_j)\), where \(z_i\) is the value at the root and \(t_j\) is the branch length to the daughter
- the observed values \(y_i\) at tip \(i\) are distributed as
\[ \begin{aligned} Y_i=1 &\sim \mathrm{Bernoulli}(\mathrm{logit}^{-1}(C_1-z_i))\\ Y_i=k &\sim \mathrm{Bernoulli}(\mathrm{logit}^{-1}(C_k-z_i) - \mathrm{logit}^{-1}(C_{k-1}-z_i))\\ Y_K=K &\sim \mathrm{Bernoulli}(1-\mathrm{logit}^{-1}(C_{K-1}-z_i))\\ \end{aligned} \]
The parameter \(\sigma\) can be interpreted as the rate of evolution.
<- which(!is.na(d[[variable]]))
defined <- which(is.na(d[[variable]])) undefined
<- c()
po for (i in get_postorder(roots, edges)) {
if ((i %in% edges[,2]) & !(i %in% undefined)) {
<- c(po, i)
po
}
}<- match(po, edges[,2]) edge_po
= list()
phylo_data $Ntrees <- length(trees)
phylo_data$Nnodes <- nrow(nodes)
phylo_data$Y <- ordinal_data$Y
phylo_data$Nobservations <- length(phylo_data$Y)
phylo_data$Nedges <- length(edge_po)
phylo_data$observations <- match(pull(d[defined, "glottocode"]), nodes$label)
phylo_data$roots <- roots
phylo_data$edges <- edges[edge_po,]
phylo_data$edge_lengths <- edge_lengths[edge_po]
phylo_data$nthres <- length(unique(phylo_data$Y)) - 1 phylo_data
<- "
brown_code data {
int<lower=1> Ntrees;
int<lower=0> Nnodes;
int<lower=1> Nobservations;
int<lower=1> Nedges;
int<lower=1,upper=Nnodes> observations[Nobservations]; // indices of the observed tips
int<lower=1, upper=Nnodes> roots[Ntrees];
int<lower=1, upper=Nnodes> edges[Nedges, 2];
vector[Nedges] edge_lengths;
int<lower=2> nthres; // number of thresholds
int Y[Nobservations]; // response variable
}
parameters {
vector[Nnodes] z_normal;
real<lower=0> sigma;
real mu_root;
real<lower=0> sigma_root;
ordered[nthres] Intercept; // temporary thresholds for centered predictors
}
model {
vector[Nnodes] z = rep_vector(0, Nnodes);
target += exponential_lpdf(sigma | 1);
target += normal_lpdf(z_normal | 0, 1);
target += normal_lpdf(Intercept | 0, 5);
for (i in 1:Ntrees) {
int idx = roots[i];
z[idx] = mu_root + z_normal[idx] * sigma_root;
}
for (i in 1:Nedges) {
int j = Nedges - i + 1;
int mother_node = edges[j, 1];
int daughter_node = edges[j, 2];
real local_mu = z[mother_node];
real local_sigma = sigma * sqrt(edge_lengths[j]);
z[daughter_node] = local_mu + z_normal[daughter_node] * local_sigma;
}
for (n in 1:Nobservations) {
target += ordered_logistic_lpmf(Y[n] | z[observations[n]], Intercept);
}
}
generated quantities {
vector[Nnodes] z = rep_vector(0, Nnodes);
int y_prior[Nobservations]; // Prior predictive samples
int y_posterior[Nobservations]; // Posterior predictive samples
vector[Nnodes] z_prior = rep_vector(0, Nnodes);
vector[Nnodes] z_posterior = rep_vector(0, Nnodes);
real<lower=0> sigma_prior;
real mu_root_prior;
real<lower=0> sigma_root_prior;
real log_lik[Nobservations];
vector[nthres] Intercept_sample;
ordered[nthres] Intercept_prior;
for (i in 1:Ntrees) {
int idx = roots[i];
z[idx] = mu_root + z_normal[idx] * sigma_root;
}
for (i in 1:Nedges) {
int j = Nedges - i + 1;
int mother_node = edges[j, 1];
int daughter_node = edges[j, 2];
real local_mu = z[mother_node];
real local_sigma = sigma * sqrt(edge_lengths[j]);
z[daughter_node] = local_mu + z_normal[daughter_node] * local_sigma;
}
for (k in 1:nthres) {
Intercept_sample[k] = normal_rng(0, 5);
}
Intercept_prior = sort_asc(Intercept_sample);
sigma_prior = exponential_rng(1);
mu_root_prior = normal_rng(0, 1);
sigma_root_prior = exponential_rng(1);
for (i in 1:Ntrees) {
z_prior[roots[i]] = normal_rng(mu_root_prior, sigma_root_prior);
}
for (i in 1:Ntrees) {
int idx = roots[i];
z_posterior[idx] = mu_root + z_normal[idx] * sigma_root;
}
for (i in 1:Nedges) {
int j = Nedges - i + 1;
int mother_node = edges[j, 1];
int daughter_node = edges[j, 2];
real local_mu = z_posterior[mother_node];
real local_sigma = sigma * sqrt(edge_lengths[j]);
z_prior[daughter_node] = normal_rng(
z_prior[mother_node],
sigma_prior * sqrt(edge_lengths[j])
);
z_posterior[daughter_node] = local_mu + z_normal[daughter_node] * local_sigma;
}
for (n in 1:Nobservations) {
y_prior[n] = ordered_logistic_rng(z_prior[observations[n]], Intercept_prior);
y_posterior[n] = ordered_logistic_rng(z[observations[n]], Intercept);
log_lik[n] = ordered_logistic_lpmf(Y[n] | z[observations[n]], Intercept);
}
}
"
<- stan_model(model_code=brown_code) model_brownian
<- rstan::sampling(
fit_brownian
model_brownian,data = phylo_data,
chains = 4,
iter=2000
)
# fit_brownian <- readRDS("models/fit_brownian.rds")
print(fit_brownian, pars=c("mu_root", "sigma_root", "sigma", "Intercept", "lp__"))
Inference for Stan model: anon_model.
4 chains, each with iter=2000; warmup=1000; thin=1;
post-warmup draws per chain=1000, total post-warmup draws=4000.
mean se_mean sd 2.5% 25% 50% 75%
mu_root -1.63 0.01 0.31 -2.28 -1.83 -1.61 -1.41
sigma_root 2.04 0.01 0.29 1.54 1.84 2.02 2.22
sigma 1.92 0.01 0.28 1.42 1.73 1.91 2.10
Intercept[1] -0.69 0.00 0.16 -1.00 -0.79 -0.69 -0.58
Intercept[2] 1.30 0.00 0.17 0.99 1.18 1.30 1.41
Intercept[3] 2.71 0.00 0.20 2.32 2.57 2.71 2.85
Intercept[4] 4.06 0.01 0.26 3.57 3.89 4.06 4.23
lp__ -4310.40 1.42 41.00 -4390.75 -4337.74 -4310.48 -4282.73
97.5% n_eff Rhat
mu_root -1.05 1294 1.00
sigma_root 2.67 930 1.00
sigma 2.52 706 1.01
Intercept[1] -0.39 2955 1.00
Intercept[2] 1.63 4194 1.00
Intercept[3] 3.10 2606 1.00
Intercept[4] 4.57 2260 1.00
lp__ -4228.27 835 1.01
Samples were drawn using NUTS(diag_e) at Sat Mar 16 20:11:56 2024.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at
convergence, Rhat=1).
<- extract_log_lik(
log_lik
fit_brownian,parameter_name = "log_lik",
merge_chains = FALSE
)<- relative_eff(exp(log_lik))
reff <- loo(log_lik, r_eff=reff)) (loo_brownian
Warning message:
“Some Pareto k diagnostic values are too high. See help('pareto-k-diagnostic') for details.
”
Computed from 4000 by 1077 log-likelihood matrix.
Estimate SE
elpd_loo -1259.9 25.6
p_loo 213.5 7.1
looic 2519.9 51.1
------
MCSE of elpd_loo is NA.
MCSE and ESS estimates assume MCMC draws (r_eff in [0.4, 1.9]).
Pareto k diagnostic values:
Count Pct. Min. ESS
(-Inf, 0.7] (good) 1070 99.4% 211
(0.7, 1] (bad) 6 0.6% <NA>
(1, Inf) (very bad) 1 0.1% <NA>
See help('pareto-k-diagnostic') for details.
plot(loo_brownian)
<- extract(fit_brownian) generated_quantities
prior predictive check
<- tibble(
prior_pc tv = apply(generated_quantities$y_prior, 1, total_variance),
lv = apply(
$y_prior, 1, function(x) lineage_wise_variance(x, variable)
generated_quantities
),cv = apply(
$y_prior, 1, function(x) crossfamily_variance(x, variable)
generated_quantities
),
)
predictive_plot(prior_pc)
Warning message in geom_point(aes(x = 1, y = empirical$tv, color = "empirical"), :
“All aesthetics have length 1, but the data has 4000 rows.
ℹ Did you mean to use `annotate()`?”
Warning message in geom_point(aes(x = 1, y = empirical$lv, color = "empirical"), :
“All aesthetics have length 1, but the data has 4000 rows.
ℹ Did you mean to use `annotate()`?”
Warning message in geom_point(aes(x = 1, y = empirical$cv, color = "empirical"), :
“All aesthetics have length 1, but the data has 4000 rows.
ℹ Did you mean to use `annotate()`?”
posterior predictive check
<- tibble(
posterior_pc tv = apply(generated_quantities$y_posterior, 1, total_variance),
lv = apply(
$y_posterior, 1, function(x) lineage_wise_variance(x, variable)
generated_quantities
),cv = apply(
$y_posterior, 1, function(x) crossfamily_variance(x, variable)
generated_quantities
),
)
predictive_plot(posterior_pc)
Warning message in geom_point(aes(x = 1, y = empirical$tv, color = "empirical"), :
“All aesthetics have length 1, but the data has 4000 rows.
ℹ Did you mean to use `annotate()`?”
Warning message in geom_point(aes(x = 1, y = empirical$lv, color = "empirical"), :
“All aesthetics have length 1, but the data has 4000 rows.
ℹ Did you mean to use `annotate()`?”
Warning message in geom_point(aes(x = 1, y = empirical$cv, color = "empirical"), :
“All aesthetics have length 1, but the data has 4000 rows.
ℹ Did you mean to use `annotate()`?”
Mixed model
Next I will fit a model with two random effect, one phylogenetic and one uncorrelated. The variable \(z\) is as in the Brownian motion model. Additionally, we have a random effect variable \(\epsilon\) which is drawn from a normal distribution with mean 0 and standard deviation \(\sigma_\epsilon\).
\[ \begin{aligned} \sigma_\epsilon^2 &\sim \text{Inverse\_Gamma}(2, 0.01)\\ \epsilon &\sim \mathcal N(0, \sigma_\epsilon) Y_i=1 &\sim \mathrm{Bernoulli}(\mathrm{logit}^{-1}(C_1-z_i-\epsilon_i))\\ Y_i=k &\sim \mathrm{Bernoulli}(\mathrm{logit}^{-1}(C_k-z_i-\epsilon_i) - \mathrm{logit}^{-1}(C_{k-1}-z_i))\\ Y_K=K &\sim \mathrm{Bernoulli}(1-\mathrm{logit}^{-1}(C_{K-1}-z_i-\epsilon_i))\\ \end{aligned} \]
<- "
mixed_code data {
int<lower=1> Ntrees;
int<lower=0> Nnodes;
int<lower=1> Nobservations;
int<lower=1> Nedges;
int<lower=1,upper=Nnodes> observations[Nobservations]; // indices of the observed tips
int<lower=1, upper=Nnodes> roots[Ntrees];
int<lower=1, upper=Nnodes> edges[Nedges, 2];
vector[Nedges] edge_lengths;
int<lower=2> nthres; // number of thresholds
int Y[Nobservations]; // response variable
}
parameters {
vector[Nnodes] z_normal;
real<lower=0> sigma;
real mu_root;
real<lower=0> sigma_root;
ordered[nthres] Intercept; // temporary thresholds for centered predictors
real<lower=0> sigma_error;
vector[Nobservations] z_error;
}
model {
vector[Nnodes] z = rep_vector(0, Nnodes);
target += exponential_lpdf(sigma | 1);
target += normal_lpdf(mu_root | 0, 1);
target += exponential_lpdf(sigma_root | 1);
target += normal_lpdf(z_normal | 0, 1);
target += normal_lpdf(Intercept | 0, 5);
target += inv_gamma_lpdf(square(sigma_error) | 2, 0.1);
target += normal_lpdf(z_error | 0, sigma_error);
for (i in 1:Ntrees) {
int idx = roots[i];
z[idx] = mu_root + z_normal[idx] * sigma_root;
}
for (i in 1:Nedges) {
int j = Nedges - i + 1;
int mother_node = edges[j, 1];
int daughter_node = edges[j, 2];
real local_mu = z[mother_node];
real local_sigma = sigma * sqrt(edge_lengths[j]);
z[daughter_node] = local_mu + z_normal[daughter_node] * local_sigma;
}
for (n in 1:Nobservations) {
target += ordered_logistic_lpmf(Y[n] | z[observations[n]] + z_error[n], Intercept);
}
}
generated quantities {
vector[Nnodes] z = rep_vector(0, Nnodes);
int y_prior[Nobservations]; // Prior predictive samples
int y_posterior[Nobservations]; // Posterior predictive samples
vector[Nnodes] z_prior = rep_vector(0, Nnodes);
vector[Nnodes] z_posterior = rep_vector(0, Nnodes);
real<lower=0> sigma_prior;
real mu_root_prior;
real<lower=0> sigma_root_prior;
real log_lik[Nobservations];
vector[nthres] Intercept_sample;
ordered[nthres] Intercept_prior;
vector[Nobservations] z_error_prior;
real sigma_error_sq = 1 / gamma_rng(2, 10);
real sigma_error_prior = sqrt(sigma_error_sq);
for (i in 1:Ntrees) {
int idx = roots[i];
z[idx] = mu_root + z_normal[idx] * sigma_root;
}
for (n in 1:Nobservations) {
z_error_prior[n] = normal_rng(0, sigma_error_prior);
}
for (i in 1:Nedges) {
int j = Nedges - i + 1;
int mother_node = edges[j, 1];
int daughter_node = edges[j, 2];
real local_mu = z[mother_node];
real local_sigma = sigma * sqrt(edge_lengths[j]);
z[daughter_node] = local_mu + z_normal[daughter_node] * local_sigma;
}
for (k in 1:nthres) {
Intercept_sample[k] = normal_rng(0, 5);
}
Intercept_prior = sort_asc(Intercept_sample);
sigma_prior = exponential_rng(1);
mu_root_prior = normal_rng(0, 1);
sigma_root_prior = exponential_rng(1);
for (i in 1:Ntrees) {
z_prior[roots[i]] = normal_rng(mu_root_prior, sigma_root_prior);
}
for (i in 1:Ntrees) {
int idx = roots[i];
z_posterior[idx] = mu_root + z_normal[idx] * sigma_root;
}
for (i in 1:Nedges) {
int j = Nedges - i + 1;
int mother_node = edges[j, 1];
int daughter_node = edges[j, 2];
real local_mu = z_posterior[mother_node];
real local_sigma = sigma * sqrt(edge_lengths[j]);
z_prior[daughter_node] = normal_rng(
z_prior[mother_node],
sigma_prior * sqrt(edge_lengths[j])
);
z_posterior[daughter_node] = local_mu + z_normal[daughter_node] * local_sigma;
}
for (n in 1:Nobservations) {
y_prior[n] = ordered_logistic_rng(z_prior[observations[n]] + z_error_prior[n], Intercept_prior);
y_posterior[n] = ordered_logistic_rng(z[observations[n]] + z_error[n], Intercept);
log_lik[n] = ordered_logistic_lpmf(Y[n] | z[observations[n]] + z_error[n], Intercept);
}
}
"
<- stan_model(model_code = mixed_code) model_mixed
<- readRDS("models/fit_mixed1.rds") fit_mixed1
# fit_mixed1 <- rstan::sampling(
# model_mixed,
# data = phylo_data,
# chains = 4,
# iter=40000,
# thin=20,
# control=list(max_treedepth=15)
# )
# saveRDS(fit_mixed1, "models/fit_mixed1.rds")
print(fit_mixed1, pars = c("mu_root", "sigma_root", "sigma", "Intercept", "sigma_error", "lp__"))
Inference for Stan model: anon_model.
4 chains, each with iter=40000; warmup=20000; thin=20;
post-warmup draws per chain=1000, total post-warmup draws=4000.
mean se_mean sd 2.5% 25% 50% 75%
mu_root -1.47 0.00 0.28 -2.03 -1.66 -1.47 -1.28
sigma_root 1.92 0.00 0.27 1.45 1.74 1.91 2.10
sigma 1.91 0.00 0.27 1.40 1.72 1.89 2.08
Intercept[1] -0.66 0.00 0.16 -0.97 -0.77 -0.66 -0.56
Intercept[2] 1.33 0.00 0.17 1.01 1.22 1.33 1.44
Intercept[3] 2.74 0.00 0.20 2.35 2.60 2.73 2.88
Intercept[4] 4.09 0.00 0.26 3.61 3.92 4.09 4.26
sigma_error 0.22 0.01 0.08 0.12 0.17 0.20 0.25
lp__ -4160.39 26.35 349.73 -4964.29 -4362.84 -4125.08 -3912.54
97.5% n_eff Rhat
mu_root -0.93 3604 1.00
sigma_root 2.47 4009 1.00
sigma 2.46 3875 1.00
Intercept[1] -0.36 3984 1.00
Intercept[2] 1.65 4022 1.00
Intercept[3] 3.15 3780 1.00
Intercept[4] 4.61 3712 1.00
sigma_error 0.44 155 1.03
lp__ -3573.75 176 1.03
Samples were drawn using NUTS(diag_e) at Sat Mar 16 18:08:22 2024.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at
convergence, Rhat=1).
<- extract_log_lik(
log_lik
fit_mixed1,parameter_name = "log_lik",
merge_chains = FALSE
)<- relative_eff(exp(log_lik))
reff <- loo(log_lik, r_eff=reff)) (loo_mixed
Warning message:
“Some Pareto k diagnostic values are too high. See help('pareto-k-diagnostic') for details.
”
Computed from 4000 by 1077 log-likelihood matrix.
Estimate SE
elpd_loo -1261.0 25.4
p_loo 218.3 7.0
looic 2522.0 50.7
------
MCSE of elpd_loo is NA.
MCSE and ESS estimates assume MCMC draws (r_eff in [0.5, 1.1]).
Pareto k diagnostic values:
Count Pct. Min. ESS
(-Inf, 0.7] (good) 1071 99.4% 220
(0.7, 1] (bad) 6 0.6% <NA>
(1, Inf) (very bad) 0 0.0% <NA>
See help('pareto-k-diagnostic') for details.
plot(loo_mixed)
<- extract(fit_mixed1) generated_quantities
prior predictive check
<- tibble(
prior_pc tv = apply(generated_quantities$y_prior, 1, total_variance),
lv = apply(
$y_prior, 1, function(x) lineage_wise_variance(x, variable)
generated_quantities
),cv = apply(
$y_prior, 1, function(x) crossfamily_variance(x, variable)
generated_quantities
),
)
predictive_plot(prior_pc)
Warning message in geom_point(aes(x = 1, y = empirical$tv, color = "empirical"), :
“All aesthetics have length 1, but the data has 4000 rows.
ℹ Did you mean to use `annotate()`?”
Warning message in geom_point(aes(x = 1, y = empirical$lv, color = "empirical"), :
“All aesthetics have length 1, but the data has 4000 rows.
ℹ Did you mean to use `annotate()`?”
Warning message in geom_point(aes(x = 1, y = empirical$cv, color = "empirical"), :
“All aesthetics have length 1, but the data has 4000 rows.
ℹ Did you mean to use `annotate()`?”
posterior predictive check
<- tibble(
posterior_pc tv = apply(generated_quantities$y_posterior, 1, total_variance),
lv = apply(
$y_posterior, 1, function(x) lineage_wise_variance(x, variable)
generated_quantities
),cv = apply(
$y_posterior, 1, function(x) crossfamily_variance(x, variable)
generated_quantities
),
)
predictive_plot(posterior_pc)
Warning message in geom_point(aes(x = 1, y = empirical$tv, color = "empirical"), :
“All aesthetics have length 1, but the data has 4000 rows.
ℹ Did you mean to use `annotate()`?”
Warning message in geom_point(aes(x = 1, y = empirical$lv, color = "empirical"), :
“All aesthetics have length 1, but the data has 4000 rows.
ℹ Did you mean to use `annotate()`?”
Warning message in geom_point(aes(x = 1, y = empirical$cv, color = "empirical"), :
“All aesthetics have length 1, but the data has 4000 rows.
ℹ Did you mean to use `annotate()`?”
Let us summarize the model comparisons:
This is very decisive evidence in favor of the CTMC model.
model comparison
tibble(model = c("ordinal", "brownian", "mixed"),
looic = c(
$estimates[3,1],
loo_ordinal$estimates[3,1],
loo_brownian$estimates[3,1]
loo_mixed
)%>%
) mutate(delta_looic = round(looic-min(looic), 3)) %>%
arrange(delta_looic)
model | looic | delta_looic |
---|---|---|
<chr> | <dbl> | <dbl> |
brownian | 2519.892 | 0.000 |
mixed | 2521.968 | 2.075 |
ordinal | 2858.004 | 338.111 |
We can conclude that phylogenetic control massively improves the fit to the data.
Variable 2: hierarchy_within
<- "hierarchy_within" variable
<- d %>%
Y select(all_of(variable)) %>%
%>%
na.omit
pull<- match(Y, sort(unique(Y))) Y
<- list()
ordinal_data $Ndata <- length(Y)
ordinal_data$nthres <- max(Y) -1
ordinal_data$Y <- Y ordinal_data
<- rstan::sampling(
fit_ordinal
model_ordinal,data = ordinal_data,
chains = 4,
iter = 2000,
)
print(fit_ordinal, pars=c("Intercept", "lp__"))
Inference for Stan model: anon_model.
4 chains, each with iter=2000; warmup=1000; thin=1;
post-warmup draws per chain=1000, total post-warmup draws=4000.
mean se_mean sd 2.5% 25% 50% 75% 97.5%
Intercept[1] -0.86 0.00 0.07 -0.99 -0.90 -0.86 -0.81 -0.73
Intercept[2] 1.86 0.00 0.09 1.69 1.80 1.86 1.92 2.05
lp__ -1043.07 0.02 1.01 -1045.68 -1043.46 -1042.75 -1042.35 -1042.08
n_eff Rhat
Intercept[1] 2072 1
Intercept[2] 3137 1
lp__ 1806 1
Samples were drawn using NUTS(diag_e) at Sat Mar 16 20:15:33 2024.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at
convergence, Rhat=1).
<- loo(fit_ordinal)) (loo_ordinal
Computed from 4000 by 1089 log-likelihood matrix.
Estimate SE
elpd_loo -1039.9 16.7
p_loo 2.0 0.1
looic 2079.9 33.3
------
MCSE of elpd_loo is 0.0.
MCSE and ESS estimates assume MCMC draws (r_eff in [0.4, 0.8]).
All Pareto k estimates are good (k < 0.7).
See help('pareto-k-diagnostic') for details.
# Extract the generated quantities
<- extract(fit_ordinal) generated_quantities
<- ordinal_data$Y
Y <- tibble(
empirical tv = total_variance(Y),
lv = lineage_wise_variance(Y, variable),
cv = crossfamily_variance(Y, variable)
)
<- tibble(
prior_pc tv = apply(generated_quantities$y_prior, 1, total_variance),
lv = apply(
$y_prior, 1, function(x) lineage_wise_variance(x, variable)
generated_quantities
),cv = apply(
$y_prior, 1, function(x) crossfamily_variance(x, variable)
generated_quantities
),
)predictive_plot(prior_pc)
Warning message in geom_point(aes(x = 1, y = empirical$tv, color = "empirical"), :
“All aesthetics have length 1, but the data has 4000 rows.
ℹ Did you mean to use `annotate()`?”
Warning message in geom_point(aes(x = 1, y = empirical$lv, color = "empirical"), :
“All aesthetics have length 1, but the data has 4000 rows.
ℹ Did you mean to use `annotate()`?”
Warning message in geom_point(aes(x = 1, y = empirical$cv, color = "empirical"), :
“All aesthetics have length 1, but the data has 4000 rows.
ℹ Did you mean to use `annotate()`?”
<- tibble(
posterior_pc tv = apply(generated_quantities$y_posterior, 1, total_variance),
lv = apply(
$y_posterior, 1, function(x) lineage_wise_variance(x, variable)
generated_quantities
),cv = apply(
$y_posterior, 1, function(x) crossfamily_variance(x, variable)
generated_quantities
),
)predictive_plot(posterior_pc)
Warning message in geom_point(aes(x = 1, y = empirical$tv, color = "empirical"), :
“All aesthetics have length 1, but the data has 4000 rows.
ℹ Did you mean to use `annotate()`?”
Warning message in geom_point(aes(x = 1, y = empirical$lv, color = "empirical"), :
“All aesthetics have length 1, but the data has 4000 rows.
ℹ Did you mean to use `annotate()`?”
Warning message in geom_point(aes(x = 1, y = empirical$cv, color = "empirical"), :
“All aesthetics have length 1, but the data has 4000 rows.
ℹ Did you mean to use `annotate()`?”
<- which(!is.na(d[[variable]]))
defined <- which(is.na(d[[variable]])) undefined
<- c()
po for (i in get_postorder(roots, edges)) {
if ((i %in% edges[,2]) && !(i %in% undefined)) {
<- c(po, i)
po
}
}<- match(po, edges[,2]) edge_po
= list()
phylo_data $Ntrees <- length(trees)
phylo_data$Nnodes <- nrow(nodes)
phylo_data$Y <- ordinal_data$Y
phylo_data$Nobservations <- length(phylo_data$Y)
phylo_data$Nedges <- length(edge_po)
phylo_data$observations <- match(pull(d[defined, "glottocode"]), nodes$label)
phylo_data$roots <- roots
phylo_data$edges <- edges[edge_po,]
phylo_data$edge_lengths <- edge_lengths[edge_po]
phylo_data$nthres <- length(unique(phylo_data$Y)) - 1 phylo_data
<- rstan::sampling(
fit_brownian
model_brownian,data = phylo_data,
chains = 4,
iter = 2000
)
print(fit_brownian, pars=c("mu_root", "sigma_root", "sigma", "Intercept", "lp__"))
Inference for Stan model: anon_model.
4 chains, each with iter=2000; warmup=1000; thin=1;
post-warmup draws per chain=1000, total post-warmup draws=4000.
mean se_mean sd 2.5% 25% 50% 75%
mu_root 0.40 0.00 0.23 -0.04 0.24 0.40 0.55
sigma_root 1.19 0.01 0.22 0.78 1.04 1.18 1.33
sigma 1.98 0.01 0.29 1.45 1.78 1.96 2.16
Intercept[1] -0.62 0.00 0.15 -0.94 -0.73 -0.62 -0.52
Intercept[2] 3.18 0.01 0.23 2.75 3.01 3.17 3.33
lp__ -3990.72 1.48 40.81 -4067.19 -4018.82 -3991.51 -3962.58
97.5% n_eff Rhat
mu_root 0.85 2357 1.00
sigma_root 1.66 1021 1.00
sigma 2.57 642 1.01
Intercept[1] -0.33 3886 1.00
Intercept[2] 3.64 1675 1.00
lp__ -3910.57 757 1.01
Samples were drawn using NUTS(diag_e) at Sat Mar 16 20:18:05 2024.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at
convergence, Rhat=1).
<- loo(fit_brownian)) (loo_brownian
Warning message:
“Some Pareto k diagnostic values are too high. See help('pareto-k-diagnostic') for details.
”
Computed from 4000 by 1089 log-likelihood matrix.
Estimate SE
elpd_loo -936.6 18.4
p_loo 205.7 6.8
looic 1873.2 36.8
------
MCSE of elpd_loo is NA.
MCSE and ESS estimates assume MCMC draws (r_eff in [0.3, 2.1]).
Pareto k diagnostic values:
Count Pct. Min. ESS
(-Inf, 0.7] (good) 1083 99.4% 121
(0.7, 1] (bad) 6 0.6% <NA>
(1, Inf) (very bad) 0 0.0% <NA>
See help('pareto-k-diagnostic') for details.
# Extract the generated quantities
<- extract(fit_brownian) generated_quantities
<- ordinal_data$Y
Y <- tibble(
empirical tv = total_variance(Y),
lv = lineage_wise_variance(Y, variable),
cv = crossfamily_variance(Y, variable)
)
<- tibble(
prior_pc tv = apply(generated_quantities$y_prior, 1, total_variance),
lv = apply(
$y_prior, 1, function(x) lineage_wise_variance(x, variable)
generated_quantities
),cv = apply(
$y_prior, 1, function(x) crossfamily_variance(x, variable)
generated_quantities
),
)predictive_plot(prior_pc)
Warning message in geom_point(aes(x = 1, y = empirical$tv, color = "empirical"), :
“All aesthetics have length 1, but the data has 4000 rows.
ℹ Did you mean to use `annotate()`?”
Warning message in geom_point(aes(x = 1, y = empirical$lv, color = "empirical"), :
“All aesthetics have length 1, but the data has 4000 rows.
ℹ Did you mean to use `annotate()`?”
Warning message in geom_point(aes(x = 1, y = empirical$cv, color = "empirical"), :
“All aesthetics have length 1, but the data has 4000 rows.
ℹ Did you mean to use `annotate()`?”
<- tibble(
posterior_pc tv = apply(generated_quantities$y_posterior, 1, total_variance),
lv = apply(
$y_posterior, 1, function(x) lineage_wise_variance(x, variable)
generated_quantities
),cv = apply(
$y_posterior, 1, function(x) crossfamily_variance(x, variable)
generated_quantities
),
)predictive_plot(posterior_pc)
Warning message in geom_point(aes(x = 1, y = empirical$tv, color = "empirical"), :
“All aesthetics have length 1, but the data has 4000 rows.
ℹ Did you mean to use `annotate()`?”
Warning message in geom_point(aes(x = 1, y = empirical$lv, color = "empirical"), :
“All aesthetics have length 1, but the data has 4000 rows.
ℹ Did you mean to use `annotate()`?”
Warning message in geom_point(aes(x = 1, y = empirical$cv, color = "empirical"), :
“All aesthetics have length 1, but the data has 4000 rows.
ℹ Did you mean to use `annotate()`?”
<- readRDS("models/fit_mixed2.rds") fit_mixed2
# fit_mixed2 <- rstan::sampling(
# model_mixed,
# data = phylo_data,
# chains = 4,
# iter=40000,
# thin=20,
# control=list(max_treedepth=15)
# )
# saveRDS(fit_mixed2, "models/fit_mixed2.rds")
print(fit_mixed2, pars=c("mu_root", "sigma_root", "sigma", "Intercept", "sigma_error", "lp__"))
Inference for Stan model: anon_model.
4 chains, each with iter=40000; warmup=20000; thin=20;
post-warmup draws per chain=1000, total post-warmup draws=4000.
mean se_mean sd 2.5% 25% 50% 75%
mu_root 0.38 0.00 0.22 -0.06 0.24 0.38 0.53
sigma_root 1.16 0.00 0.21 0.78 1.01 1.15 1.30
sigma 1.96 0.00 0.29 1.43 1.76 1.95 2.15
Intercept[1] -0.64 0.00 0.16 -0.94 -0.74 -0.64 -0.53
Intercept[2] 3.18 0.00 0.24 2.72 3.01 3.17 3.34
sigma_error 0.23 0.01 0.09 0.12 0.17 0.21 0.27
lp__ -3886.28 31.52 366.29 -4692.49 -4107.62 -3857.59 -3625.73
97.5% n_eff Rhat
mu_root 0.82 3760 1.00
sigma_root 1.61 3863 1.00
sigma 2.58 4002 1.00
Intercept[1] -0.33 3678 1.00
Intercept[2] 3.67 3596 1.00
sigma_error 0.46 109 1.04
lp__ -3260.73 135 1.03
Samples were drawn using NUTS(diag_e) at Sat Mar 16 18:49:55 2024.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at
convergence, Rhat=1).
<- loo(fit_mixed2)) (loo_mixed
Warning message:
“Some Pareto k diagnostic values are too high. See help('pareto-k-diagnostic') for details.
”
Computed from 4000 by 1089 log-likelihood matrix.
Estimate SE
elpd_loo -937.8 18.3
p_loo 212.6 6.9
looic 1875.5 36.6
------
MCSE of elpd_loo is NA.
MCSE and ESS estimates assume MCMC draws (r_eff in [0.7, 1.1]).
Pareto k diagnostic values:
Count Pct. Min. ESS
(-Inf, 0.7] (good) 1085 99.6% 345
(0.7, 1] (bad) 3 0.3% <NA>
(1, Inf) (very bad) 1 0.1% <NA>
See help('pareto-k-diagnostic') for details.
# Extract the generated quantities
<- extract(fit_mixed2) generated_quantities
<- ordinal_data$Y
Y <- tibble(
empirical tv = total_variance(Y),
lv = lineage_wise_variance(Y, variable),
cv = crossfamily_variance(Y, variable)
)
<- tibble(
prior_pc tv = apply(generated_quantities$y_prior, 1, total_variance),
lv = apply(
$y_prior, 1, function(x) lineage_wise_variance(x, variable)
generated_quantities
),cv = apply(
$y_prior, 1, function(x) crossfamily_variance(x, variable)
generated_quantities
),
)predictive_plot(prior_pc)
Warning message in geom_point(aes(x = 1, y = empirical$tv, color = "empirical"), :
“All aesthetics have length 1, but the data has 4000 rows.
ℹ Did you mean to use `annotate()`?”
Warning message in geom_point(aes(x = 1, y = empirical$lv, color = "empirical"), :
“All aesthetics have length 1, but the data has 4000 rows.
ℹ Did you mean to use `annotate()`?”
Warning message in geom_point(aes(x = 1, y = empirical$cv, color = "empirical"), :
“All aesthetics have length 1, but the data has 4000 rows.
ℹ Did you mean to use `annotate()`?”
<- tibble(
posterior_pc tv = apply(generated_quantities$y_posterior, 1, total_variance),
lv = apply(
$y_posterior, 1, function(x) lineage_wise_variance(x, variable)
generated_quantities
),cv = apply(
$y_posterior, 1, function(x) crossfamily_variance(x, variable)
generated_quantities
),
)predictive_plot(posterior_pc)
Warning message in geom_point(aes(x = 1, y = empirical$tv, color = "empirical"), :
“All aesthetics have length 1, but the data has 4000 rows.
ℹ Did you mean to use `annotate()`?”
Warning message in geom_point(aes(x = 1, y = empirical$lv, color = "empirical"), :
“All aesthetics have length 1, but the data has 4000 rows.
ℹ Did you mean to use `annotate()`?”
Warning message in geom_point(aes(x = 1, y = empirical$cv, color = "empirical"), :
“All aesthetics have length 1, but the data has 4000 rows.
ℹ Did you mean to use `annotate()`?”
model comparison
tibble(model = c("ordinal", "brownian", "mixed"),
looic = c(
$estimates[3,1],
loo_ordinal$estimates[3,1],
loo_brownian$estimates[3,1]
loo_mixed
)%>%
) mutate(delta_looic = round(looic-min(looic), 3)) %>%
arrange(delta_looic)
model | looic | delta_looic |
---|---|---|
<chr> | <dbl> | <dbl> |
brownian | 1873.185 | 0.000 |
mixed | 1875.527 | 2.342 |
ordinal | 2079.891 | 206.705 |
Variable 3: domestic_organisation
<- "domestic_organisation" variable
<- d %>%
Y select(all_of(variable)) %>%
%>%
na.omit
pull<- match(Y, sort(unique(Y))) Y
<- list()
ordinal_data $Ndata <- length(Y)
ordinal_data$nthres <- max(Y) -1
ordinal_data$Y <- Y ordinal_data
<- rstan::sampling(
fit_ordinal
model_ordinal,data = ordinal_data,
chains = 4,
iter = 2000,
)
print(fit_ordinal, pars=c("Intercept", "lp__"))
Inference for Stan model: anon_model.
4 chains, each with iter=2000; warmup=1000; thin=1;
post-warmup draws per chain=1000, total post-warmup draws=4000.
mean se_mean sd 2.5% 25% 50% 75% 97.5%
Intercept[1] -0.93 0.00 0.06 -1.06 -0.97 -0.93 -0.89 -0.80
Intercept[2] 0.05 0.00 0.06 -0.06 0.01 0.05 0.09 0.17
Intercept[3] 1.45 0.00 0.07 1.31 1.40 1.45 1.50 1.60
lp__ -1631.03 0.03 1.27 -1634.30 -1631.56 -1630.69 -1630.12 -1629.64
n_eff Rhat
Intercept[1] 2517 1
Intercept[2] 3701 1
Intercept[3] 4405 1
lp__ 1480 1
Samples were drawn using NUTS(diag_e) at Sat Mar 16 20:21:53 2024.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at
convergence, Rhat=1).
<- loo(fit_ordinal)) (loo_ordinal
Computed from 4000 by 1183 log-likelihood matrix.
Estimate SE
elpd_loo -1625.2 5.9
p_loo 3.0 0.0
looic 3250.4 11.8
------
MCSE of elpd_loo is 0.0.
MCSE and ESS estimates assume MCMC draws (r_eff in [0.6, 1.1]).
All Pareto k estimates are good (k < 0.7).
See help('pareto-k-diagnostic') for details.
# Extract the generated quantities
<- extract(fit_ordinal) generated_quantities
<- ordinal_data$Y
Y <- tibble(
empirical tv = total_variance(Y),
lv = lineage_wise_variance(Y, variable),
cv = crossfamily_variance(Y, variable)
)
<- tibble(
prior_pc tv = apply(generated_quantities$y_prior, 1, total_variance),
lv = apply(
$y_prior, 1, function(x) lineage_wise_variance(x, variable)
generated_quantities
),cv = apply(
$y_prior, 1, function(x) crossfamily_variance(x, variable)
generated_quantities
),
)predictive_plot(prior_pc)
Warning message in geom_point(aes(x = 1, y = empirical$tv, color = "empirical"), :
“All aesthetics have length 1, but the data has 4000 rows.
ℹ Did you mean to use `annotate()`?”
Warning message in geom_point(aes(x = 1, y = empirical$lv, color = "empirical"), :
“All aesthetics have length 1, but the data has 4000 rows.
ℹ Did you mean to use `annotate()`?”
Warning message in geom_point(aes(x = 1, y = empirical$cv, color = "empirical"), :
“All aesthetics have length 1, but the data has 4000 rows.
ℹ Did you mean to use `annotate()`?”
<- tibble(
posterior_pc tv = apply(generated_quantities$y_posterior, 1, total_variance),
lv = apply(
$y_posterior, 1, function(x) lineage_wise_variance(x, variable)
generated_quantities
),cv = apply(
$y_posterior, 1, function(x) crossfamily_variance(x, variable)
generated_quantities
),
)predictive_plot(posterior_pc)
Warning message in geom_point(aes(x = 1, y = empirical$tv, color = "empirical"), :
“All aesthetics have length 1, but the data has 4000 rows.
ℹ Did you mean to use `annotate()`?”
Warning message in geom_point(aes(x = 1, y = empirical$lv, color = "empirical"), :
“All aesthetics have length 1, but the data has 4000 rows.
ℹ Did you mean to use `annotate()`?”
Warning message in geom_point(aes(x = 1, y = empirical$cv, color = "empirical"), :
“All aesthetics have length 1, but the data has 4000 rows.
ℹ Did you mean to use `annotate()`?”
<- which(!is.na(d[[variable]]))
defined <- which(is.na(d[[variable]])) undefined
<- c()
po for (i in get_postorder(roots, edges)) {
if ((i %in% edges[,2]) & !(i %in% undefined)) {
<- c(po, i)
po
}
}<- match(po, edges[,2]) edge_po
= list()
phylo_data $Ntrees <- length(trees)
phylo_data$Nnodes <- nrow(nodes)
phylo_data$Y <- ordinal_data$Y
phylo_data$Nobservations <- length(phylo_data$Y)
phylo_data$Nedges <- length(edge_po)
phylo_data$observations <- match(pull(d[defined, "glottocode"]), nodes$label)
phylo_data$roots <- roots
phylo_data$edges <- edges[edge_po,]
phylo_data$edge_lengths <- edge_lengths[edge_po]
phylo_data$nthres <- length(unique(phylo_data$Y)) - 1 phylo_data
<- rstan::sampling(
fit_brownian
model_brownian,data = phylo_data,
chains = 4,
iter=2000
)
print(fit_brownian, pars=c("mu_root", "sigma_root", "sigma", "Intercept", "lp__"))
Inference for Stan model: anon_model.
4 chains, each with iter=2000; warmup=1000; thin=1;
post-warmup draws per chain=1000, total post-warmup draws=4000.
mean se_mean sd 2.5% 25% 50% 75%
mu_root -0.19 0.01 0.31 -0.80 -0.41 -0.19 0.01
sigma_root 1.00 0.01 0.18 0.68 0.87 0.99 1.13
sigma 1.53 0.01 0.22 1.11 1.37 1.53 1.68
Intercept[1] -1.25 0.01 0.28 -1.83 -1.44 -1.25 -1.06
Intercept[2] -0.01 0.01 0.28 -0.56 -0.20 -0.01 0.18
Intercept[3] 1.77 0.01 0.30 1.20 1.57 1.77 1.97
lp__ -4597.03 1.43 42.38 -4682.29 -4625.63 -4597.17 -4568.26
97.5% n_eff Rhat
mu_root 0.41 2858 1
sigma_root 1.39 1066 1
sigma 1.98 813 1
Intercept[1] -0.70 3160 1
Intercept[2] 0.55 2935 1
Intercept[3] 2.36 2409 1
lp__ -4515.18 878 1
Samples were drawn using NUTS(diag_e) at Sat Mar 16 20:24:15 2024.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at
convergence, Rhat=1).
<- loo(fit_brownian)) (loo_brownian
Warning message:
“Some Pareto k diagnostic values are too high. See help('pareto-k-diagnostic') for details.
”
Computed from 4000 by 1183 log-likelihood matrix.
Estimate SE
elpd_loo -1540.1 13.9
p_loo 213.0 5.1
looic 3080.3 27.7
------
MCSE of elpd_loo is NA.
MCSE and ESS estimates assume MCMC draws (r_eff in [0.4, 1.9]).
Pareto k diagnostic values:
Count Pct. Min. ESS
(-Inf, 0.7] (good) 1178 99.6% 264
(0.7, 1] (bad) 5 0.4% <NA>
(1, Inf) (very bad) 0 0.0% <NA>
See help('pareto-k-diagnostic') for details.
# Extract the generated quantities
<- extract(fit_brownian) generated_quantities
<- ordinal_data$Y
Y <- tibble(
empirical tv = total_variance(Y),
lv = lineage_wise_variance(Y, variable),
cv = crossfamily_variance(Y, variable)
)
<- tibble(
prior_pc tv = apply(generated_quantities$y_prior, 1, total_variance),
lv = apply(
$y_prior, 1, function(x) lineage_wise_variance(x, variable)
generated_quantities
),cv = apply(
$y_prior, 1, function(x) crossfamily_variance(x, variable)
generated_quantities
),
)predictive_plot(prior_pc)
Warning message in geom_point(aes(x = 1, y = empirical$tv, color = "empirical"), :
“All aesthetics have length 1, but the data has 4000 rows.
ℹ Did you mean to use `annotate()`?”
Warning message in geom_point(aes(x = 1, y = empirical$lv, color = "empirical"), :
“All aesthetics have length 1, but the data has 4000 rows.
ℹ Did you mean to use `annotate()`?”
Warning message in geom_point(aes(x = 1, y = empirical$cv, color = "empirical"), :
“All aesthetics have length 1, but the data has 4000 rows.
ℹ Did you mean to use `annotate()`?”
<- tibble(
posterior_pc tv = apply(generated_quantities$y_posterior, 1, total_variance),
lv = apply(
$y_posterior, 1, function(x) lineage_wise_variance(x, variable)
generated_quantities
),cv = apply(
$y_posterior, 1, function(x) crossfamily_variance(x, variable)
generated_quantities
),
)predictive_plot(posterior_pc)
Warning message in geom_point(aes(x = 1, y = empirical$tv, color = "empirical"), :
“All aesthetics have length 1, but the data has 4000 rows.
ℹ Did you mean to use `annotate()`?”
Warning message in geom_point(aes(x = 1, y = empirical$lv, color = "empirical"), :
“All aesthetics have length 1, but the data has 4000 rows.
ℹ Did you mean to use `annotate()`?”
Warning message in geom_point(aes(x = 1, y = empirical$cv, color = "empirical"), :
“All aesthetics have length 1, but the data has 4000 rows.
ℹ Did you mean to use `annotate()`?”
<- readRDS("models/fit_mixed3.rds") fit_mixed3
# fit_mixed3 <- rstan::sampling(
# model_mixed,
# data = phylo_data,
# chains = 4,
# iter = 40000,
# thin = 20,
# control = list(max_treedepth = 15)
# )
# saveRDS(fit_mixed3, "models/fit_mixed3.rds")
print(
fit_mixed3,pars = c("mu_root", "sigma_root", "sigma", "Intercept", "sigma_error", "lp__")
)
Inference for Stan model: anon_model.
4 chains, each with iter=40000; warmup=20000; thin=20;
post-warmup draws per chain=1000, total post-warmup draws=4000.
mean se_mean sd 2.5% 25% 50% 75%
mu_root -0.17 0.00 0.31 -0.78 -0.38 -0.18 0.04
sigma_root 0.99 0.00 0.19 0.66 0.86 0.98 1.11
sigma 1.54 0.00 0.23 1.11 1.37 1.53 1.69
Intercept[1] -1.25 0.00 0.29 -1.80 -1.44 -1.25 -1.05
Intercept[2] 0.01 0.00 0.29 -0.56 -0.19 0.01 0.20
Intercept[3] 1.80 0.01 0.31 1.22 1.59 1.80 2.01
sigma_error 0.24 0.01 0.09 0.13 0.18 0.22 0.27
lp__ -4495.62 25.99 377.70 -5338.93 -4720.97 -4470.54 -4230.67
97.5% n_eff Rhat
mu_root 0.46 3899 1.00
sigma_root 1.38 3222 1.00
sigma 2.03 3691 1.00
Intercept[1] -0.68 3348 1.00
Intercept[2] 0.58 3338 1.00
Intercept[3] 2.40 3038 1.00
sigma_error 0.46 197 1.02
lp__ -3843.50 211 1.01
Samples were drawn using NUTS(diag_e) at Sun Mar 10 11:53:58 2024.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at
convergence, Rhat=1).
<- loo(fit_mixed3)) (loo_mixed
Computed from 4000 by 1183 log-likelihood matrix.
Estimate SE
elpd_loo -1540.2 13.8
p_loo 226.2 5.3
looic 3080.3 27.5
------
MCSE of elpd_loo is 0.4.
MCSE and ESS estimates assume MCMC draws (r_eff in [0.6, 1.1]).
All Pareto k estimates are good (k < 0.7).
See help('pareto-k-diagnostic') for details.
# Extract the generated quantities
<- extract(fit_mixed3) generated_quantities
<- ordinal_data$Y
Y <- tibble(
empirical tv = total_variance(Y),
lv = lineage_wise_variance(Y, variable),
cv = crossfamily_variance(Y, variable)
)
<- tibble(
prior_pc tv = apply(generated_quantities$y_prior, 1, total_variance),
lv = apply(
$y_prior, 1, function(x) lineage_wise_variance(x, variable)
generated_quantities
),cv = apply(
$y_prior, 1, function(x) crossfamily_variance(x, variable)
generated_quantities
),
)predictive_plot(prior_pc)
Warning message in geom_point(aes(x = 1, y = empirical$tv, color = "empirical"), :
“All aesthetics have length 1, but the data has 4000 rows.
ℹ Did you mean to use `annotate()`?”
Warning message in geom_point(aes(x = 1, y = empirical$lv, color = "empirical"), :
“All aesthetics have length 1, but the data has 4000 rows.
ℹ Did you mean to use `annotate()`?”
Warning message in geom_point(aes(x = 1, y = empirical$cv, color = "empirical"), :
“All aesthetics have length 1, but the data has 4000 rows.
ℹ Did you mean to use `annotate()`?”
<- tibble(
posterior_pc tv = apply(generated_quantities$y_posterior, 1, total_variance),
lv = apply(
$y_posterior, 1, function(x) lineage_wise_variance(x, variable)
generated_quantities
),cv = apply(
$y_posterior, 1, function(x) crossfamily_variance(x, variable)
generated_quantities
),
)predictive_plot(posterior_pc)
Warning message in geom_point(aes(x = 1, y = empirical$tv, color = "empirical"), :
“All aesthetics have length 1, but the data has 4000 rows.
ℹ Did you mean to use `annotate()`?”
Warning message in geom_point(aes(x = 1, y = empirical$lv, color = "empirical"), :
“All aesthetics have length 1, but the data has 4000 rows.
ℹ Did you mean to use `annotate()`?”
Warning message in geom_point(aes(x = 1, y = empirical$cv, color = "empirical"), :
“All aesthetics have length 1, but the data has 4000 rows.
ℹ Did you mean to use `annotate()`?”
model comparison
tibble(model = c("ordinal", "brownian", "mixed"),
looic = c(
$estimates[3,1],
loo_ordinal$estimates[3,1],
loo_brownian$estimates[3,1]
loo_mixed
)%>%
) mutate(delta_looic = round(looic-min(looic), 3)) %>%
arrange(delta_looic)
model | looic | delta_looic |
---|---|---|
<chr> | <dbl> | <dbl> |
brownian | 3080.276 | 0.000 |
mixed | 3080.314 | 0.038 |
ordinal | 3250.418 | 170.143 |
Variable 4: agricultureLevel
<- "agricultureLevel" variable
<- d %>%
Y select(all_of(variable)) %>%
%>%
na.omit
pull<- match(Y, sort(unique(Y))) Y
<- list()
ordinal_data $Ndata <- length(Y)
ordinal_data$nthres <- max(Y) -1
ordinal_data$Y <- Y ordinal_data
<- rstan::sampling(
fit_ordinal
model_ordinal,data = ordinal_data,
chains = 4,
iter = 2000,
)
print(fit_ordinal, pars=c("Intercept", "lp__"))
Inference for Stan model: anon_model.
4 chains, each with iter=2000; warmup=1000; thin=1;
post-warmup draws per chain=1000, total post-warmup draws=4000.
mean se_mean sd 2.5% 25% 50% 75% 97.5%
Intercept[1] -1.53 0.00 0.07 -1.68 -1.57 -1.53 -1.48 -1.39
Intercept[2] -1.32 0.00 0.07 -1.46 -1.37 -1.32 -1.28 -1.19
Intercept[3] -1.19 0.00 0.07 -1.32 -1.23 -1.18 -1.14 -1.06
Intercept[4] -1.02 0.00 0.06 -1.15 -1.07 -1.02 -0.98 -0.90
Intercept[5] -0.67 0.00 0.06 -0.79 -0.71 -0.67 -0.63 -0.56
Intercept[6] 0.03 0.00 0.06 -0.08 -0.01 0.03 0.07 0.15
Intercept[7] 1.08 0.00 0.07 0.95 1.04 1.08 1.13 1.21
Intercept[8] 2.35 0.00 0.10 2.15 2.28 2.35 2.42 2.55
Intercept[9] 4.11 0.00 0.22 3.70 3.96 4.10 4.26 4.57
lp__ -2448.97 0.05 2.07 -2453.62 -2450.17 -2448.61 -2447.47 -2445.90
n_eff Rhat
Intercept[1] 3364 1
Intercept[2] 3978 1
Intercept[3] 4222 1
Intercept[4] 4554 1
Intercept[5] 5240 1
Intercept[6] 5007 1
Intercept[7] 4919 1
Intercept[8] 5653 1
Intercept[9] 5456 1
lp__ 1955 1
Samples were drawn using NUTS(diag_e) at Sat Mar 16 20:27:19 2024.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at
convergence, Rhat=1).
<- loo(fit_ordinal)) (loo_ordinal
Computed from 4000 by 1209 log-likelihood matrix.
Estimate SE
elpd_loo -2424.1 23.3
p_loo 8.9 0.3
looic 4848.2 46.7
------
MCSE of elpd_loo is 0.0.
MCSE and ESS estimates assume MCMC draws (r_eff in [0.8, 1.6]).
All Pareto k estimates are good (k < 0.7).
See help('pareto-k-diagnostic') for details.
# Extract the generated quantities
<- extract(fit_ordinal) generated_quantities
<- ordinal_data$Y
Y <- tibble(
empirical tv = total_variance(Y),
lv = lineage_wise_variance(Y, variable),
cv = crossfamily_variance(Y, variable)
)
<- tibble(
prior_pc tv = apply(generated_quantities$y_prior, 1, total_variance),
lv = apply(
$y_prior, 1, function(x) lineage_wise_variance(x, variable)
generated_quantities
),cv = apply(
$y_prior, 1, function(x) crossfamily_variance(x, variable)
generated_quantities
),
)predictive_plot(prior_pc)
Warning message in geom_point(aes(x = 1, y = empirical$tv, color = "empirical"), :
“All aesthetics have length 1, but the data has 4000 rows.
ℹ Did you mean to use `annotate()`?”
Warning message in geom_point(aes(x = 1, y = empirical$lv, color = "empirical"), :
“All aesthetics have length 1, but the data has 4000 rows.
ℹ Did you mean to use `annotate()`?”
Warning message in geom_point(aes(x = 1, y = empirical$cv, color = "empirical"), :
“All aesthetics have length 1, but the data has 4000 rows.
ℹ Did you mean to use `annotate()`?”
<- tibble(
posterior_pc tv = apply(generated_quantities$y_posterior, 1, total_variance),
lv = apply(
$y_posterior, 1, function(x) lineage_wise_variance(x, variable)
generated_quantities
),cv = apply(
$y_posterior, 1, function(x) crossfamily_variance(x, variable)
generated_quantities
),
)predictive_plot(posterior_pc)
Warning message in geom_point(aes(x = 1, y = empirical$tv, color = "empirical"), :
“All aesthetics have length 1, but the data has 4000 rows.
ℹ Did you mean to use `annotate()`?”
Warning message in geom_point(aes(x = 1, y = empirical$lv, color = "empirical"), :
“All aesthetics have length 1, but the data has 4000 rows.
ℹ Did you mean to use `annotate()`?”
Warning message in geom_point(aes(x = 1, y = empirical$cv, color = "empirical"), :
“All aesthetics have length 1, but the data has 4000 rows.
ℹ Did you mean to use `annotate()`?”
<- which(!is.na(d[[variable]]))
defined <- which(is.na(d[[variable]])) undefined
<- c()
po for (i in get_postorder(roots, edges)) {
if ((i %in% edges[,2]) & !(i %in% undefined)) {
<- c(po, i)
po
}
}<- match(po, edges[,2]) edge_po
= list()
phylo_data $Ntrees <- length(trees)
phylo_data$Nnodes <- nrow(nodes)
phylo_data$Y <- ordinal_data$Y
phylo_data$Nobservations <- length(phylo_data$Y)
phylo_data$Nedges <- length(edge_po)
phylo_data$observations <- match(pull(d[defined, "glottocode"]), nodes$label)
phylo_data$roots <- roots
phylo_data$edges <- edges[edge_po,]
phylo_data$edge_lengths <- edge_lengths[edge_po]
phylo_data$nthres <- length(unique(phylo_data$Y)) - 1 phylo_data
<- rstan::sampling(
fit_brownian
model_brownian,data = phylo_data,
chains = 4,
iter = 2000
)
print(fit_brownian, pars=c("mu_root", "sigma_root", "sigma", "Intercept", "lp__"))
Inference for Stan model: anon_model.
4 chains, each with iter=2000; warmup=1000; thin=1;
post-warmup draws per chain=1000, total post-warmup draws=4000.
mean se_mean sd 2.5% 25% 50% 75%
mu_root 0.34 0.03 1.41 -2.49 -0.60 0.36 1.32
sigma_root 3.14 0.01 0.31 2.56 2.93 3.13 3.33
sigma 1.81 0.01 0.24 1.36 1.65 1.80 1.96
Intercept[1] -1.24 0.02 1.39 -3.97 -2.16 -1.21 -0.30
Intercept[2] -0.70 0.02 1.39 -3.46 -1.64 -0.67 0.24
Intercept[3] -0.36 0.02 1.39 -3.10 -1.30 -0.33 0.58
Intercept[4] 0.01 0.02 1.39 -2.75 -0.92 0.04 0.96
Intercept[5] 0.78 0.02 1.39 -1.97 -0.14 0.81 1.72
Intercept[6] 2.15 0.02 1.39 -0.61 1.21 2.18 3.08
Intercept[7] 3.87 0.02 1.39 1.11 2.92 3.89 4.81
Intercept[8] 5.65 0.02 1.39 2.86 4.71 5.67 6.59
Intercept[9] 7.74 0.02 1.42 4.93 6.76 7.76 8.70
lp__ -5002.57 1.79 46.21 -5092.02 -5033.70 -5002.50 -4971.48
97.5% n_eff Rhat
mu_root 3.03 3153 1.00
sigma_root 3.78 795 1.01
sigma 2.30 650 1.01
Intercept[1] 1.43 3360 1.00
Intercept[2] 1.97 3396 1.00
Intercept[3] 2.30 3420 1.00
Intercept[4] 2.71 3451 1.00
Intercept[5] 3.46 3469 1.00
Intercept[6] 4.81 3519 1.00
Intercept[7] 6.52 3534 1.00
Intercept[8] 8.31 3530 1.00
Intercept[9] 10.44 3577 1.00
lp__ -4911.87 665 1.01
Samples were drawn using NUTS(diag_e) at Sat Mar 16 20:34:08 2024.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at
convergence, Rhat=1).
<- loo(fit_brownian)) (loo_brownian
Warning message:
“Some Pareto k diagnostic values are too high. See help('pareto-k-diagnostic') for details.
”
Computed from 4000 by 1209 log-likelihood matrix.
Estimate SE
elpd_loo -2009.9 34.6
p_loo 322.1 9.7
looic 4019.9 69.3
------
MCSE of elpd_loo is NA.
MCSE and ESS estimates assume MCMC draws (r_eff in [0.3, 1.8]).
Pareto k diagnostic values:
Count Pct. Min. ESS
(-Inf, 0.7] (good) 1135 93.9% 99
(0.7, 1] (bad) 72 6.0% <NA>
(1, Inf) (very bad) 2 0.2% <NA>
See help('pareto-k-diagnostic') for details.
# Extract the generated quantities
<- extract(fit_brownian) generated_quantities
<- ordinal_data$Y
Y <- tibble(
empirical tv = total_variance(Y),
lv = lineage_wise_variance(Y, variable),
cv = crossfamily_variance(Y, variable)
)
<- tibble(
prior_pc tv = apply(generated_quantities$y_prior, 1, total_variance),
lv = apply(
$y_prior, 1, function(x) lineage_wise_variance(x, variable)
generated_quantities
),cv = apply(
$y_prior, 1, function(x) crossfamily_variance(x, variable)
generated_quantities
),
)predictive_plot(prior_pc)
Warning message in geom_point(aes(x = 1, y = empirical$tv, color = "empirical"), :
“All aesthetics have length 1, but the data has 4000 rows.
ℹ Did you mean to use `annotate()`?”
Warning message in geom_point(aes(x = 1, y = empirical$lv, color = "empirical"), :
“All aesthetics have length 1, but the data has 4000 rows.
ℹ Did you mean to use `annotate()`?”
Warning message in geom_point(aes(x = 1, y = empirical$cv, color = "empirical"), :
“All aesthetics have length 1, but the data has 4000 rows.
ℹ Did you mean to use `annotate()`?”
<- tibble(
posterior_pc tv = apply(generated_quantities$y_posterior, 1, total_variance),
lv = apply(
$y_posterior, 1, function(x) lineage_wise_variance(x, variable)
generated_quantities
),cv = apply(
$y_posterior, 1, function(x) crossfamily_variance(x, variable)
generated_quantities
),
)predictive_plot(posterior_pc)
Warning message in geom_point(aes(x = 1, y = empirical$tv, color = "empirical"), :
“All aesthetics have length 1, but the data has 4000 rows.
ℹ Did you mean to use `annotate()`?”
Warning message in geom_point(aes(x = 1, y = empirical$lv, color = "empirical"), :
“All aesthetics have length 1, but the data has 4000 rows.
ℹ Did you mean to use `annotate()`?”
Warning message in geom_point(aes(x = 1, y = empirical$cv, color = "empirical"), :
“All aesthetics have length 1, but the data has 4000 rows.
ℹ Did you mean to use `annotate()`?”
<- readRDS("models/fit_mixed4.rds") fit_mixed4
# fit_mixed4 <- rstan::sampling(
# model_mixed,
# data = phylo_data,
# chains = 4,
# iter = 40000,
# thin = 20,
# control = list(max_treedepth=15)
# )
# saveRDS(fit_mixed4, "models/fit_mixed4.rds")
print(fit_mixed4, pars=c("mu_root", "sigma_root", "sigma", "Intercept", "sigma_error", "lp__"))
Inference for Stan model: anon_model.
4 chains, each with iter=40000; warmup=20000; thin=20;
post-warmup draws per chain=1000, total post-warmup draws=4000.
mean se_mean sd 2.5% 25% 50% 75%
mu_root 0.13 0.01 0.84 -1.50 -0.44 0.14 0.71
sigma_root 3.07 0.01 0.29 2.52 2.86 3.05 3.25
sigma 1.79 0.00 0.24 1.35 1.63 1.79 1.95
Intercept[1] -1.45 0.01 0.86 -3.10 -2.05 -1.44 -0.84
Intercept[2] -0.91 0.01 0.86 -2.55 -1.51 -0.90 -0.31
Intercept[3] -0.58 0.01 0.86 -2.20 -1.17 -0.57 0.03
Intercept[4] -0.20 0.01 0.86 -1.86 -0.79 -0.19 0.40
Intercept[5] 0.57 0.01 0.86 -1.07 -0.02 0.58 1.16
Intercept[6] 1.94 0.01 0.86 0.31 1.35 1.95 2.53
Intercept[7] 3.67 0.01 0.86 2.02 3.08 3.68 4.25
Intercept[8] 5.45 0.01 0.87 3.79 4.85 5.46 6.04
Intercept[9] 7.55 0.02 0.91 5.78 6.94 7.54 8.19
sigma_error 0.23 0.00 0.08 0.12 0.17 0.21 0.26
lp__ -4864.79 23.60 369.52 -5609.00 -5105.58 -4839.37 -4604.52
97.5% n_eff Rhat
mu_root 1.71 3743 1.00
sigma_root 3.65 3157 1.00
sigma 2.29 4095 1.00
Intercept[1] 0.19 3849 1.00
Intercept[2] 0.72 3841 1.00
Intercept[3] 1.06 3821 1.00
Intercept[4] 1.44 3799 1.00
Intercept[5] 2.21 3780 1.00
Intercept[6] 3.59 3744 1.00
Intercept[7] 5.30 3695 1.00
Intercept[8] 7.13 3599 1.00
Intercept[9] 9.32 3558 1.00
sigma_error 0.41 230 1.01
lp__ -4200.97 245 1.01
Samples were drawn using NUTS(diag_e) at Sun Mar 10 14:07:50 2024.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at
convergence, Rhat=1).
<- loo(fit_mixed4)) (loo_mixed
Warning message:
“Some Pareto k diagnostic values are too high. See help('pareto-k-diagnostic') for details.
”
Computed from 4000 by 1209 log-likelihood matrix.
Estimate SE
elpd_loo -2009.6 34.6
p_loo 328.1 9.7
looic 4019.2 69.1
------
MCSE of elpd_loo is NA.
MCSE and ESS estimates assume MCMC draws (r_eff in [0.8, 1.1]).
Pareto k diagnostic values:
Count Pct. Min. ESS
(-Inf, 0.7] (good) 1144 94.6% 207
(0.7, 1] (bad) 65 5.4% <NA>
(1, Inf) (very bad) 0 0.0% <NA>
See help('pareto-k-diagnostic') for details.
# Extract the generated quantities
<- extract(fit_mixed4) generated_quantities
<- ordinal_data$Y
Y <- tibble(
empirical tv = total_variance(Y),
lv = lineage_wise_variance(Y, variable),
cv = crossfamily_variance(Y, variable)
)
<- tibble(
prior_pc tv = apply(generated_quantities$y_prior, 1, total_variance),
lv = apply(
$y_prior, 1, function(x) lineage_wise_variance(x, variable)
generated_quantities
),cv = apply(
$y_prior, 1, function(x) crossfamily_variance(x, variable)
generated_quantities
),
)predictive_plot(prior_pc)
Warning message in geom_point(aes(x = 1, y = empirical$tv, color = "empirical"), :
“All aesthetics have length 1, but the data has 4000 rows.
ℹ Did you mean to use `annotate()`?”
Warning message in geom_point(aes(x = 1, y = empirical$lv, color = "empirical"), :
“All aesthetics have length 1, but the data has 4000 rows.
ℹ Did you mean to use `annotate()`?”
Warning message in geom_point(aes(x = 1, y = empirical$cv, color = "empirical"), :
“All aesthetics have length 1, but the data has 4000 rows.
ℹ Did you mean to use `annotate()`?”
<- tibble(
posterior_pc tv = apply(generated_quantities$y_posterior, 1, total_variance),
lv = apply(
$y_posterior, 1, function(x) lineage_wise_variance(x, variable)
generated_quantities
),cv = apply(
$y_posterior, 1, function(x) crossfamily_variance(x, variable)
generated_quantities
),
)predictive_plot(posterior_pc)
Warning message in geom_point(aes(x = 1, y = empirical$tv, color = "empirical"), :
“All aesthetics have length 1, but the data has 4000 rows.
ℹ Did you mean to use `annotate()`?”
Warning message in geom_point(aes(x = 1, y = empirical$lv, color = "empirical"), :
“All aesthetics have length 1, but the data has 4000 rows.
ℹ Did you mean to use `annotate()`?”
Warning message in geom_point(aes(x = 1, y = empirical$cv, color = "empirical"), :
“All aesthetics have length 1, but the data has 4000 rows.
ℹ Did you mean to use `annotate()`?”
model comparison
tibble(model = c("ordinal", "brownian", "mixed"),
looic = c(
$estimates[3,1],
loo_ordinal$estimates[3,1],
loo_brownian$estimates[3,1]
loo_mixed
)%>%
) mutate(delta_looic = round(looic-min(looic), 3)) %>%
arrange(delta_looic)
model | looic | delta_looic |
---|---|---|
<chr> | <dbl> | <dbl> |
mixed | 4019.184 | 0.000 |
brownian | 4019.874 | 0.690 |
ordinal | 4848.207 | 829.023 |
Variable 5: settlement_strategy
<- "settlement_strategy" variable
<- d %>%
Y select(all_of(variable)) %>%
%>%
na.omit
pull<- match(Y, sort(unique(Y))) Y
<- list()
ordinal_data $Ndata <- length(Y)
ordinal_data$nthres <- max(Y) -1
ordinal_data$Y <- Y ordinal_data
<- rstan::sampling(
fit_ordinal
model_ordinal,data=ordinal_data,
chains=4,
iter=2000,
)
print(fit_ordinal, pars=c("Intercept", "lp__"))
Inference for Stan model: anon_model.
4 chains, each with iter=2000; warmup=1000; thin=1;
post-warmup draws per chain=1000, total post-warmup draws=4000.
mean se_mean sd 2.5% 25% 50% 75% 97.5%
Intercept[1] -1.42 0.00 0.08 -1.57 -1.47 -1.42 -1.37 -1.27
Intercept[2] -0.88 0.00 0.07 -1.01 -0.93 -0.88 -0.84 -0.75
Intercept[3] 0.06 0.00 0.06 -0.06 0.02 0.06 0.10 0.18
lp__ -1371.87 0.03 1.23 -1375.05 -1372.48 -1371.55 -1370.97 -1370.44
n_eff Rhat
Intercept[1] 2523 1
Intercept[2] 3364 1
Intercept[3] 5206 1
lp__ 1882 1
Samples were drawn using NUTS(diag_e) at Sat Mar 16 20:37:54 2024.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at
convergence, Rhat=1).
<- loo(fit_ordinal)) (loo_ordinal
Computed from 4000 by 1107 log-likelihood matrix.
Estimate SE
elpd_loo -1365.1 18.0
p_loo 3.1 0.1
looic 2730.2 36.0
------
MCSE of elpd_loo is 0.0.
MCSE and ESS estimates assume MCMC draws (r_eff in [0.6, 1.3]).
All Pareto k estimates are good (k < 0.7).
See help('pareto-k-diagnostic') for details.
# Extract the generated quantities
<- extract(fit_ordinal) generated_quantities
<- ordinal_data$Y
Y <- tibble(
empirical tv = total_variance(Y),
lv = lineage_wise_variance(Y, variable),
cv = crossfamily_variance(Y, variable)
)
<- tibble(
prior_pc tv = apply(generated_quantities$y_prior, 1, total_variance),
lv = apply(
$y_prior, 1, function(x) lineage_wise_variance(x, variable)
generated_quantities
),cv = apply(
$y_prior, 1, function(x) crossfamily_variance(x, variable)
generated_quantities
),
)predictive_plot(prior_pc)
Warning message in geom_point(aes(x = 1, y = empirical$tv, color = "empirical"), :
“All aesthetics have length 1, but the data has 4000 rows.
ℹ Did you mean to use `annotate()`?”
Warning message in geom_point(aes(x = 1, y = empirical$lv, color = "empirical"), :
“All aesthetics have length 1, but the data has 4000 rows.
ℹ Did you mean to use `annotate()`?”
Warning message in geom_point(aes(x = 1, y = empirical$cv, color = "empirical"), :
“All aesthetics have length 1, but the data has 4000 rows.
ℹ Did you mean to use `annotate()`?”
<- tibble(
posterior_pc tv = apply(generated_quantities$y_posterior, 1, total_variance),
lv = apply(
$y_posterior, 1, function(x) lineage_wise_variance(x, variable)
generated_quantities
),cv = apply(
$y_posterior, 1, function(x) crossfamily_variance(x, variable)
generated_quantities
),
)predictive_plot(posterior_pc)
Warning message in geom_point(aes(x = 1, y = empirical$tv, color = "empirical"), :
“All aesthetics have length 1, but the data has 4000 rows.
ℹ Did you mean to use `annotate()`?”
Warning message in geom_point(aes(x = 1, y = empirical$lv, color = "empirical"), :
“All aesthetics have length 1, but the data has 4000 rows.
ℹ Did you mean to use `annotate()`?”
Warning message in geom_point(aes(x = 1, y = empirical$cv, color = "empirical"), :
“All aesthetics have length 1, but the data has 4000 rows.
ℹ Did you mean to use `annotate()`?”
<- which(!is.na(d[[variable]]))
defined <- which(is.na(d[[variable]])) undefined
<- c()
po for (i in get_postorder(roots, edges)) {
if ((i %in% edges[,2]) & !(i %in% undefined)) {
<- c(po, i)
po
}
}<- match(po, edges[,2]) edge_po
= list()
phylo_data $Ntrees <- length(trees)
phylo_data$Nnodes <- nrow(nodes)
phylo_data$Y <- ordinal_data$Y
phylo_data$Nobservations <- length(phylo_data$Y)
phylo_data$Nedges <- length(edge_po)
phylo_data$observations <- match(pull(d[defined, "glottocode"]), nodes$label)
phylo_data$roots <- roots
phylo_data$edges <- edges[edge_po,]
phylo_data$edge_lengths <- edge_lengths[edge_po]
phylo_data$nthres <- length(unique(phylo_data$Y)) - 1 phylo_data
<- rstan::sampling(
fit_brownian
model_brownian,data = phylo_data,
chains = 4,
iter = 2000
)
print(fit_brownian, pars=c("mu_root", "sigma_root", "sigma", "Intercept", "lp__"))
Inference for Stan model: anon_model.
4 chains, each with iter=2000; warmup=1000; thin=1;
post-warmup draws per chain=1000, total post-warmup draws=4000.
mean se_mean sd 2.5% 25% 50% 75%
mu_root 0.37 0.01 0.25 -0.13 0.20 0.37 0.54
sigma_root 1.83 0.01 0.28 1.34 1.63 1.81 2.01
sigma 1.37 0.02 0.36 0.71 1.14 1.37 1.60
Intercept[1] -1.17 0.00 0.17 -1.52 -1.28 -1.16 -1.05
Intercept[2] -0.38 0.00 0.16 -0.70 -0.49 -0.38 -0.27
Intercept[3] 0.96 0.00 0.17 0.64 0.85 0.97 1.08
lp__ -4297.83 1.89 44.43 -4383.08 -4328.19 -4298.55 -4268.17
97.5% n_eff Rhat
mu_root 0.88 2151 1.00
sigma_root 2.45 718 1.01
sigma 2.12 415 1.01
Intercept[1] -0.84 2137 1.00
Intercept[2] -0.07 4349 1.00
Intercept[3] 1.29 4159 1.00
lp__ -4210.12 550 1.01
Samples were drawn using NUTS(diag_e) at Sat Mar 16 20:39:49 2024.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at
convergence, Rhat=1).
<- loo(fit_brownian)) (loo_brownian
Warning message:
“Some Pareto k diagnostic values are too high. See help('pareto-k-diagnostic') for details.
”
Computed from 4000 by 1107 log-likelihood matrix.
Estimate SE
elpd_loo -1231.3 21.8
p_loo 186.0 6.1
looic 2462.7 43.6
------
MCSE of elpd_loo is NA.
MCSE and ESS estimates assume MCMC draws (r_eff in [0.3, 2.2]).
Pareto k diagnostic values:
Count Pct. Min. ESS
(-Inf, 0.7] (good) 1100 99.4% 175
(0.7, 1] (bad) 7 0.6% <NA>
(1, Inf) (very bad) 0 0.0% <NA>
See help('pareto-k-diagnostic') for details.
# Extract the generated quantities
<- extract(fit_brownian) generated_quantities
<- ordinal_data$Y
Y <- tibble(
empirical tv = total_variance(Y),
lv = lineage_wise_variance(Y, variable),
cv = crossfamily_variance(Y, variable)
)
<- tibble(
prior_pc tv = apply(generated_quantities$y_prior, 1, total_variance),
lv = apply(
$y_prior, 1, function(x) lineage_wise_variance(x, variable)
generated_quantities
),cv = apply(
$y_prior, 1, function(x) crossfamily_variance(x, variable)
generated_quantities
),
)predictive_plot(prior_pc)
Warning message in geom_point(aes(x = 1, y = empirical$tv, color = "empirical"), :
“All aesthetics have length 1, but the data has 4000 rows.
ℹ Did you mean to use `annotate()`?”
Warning message in geom_point(aes(x = 1, y = empirical$lv, color = "empirical"), :
“All aesthetics have length 1, but the data has 4000 rows.
ℹ Did you mean to use `annotate()`?”
Warning message in geom_point(aes(x = 1, y = empirical$cv, color = "empirical"), :
“All aesthetics have length 1, but the data has 4000 rows.
ℹ Did you mean to use `annotate()`?”
<- tibble(
posterior_pc tv = apply(generated_quantities$y_posterior, 1, total_variance),
lv = apply(
$y_posterior, 1, function(x) lineage_wise_variance(x, variable)
generated_quantities
),cv = apply(
$y_posterior, 1, function(x) crossfamily_variance(x, variable)
generated_quantities
),
)predictive_plot(posterior_pc)
Warning message in geom_point(aes(x = 1, y = empirical$tv, color = "empirical"), :
“All aesthetics have length 1, but the data has 4000 rows.
ℹ Did you mean to use `annotate()`?”
Warning message in geom_point(aes(x = 1, y = empirical$lv, color = "empirical"), :
“All aesthetics have length 1, but the data has 4000 rows.
ℹ Did you mean to use `annotate()`?”
Warning message in geom_point(aes(x = 1, y = empirical$cv, color = "empirical"), :
“All aesthetics have length 1, but the data has 4000 rows.
ℹ Did you mean to use `annotate()`?”
<- readRDS("models/fit_mixed5.rds") fit_mixed5
# fit_mixed5 <- rstan::sampling(
# model_mixed,
# data = phylo_data,
# chains = 4,
# iter = 40000,
# thin = 20,
# control = list(max_treedepth = 15)
# )
# saveRDS(fit_mixed5, "models/fit_mixed5.rds")
print(fit_mixed5, pars=c("mu_root", "sigma_root", "sigma", "Intercept", "sigma_error", "lp__"))
Inference for Stan model: anon_model.
4 chains, each with iter=40000; warmup=20000; thin=20;
post-warmup draws per chain=1000, total post-warmup draws=4000.
mean se_mean sd 2.5% 25% 50% 75%
mu_root 0.35 0.00 0.24 -0.12 0.18 0.35 0.51
sigma_root 1.77 0.00 0.26 1.29 1.59 1.76 1.94
sigma 1.34 0.01 0.36 0.67 1.10 1.33 1.58
Intercept[1] -1.19 0.00 0.17 -1.53 -1.31 -1.18 -1.07
Intercept[2] -0.39 0.00 0.16 -0.71 -0.50 -0.39 -0.28
Intercept[3] 0.96 0.00 0.17 0.64 0.85 0.96 1.07
sigma_error 0.25 0.01 0.10 0.13 0.18 0.22 0.29
lp__ -4246.96 30.90 395.49 -5120.90 -4492.16 -4205.32 -3959.52
97.5% n_eff Rhat
mu_root 0.83 4040 1.00
sigma_root 2.35 3992 1.00
sigma 2.11 3667 1.00
Intercept[1] -0.86 4141 1.00
Intercept[2] -0.07 4317 1.00
Intercept[3] 1.29 4342 1.00
sigma_error 0.51 167 1.01
lp__ -3585.81 164 1.01
Samples were drawn using NUTS(diag_e) at Sun Mar 10 18:04:25 2024.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at
convergence, Rhat=1).
<- loo(fit_mixed5)) (loo_mixed
Warning message:
“Some Pareto k diagnostic values are too high. See help('pareto-k-diagnostic') for details.
”
Computed from 4000 by 1107 log-likelihood matrix.
Estimate SE
elpd_loo -1231.4 21.7
p_loo 193.9 6.0
looic 2462.8 43.4
------
MCSE of elpd_loo is NA.
MCSE and ESS estimates assume MCMC draws (r_eff in [0.5, 1.1]).
Pareto k diagnostic values:
Count Pct. Min. ESS
(-Inf, 0.7] (good) 1101 99.5% 320
(0.7, 1] (bad) 6 0.5% <NA>
(1, Inf) (very bad) 0 0.0% <NA>
See help('pareto-k-diagnostic') for details.
# Extract the generated quantities
<- extract(fit_mixed5) generated_quantities
<- ordinal_data$Y
Y <- tibble(
empirical tv = total_variance(Y),
lv = lineage_wise_variance(Y, variable),
cv = crossfamily_variance(Y, variable)
)
<- tibble(
prior_pc tv = apply(generated_quantities$y_prior, 1, total_variance),
lv = apply(
$y_prior, 1, function(x) lineage_wise_variance(x, variable)
generated_quantities
),cv = apply(
$y_prior, 1, function(x) crossfamily_variance(x, variable)
generated_quantities
),
)predictive_plot(prior_pc)
Warning message in geom_point(aes(x = 1, y = empirical$tv, color = "empirical"), :
“All aesthetics have length 1, but the data has 4000 rows.
ℹ Did you mean to use `annotate()`?”
Warning message in geom_point(aes(x = 1, y = empirical$lv, color = "empirical"), :
“All aesthetics have length 1, but the data has 4000 rows.
ℹ Did you mean to use `annotate()`?”
Warning message in geom_point(aes(x = 1, y = empirical$cv, color = "empirical"), :
“All aesthetics have length 1, but the data has 4000 rows.
ℹ Did you mean to use `annotate()`?”
<- tibble(
posterior_pc tv = apply(generated_quantities$y_posterior, 1, total_variance),
lv = apply(
$y_posterior, 1, function(x) lineage_wise_variance(x, variable)
generated_quantities
),cv = apply(
$y_posterior, 1, function(x) crossfamily_variance(x, variable)
generated_quantities
),
)predictive_plot(posterior_pc)
Warning message in geom_point(aes(x = 1, y = empirical$tv, color = "empirical"), :
“All aesthetics have length 1, but the data has 4000 rows.
ℹ Did you mean to use `annotate()`?”
Warning message in geom_point(aes(x = 1, y = empirical$lv, color = "empirical"), :
“All aesthetics have length 1, but the data has 4000 rows.
ℹ Did you mean to use `annotate()`?”
Warning message in geom_point(aes(x = 1, y = empirical$cv, color = "empirical"), :
“All aesthetics have length 1, but the data has 4000 rows.
ℹ Did you mean to use `annotate()`?”
model comparison
tibble(model = c("ordinal", "brownian", "mixed"),
looic = c(
$estimates[3,1],
loo_ordinal$estimates[3,1],
loo_brownian$estimates[3,1]
loo_mixed
)%>%
) mutate(delta_looic = round(looic-min(looic), 3)) %>%
arrange(delta_looic)
model | looic | delta_looic |
---|---|---|
<chr> | <dbl> | <dbl> |
brownian | 2462.659 | 0.000 |
mixed | 2462.809 | 0.150 |
ordinal | 2730.171 | 267.512 |
Variable 6: exogamy
<- "exogamy" variable
<- d %>%
Y select(all_of(variable)) %>%
%>%
na.omit
pull<- match(Y, sort(unique(Y))) Y
Since there are only two values for this variable, we need a logistic rather than an ordinal model.
<- list()
binary_data $Ndata <- length(Y)
binary_data$Y <- Y-1 binary_data
<- "
binary_code data {
int<lower=1> Ndata; // total number of observations
int<lower=0, upper=1> Y[Ndata]; // binary response variable
}
parameters {
real Intercept; // intercept for logistic regression
}
model {
target += normal_lpdf(Intercept | 0, 2);
for (n in 1:Ndata) {
target += bernoulli_logit_lpmf(Y[n] | Intercept);
}
}
generated quantities {
int y_prior[Ndata];
int y_posterior[Ndata];
vector[Ndata] log_lik;
real Intercept_prior = normal_rng(0, 2); // sample from the prior
for (n in 1:Ndata) {
y_prior[n] = bernoulli_logit_rng(Intercept_prior); // simulate prior predictive data
y_posterior[n] = bernoulli_logit_rng(Intercept); // simulate posterior predictive data
log_lik[n] = bernoulli_logit_lpmf(Y[n] | Intercept); // log likelihood for each observation
}
}
"
<- stan_model(model_code=binary_code) model_binary
<- rstan::sampling(
fit_binary
model_binary,data=binary_data,
chains=4,
iter=2000,
)
print(fit_binary, pars=c("Intercept", "lp__"))
Inference for Stan model: anon_model.
4 chains, each with iter=2000; warmup=1000; thin=1;
post-warmup draws per chain=1000, total post-warmup draws=4000.
mean se_mean sd 2.5% 25% 50% 75% 97.5% n_eff
Intercept -0.66 0.00 0.07 -0.79 -0.70 -0.66 -0.61 -0.53 1646
lp__ -666.79 0.02 0.70 -668.79 -666.98 -666.53 -666.33 -666.28 1601
Rhat
Intercept 1
lp__ 1
Samples were drawn using NUTS(diag_e) at Sat Mar 16 20:43:43 2024.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at
convergence, Rhat=1).
<- loo(fit_binary)) (loo_binary
Computed from 4000 by 1036 log-likelihood matrix.
Estimate SE
elpd_loo -665.6 10.1
p_loo 1.0 0.0
looic 1331.3 20.2
------
MCSE of elpd_loo is 0.0.
MCSE and ESS estimates assume MCMC draws (r_eff in [0.4, 0.4]).
All Pareto k estimates are good (k < 0.7).
See help('pareto-k-diagnostic') for details.
# Extract the generated quantities
<- extract(fit_binary) generated_quantities
<- binary_data$Y
Y <- tibble(
empirical tv = total_variance(Y),
lv = lineage_wise_variance(Y, variable),
cv = crossfamily_variance(Y, variable)
)
<- tibble(
prior_pc tv = apply(generated_quantities$y_prior, 1, total_variance),
lv = apply(
$y_prior, 1, function(x) lineage_wise_variance(x, variable)
generated_quantities
),cv = apply(
$y_prior, 1, function(x) crossfamily_variance(x, variable)
generated_quantities
),
)predictive_plot(prior_pc)
Warning message in geom_point(aes(x = 1, y = empirical$tv, color = "empirical"), :
“All aesthetics have length 1, but the data has 4000 rows.
ℹ Did you mean to use `annotate()`?”
Warning message in geom_point(aes(x = 1, y = empirical$lv, color = "empirical"), :
“All aesthetics have length 1, but the data has 4000 rows.
ℹ Did you mean to use `annotate()`?”
Warning message in geom_point(aes(x = 1, y = empirical$cv, color = "empirical"), :
“All aesthetics have length 1, but the data has 4000 rows.
ℹ Did you mean to use `annotate()`?”
<- tibble(
posterior_pc tv = apply(generated_quantities$y_posterior, 1, total_variance),
lv = apply(
$y_posterior, 1, function(x) lineage_wise_variance(x, variable)
generated_quantities
),cv = apply(
$y_posterior, 1, function(x) crossfamily_variance(x, variable)
generated_quantities
),
)predictive_plot(posterior_pc)
Warning message in geom_point(aes(x = 1, y = empirical$tv, color = "empirical"), :
“All aesthetics have length 1, but the data has 4000 rows.
ℹ Did you mean to use `annotate()`?”
Warning message in geom_point(aes(x = 1, y = empirical$lv, color = "empirical"), :
“All aesthetics have length 1, but the data has 4000 rows.
ℹ Did you mean to use `annotate()`?”
Warning message in geom_point(aes(x = 1, y = empirical$cv, color = "empirical"), :
“All aesthetics have length 1, but the data has 4000 rows.
ℹ Did you mean to use `annotate()`?”
<- which(!is.na(d[[variable]]))
defined <- which(is.na(d[[variable]])) undefined
<- c()
po for (i in get_postorder(roots, edges)) {
if ((i %in% edges[,2]) & !(i %in% undefined)) {
<- c(po, i)
po
}
}<- match(po, edges[,2]) edge_po
= list()
phylo_data $Ntrees <- length(trees)
phylo_data$Nnodes <- nrow(nodes)
phylo_data$Y <- binary_data$Y
phylo_data$Nobservations <- length(phylo_data$Y)
phylo_data$Nedges <- length(edge_po)
phylo_data$observations <- match(pull(d[defined, "glottocode"]), nodes$label)
phylo_data$roots <- roots
phylo_data$edges <- edges[edge_po,]
phylo_data$edge_lengths <- edge_lengths[edge_po]
phylo_data$nthres <- length(unique(phylo_data$Y)) - 1 phylo_data
The phylogenetic ordinal regression models also have to be adjusted to a binary outcome variable.
<- "
brownian_binary_code data {
int<lower=1> Ntrees;
int<lower=0> Nnodes;
int<lower=1> Nobservations;
int<lower=1> Nedges;
int<lower=1,upper=Nnodes> observations[Nobservations]; // indices of the observed tips
int<lower=1, upper=Nnodes> roots[Ntrees];
int<lower=1, upper=Nnodes> edges[Nedges, 2];
vector[Nedges] edge_lengths;
int<lower=0, upper=1> Y[Nobservations]; // response variable
}
parameters {
vector[Nnodes] z_normal;
real<lower=0> sigma;
real mu_root;
real<lower=0> sigma_root;
}
model {
vector[Nnodes] z = rep_vector(0, Nnodes);
target += exponential_lpdf(sigma | 1);
target += normal_lpdf(mu_root | 0, 1);
target += exponential_lpdf(sigma_root | 1);
target += normal_lpdf(z_normal | 0, 1);
for (i in 1:Ntrees) {
int idx = roots[i];
z[idx] = mu_root + z_normal[idx] * sigma_root;
}
for (i in 1:Nedges) {
int j = Nedges - i + 1;
int mother_node = edges[j, 1];
int daughter_node = edges[j, 2];
real local_mu = z[mother_node];
real local_sigma = sigma * sqrt(edge_lengths[j]);
z[daughter_node] = local_mu + z_normal[daughter_node] * local_sigma;
}
for (n in 1:Nobservations) {
target += bernoulli_logit_lpmf(Y[n] | z[observations[n]]);
}
}
generated quantities {
int y_prior[Nobservations]; // Prior predictive samples
int y_posterior[Nobservations]; // Posterior predictive samples
vector[Nnodes] z_prior = rep_vector(0, Nnodes);
vector[Nnodes] z_posterior = rep_vector(0, Nnodes);
real<lower=0> sigma_prior;
real mu_root_prior;
real<lower=0> sigma_root_prior;
real log_lik[Nobservations];
for (i in 1:Ntrees) {
int idx = roots[i];
z_posterior[idx] = mu_root + z_normal[idx] * sigma_root;
}
for (i in 1:Nedges) {
int j = Nedges - i + 1;
int mother_node = edges[j, 1];
int daughter_node = edges[j, 2];
real local_mu = z_posterior[mother_node];
real local_sigma = sigma * sqrt(edge_lengths[j]);
z_posterior[daughter_node] = local_mu + z_normal[daughter_node] * local_sigma;
}
sigma_prior = exponential_rng(1);
mu_root_prior = normal_rng(0, 1);
sigma_root_prior = exponential_rng(1);
for (i in 1:Ntrees) {
z_prior[roots[i]] = normal_rng(mu_root_prior, sigma_root_prior);
}
for (i in 1:Nedges) {
int j = Nedges - i + 1;
int mother_node = edges[j, 1];
int daughter_node = edges[j, 2];
real local_mu = z_posterior[mother_node];
real local_sigma = sigma * sqrt(edge_lengths[j]);
z_prior[daughter_node] = normal_rng(
z_prior[mother_node],
sigma_prior * sqrt(edge_lengths[j])
);
z_posterior[daughter_node] = local_mu + z_normal[daughter_node] * local_sigma;
}
for (n in 1:Nobservations) {
y_prior[n] = bernoulli_logit_rng(z_prior[observations[n]]);
y_posterior[n] = bernoulli_logit_rng(z_posterior[observations[n]]);
log_lik[n] = bernoulli_logit_lpmf(Y[n] | z_posterior[observations[n]]);
}
}
"
<- stan_model(model_code=brownian_binary_code) model_brownian_binary
<- rstan::sampling(
fit_brownian
model_brownian_binary,data = phylo_data,
chains = 4,
iter = 2000
)
print(fit_brownian, pars=c("mu_root", "sigma_root", "sigma", "lp__"))
Inference for Stan model: anon_model.
4 chains, each with iter=2000; warmup=1000; thin=1;
post-warmup draws per chain=1000, total post-warmup draws=4000.
mean se_mean sd 2.5% 25% 50% 75% 97.5%
mu_root -1.05 0.01 0.22 -1.54 -1.19 -1.04 -0.90 -0.66
sigma_root 1.39 0.02 0.44 0.61 1.09 1.35 1.66 2.36
sigma 3.16 0.03 0.71 2.00 2.64 3.07 3.59 4.75
lp__ -3638.22 1.93 45.74 -3724.97 -3669.21 -3639.19 -3607.50 -3546.27
n_eff Rhat
mu_root 1405 1.00
sigma_root 573 1.01
sigma 491 1.00
lp__ 559 1.01
Samples were drawn using NUTS(diag_e) at Sat Mar 16 20:45:39 2024.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at
convergence, Rhat=1).
<- loo(fit_brownian)) (loo_brownian
Warning message:
“Some Pareto k diagnostic values are too high. See help('pareto-k-diagnostic') for details.
”
Computed from 4000 by 1036 log-likelihood matrix.
Estimate SE
elpd_loo -597.7 13.1
p_loo 208.0 6.8
looic 1195.4 26.3
------
MCSE of elpd_loo is NA.
MCSE and ESS estimates assume MCMC draws (r_eff in [0.5, 1.9]).
Pareto k diagnostic values:
Count Pct. Min. ESS
(-Inf, 0.7] (good) 917 88.5% 391
(0.7, 1] (bad) 41 4.0% <NA>
(1, Inf) (very bad) 78 7.5% <NA>
See help('pareto-k-diagnostic') for details.
# Extract the generated quantities
<- extract(fit_brownian) generated_quantities
<- binary_data$Y
Y <- tibble(
empirical tv = total_variance(Y),
lv = lineage_wise_variance(Y, variable),
cv = crossfamily_variance(Y, variable)
)
<- tibble(
prior_pc tv = apply(generated_quantities$y_prior, 1, total_variance),
lv = apply(
$y_prior, 1, function(x) lineage_wise_variance(x, variable)
generated_quantities
),cv = apply(
$y_prior, 1, function(x) crossfamily_variance(x, variable)
generated_quantities
),
)predictive_plot(prior_pc)
Warning message in geom_point(aes(x = 1, y = empirical$tv, color = "empirical"), :
“All aesthetics have length 1, but the data has 4000 rows.
ℹ Did you mean to use `annotate()`?”
Warning message in geom_point(aes(x = 1, y = empirical$lv, color = "empirical"), :
“All aesthetics have length 1, but the data has 4000 rows.
ℹ Did you mean to use `annotate()`?”
Warning message in geom_point(aes(x = 1, y = empirical$cv, color = "empirical"), :
“All aesthetics have length 1, but the data has 4000 rows.
ℹ Did you mean to use `annotate()`?”
<- tibble(
posterior_pc tv = apply(generated_quantities$y_posterior, 1, total_variance),
lv = apply(
$y_posterior, 1, function(x) lineage_wise_variance(x, variable)
generated_quantities
),cv = apply(
$y_posterior, 1, function(x) crossfamily_variance(x, variable)
generated_quantities
),
)predictive_plot(posterior_pc)
Warning message in geom_point(aes(x = 1, y = empirical$tv, color = "empirical"), :
“All aesthetics have length 1, but the data has 4000 rows.
ℹ Did you mean to use `annotate()`?”
Warning message in geom_point(aes(x = 1, y = empirical$lv, color = "empirical"), :
“All aesthetics have length 1, but the data has 4000 rows.
ℹ Did you mean to use `annotate()`?”
Warning message in geom_point(aes(x = 1, y = empirical$cv, color = "empirical"), :
“All aesthetics have length 1, but the data has 4000 rows.
ℹ Did you mean to use `annotate()`?”
<- "
mixed_binary_code data {
int<lower=1> Ntrees;
int<lower=0> Nnodes;
int<lower=1> Nobservations;
int<lower=1> Nedges;
int<lower=1,upper=Nnodes> observations[Nobservations]; // indices of the observed tips
int<lower=1, upper=Nnodes> roots[Ntrees];
int<lower=1, upper=Nnodes> edges[Nedges, 2];
vector[Nedges] edge_lengths;
int<lower=0, upper=1> Y[Nobservations]; // response variable
}
parameters {
vector[Nnodes] z_normal;
real mu;
real<lower=0> sigma;
real mu_root;
real<lower=0> sigma_root;
real<lower=0> sigma_error;
vector[Nobservations] z_error;
}
model {
vector[Nnodes] z = rep_vector(0, Nnodes);
target += normal_lpdf(mu | 0, 10);
target += exponential_lpdf(sigma | 1);
target += normal_lpdf(mu_root | 0, 1);
target += exponential_lpdf(sigma_root | 1);
target += normal_lpdf(z_normal | 0, 1);
target += inv_gamma_lpdf(square(sigma_error) | 2, 0.1);
target += normal_lpdf(z_error | 0, sigma_error);
for (i in 1:Ntrees) {
int idx = roots[i];
z[idx] = mu_root + z_normal[idx] * sigma_root;
}
for (i in 1:Nedges) {
int j = Nedges - i + 1;
int mother_node = edges[j, 1];
int daughter_node = edges[j, 2];
real local_mu = z[mother_node];
real local_sigma = sigma * sqrt(edge_lengths[j]);
z[daughter_node] = local_mu + z_normal[daughter_node] * local_sigma;
}
for (n in 1:Nobservations) {
target += bernoulli_logit_lpmf(Y[n] | z[observations[n]] + z_error[n]);
}
}
generated quantities {
int y_prior[Nobservations]; // Prior predictive samples
int y_posterior[Nobservations]; // Posterior predictive samples
vector[Nnodes] z_prior = rep_vector(0, Nnodes);
vector[Nnodes] z_posterior = rep_vector(0, Nnodes);
real<lower=0> sigma_prior;
real mu_root_prior;
real<lower=0> sigma_root_prior;
real log_lik[Nobservations];
vector[Nobservations] z_error_prior;
real sigma_error_sq = 1 / gamma_rng(2, 10);
real sigma_error_prior = sqrt(sigma_error_sq);
for (n in 1:Nobservations) {
z_error_prior[n] = normal_rng(0, sigma_error_prior);
}
mu_root_prior = normal_rng(0, 1);
sigma_root_prior = exponential_rng(1);
sigma_prior = exponential_rng(1);
for (i in 1:Ntrees) {
z_prior[roots[i]] = normal_rng(mu_root_prior, sigma_root_prior);
z_posterior[roots[i]] = mu_root + z_normal[roots[i]] * sigma_root;
}
for (i in 1:Nedges) {
int j = Nedges - i + 1;
int mother_node = edges[j, 1];
int daughter_node = edges[j, 2];
real local_mu = z_posterior[mother_node];
real local_sigma = sigma * sqrt(edge_lengths[j]);
z_prior[daughter_node] = normal_rng(
z_prior[mother_node],
sigma_prior * sqrt(edge_lengths[j])
);
z_posterior[daughter_node] = local_mu + z_normal[daughter_node] * local_sigma;
}
for (n in 1:Nobservations) {
y_prior[n] = bernoulli_logit_rng(z_prior[observations[n]] + z_error_prior[n]);
y_posterior[n] = bernoulli_logit_rng(z_posterior[observations[n]] + z_error[n]);
log_lik[n] = bernoulli_logit_lpmf(Y[n] | z_posterior[observations[n]] + z_error[n]);
}
}
"
<- stan_model(model_code=mixed_binary_code) model_mixed_binary
<- readRDS("models/fit_mixed6.rds") fit_mixed6
# fit_mixed6 <- rstan::sampling(
# model_mixed_binary,
# data = phylo_data,
# chains = 4,
# iter=40000,
# thin=20,
# control=list(max_treedepth=15)
# )
# saveRDS(fit_mixed6, "models/fit_mixed6.rds")
print(fit_mixed6, pars=c("mu_root", "sigma_root", "sigma", "sigma_error", "lp__"))
Inference for Stan model: anon_model.
4 chains, each with iter=40000; warmup=20000; thin=20;
post-warmup draws per chain=1000, total post-warmup draws=4000.
mean se_mean sd 2.5% 25% 50% 75%
mu_root -1.08 0.00 0.23 -1.58 -1.21 -1.06 -0.92
sigma_root 1.41 0.01 0.45 0.61 1.10 1.39 1.69
sigma 3.19 0.01 0.71 2.01 2.69 3.10 3.60
sigma_error 0.24 0.01 0.09 0.12 0.17 0.22 0.28
lp__ -3550.26 26.78 359.14 -4338.65 -3768.59 -3515.73 -3298.47
97.5% n_eff Rhat
mu_root -0.67 3673 1.00
sigma_root 2.37 3922 1.00
sigma 4.79 3991 1.00
sigma_error 0.47 176 1.02
lp__ -2933.81 180 1.02
Samples were drawn using NUTS(diag_e) at Sun Mar 10 21:12:03 2024.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at
convergence, Rhat=1).
<- loo(fit_mixed6)) (loo_mixed
Warning message:
“Some Pareto k diagnostic values are too high. See help('pareto-k-diagnostic') for details.
”
Computed from 4000 by 1036 log-likelihood matrix.
Estimate SE
elpd_loo -598.3 13.1
p_loo 214.4 6.8
looic 1196.7 26.2
------
MCSE of elpd_loo is NA.
MCSE and ESS estimates assume MCMC draws (r_eff in [0.7, 1.1]).
Pareto k diagnostic values:
Count Pct. Min. ESS
(-Inf, 0.7] (good) 993 95.8% 189
(0.7, 1] (bad) 42 4.1% <NA>
(1, Inf) (very bad) 1 0.1% <NA>
See help('pareto-k-diagnostic') for details.
# Extract the generated quantities
<- extract(fit_mixed6) generated_quantities
<- binary_data$Y
Y <- tibble(
empirical tv = total_variance(Y),
lv = lineage_wise_variance(Y, variable),
cv = crossfamily_variance(Y, variable)
)
<- tibble(
prior_pc tv = apply(generated_quantities$y_prior, 1, total_variance),
lv = apply(
$y_prior, 1, function(x) lineage_wise_variance(x, variable)
generated_quantities
),cv = apply(
$y_prior, 1, function(x) crossfamily_variance(x, variable)
generated_quantities
),
)predictive_plot(prior_pc)
Warning message in geom_point(aes(x = 1, y = empirical$tv, color = "empirical"), :
“All aesthetics have length 1, but the data has 4000 rows.
ℹ Did you mean to use `annotate()`?”
Warning message in geom_point(aes(x = 1, y = empirical$lv, color = "empirical"), :
“All aesthetics have length 1, but the data has 4000 rows.
ℹ Did you mean to use `annotate()`?”
Warning message in geom_point(aes(x = 1, y = empirical$cv, color = "empirical"), :
“All aesthetics have length 1, but the data has 4000 rows.
ℹ Did you mean to use `annotate()`?”
<- tibble(
posterior_pc tv = apply(generated_quantities$y_posterior, 1, total_variance),
lv = apply(
$y_posterior, 1, function(x) lineage_wise_variance(x, variable)
generated_quantities
),cv = apply(
$y_posterior, 1, function(x) crossfamily_variance(x, variable)
generated_quantities
),
)predictive_plot(posterior_pc)
Warning message in geom_point(aes(x = 1, y = empirical$tv, color = "empirical"), :
“All aesthetics have length 1, but the data has 4000 rows.
ℹ Did you mean to use `annotate()`?”
Warning message in geom_point(aes(x = 1, y = empirical$lv, color = "empirical"), :
“All aesthetics have length 1, but the data has 4000 rows.
ℹ Did you mean to use `annotate()`?”
Warning message in geom_point(aes(x = 1, y = empirical$cv, color = "empirical"), :
“All aesthetics have length 1, but the data has 4000 rows.
ℹ Did you mean to use `annotate()`?”
model comparison
tibble(model = c("binary", "brownian", "mixed"),
looic = c(
$estimates[3,1],
loo_binary$estimates[3,1],
loo_brownian$estimates[3,1]
loo_mixed
)%>%
) mutate(delta_looic = round(looic-min(looic), 3)) %>%
arrange(delta_looic)
model | looic | delta_looic |
---|---|---|
<chr> | <dbl> | <dbl> |
brownian | 1195.383 | 0.000 |
mixed | 1196.675 | 1.292 |
binary | 1331.269 | 135.885 |
Variable 7: crop_type
<- "crop_type" variable
<- d %>%
Y select(all_of(variable)) %>%
%>%
na.omit
pull<- match(Y, sort(unique(Y))) Y
Since there is no inherent order in the crop types, we need a logistic rather than an ordinal model.
<- list()
cardinal_data $Ndata <- length(Y)
cardinal_data$Nclasses <- max(Y)
cardinal_data$Y <- Y cardinal_data
<- "
cardinal_code data {
int<lower=1> Ndata;
int<lower=1> Nclasses;
int<lower=0, upper=Nclasses> Y[Ndata];
}
parameters {
vector[Nclasses] mu;
}
model {
mu ~ normal(0, 2);
for (n in 1:Ndata) {
Y[n] ~ categorical(softmax(mu));
}
}
generated quantities {
int y_prior[Ndata];
int y_posterior[Ndata];
vector[Ndata] log_lik;
vector[Nclasses] mu_prior;
for (k in 1:Nclasses) {
mu_prior[k ]= normal_rng(0, 2);
}
for (n in 1:Ndata) {
y_prior[n] = categorical_rng(softmax(mu_prior));
y_posterior[n] = categorical_rng(softmax(mu));
log_lik[n] = categorical_lpmf(Y[n] | softmax(mu));
}
}
"
<- stan_model(model_code = cardinal_code) model_cardinal
<- rstan::sampling(
fit_cardinal
model_cardinal,data = cardinal_data,
chains = 4,
iter = 2000,
)
print(fit_cardinal, pars = c("mu", "lp__"))
Inference for Stan model: anon_model.
4 chains, each with iter=2000; warmup=1000; thin=1;
post-warmup draws per chain=1000, total post-warmup draws=4000.
mean se_mean sd 2.5% 25% 50% 75% 97.5% n_eff
mu[1] 1.27 0.04 0.83 -0.32 0.72 1.28 1.82 2.87 538
mu[2] -3.18 0.04 1.00 -5.24 -3.83 -3.14 -2.54 -1.28 598
mu[3] -2.42 0.04 0.92 -4.27 -3.03 -2.41 -1.79 -0.67 581
mu[4] 0.27 0.04 0.84 -1.33 -0.28 0.29 0.83 1.86 540
mu[5] 1.39 0.04 0.83 -0.20 0.83 1.40 1.93 2.99 537
mu[6] 2.32 0.04 0.83 0.74 1.78 2.34 2.87 3.92 536
lp__ -1324.64 0.05 1.83 -1329.11 -1325.62 -1324.28 -1323.31 -1322.15 1146
Rhat
mu[1] 1.00
mu[2] 1.01
mu[3] 1.00
mu[4] 1.00
mu[5] 1.01
mu[6] 1.01
lp__ 1.00
Samples were drawn using NUTS(diag_e) at Sat Mar 16 20:51:24 2024.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at
convergence, Rhat=1).
<- loo(fit_cardinal)) (loo_cardinal
Computed from 4000 by 1103 log-likelihood matrix.
Estimate SE
elpd_loo -1323.5 24.2
p_loo 4.9 0.7
looic 2647.0 48.3
------
MCSE of elpd_loo is 0.0.
MCSE and ESS estimates assume MCMC draws (r_eff in [0.6, 1.1]).
All Pareto k estimates are good (k < 0.7).
See help('pareto-k-diagnostic') for details.
# Extract the generated quantities
<- extract(fit_cardinal) generated_quantities
<- cardinal_data$Y
Y <- tibble(
empirical tv = total_variance(Y),
lv = lineage_wise_variance(Y, variable),
cv = crossfamily_variance(Y, variable)
)
<- tibble(
prior_pc tv = apply(generated_quantities$y_prior, 1, total_variance),
lv = apply(
$y_prior, 1, function(x) lineage_wise_variance(x, variable)
generated_quantities
),cv = apply(
$y_prior, 1, function(x) crossfamily_variance(x, variable)
generated_quantities
),
)predictive_plot(prior_pc)
Warning message in geom_point(aes(x = 1, y = empirical$tv, color = "empirical"), :
“All aesthetics have length 1, but the data has 4000 rows.
ℹ Did you mean to use `annotate()`?”
Warning message in geom_point(aes(x = 1, y = empirical$lv, color = "empirical"), :
“All aesthetics have length 1, but the data has 4000 rows.
ℹ Did you mean to use `annotate()`?”
Warning message in geom_point(aes(x = 1, y = empirical$cv, color = "empirical"), :
“All aesthetics have length 1, but the data has 4000 rows.
ℹ Did you mean to use `annotate()`?”
<- tibble(
posterior_pc tv = apply(generated_quantities$y_posterior, 1, total_variance),
lv = apply(
$y_posterior, 1, function(x) lineage_wise_variance(x, variable)
generated_quantities
),cv = apply(
$y_posterior, 1, function(x) crossfamily_variance(x, variable)
generated_quantities
),
)predictive_plot(posterior_pc)
Warning message in geom_point(aes(x = 1, y = empirical$tv, color = "empirical"), :
“All aesthetics have length 1, but the data has 4000 rows.
ℹ Did you mean to use `annotate()`?”
Warning message in geom_point(aes(x = 1, y = empirical$lv, color = "empirical"), :
“All aesthetics have length 1, but the data has 4000 rows.
ℹ Did you mean to use `annotate()`?”
Warning message in geom_point(aes(x = 1, y = empirical$cv, color = "empirical"), :
“All aesthetics have length 1, but the data has 4000 rows.
ℹ Did you mean to use `annotate()`?”
<- which(!is.na(d[[variable]]))
defined <- which(is.na(d[[variable]])) undefined
<- c()
po for (i in get_postorder(roots, edges)) {
if ((i %in% edges[, 2]) && !(i %in% undefined)) {
<- c(po, i)
po
}
}<- match(po, edges[, 2]) edge_po
<- list()
phylo_data $Ntrees <- length(trees)
phylo_data$Nnodes <- nrow(nodes)
phylo_data$Y <- Y
phylo_data$Nobservations <- length(phylo_data$Y)
phylo_data$Nedges <- length(edge_po)
phylo_data$observations <- match(pull(d[defined, "glottocode"]), nodes$label)
phylo_data$roots <- roots
phylo_data$edges <- edges[edge_po,]
phylo_data$edge_lengths <- edge_lengths[edge_po]
phylo_data$Nclasses <- length(unique(phylo_data$Y)) phylo_data
<- "
cardinal_brown_code data {
int<lower=1> Ntrees;
int<lower=0> Nnodes;
int<lower=1> Nobservations;
int<lower=1> Nedges;
int<lower=1> Nclasses;
int<lower=1,upper=Nnodes> observations[Nobservations];
int<lower=1, upper=Nnodes> roots[Ntrees];
int<lower=1, upper=Nnodes> edges[Nedges, 2];
vector[Nedges] edge_lengths;
int<lower=0, upper=Nclasses> Y[Nobservations];
}
parameters {
matrix[Nnodes, Nclasses] z;
vector[Nclasses] mu_root;
row_vector<lower=0>[Nclasses] sigma;
vector<lower=0>[Nclasses] sigma_root;
}
model {
mu_root ~ normal(0, 1);
sigma ~ exponential(1);
sigma_root ~ exponential(1);
for (i in 1:Ntrees) {
int idx = roots[i];
z[idx] ~ normal(mu_root, sigma_root);
}
for (i in 1:Nedges) {
int j = Nedges - i + 1;
int mother_node = edges[j, 1];
int daughter_node = edges[j, 2];
row_vector[Nclasses] local_mu = z[mother_node];
row_vector[Nclasses] local_sigma = sigma .* sqrt(edge_lengths[j]);
for (k in 1:Nclasses) {
z[daughter_node, k] ~ normal(local_mu[k], local_sigma[k]);
}
}
for (n in 1:Nobservations) {
Y[n] ~ categorical(softmax(to_vector(z[observations[n]])));
}
}
generated quantities {
int y_prior[Nobservations];
int y_posterior[Nobservations];
matrix[Nnodes, Nclasses] z_prior = rep_matrix(0, Nnodes, Nclasses);
row_vector[Nclasses] sigma_prior;
vector[Nclasses] mu_root_prior;
vector[Nclasses] sigma_root_prior;
vector[Nobservations] log_lik;
for (k in 1:Nclasses) {
sigma_prior[k] = exponential_rng(1);
mu_root_prior[k] = normal_rng(0, 1);
sigma_root_prior[k] = exponential_rng(1);
}
for (i in 1:Ntrees) {
int idx = roots[i];
for (k in 1:Nclasses) {
z_prior[idx, k] = normal_rng(mu_root_prior[k], sigma_root_prior[k]);
}
}
for (i in 1:Nedges) {
int j = Nedges - i + 1;
int mother_node = edges[j, 1];
int daughter_node = edges[j, 2];
row_vector[Nclasses] local_mu = z_prior[mother_node];
row_vector[Nclasses] local_sigma = sigma_prior .* sqrt(edge_lengths[j]);
for (k in 1:Nclasses) {
z_prior[daughter_node, k] = normal_rng(local_mu[k], local_sigma[k]);
}
}
for (n in 1:Nobservations) {
y_prior[n] = categorical_rng(softmax(to_vector(z_prior[observations[n]])));
y_posterior[n] = categorical_rng(softmax(to_vector(z[observations[n]])));
log_lik[n] = categorical_lpmf(Y[n] | softmax(to_vector(z[observations[n]])));
}
}
"
<- stan_model(model_code = cardinal_brown_code) model_cardinal_brownian
<- rstan::sampling(
fit_brownian_cardinal
model_cardinal_brownian,data = phylo_data,
chains = 4,
iter = 2000
)saveRDS(fit_brownian, "models/fit_cardinal_brownian.rds")
Warning message:
“There were 4000 transitions after warmup that exceeded the maximum treedepth. Increase max_treedepth above 10. See
https://mc-stan.org/misc/warnings.html#maximum-treedepth-exceeded”
Warning message:
“There were 3 chains where the estimated Bayesian Fraction of Missing Information was low. See
https://mc-stan.org/misc/warnings.html#bfmi-low”
Warning message:
“Examine the pairs() plot to diagnose sampling problems
”
Warning message:
“The largest R-hat is NA, indicating chains have not mixed.
Running the chains for more iterations may help. See
https://mc-stan.org/misc/warnings.html#r-hat”
Warning message:
“Bulk Effective Samples Size (ESS) is too low, indicating posterior means and medians may be unreliable.
Running the chains for more iterations may help. See
https://mc-stan.org/misc/warnings.html#bulk-ess”
Warning message:
“Tail Effective Samples Size (ESS) is too low, indicating posterior variances and tail quantiles may be unreliable.
Running the chains for more iterations may help. See
https://mc-stan.org/misc/warnings.html#tail-ess”
#fit_brownian_cardinal <- readRDS("models/fit_cardinal_brownian.rds")
print(fit_brownian_cardinal, pars = c("sigma", "mu_root", "sigma_root", "lp__"))
Inference for Stan model: anon_model.
4 chains, each with iter=2000; warmup=1000; thin=1;
post-warmup draws per chain=1000, total post-warmup draws=4000.
mean se_mean sd 2.5% 25% 50% 75%
sigma[1] 4.38 0.89 1.27 2.22 3.65 4.80 5.37
sigma[2] 2.57 0.36 0.54 1.55 2.13 2.76 2.92
sigma[3] 2.28 0.40 0.58 1.63 1.83 1.99 2.71
sigma[4] 4.48 0.76 1.10 2.70 3.85 4.69 5.14
sigma[5] 3.60 0.52 0.76 2.33 2.90 3.53 4.38
sigma[6] 6.17 0.96 1.38 3.87 5.45 6.43 7.17
mu_root[1] 0.98 0.10 0.50 -0.04 0.65 0.93 1.33
mu_root[2] -2.48 0.14 0.24 -2.91 -2.67 -2.45 -2.27
mu_root[3] -2.06 0.40 0.59 -2.89 -2.68 -1.88 -1.58
mu_root[4] -0.69 0.43 0.62 -1.67 -0.95 -0.66 -0.31
mu_root[5] 1.27 0.88 1.28 -0.39 0.16 1.25 2.14
mu_root[6] 1.98 0.28 0.53 1.06 1.64 1.92 2.22
sigma_root[1] 6.95 0.92 1.42 4.75 5.83 6.64 8.04
sigma_root[2] 1.12 0.35 0.50 0.63 0.72 0.89 1.45
sigma_root[3] 1.10 0.32 0.47 0.51 0.70 0.96 1.46
sigma_root[4] 0.58 0.23 0.33 0.16 0.25 0.57 0.86
sigma_root[5] 3.12 0.66 0.97 1.78 2.18 3.03 4.06
sigma_root[6] 4.73 1.04 1.52 2.03 3.85 5.40 5.82
lp__ -5952.34 1675.64 2395.12 -8004.89 -7721.90 -7077.39 -5174.37
97.5% n_eff Rhat
sigma[1] 5.94 2 7.75
sigma[2] 3.44 2 4.41
sigma[3] 3.49 2 6.33
sigma[4] 6.29 2 6.06
sigma[5] 4.63 2 4.99
sigma[6] 8.23 2 7.68
mu_root[1] 1.95 24 1.24
mu_root[2] -2.13 3 2.08
mu_root[3] -1.26 2 4.55
mu_root[4] 0.29 2 7.28
mu_root[5] 3.38 2 4.74
mu_root[6] 3.19 3 1.98
sigma_root[1] 9.48 2 4.32
sigma_root[2] 2.10 2 6.42
sigma_root[3] 2.01 2 5.31
sigma_root[4] 1.13 2 5.34
sigma_root[5] 4.56 2 4.74
sigma_root[6] 6.33 2 5.92
lp__ -1227.60 2 11.19
Samples were drawn using NUTS(diag_e) at Sat Mar 16 22:35:53 2024.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at
convergence, Rhat=1).
<- loo(fit_brownian)) (loo_brownian
Warning message:
“Some Pareto k diagnostic values are too high. See help('pareto-k-diagnostic') for details.
”
Computed from 4000 by 1036 log-likelihood matrix.
Estimate SE
elpd_loo -597.7 13.1
p_loo 208.0 6.8
looic 1195.4 26.3
------
MCSE of elpd_loo is NA.
MCSE and ESS estimates assume MCMC draws (r_eff in [0.5, 1.9]).
Pareto k diagnostic values:
Count Pct. Min. ESS
(-Inf, 0.7] (good) 917 88.5% 391
(0.7, 1] (bad) 41 4.0% <NA>
(1, Inf) (very bad) 78 7.5% <NA>
See help('pareto-k-diagnostic') for details.
# Extract the generated quantities
<- extract(fit_brownian_cardinal) generated_quantities
<- cardinal_data$Y
Y <- tibble(
empirical tv = total_variance(Y),
lv = lineage_wise_variance(Y, variable),
cv = crossfamily_variance(Y, variable)
)
<- tibble(
prior_pc tv = apply(generated_quantities$y_prior, 1, total_variance),
lv = apply(
$y_prior, 1, function(x) lineage_wise_variance(x, variable)
generated_quantities
),cv = apply(
$y_prior, 1, function(x) crossfamily_variance(x, variable)
generated_quantities
),
)predictive_plot(prior_pc)
Warning message in geom_point(aes(x = 1, y = empirical$tv, color = "empirical"), :
“All aesthetics have length 1, but the data has 4000 rows.
ℹ Did you mean to use `annotate()`?”
Warning message in geom_point(aes(x = 1, y = empirical$lv, color = "empirical"), :
“All aesthetics have length 1, but the data has 4000 rows.
ℹ Did you mean to use `annotate()`?”
Warning message in geom_point(aes(x = 1, y = empirical$cv, color = "empirical"), :
“All aesthetics have length 1, but the data has 4000 rows.
ℹ Did you mean to use `annotate()`?”
<- tibble(
posterior_pc tv = apply(generated_quantities$y_posterior, 1, total_variance),
lv = apply(
$y_posterior, 1, function(x) lineage_wise_variance(x, variable)
generated_quantities
),cv = apply(
$y_posterior, 1, function(x) crossfamily_variance(x, variable)
generated_quantities
),
)predictive_plot(posterior_pc)
Warning message in geom_point(aes(x = 1, y = empirical$tv, color = "empirical"), :
“All aesthetics have length 1, but the data has 4000 rows.
ℹ Did you mean to use `annotate()`?”
Warning message in geom_point(aes(x = 1, y = empirical$lv, color = "empirical"), :
“All aesthetics have length 1, but the data has 4000 rows.
ℹ Did you mean to use `annotate()`?”
Warning message in geom_point(aes(x = 1, y = empirical$cv, color = "empirical"), :
“All aesthetics have length 1, but the data has 4000 rows.
ℹ Did you mean to use `annotate()`?”
<- "
cardinal_mixed_code data {
int<lower=1> Ntrees;
int<lower=0> Nnodes;
int<lower=1> Nobservations;
int<lower=1> Nedges;
int<lower=1> Nclasses;
int<lower=1,upper=Nnodes> observations[Nobservations];
int<lower=1, upper=Nnodes> roots[Ntrees];
int<lower=1, upper=Nnodes> edges[Nedges, 2];
vector[Nedges] edge_lengths;
int<lower=0, upper=Nclasses> Y[Nobservations];
}
parameters {
matrix[Nnodes, Nclasses] z;
vector[Nclasses] mu_root;
row_vector<lower=0>[Nclasses] sigma;
vector<lower=0>[Nclasses] sigma_root;
matrix[Nobservations, Nclasses] z_error;
vector<lower=0>[Nclasses] sigma_error;
}
model {
mu_root ~ normal(0, 1);
sigma ~ exponential(1);
sigma_root ~ exponential(1);
sigma_error ~ exponential(1);
for (i in 1:Nobservations) {
z_error[i] ~ normal(0, sigma_error);
}
for (i in 1:Ntrees) {
int idx = roots[i];
z[idx] ~ normal(mu_root, sigma_root);
}
for (i in 1:Nedges) {
int j = Nedges - i + 1;
int mother_node = edges[j, 1];
int daughter_node = edges[j, 2];
row_vector[Nclasses] local_mu = z[mother_node];
row_vector[Nclasses] local_sigma = sigma .* sqrt(edge_lengths[j]);
for (k in 1:Nclasses) {
z[daughter_node, k] ~ normal(local_mu[k], local_sigma[k]);
}
}
for (n in 1:Nobservations) {
Y[n] ~ categorical(softmax(to_vector(
z[observations[n]] + z_error[n]
)));
}
}
generated quantities {
int y_prior[Nobservations];
int y_posterior[Nobservations];
matrix[Nnodes, Nclasses] z_prior = rep_matrix(0, Nnodes, Nclasses);
row_vector[Nclasses] sigma_prior;
vector[Nclasses] mu_root_prior;
vector[Nclasses] sigma_root_prior;
matrix[Nobservations, Nclasses] z_error_prior;
vector<lower=0>[Nclasses] sigma_error_prior;
vector[Nobservations] log_lik;
for (k in 1:Nclasses) {
sigma_prior[k] = exponential_rng(1);
mu_root_prior[k] = normal_rng(0, 1);
sigma_root_prior[k] = exponential_rng(1);
sigma_error_prior[k] = exponential_rng(1);
for (i in 1:Nobservations) {
z_error_prior[i, k] = normal_rng(0, sigma_error_prior[k]);
}
}
for (i in 1:Ntrees) {
int idx = roots[i];
for (k in 1:Nclasses) {
z_prior[idx, k] = normal_rng(mu_root_prior[k], sigma_root_prior[k]);
}
}
for (i in 1:Nedges) {
int j = Nedges - i + 1;
int mother_node = edges[j, 1];
int daughter_node = edges[j, 2];
row_vector[Nclasses] local_mu = z_prior[mother_node];
row_vector[Nclasses] local_sigma = sigma_prior .* sqrt(edge_lengths[j]);
for (k in 1:Nclasses) {
z_prior[daughter_node, k] = normal_rng(local_mu[k], local_sigma[k]);
}
}
for (n in 1:Nobservations) {
y_prior[n] = categorical_rng(softmax(to_vector(
z_prior[observations[n]] + z_error_prior[n]
)));
y_posterior[n] = categorical_rng(softmax(to_vector(
z[observations[n]] + z_error[n]
)));
log_lik[n] = categorical_lpmf(Y[n] | softmax(to_vector(
z[observations[n]] + z_error[n]
)));
}
}
"
<- stan_model(model_code = cardinal_mixed_code) model_mixed7
<- readRDS("models/fit_mixed7.rds") fit_mixed7
# fit_mixed7 <- rstan::sampling(
# model_mixed7,
# data = phylo_data,
# chains = 4,
# iter = 2000,
# thin = 1,
# control = list(max_treedepth=15)
# )
#saveRDS(fit_mixed7, "models/fit_mixed7.rds")
print(fit_mixed7, pars=c("mu_root", "sigma_root", "sigma", "sigma_error", "lp__"))
Inference for Stan model: anon_model.
4 chains, each with iter=2000; warmup=1000; thin=1;
post-warmup draws per chain=1000, total post-warmup draws=4000.
mean se_mean sd 2.5% 25% 50% 75%
mu_root[1] 1.53 0.39 0.79 -0.09 0.98 1.59 2.07
mu_root[2] -2.36 0.33 0.48 -3.08 -2.75 -2.48 -1.89
mu_root[3] -2.22 0.44 0.63 -3.03 -2.75 -2.39 -1.75
mu_root[4] -0.31 0.69 0.98 -1.84 -1.01 -0.19 0.37
mu_root[5] 1.64 0.30 0.54 0.65 1.21 1.66 2.06
mu_root[6] 2.22 0.40 0.75 0.57 1.78 2.35 2.75
sigma_root[1] 7.72 1.30 1.91 5.31 6.30 7.11 8.43
sigma_root[2] 0.97 0.41 0.59 0.23 0.44 0.90 1.40
sigma_root[3] 1.22 0.43 0.62 0.44 0.63 1.12 1.69
sigma_root[4] 0.73 0.32 0.49 0.25 0.38 0.57 0.81
sigma_root[5] 3.86 0.54 0.81 2.58 3.09 3.92 4.62
sigma_root[6] 6.17 0.28 0.49 5.10 5.83 6.24 6.55
sigma[1] 5.39 0.66 0.96 4.21 4.52 5.40 5.96
sigma[2] 3.39 0.62 0.88 2.22 2.75 3.22 3.91
sigma[3] 2.89 0.45 0.65 2.13 2.24 2.80 3.53
sigma[4] 5.09 0.97 1.39 3.36 4.05 4.69 6.14
sigma[5] 3.82 1.05 1.49 1.76 2.85 3.61 4.62
sigma[6] 5.60 1.05 1.51 3.33 4.56 5.62 6.42
sigma_error[1] 0.68 0.13 0.19 0.35 0.52 0.72 0.86
sigma_error[2] 0.75 0.25 0.36 0.29 0.45 0.70 1.10
sigma_error[3] 0.62 0.12 0.18 0.29 0.54 0.70 0.73
sigma_error[4] 1.04 0.22 0.31 0.65 0.83 0.97 1.15
sigma_error[5] 1.03 0.04 0.10 0.88 0.96 1.00 1.08
sigma_error[6] 1.62 0.42 0.60 0.88 1.01 1.63 2.21
lp__ -9806.20 1631.45 2331.40 -12416.44 -11840.28 -10335.07 -8565.76
97.5% n_eff Rhat
mu_root[1] 2.97 4 1.57
mu_root[2] -1.68 2 6.05
mu_root[3] -1.18 2 6.60
mu_root[4] 1.03 2 9.25
mu_root[5] 2.63 3 2.01
mu_root[6] 3.47 3 1.91
sigma_root[1] 11.63 2 4.75
sigma_root[2] 2.00 2 6.72
sigma_root[3] 2.28 2 7.86
sigma_root[4] 1.97 2 5.86
sigma_root[5] 5.08 2 3.48
sigma_root[6] 6.95 3 2.02
sigma[1] 7.21 2 5.83
sigma[2] 4.93 2 9.52
sigma[3] 3.89 2 6.08
sigma[4] 7.61 2 7.43
sigma[5] 6.33 2 10.43
sigma[6] 7.99 2 8.17
sigma_error[1] 0.92 2 8.14
sigma_error[2] 1.31 2 10.38
sigma_error[3] 0.83 2 7.90
sigma_error[4] 1.62 2 9.82
sigma_error[5] 1.27 7 2.21
sigma_error[6] 2.38 2 10.29
lp__ -5819.33 2 10.72
Samples were drawn using NUTS(diag_e) at Sat Mar 16 07:19:14 2024.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at
convergence, Rhat=1).
<- loo(fit_mixed7)) (loo_mixed
Warning message:
“Some Pareto k diagnostic values are too high. See help('pareto-k-diagnostic') for details.
”
Computed from 4000 by 1103 log-likelihood matrix.
Estimate SE
elpd_loo -546.6 24.2
p_loo 374.9 18.7
looic 1093.1 48.5
------
MCSE of elpd_loo is NA.
MCSE and ESS estimates assume MCMC draws (r_eff in [0.0, 1.0]).
Pareto k diagnostic values:
Count Pct. Min. ESS
(-Inf, 0.7] (good) 302 27.4% 0
(0.7, 1] (bad) 365 33.1% <NA>
(1, Inf) (very bad) 436 39.5% <NA>
See help('pareto-k-diagnostic') for details.
# Extract the generated quantities
<- extract(fit_mixed7) generated_quantities
<- cardinal_data$Y
Y <- tibble(
empirical tv = total_variance(Y),
lv = lineage_wise_variance(Y, variable),
cv = crossfamily_variance(Y, variable)
)
<- tibble(
prior_pc tv = apply(generated_quantities$y_prior, 1, total_variance),
lv = apply(
$y_prior, 1, function(x) lineage_wise_variance(x, variable)
generated_quantities
),cv = apply(
$y_prior, 1, function(x) crossfamily_variance(x, variable)
generated_quantities
),
)predictive_plot(prior_pc)
Warning message in geom_point(aes(x = 1, y = empirical$tv, color = "empirical"), :
“All aesthetics have length 1, but the data has 4000 rows.
ℹ Did you mean to use `annotate()`?”
Warning message in geom_point(aes(x = 1, y = empirical$lv, color = "empirical"), :
“All aesthetics have length 1, but the data has 4000 rows.
ℹ Did you mean to use `annotate()`?”
Warning message in geom_point(aes(x = 1, y = empirical$cv, color = "empirical"), :
“All aesthetics have length 1, but the data has 4000 rows.
ℹ Did you mean to use `annotate()`?”
<- tibble(
posterior_pc tv = apply(generated_quantities$y_posterior, 1, total_variance),
lv = apply(
$y_posterior, 1, function(x) lineage_wise_variance(x, variable)
generated_quantities
),cv = apply(
$y_posterior, 1, function(x) crossfamily_variance(x, variable)
generated_quantities
),
)predictive_plot(posterior_pc)
Warning message in geom_point(aes(x = 1, y = empirical$tv, color = "empirical"), :
“All aesthetics have length 1, but the data has 4000 rows.
ℹ Did you mean to use `annotate()`?”
Warning message in geom_point(aes(x = 1, y = empirical$lv, color = "empirical"), :
“All aesthetics have length 1, but the data has 4000 rows.
ℹ Did you mean to use `annotate()`?”
Warning message in geom_point(aes(x = 1, y = empirical$cv, color = "empirical"), :
“All aesthetics have length 1, but the data has 4000 rows.
ℹ Did you mean to use `annotate()`?”
model comparison
tibble(model = c("cardinal", "brownian", "mixed"),
looic = c(
$estimates[3,1],
loo_cardinal$estimates[3,1],
loo_brownian$estimates[3,1]
loo_mixed
)%>%
) mutate(delta_looic = round(looic-min(looic), 3)) %>%
arrange(delta_looic)
model | looic | delta_looic |
---|---|---|
<chr> | <dbl> | <dbl> |
mixed | 1093.123 | 0.000 |
brownian | 1195.383 | 102.261 |
cardinal | 2646.981 | 1553.858 |
wrapping up
For all seven variables, the mixed model is clearly better than the non-phylogenetic logistic model, and it is about as good as the Brownian motion model. It is thus justified to work with for downstream tasks.
<- list(
z_error extract(fit_mixed1, pars="z_error")$z_error,
extract(fit_mixed2, pars="z_error")$z_error,
extract(fit_mixed3, pars="z_error")$z_error,
extract(fit_mixed4, pars="z_error")$z_error,
extract(fit_mixed5, pars="z_error")$z_error,
extract(fit_mixed6, pars="z_error")$z_error,
extract(fit_mixed7, pars="z_error")$z_error
)
<- colnames(d)[3:9] variables
# for each i in 1:6, create a new column in d and fill those rows where d[[variables[i]]] is NA with NA, and the other rows with the posterior mean of z_error[[i]]
for (i in 1:6) {
paste0(variables[i], "_residuals")]] <- NA
d[[paste0(variables[i], "_residuals")]][!is.na(d[[variables[i]]])] <- apply(z_error[[i]], 2, mean)
d[[ }
<- apply(extract(fit_mixed7, pars="z_error")$z_error, c(2,3), mean) z_error_7
<- paste(variables[7], 1:6, 'residuals', sep="_") z7_names
for (i in 1:6) {
<- NA
d[[z7_names[i]]] !is.na(d[[variables[7]]])] <- z_error_7[,i]
d[[z7_names[i]]][ }
write_csv(d, "../data/data_with_latent_variables.csv")