Finding the best marginal model for each variable

Before I investigate the correlations between the variables, I want to check which phylogenetic model provides the best fit of the individual variables.

All variables are ordinal.

I will test two models for each variable:

ordinal regression
Poisson regression

Both will be done in a non-phylogenetic and a phylogenetic setting.

For all models, Bayesian inference will be used, implemented in Stan.

preparation

Im my experience, R is the best environment for working with Stan.

I start with loading relevant packages.

library(repr)
options(repr.plot.width=12, repr.plot.height=10)

library(tidyverse)
library(gridExtra)
library(patchwork)
library(stringr)
library(reshape2)
library(rstan)
library(bridgesampling)
library(loo)
library(ape)
library(bayesplot)
library(shinystan)
library(coda)
library(codetools)
library(brms)
library(abind)

options(mc.cores = parallel::detectCores())
rstan_options("auto_write" = TRUE)

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.0     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

Attaching package: ‘gridExtra’


The following object is masked from ‘package:dplyr’:

    combine



Attaching package: ‘reshape2’


The following object is masked from ‘package:tidyr’:

    smiths


Loading required package: StanHeaders


rstan version 2.32.6 (Stan version 2.32.2)


For execution on a local, multicore CPU with excess RAM we recommend calling
options(mc.cores = parallel::detectCores()).
To avoid recompilation of unchanged Stan programs, we recommend calling
rstan_options(auto_write = TRUE)
For within-chain threading using `reduce_sum()` or `map_rect()` Stan functions,
change `threads_per_chain` option:
rstan_options(threads_per_chain = 1)



Attaching package: ‘rstan’


The following object is masked from ‘package:tidyr’:

    extract


This is loo version 2.7.0

- Online documentation and vignettes at mc-stan.org/loo

- As of v2.0.0 loo defaults to 1 core but we recommend using as many as possible. Use the 'cores' argument or set options(mc.cores = NUM_CORES) for an entire session. 


Attaching package: ‘loo’


The following object is masked from ‘package:rstan’:

    loo



Attaching package: ‘ape’


The following object is masked from ‘package:dplyr’:

    where


This is bayesplot version 1.11.1

- Online documentation and vignettes at mc-stan.org/bayesplot

- bayesplot theme set to bayesplot::theme_default()

   * Does _not_ affect other ggplot2 plots

   * See ?bayesplot_theme_set for details on theme setting

Loading required package: shiny


This is shinystan version 2.6.0



Attaching package: ‘coda’


The following object is masked from ‘package:rstan’:

    traceplot


Loading required package: Rcpp

Loading 'brms' package (version 2.20.4). Useful instructions
can be found by typing help('brms'). A more detailed introduction
to the package is available through vignette('brms_overview').


Attaching package: ‘brms’


The following object is masked from ‘package:bayesplot’:

    rhat


The following object is masked from ‘package:bridgesampling’:

    bf


The following object is masked from ‘package:rstan’:

    loo


The following object is masked from ‘package:stats’:

    ar

karminrot <- rgb(206/255, 143/255, 137/255)
gold <- rgb(217/255, 205/255, 177/255)
anthrazit <- rgb(143/255, 143/255, 149/255)

Next, the data are loaded into a tibble called d.

d <- read_csv("../data/data_with_families.csv") %>%
    distinct(glottocode, .keep_all = TRUE)
taxa <- d$glottocode
head(d)

Rows: 1291 Columns: 10
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (3): soc_id, glottocode, Family
dbl (7): political_complexity, hierarchy_within, domestic_organisation, agri...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

A tibble: 6 × 10
soc_id	glottocode	political_complexity	hierarchy_within	domestic_organisation	agricultureLevel	settlement_strategy	exogamy	crop_type	Family
<chr>	<chr>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<chr>
Aa1	juho1239	1	3	3	0	1	1	1	Kxa
Aa2	okie1245	1	2	1	0	1	1	1	Nilotic
Aa3	nama1265	2	2	1	0	1	1	1	Khoe_Kwadi
Aa4	dama1270	NA	NA	2	0	NA	1	NA	Khoe_Kwadi
Aa5	bila1255	1	2	1	0	1	1	1	Atlantic_Congo
Aa6	sand1273	2	2	1	5	3	1	6	Sandawe

In the next step, the maximum clade credibility trees are loaded as phylo objects.

mcc_trees <- list.files("../data/mcc_trees", pattern = ".tre", full.names = TRUE)

trees <- purrr::map(mcc_trees, read.tree)

# get a vector of tip labels of all trees in trees

tips <- purrr::map(trees, function(x) x$tip.label) %>% unlist %>% unique

A brief sanity check to ensure that all tips of the trees correspond to a Glottocode in d.

if (length(setdiff(tips, taxa)) == 0 ) {
    print("All tips in trees are in taxa")
} else {
    print("Not all tips in trees are in taxa")
}

[1] "All tips in trees are in taxa"

All Glottocodes in d that do not occur as a tip are isolates (for the purpose of this study).

For each isolate I create a list object with the fields "tip.lable" and "Nnode", to make them compatible with the phylo objects for language families. These list objects are added to the list trees.

isolates <- setdiff(taxa, tips)

# convert isolates to one-node trees and add to trees
for (i in isolates) {
    tr <- list()
    tr[["tip.label"]] <- i
    tr[["Nnode"]] <- 0
    trees[[length(trees) + 1]] <- tr
}

Just to make sure, I test whether the tips of this forest are exactly the Glottocodes in the data.

tips <- c()
for (tr in trees) {
    tips <- c(tips, tr$tip.label)
}

if (all(sort(tips) == sort(taxa))) {
    print("taxa and tips are identical")
} else {
    print("taxa and tips are not identical")
}

[1] "taxa and tips are identical"

Now I re-order the rows in d to make sure they are in the same order as tips.

d <- d[match(tips, d$glottocode), ]

The trees in trees form a forest, i.e., a disconnected directed acyclic graph. I want to construct a representation of this graph which I can pass to Stan, consisting of

the number of nodes N,
a Nx2-matrix edges, and
a length-N vector of edge lengths, edge_lengths.
a vector of root nodes
a vector of tip nodes

First I create an Nx4 matrix where each row represents one node. The columns are

node number
tree number
tree-internal node number
label

For internal nodes, the label is the empty string.

node_matrix <- c()
node_number <- 1
for (tn in 1:length(trees)) {
    tr <- trees[[tn]]
    tr_nodes <- length(tr$tip.label) + tr$Nnode
    for (i in 1:tr_nodes) {
        tl <- ""
        if (i <= length(tr$tip.label)) {
            tl <- tr$tip.label[i]
        }
        node_matrix <- rbind(node_matrix, c(node_number, tn, i, tl))
        node_number <- node_number+1
    }
}

I convert the matrix to a tibble.

nodes <- tibble(
    node_number = as.numeric(node_matrix[,1]),
    tree_number = as.numeric(node_matrix[,2]),
    internal_number = as.numeric(node_matrix[,3]),
    label = node_matrix[,4]
)
head(nodes)
tail(nodes)

A tibble: 6 × 4
node_number	tree_number	internal_number	label
<dbl>	<dbl>	<dbl>	<chr>
1	1	1	kron1241
2	1	2	tuli1249
3	1	3
4	2	1	zarm1239
5	2	2	koyr1242
6	2	3

A tibble: 6 × 4
node_number	tree_number	internal_number	label
<dbl>	<dbl>	<dbl>	<chr>
2231	179	1	qawa1238
2232	180	1	cham1315
2233	181	1	nort2971
2234	182	1	trum1247
2235	183	1	sout2994
2236	184	1	guat1253

Next I define a function that maps a tree number and an internal node number to the global node number.

get_global_id <- function(tree_number, local_id) which((nodes$tree_number == tree_number) & (nodes$internal_number == local_id))

Now I create a matrix of edges and, concomittantly, a vector of edge lengths.

edges <- c()
edge_lengths <- c()
for (tn in 1:length(trees)) {
    tr <- trees[[tn]]
    if (length(tr$edge) > 0 ) {
        for (en in 1:nrow(tr$edge)) {
            mother <- tr$edge[en, 1]
            daughter <- tr$edge[en, 2]
            mother_id <- get_global_id(tn, mother)
            daughter_id <- get_global_id(tn, daughter)
            el <- tr$edge.length[en]
            edges <- rbind(edges, c(mother_id, daughter_id))
            edge_lengths <- c(edge_lengths, el)
        }
    }
}

Finally I identify the root nodes and tip nodes.

roots <- c(sort(setdiff(edges[,1], edges[,2])), match(isolates, nodes$label))
tips <- which(nodes$label != "")

Next I define a function computing the post-order.

get_postorder <- function(roots, edges) {
    input <- roots
    output <- c()
    while (length(input) > 0) {
        pivot <- input[1]
        daughters <- setdiff(edges[edges[,1] == pivot, 2], output)
        if (length(daughters) == 0) {
            input <- input[-1]
            output <- c(output, pivot)
        } else {
            input <- c(daughters, input)
        }
    }
    return(output)
}

orrdinal regression

For orrdinal regression, the model specifications are as follows. \(K\) is the number of possible values.

\[ \begin{aligned} C_k &\sim \mathcal N(0, 5)\\ Y_i=1 &\sim \mathrm{Bernoulli}(\mathrm{logit}^{-1}(C_1))\\ Y_i=k &\sim \mathrm{Bernoulli}(\mathrm{logit}^{-1}(C_k) - \mathrm{logit}^{-1}(C_{k-1}))\\ Y_K=K &\sim \mathrm{Bernoulli}(1-\mathrm{logit}^{-1}(C_{K-1}))\\ \end{aligned} \]

Variable 1: Political complexity

variable <- "political_complexity"

Y <- d %>%
    select(all_of(variable)) %>%
    na.omit %>%
    pull
Y <- match(Y, sort(unique(Y)))

ordinal_data <- list()
ordinal_data$Ndata <- length(Y)
ordinal_data$nthres <- max(Y) -1
ordinal_data$Y <- Y

I generated Stan code for ordinal regression from brms and then did some manual adjustments.

ordinal_code <- "
data {
  int<lower=1> Ndata;  // total number of observations
  int Y[Ndata];  // response variable
  int<lower=2> nthres;  // number of thresholds
}
parameters {
  ordered[nthres] Intercept;  // temporary thresholds for centered predictors
}

model {
  for (n in 1:Ndata) {
    target += ordered_logistic_lpmf(Y[n] | 0, Intercept);
  }
  target += normal_lpdf(Intercept | 0, 5);
}

generated quantities {
  int y_prior[Ndata];
  int y_posterior[Ndata];
  vector[nthres] Intercept_sample;
  ordered[nthres] Intercept_prior;
  vector[Ndata] log_lik;
  for (k in 1:nthres) {
    Intercept_sample[k] = normal_rng(0, 5);
  }
  Intercept_prior = sort_asc(Intercept_sample);
  for (n in 1:Ndata) {
    y_prior[n] = ordered_logistic_rng(0, Intercept_prior);
    y_posterior[n] = ordered_logistic_rng(0, Intercept);
    log_lik[n] = ordered_logistic_lpmf(Y[n] | 0, Intercept);
  }
}

"

model_ordinal <- stan_model(model_code=ordinal_code)

fit_ordinal <- rstan::sampling(
    model_ordinal,
    data=ordinal_data,
    chains=4,
    iter=2000,
)

print(fit_ordinal, pars=c("Intercept", "lp__"))

Inference for Stan model: anon_model.
4 chains, each with iter=2000; warmup=1000; thin=1; 
post-warmup draws per chain=1000, total post-warmup draws=4000.

                 mean se_mean   sd     2.5%      25%      50%      75%    97.5%
Intercept[1]    -0.25    0.00 0.06    -0.37    -0.29    -0.24    -0.20    -0.12
Intercept[2]     1.06    0.00 0.07     0.93     1.01     1.06     1.11     1.20
Intercept[3]     2.06    0.00 0.10     1.87     2.00     2.06     2.13     2.26
Intercept[4]     3.17    0.00 0.16     2.88     3.07     3.17     3.28     3.50
lp__         -1437.01    0.03 1.46 -1440.74 -1437.73 -1436.66 -1435.94 -1435.20
             n_eff Rhat
Intercept[1]  3788    1
Intercept[2]  5320    1
Intercept[3]  5574    1
Intercept[4]  4919    1
lp__          1775    1

Samples were drawn using NUTS(diag_e) at Sat Mar 16 20:09:19 2024.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at 
convergence, Rhat=1).

(loo_ordinal <- loo(fit_ordinal))


Computed from 4000 by 1077 log-likelihood matrix.

         Estimate   SE
elpd_loo  -1429.0 21.8
p_loo         4.1  0.2
looic      2858.0 43.6
------
MCSE of elpd_loo is 0.0.
MCSE and ESS estimates assume MCMC draws (r_eff in [0.8, 1.2]).

All Pareto k estimates are good (k < 0.7).
See help('pareto-k-diagnostic') for details.

# Extract the generated quantities
generated_quantities <- extract(fit_ordinal)

Predictive checks

Following Jäger & Wahle (2021), I define three statistics to be used in prior and posterior predictive checks:

total variance (tv)
mean lineage-wise variance (lv)
cross-family variance (cv)

Here are the corresponding R functions.

total_variance <- function(Y) var(Y)

lineage_wise_variance <- function(Y, variable=variable) {
    out <- d %>% 
        filter(!is.na(d[[variable]])) %>%
        mutate(Y = Y) %>%
        group_by(Family) %>%
        summarize(variance = var(Y, na.rm = TRUE)) %>%
        summarize(mean_variance = mean(variance, na.rm = TRUE)) %>%
        pull
    return(out)
}

crossfamily_variance <- function(Y, variable=variable) {
    out <- d %>% 
        filter(!is.na(d[[variable]])) %>%
        mutate(Y = Y) %>%
        group_by(Family) %>%
        summarize(centroids = mean(Y, na.rm = TRUE)) %>%
        pull(centroids) %>%
        var
    return(out)
}

First we get the empirical values.

Y <- ordinal_data$Y
empirical <- tibble(
    tv = total_variance(Y),
    lv = lineage_wise_variance(Y, variable),
    cv = crossfamily_variance(Y, variable)
)

Thanks to the generated quantities block in the Stan programm, fitting the model simultaneously produced prior and posterior predictive samples.

prior_pc <- tibble(
    tv = apply(generated_quantities$y_prior, 1, total_variance),
    lv = apply(
        generated_quantities$y_prior, 1, function(x) lineage_wise_variance(x, variable)
    ),
    cv = apply(
        generated_quantities$y_prior, 1, function(x) crossfamily_variance(x, variable)
    ),
)

predictive_plot <- function(prior_pc) {
    p1 <- prior_pc %>%
      mutate(
          x_jittered = 1 + rnorm(n(), mean = 0, sd = 0.01),
          y_jittered = tv + rnorm(n(), mean = 0, sd = sd(prior_pc$tv)/10)
      ) %>%
      ggplot(aes(x = x_jittered, y = tv)) +
      geom_boxplot(aes(x = factor(1), y = tv), width = 0.5, fill=karminrot) +
      geom_point(alpha = 0.5, size = 0.1, position = position_identity()) +
      theme_grey(base_size = 16) +
      theme(
        axis.text.x = element_blank(),
        axis.ticks.x = element_blank(),
        axis.title.x = element_blank(),
        axis.title.y = element_text(size = 18),
        plot.title = element_text(size = 20, hjust = 0.5)
      ) +
      labs(title = "total variance", y = "") +
      geom_point(aes(x=1, y=empirical$tv, color="empirical"), size=5, fill='red') +
      guides(color = "none")

    p2 <- prior_pc %>%
      mutate(
          x_jittered = 1 + rnorm(n(), mean = 0, sd = 0.01),
          y_jittered = lv + rnorm(n(), mean = 0, sd = sd(prior_pc$lv)/10)
      ) %>%
      ggplot(aes(x = x_jittered, y = lv)) +
      geom_boxplot(aes(x = factor(1), y = lv), width = 0.5, fill=gold) +
      geom_point(alpha = 0.5, size = 0.1, position = position_identity()) +
      theme_grey(base_size = 16) +
      theme(
        axis.text.x = element_blank(),
        axis.ticks.x = element_blank(),
        axis.title.x = element_blank(),
        axis.title.y = element_text(size = 18),
        plot.title = element_text(size = 20, hjust = 0.5)
      ) +
      labs(title = "lineage-wise variance", y = "") +
      geom_point(aes(x=1, y=empirical$lv, color="empirical"), size=5, fill='red') +
      guides(color = "none")

    p3 <- prior_pc %>%
      mutate(
          x_jittered = 1 + rnorm(n(), mean = 0, sd = 0.01),
          y_jittered = cv + rnorm(n(), mean = 0, sd = sd(prior_pc$cv)/10)
      ) %>%
      ggplot(aes(x = x_jittered, y = cv)) +
      geom_boxplot(aes(x = factor(1), y = cv), width = 0.5, fill=anthrazit) +
      geom_point(alpha = 0.5, size = 0.1, position = position_identity()) +
      theme_grey(base_size = 16) +
      theme(
        axis.text.x = element_blank(),
        axis.ticks.x = element_blank(),
        axis.title.x = element_blank(),
        axis.title.y = element_text(size = 18),
        plot.title = element_text(size = 20, hjust = 0.5)
      ) +
      labs(title = "cross-lineage variance", y = "") +
      geom_point(aes(x=1, y=empirical$cv, color="empirical"), size=5, fill='red') +
      labs(color = "") +
      guides(color = "none")

    return(grid.arrange(p1, p2, p3, ncol=3))
}

predictive_plot(prior_pc)

Warning message in geom_point(aes(x = 1, y = empirical$tv, color = "empirical"), :
“All aesthetics have length 1, but the data has 4000 rows.
ℹ Did you mean to use `annotate()`?”
Warning message in geom_point(aes(x = 1, y = empirical$lv, color = "empirical"), :
“All aesthetics have length 1, but the data has 4000 rows.
ℹ Did you mean to use `annotate()`?”
Warning message in geom_point(aes(x = 1, y = empirical$cv, color = "empirical"), :
“All aesthetics have length 1, but the data has 4000 rows.
ℹ Did you mean to use `annotate()`?”

We see that the prior cover the empirical values quite well. Here is what happens after the model is fitted:

posterior_pc <- tibble(
    tv = apply(generated_quantities$y_posterior, 1, total_variance),
    lv = apply(
        generated_quantities$y_posterior, 1, function(x) lineage_wise_variance(x, variable)
    ),
    cv = apply(
        generated_quantities$y_posterior, 1, function(x) crossfamily_variance(x, variable)
    ),
)
predictive_plot(posterior_pc)

Warning message in geom_point(aes(x = 1, y = empirical$tv, color = "empirical"), :
“All aesthetics have length 1, but the data has 4000 rows.
ℹ Did you mean to use `annotate()`?”
Warning message in geom_point(aes(x = 1, y = empirical$lv, color = "empirical"), :
“All aesthetics have length 1, but the data has 4000 rows.
ℹ Did you mean to use `annotate()`?”
Warning message in geom_point(aes(x = 1, y = empirical$cv, color = "empirical"), :
“All aesthetics have length 1, but the data has 4000 rows.
ℹ Did you mean to use `annotate()`?”

Something is seriously wrong with this model, obviously.

Setting up a phylogenetic model

Brownian motion

Next I will consider the phylogenetic Brownian motion model, combined with a logistic component. Here the basic assumptions are

there is a continuous latent variable \(z\) which has a value at each node
at the root nodes, \(z\) is drawn from a distribution like \(\mathcal N(0, 1)\)
once the value of \(z\) at a mother node is fixed, the values at the daughter nodes is drawn from \(\mathcal N(z_i, \sigma t_j)\), where \(z_i\) is the value at the root and \(t_j\) is the branch length to the daughter
the observed values \(y_i\) at tip \(i\) are distributed as

\[ \begin{aligned} Y_i=1 &\sim \mathrm{Bernoulli}(\mathrm{logit}^{-1}(C_1-z_i))\\ Y_i=k &\sim \mathrm{Bernoulli}(\mathrm{logit}^{-1}(C_k-z_i) - \mathrm{logit}^{-1}(C_{k-1}-z_i))\\ Y_K=K &\sim \mathrm{Bernoulli}(1-\mathrm{logit}^{-1}(C_{K-1}-z_i))\\ \end{aligned} \]

The parameter \(\sigma\) can be interpreted as the rate of evolution.

defined <- which(!is.na(d[[variable]]))
undefined <- which(is.na(d[[variable]]))

po <- c()
for (i in get_postorder(roots, edges)) {
    if ((i %in% edges[,2]) & !(i %in% undefined)) {
        po <- c(po, i)
    }
}
edge_po <- match(po, edges[,2])

phylo_data = list()
phylo_data$Ntrees <- length(trees)
phylo_data$Nnodes <- nrow(nodes)
phylo_data$Y <- ordinal_data$Y
phylo_data$Nobservations <- length(phylo_data$Y)
phylo_data$Nedges <- length(edge_po)
phylo_data$observations <- match(pull(d[defined, "glottocode"]), nodes$label)
phylo_data$roots <- roots
phylo_data$edges <- edges[edge_po,]
phylo_data$edge_lengths <- edge_lengths[edge_po]
phylo_data$nthres <- length(unique(phylo_data$Y)) - 1

brown_code <- "
data {
    int<lower=1> Ntrees;
    int<lower=0> Nnodes;
    int<lower=1> Nobservations;
    int<lower=1> Nedges;
    int<lower=1,upper=Nnodes> observations[Nobservations]; // indices of the observed tips
    int<lower=1, upper=Nnodes> roots[Ntrees];
    int<lower=1, upper=Nnodes> edges[Nedges, 2];
    vector[Nedges] edge_lengths;
    int<lower=2> nthres;  // number of thresholds
    int Y[Nobservations];  // response variable
}

parameters {
    vector[Nnodes] z_normal;
    real<lower=0> sigma;
    real mu_root;
    real<lower=0> sigma_root;
    ordered[nthres] Intercept;  // temporary thresholds for centered predictors
}

model {
    vector[Nnodes] z = rep_vector(0, Nnodes);
    target += exponential_lpdf(sigma | 1);
    target += normal_lpdf(z_normal | 0, 1);
    target += normal_lpdf(Intercept | 0, 5);
    
    for (i in 1:Ntrees) {
        int idx = roots[i];
        z[idx] = mu_root + z_normal[idx] * sigma_root;
    }
    
    for (i in 1:Nedges) {
        int j = Nedges - i + 1;
        int mother_node = edges[j, 1];
        int daughter_node = edges[j, 2];
        real local_mu = z[mother_node];
        real local_sigma = sigma * sqrt(edge_lengths[j]);
        z[daughter_node] = local_mu + z_normal[daughter_node] * local_sigma;
    }

    for (n in 1:Nobservations) {
        target += ordered_logistic_lpmf(Y[n] | z[observations[n]], Intercept);
    }    
}


generated quantities {
    vector[Nnodes] z = rep_vector(0, Nnodes);
    int y_prior[Nobservations]; // Prior predictive samples
    int y_posterior[Nobservations]; // Posterior predictive samples
    vector[Nnodes] z_prior = rep_vector(0, Nnodes);
    vector[Nnodes] z_posterior = rep_vector(0, Nnodes);
    real<lower=0> sigma_prior;
    real mu_root_prior;
    real<lower=0> sigma_root_prior;
    real log_lik[Nobservations];
    vector[nthres] Intercept_sample;
    ordered[nthres] Intercept_prior;

    for (i in 1:Ntrees) {
        int idx = roots[i];
        z[idx] = mu_root + z_normal[idx] * sigma_root;
    }
    
    for (i in 1:Nedges) {
        int j = Nedges - i + 1;
        int mother_node = edges[j, 1];
        int daughter_node = edges[j, 2];
        real local_mu = z[mother_node];
        real local_sigma = sigma * sqrt(edge_lengths[j]);
        z[daughter_node] = local_mu + z_normal[daughter_node] * local_sigma;
    }

    for (k in 1:nthres) {
        Intercept_sample[k] = normal_rng(0, 5);
    }
    Intercept_prior = sort_asc(Intercept_sample);

    sigma_prior = exponential_rng(1);
    mu_root_prior = normal_rng(0, 1);
    sigma_root_prior = exponential_rng(1);

    for (i in 1:Ntrees) {
        z_prior[roots[i]] = normal_rng(mu_root_prior, sigma_root_prior);
    }

    for (i in 1:Ntrees) {
        int idx = roots[i];
        z_posterior[idx] = mu_root + z_normal[idx] * sigma_root;
    }
    
    for (i in 1:Nedges) {
        int j = Nedges - i + 1;
        int mother_node = edges[j, 1];
        int daughter_node = edges[j, 2];
        real local_mu = z_posterior[mother_node];
        real local_sigma = sigma * sqrt(edge_lengths[j]);
        z_prior[daughter_node] = normal_rng(
            z_prior[mother_node],
            sigma_prior * sqrt(edge_lengths[j])
        );

        z_posterior[daughter_node] = local_mu + z_normal[daughter_node] * local_sigma;
    }
    
    for (n in 1:Nobservations) {
        y_prior[n] = ordered_logistic_rng(z_prior[observations[n]], Intercept_prior);
        y_posterior[n] = ordered_logistic_rng(z[observations[n]], Intercept);
        log_lik[n] = ordered_logistic_lpmf(Y[n] | z[observations[n]], Intercept);
    }
}
"

model_brownian <- stan_model(model_code=brown_code)

fit_brownian <- rstan::sampling(
   model_brownian,
   data = phylo_data,
   chains = 4,
   iter=2000
)

# fit_brownian <- readRDS("models/fit_brownian.rds")

print(fit_brownian, pars=c("mu_root", "sigma_root", "sigma", "Intercept", "lp__"))

Inference for Stan model: anon_model.
4 chains, each with iter=2000; warmup=1000; thin=1; 
post-warmup draws per chain=1000, total post-warmup draws=4000.

                 mean se_mean    sd     2.5%      25%      50%      75%
mu_root         -1.63    0.01  0.31    -2.28    -1.83    -1.61    -1.41
sigma_root       2.04    0.01  0.29     1.54     1.84     2.02     2.22
sigma            1.92    0.01  0.28     1.42     1.73     1.91     2.10
Intercept[1]    -0.69    0.00  0.16    -1.00    -0.79    -0.69    -0.58
Intercept[2]     1.30    0.00  0.17     0.99     1.18     1.30     1.41
Intercept[3]     2.71    0.00  0.20     2.32     2.57     2.71     2.85
Intercept[4]     4.06    0.01  0.26     3.57     3.89     4.06     4.23
lp__         -4310.40    1.42 41.00 -4390.75 -4337.74 -4310.48 -4282.73
                97.5% n_eff Rhat
mu_root         -1.05  1294 1.00
sigma_root       2.67   930 1.00
sigma            2.52   706 1.01
Intercept[1]    -0.39  2955 1.00
Intercept[2]     1.63  4194 1.00
Intercept[3]     3.10  2606 1.00
Intercept[4]     4.57  2260 1.00
lp__         -4228.27   835 1.01

Samples were drawn using NUTS(diag_e) at Sat Mar 16 20:11:56 2024.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at 
convergence, Rhat=1).

log_lik <- extract_log_lik(
    fit_brownian,
    parameter_name = "log_lik",
    merge_chains = FALSE
)
reff <- relative_eff(exp(log_lik))
(loo_brownian <- loo(log_lik, r_eff=reff))

Warning message:
“Some Pareto k diagnostic values are too high. See help('pareto-k-diagnostic') for details.
”


Computed from 4000 by 1077 log-likelihood matrix.

         Estimate   SE
elpd_loo  -1259.9 25.6
p_loo       213.5  7.1
looic      2519.9 51.1
------
MCSE of elpd_loo is NA.
MCSE and ESS estimates assume MCMC draws (r_eff in [0.4, 1.9]).

Pareto k diagnostic values:
                         Count Pct.    Min. ESS
(-Inf, 0.7]   (good)     1070  99.4%   211     
   (0.7, 1]   (bad)         6   0.6%   <NA>    
   (1, Inf)   (very bad)    1   0.1%   <NA>    
See help('pareto-k-diagnostic') for details.

plot(loo_brownian)

generated_quantities <- extract(fit_brownian)

prior predictive check

prior_pc <- tibble(
    tv = apply(generated_quantities$y_prior, 1, total_variance),
    lv = apply(
        generated_quantities$y_prior, 1, function(x) lineage_wise_variance(x, variable)
    ),
    cv = apply(
        generated_quantities$y_prior, 1, function(x) crossfamily_variance(x, variable)
    ),
)

predictive_plot(prior_pc)

Warning message in geom_point(aes(x = 1, y = empirical$tv, color = "empirical"), :
“All aesthetics have length 1, but the data has 4000 rows.
ℹ Did you mean to use `annotate()`?”
Warning message in geom_point(aes(x = 1, y = empirical$lv, color = "empirical"), :
“All aesthetics have length 1, but the data has 4000 rows.
ℹ Did you mean to use `annotate()`?”
Warning message in geom_point(aes(x = 1, y = empirical$cv, color = "empirical"), :
“All aesthetics have length 1, but the data has 4000 rows.
ℹ Did you mean to use `annotate()`?”

posterior predictive check

posterior_pc <- tibble(
    tv = apply(generated_quantities$y_posterior, 1, total_variance),
    lv = apply(
        generated_quantities$y_posterior, 1, function(x) lineage_wise_variance(x, variable)
    ),
    cv = apply(
        generated_quantities$y_posterior, 1, function(x) crossfamily_variance(x, variable)
    ),
)


predictive_plot(posterior_pc)

Warning message in geom_point(aes(x = 1, y = empirical$tv, color = "empirical"), :
“All aesthetics have length 1, but the data has 4000 rows.
ℹ Did you mean to use `annotate()`?”
Warning message in geom_point(aes(x = 1, y = empirical$lv, color = "empirical"), :
“All aesthetics have length 1, but the data has 4000 rows.
ℹ Did you mean to use `annotate()`?”
Warning message in geom_point(aes(x = 1, y = empirical$cv, color = "empirical"), :
“All aesthetics have length 1, but the data has 4000 rows.
ℹ Did you mean to use `annotate()`?”

Mixed model

Next I will fit a model with two random effect, one phylogenetic and one uncorrelated. The variable \(z\) is as in the Brownian motion model. Additionally, we have a random effect variable \(\epsilon\) which is drawn from a normal distribution with mean 0 and standard deviation \(\sigma_\epsilon\).

\[ \begin{aligned} \sigma_\epsilon^2 &\sim \text{Inverse\_Gamma}(2, 0.01)\\ \epsilon &\sim \mathcal N(0, \sigma_\epsilon) Y_i=1 &\sim \mathrm{Bernoulli}(\mathrm{logit}^{-1}(C_1-z_i-\epsilon_i))\\ Y_i=k &\sim \mathrm{Bernoulli}(\mathrm{logit}^{-1}(C_k-z_i-\epsilon_i) - \mathrm{logit}^{-1}(C_{k-1}-z_i))\\ Y_K=K &\sim \mathrm{Bernoulli}(1-\mathrm{logit}^{-1}(C_{K-1}-z_i-\epsilon_i))\\ \end{aligned} \]

mixed_code <- "
data {
    int<lower=1> Ntrees;
    int<lower=0> Nnodes;
    int<lower=1> Nobservations;
    int<lower=1> Nedges;
    int<lower=1,upper=Nnodes> observations[Nobservations]; // indices of the observed tips
    int<lower=1, upper=Nnodes> roots[Ntrees];
    int<lower=1, upper=Nnodes> edges[Nedges, 2];
    vector[Nedges] edge_lengths;
    int<lower=2> nthres;  // number of thresholds
    int Y[Nobservations];  // response variable
}

parameters {
    vector[Nnodes] z_normal;
    real<lower=0> sigma;
    real mu_root;
    real<lower=0> sigma_root;
    ordered[nthres] Intercept;  // temporary thresholds for centered predictors
    real<lower=0> sigma_error;
    vector[Nobservations] z_error;
}

model {
    vector[Nnodes] z = rep_vector(0, Nnodes);
    target += exponential_lpdf(sigma | 1);
    target += normal_lpdf(mu_root | 0, 1);
    target += exponential_lpdf(sigma_root | 1);
    target += normal_lpdf(z_normal | 0, 1);
    target += normal_lpdf(Intercept | 0, 5);
    target += inv_gamma_lpdf(square(sigma_error) | 2, 0.1);

    target += normal_lpdf(z_error | 0, sigma_error);

    for (i in 1:Ntrees) {
        int idx = roots[i];
        z[idx] = mu_root + z_normal[idx] * sigma_root;
    }
    
    for (i in 1:Nedges) {
        int j = Nedges - i + 1;
        int mother_node = edges[j, 1];
        int daughter_node = edges[j, 2];
        real local_mu = z[mother_node];
        real local_sigma = sigma * sqrt(edge_lengths[j]);
        z[daughter_node] = local_mu + z_normal[daughter_node] * local_sigma;
    }

    for (n in 1:Nobservations) {
        target += ordered_logistic_lpmf(Y[n] | z[observations[n]] + z_error[n], Intercept);
    }
    
}

generated quantities {
    vector[Nnodes] z = rep_vector(0, Nnodes);
    int y_prior[Nobservations]; // Prior predictive samples
    int y_posterior[Nobservations]; // Posterior predictive samples
    vector[Nnodes] z_prior = rep_vector(0, Nnodes);
    vector[Nnodes] z_posterior = rep_vector(0, Nnodes);
    real<lower=0> sigma_prior;
    real mu_root_prior;
    real<lower=0> sigma_root_prior;
    real log_lik[Nobservations];
    vector[nthres] Intercept_sample;
    ordered[nthres] Intercept_prior;
    vector[Nobservations] z_error_prior;


    real sigma_error_sq = 1 / gamma_rng(2, 10);
    real sigma_error_prior = sqrt(sigma_error_sq);

    for (i in 1:Ntrees) {
        int idx = roots[i];
        z[idx] = mu_root + z_normal[idx] * sigma_root;
    }

    for (n in 1:Nobservations) {
        z_error_prior[n] = normal_rng(0, sigma_error_prior);
    }
    
    for (i in 1:Nedges) {
        int j = Nedges - i + 1;
        int mother_node = edges[j, 1];
        int daughter_node = edges[j, 2];
        real local_mu = z[mother_node];
        real local_sigma = sigma * sqrt(edge_lengths[j]);
        z[daughter_node] = local_mu + z_normal[daughter_node] * local_sigma;
    }

    for (k in 1:nthres) {
        Intercept_sample[k] = normal_rng(0, 5);
    }
    Intercept_prior = sort_asc(Intercept_sample);

    sigma_prior = exponential_rng(1);
    mu_root_prior = normal_rng(0, 1);
    sigma_root_prior = exponential_rng(1);

    for (i in 1:Ntrees) {
        z_prior[roots[i]] = normal_rng(mu_root_prior, sigma_root_prior);
    }

    for (i in 1:Ntrees) {
        int idx = roots[i];
        z_posterior[idx] = mu_root + z_normal[idx] * sigma_root;
    }
    
    for (i in 1:Nedges) {
        int j = Nedges - i + 1;
        int mother_node = edges[j, 1];
        int daughter_node = edges[j, 2];
        real local_mu = z_posterior[mother_node];
        real local_sigma = sigma * sqrt(edge_lengths[j]);
        z_prior[daughter_node] = normal_rng(
            z_prior[mother_node],
            sigma_prior * sqrt(edge_lengths[j])
        );

        z_posterior[daughter_node] = local_mu + z_normal[daughter_node] * local_sigma;
    }
    
    for (n in 1:Nobservations) {
        y_prior[n] = ordered_logistic_rng(z_prior[observations[n]] + z_error_prior[n], Intercept_prior);
        y_posterior[n] = ordered_logistic_rng(z[observations[n]] + z_error[n], Intercept);
        log_lik[n] = ordered_logistic_lpmf(Y[n] | z[observations[n]] + z_error[n], Intercept);
    }
}

"

model_mixed <- stan_model(model_code = mixed_code)

fit_mixed1 <- readRDS("models/fit_mixed1.rds")

# fit_mixed1 <- rstan::sampling(
#    model_mixed,
#    data = phylo_data,
#    chains = 4,
#    iter=40000,
#    thin=20,
#    control=list(max_treedepth=15)
# )

# saveRDS(fit_mixed1, "models/fit_mixed1.rds")

print(fit_mixed1, pars = c("mu_root", "sigma_root", "sigma", "Intercept", "sigma_error", "lp__"))

Inference for Stan model: anon_model.
4 chains, each with iter=40000; warmup=20000; thin=20; 
post-warmup draws per chain=1000, total post-warmup draws=4000.

                 mean se_mean     sd     2.5%      25%      50%      75%
mu_root         -1.47    0.00   0.28    -2.03    -1.66    -1.47    -1.28
sigma_root       1.92    0.00   0.27     1.45     1.74     1.91     2.10
sigma            1.91    0.00   0.27     1.40     1.72     1.89     2.08
Intercept[1]    -0.66    0.00   0.16    -0.97    -0.77    -0.66    -0.56
Intercept[2]     1.33    0.00   0.17     1.01     1.22     1.33     1.44
Intercept[3]     2.74    0.00   0.20     2.35     2.60     2.73     2.88
Intercept[4]     4.09    0.00   0.26     3.61     3.92     4.09     4.26
sigma_error      0.22    0.01   0.08     0.12     0.17     0.20     0.25
lp__         -4160.39   26.35 349.73 -4964.29 -4362.84 -4125.08 -3912.54
                97.5% n_eff Rhat
mu_root         -0.93  3604 1.00
sigma_root       2.47  4009 1.00
sigma            2.46  3875 1.00
Intercept[1]    -0.36  3984 1.00
Intercept[2]     1.65  4022 1.00
Intercept[3]     3.15  3780 1.00
Intercept[4]     4.61  3712 1.00
sigma_error      0.44   155 1.03
lp__         -3573.75   176 1.03

Samples were drawn using NUTS(diag_e) at Sat Mar 16 18:08:22 2024.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at 
convergence, Rhat=1).

log_lik <- extract_log_lik(
    fit_mixed1,
    parameter_name = "log_lik",
    merge_chains = FALSE
)
reff <- relative_eff(exp(log_lik))
(loo_mixed <- loo(log_lik, r_eff=reff))

Warning message:
“Some Pareto k diagnostic values are too high. See help('pareto-k-diagnostic') for details.
”


Computed from 4000 by 1077 log-likelihood matrix.

         Estimate   SE
elpd_loo  -1261.0 25.4
p_loo       218.3  7.0
looic      2522.0 50.7
------
MCSE of elpd_loo is NA.
MCSE and ESS estimates assume MCMC draws (r_eff in [0.5, 1.1]).

Pareto k diagnostic values:
                         Count Pct.    Min. ESS
(-Inf, 0.7]   (good)     1071  99.4%   220     
   (0.7, 1]   (bad)         6   0.6%   <NA>    
   (1, Inf)   (very bad)    0   0.0%   <NA>    
See help('pareto-k-diagnostic') for details.

plot(loo_mixed)

generated_quantities <- extract(fit_mixed1)

prior predictive check

prior_pc <- tibble(
    tv = apply(generated_quantities$y_prior, 1, total_variance),
    lv = apply(
        generated_quantities$y_prior, 1, function(x) lineage_wise_variance(x, variable)
    ),
    cv = apply(
        generated_quantities$y_prior, 1, function(x) crossfamily_variance(x, variable)
    ),
)

predictive_plot(prior_pc)

Warning message in geom_point(aes(x = 1, y = empirical$tv, color = "empirical"), :
“All aesthetics have length 1, but the data has 4000 rows.
ℹ Did you mean to use `annotate()`?”
Warning message in geom_point(aes(x = 1, y = empirical$lv, color = "empirical"), :
“All aesthetics have length 1, but the data has 4000 rows.
ℹ Did you mean to use `annotate()`?”
Warning message in geom_point(aes(x = 1, y = empirical$cv, color = "empirical"), :
“All aesthetics have length 1, but the data has 4000 rows.
ℹ Did you mean to use `annotate()`?”

posterior predictive check

posterior_pc <- tibble(
    tv = apply(generated_quantities$y_posterior, 1, total_variance),
    lv = apply(
        generated_quantities$y_posterior, 1, function(x) lineage_wise_variance(x, variable)
    ),
    cv = apply(
        generated_quantities$y_posterior, 1, function(x) crossfamily_variance(x, variable)
    ),
)

predictive_plot(posterior_pc)

Warning message in geom_point(aes(x = 1, y = empirical$tv, color = "empirical"), :
“All aesthetics have length 1, but the data has 4000 rows.
ℹ Did you mean to use `annotate()`?”
Warning message in geom_point(aes(x = 1, y = empirical$lv, color = "empirical"), :
“All aesthetics have length 1, but the data has 4000 rows.
ℹ Did you mean to use `annotate()`?”
Warning message in geom_point(aes(x = 1, y = empirical$cv, color = "empirical"), :
“All aesthetics have length 1, but the data has 4000 rows.
ℹ Did you mean to use `annotate()`?”

Let us summarize the model comparisons:

This is very decisive evidence in favor of the CTMC model.

model comparison

tibble(model = c("ordinal", "brownian", "mixed"),
       looic = c(
            loo_ordinal$estimates[3,1],
            loo_brownian$estimates[3,1],
            loo_mixed$estimates[3,1]
       )
) %>%
mutate(delta_looic = round(looic-min(looic), 3)) %>%
arrange(delta_looic)

A tibble: 3 × 3
model	looic	delta_looic
<chr>	<dbl>	<dbl>
brownian	2519.892	0.000
mixed	2521.968	2.075
ordinal	2858.004	338.111

We can conclude that phylogenetic control massively improves the fit to the data.

Variable 2: hierarchy_within

variable <- "hierarchy_within"

Y <- d %>%
    select(all_of(variable)) %>%
    na.omit %>%
    pull
Y <- match(Y, sort(unique(Y)))

ordinal_data <- list()
ordinal_data$Ndata <- length(Y)
ordinal_data$nthres <- max(Y) -1
ordinal_data$Y <- Y

fit_ordinal <- rstan::sampling(
    model_ordinal,
    data = ordinal_data,
    chains = 4,
    iter = 2000,
)

print(fit_ordinal, pars=c("Intercept", "lp__"))

Inference for Stan model: anon_model.
4 chains, each with iter=2000; warmup=1000; thin=1; 
post-warmup draws per chain=1000, total post-warmup draws=4000.

                 mean se_mean   sd     2.5%      25%      50%      75%    97.5%
Intercept[1]    -0.86    0.00 0.07    -0.99    -0.90    -0.86    -0.81    -0.73
Intercept[2]     1.86    0.00 0.09     1.69     1.80     1.86     1.92     2.05
lp__         -1043.07    0.02 1.01 -1045.68 -1043.46 -1042.75 -1042.35 -1042.08
             n_eff Rhat
Intercept[1]  2072    1
Intercept[2]  3137    1
lp__          1806    1

Samples were drawn using NUTS(diag_e) at Sat Mar 16 20:15:33 2024.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at 
convergence, Rhat=1).

(loo_ordinal <- loo(fit_ordinal))


Computed from 4000 by 1089 log-likelihood matrix.

         Estimate   SE
elpd_loo  -1039.9 16.7
p_loo         2.0  0.1
looic      2079.9 33.3
------
MCSE of elpd_loo is 0.0.
MCSE and ESS estimates assume MCMC draws (r_eff in [0.4, 0.8]).

All Pareto k estimates are good (k < 0.7).
See help('pareto-k-diagnostic') for details.

# Extract the generated quantities
generated_quantities <- extract(fit_ordinal)

Y <- ordinal_data$Y
empirical <- tibble(
    tv = total_variance(Y),
    lv = lineage_wise_variance(Y, variable),
    cv = crossfamily_variance(Y, variable)
)

prior_pc <- tibble(
    tv = apply(generated_quantities$y_prior, 1, total_variance),
    lv = apply(
        generated_quantities$y_prior, 1, function(x) lineage_wise_variance(x, variable)
    ),
    cv = apply(
        generated_quantities$y_prior, 1, function(x) crossfamily_variance(x, variable)
    ),
)
predictive_plot(prior_pc)

Warning message in geom_point(aes(x = 1, y = empirical$tv, color = "empirical"), :
“All aesthetics have length 1, but the data has 4000 rows.
ℹ Did you mean to use `annotate()`?”
Warning message in geom_point(aes(x = 1, y = empirical$lv, color = "empirical"), :
“All aesthetics have length 1, but the data has 4000 rows.
ℹ Did you mean to use `annotate()`?”
Warning message in geom_point(aes(x = 1, y = empirical$cv, color = "empirical"), :
“All aesthetics have length 1, but the data has 4000 rows.
ℹ Did you mean to use `annotate()`?”

posterior_pc <- tibble(
    tv = apply(generated_quantities$y_posterior, 1, total_variance),
    lv = apply(
        generated_quantities$y_posterior, 1, function(x) lineage_wise_variance(x, variable)
    ),
    cv = apply(
        generated_quantities$y_posterior, 1, function(x) crossfamily_variance(x, variable)
    ),
)
predictive_plot(posterior_pc)

Warning message in geom_point(aes(x = 1, y = empirical$tv, color = "empirical"), :
“All aesthetics have length 1, but the data has 4000 rows.
ℹ Did you mean to use `annotate()`?”
Warning message in geom_point(aes(x = 1, y = empirical$lv, color = "empirical"), :
“All aesthetics have length 1, but the data has 4000 rows.
ℹ Did you mean to use `annotate()`?”
Warning message in geom_point(aes(x = 1, y = empirical$cv, color = "empirical"), :
“All aesthetics have length 1, but the data has 4000 rows.
ℹ Did you mean to use `annotate()`?”

defined <- which(!is.na(d[[variable]]))
undefined <- which(is.na(d[[variable]]))

po <- c()
for (i in get_postorder(roots, edges)) {
    if ((i %in% edges[,2]) && !(i %in% undefined)) {
        po <- c(po, i)
    }
}
edge_po <- match(po, edges[,2])

phylo_data = list()
phylo_data$Ntrees <- length(trees)
phylo_data$Nnodes <- nrow(nodes)
phylo_data$Y <- ordinal_data$Y
phylo_data$Nobservations <- length(phylo_data$Y)
phylo_data$Nedges <- length(edge_po)
phylo_data$observations <- match(pull(d[defined, "glottocode"]), nodes$label)
phylo_data$roots <- roots
phylo_data$edges <- edges[edge_po,]
phylo_data$edge_lengths <- edge_lengths[edge_po]
phylo_data$nthres <- length(unique(phylo_data$Y)) - 1

fit_brownian <- rstan::sampling(
   model_brownian,
   data = phylo_data,
   chains = 4,
   iter = 2000
)

print(fit_brownian, pars=c("mu_root", "sigma_root", "sigma", "Intercept", "lp__"))

Inference for Stan model: anon_model.
4 chains, each with iter=2000; warmup=1000; thin=1; 
post-warmup draws per chain=1000, total post-warmup draws=4000.

                 mean se_mean    sd     2.5%      25%      50%      75%
mu_root          0.40    0.00  0.23    -0.04     0.24     0.40     0.55
sigma_root       1.19    0.01  0.22     0.78     1.04     1.18     1.33
sigma            1.98    0.01  0.29     1.45     1.78     1.96     2.16
Intercept[1]    -0.62    0.00  0.15    -0.94    -0.73    -0.62    -0.52
Intercept[2]     3.18    0.01  0.23     2.75     3.01     3.17     3.33
lp__         -3990.72    1.48 40.81 -4067.19 -4018.82 -3991.51 -3962.58
                97.5% n_eff Rhat
mu_root          0.85  2357 1.00
sigma_root       1.66  1021 1.00
sigma            2.57   642 1.01
Intercept[1]    -0.33  3886 1.00
Intercept[2]     3.64  1675 1.00
lp__         -3910.57   757 1.01

Samples were drawn using NUTS(diag_e) at Sat Mar 16 20:18:05 2024.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at 
convergence, Rhat=1).

(loo_brownian <- loo(fit_brownian))

Warning message:
“Some Pareto k diagnostic values are too high. See help('pareto-k-diagnostic') for details.
”


Computed from 4000 by 1089 log-likelihood matrix.

         Estimate   SE
elpd_loo   -936.6 18.4
p_loo       205.7  6.8
looic      1873.2 36.8
------
MCSE of elpd_loo is NA.
MCSE and ESS estimates assume MCMC draws (r_eff in [0.3, 2.1]).

Pareto k diagnostic values:
                         Count Pct.    Min. ESS
(-Inf, 0.7]   (good)     1083  99.4%   121     
   (0.7, 1]   (bad)         6   0.6%   <NA>    
   (1, Inf)   (very bad)    0   0.0%   <NA>    
See help('pareto-k-diagnostic') for details.

# Extract the generated quantities
generated_quantities <- extract(fit_brownian)

Y <- ordinal_data$Y
empirical <- tibble(
    tv = total_variance(Y),
    lv = lineage_wise_variance(Y, variable),
    cv = crossfamily_variance(Y, variable)
)

prior_pc <- tibble(
    tv = apply(generated_quantities$y_prior, 1, total_variance),
    lv = apply(
        generated_quantities$y_prior, 1, function(x) lineage_wise_variance(x, variable)
    ),
    cv = apply(
        generated_quantities$y_prior, 1, function(x) crossfamily_variance(x, variable)
    ),
)
predictive_plot(prior_pc)

Warning message in geom_point(aes(x = 1, y = empirical$tv, color = "empirical"), :
“All aesthetics have length 1, but the data has 4000 rows.
ℹ Did you mean to use `annotate()`?”
Warning message in geom_point(aes(x = 1, y = empirical$lv, color = "empirical"), :
“All aesthetics have length 1, but the data has 4000 rows.
ℹ Did you mean to use `annotate()`?”
Warning message in geom_point(aes(x = 1, y = empirical$cv, color = "empirical"), :
“All aesthetics have length 1, but the data has 4000 rows.
ℹ Did you mean to use `annotate()`?”

posterior_pc <- tibble(
  tv = apply(generated_quantities$y_posterior, 1, total_variance),
  lv = apply(
    generated_quantities$y_posterior, 1, function(x) lineage_wise_variance(x, variable)
  ),
  cv = apply(
    generated_quantities$y_posterior, 1, function(x) crossfamily_variance(x, variable)
  ),
)
predictive_plot(posterior_pc)

Warning message in geom_point(aes(x = 1, y = empirical$tv, color = "empirical"), :
“All aesthetics have length 1, but the data has 4000 rows.
ℹ Did you mean to use `annotate()`?”
Warning message in geom_point(aes(x = 1, y = empirical$lv, color = "empirical"), :
“All aesthetics have length 1, but the data has 4000 rows.
ℹ Did you mean to use `annotate()`?”
Warning message in geom_point(aes(x = 1, y = empirical$cv, color = "empirical"), :
“All aesthetics have length 1, but the data has 4000 rows.
ℹ Did you mean to use `annotate()`?”

fit_mixed2 <- readRDS("models/fit_mixed2.rds")

# fit_mixed2 <- rstan::sampling(
#    model_mixed,
#    data = phylo_data,
#    chains = 4,
#    iter=40000,
#    thin=20,
#    control=list(max_treedepth=15)
# )

# saveRDS(fit_mixed2, "models/fit_mixed2.rds")

print(fit_mixed2, pars=c("mu_root", "sigma_root", "sigma", "Intercept", "sigma_error", "lp__"))

Inference for Stan model: anon_model.
4 chains, each with iter=40000; warmup=20000; thin=20; 
post-warmup draws per chain=1000, total post-warmup draws=4000.

                 mean se_mean     sd     2.5%      25%      50%      75%
mu_root          0.38    0.00   0.22    -0.06     0.24     0.38     0.53
sigma_root       1.16    0.00   0.21     0.78     1.01     1.15     1.30
sigma            1.96    0.00   0.29     1.43     1.76     1.95     2.15
Intercept[1]    -0.64    0.00   0.16    -0.94    -0.74    -0.64    -0.53
Intercept[2]     3.18    0.00   0.24     2.72     3.01     3.17     3.34
sigma_error      0.23    0.01   0.09     0.12     0.17     0.21     0.27
lp__         -3886.28   31.52 366.29 -4692.49 -4107.62 -3857.59 -3625.73
                97.5% n_eff Rhat
mu_root          0.82  3760 1.00
sigma_root       1.61  3863 1.00
sigma            2.58  4002 1.00
Intercept[1]    -0.33  3678 1.00
Intercept[2]     3.67  3596 1.00
sigma_error      0.46   109 1.04
lp__         -3260.73   135 1.03

Samples were drawn using NUTS(diag_e) at Sat Mar 16 18:49:55 2024.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at 
convergence, Rhat=1).

(loo_mixed <- loo(fit_mixed2))

Warning message:
“Some Pareto k diagnostic values are too high. See help('pareto-k-diagnostic') for details.
”


Computed from 4000 by 1089 log-likelihood matrix.

         Estimate   SE
elpd_loo   -937.8 18.3
p_loo       212.6  6.9
looic      1875.5 36.6
------
MCSE of elpd_loo is NA.
MCSE and ESS estimates assume MCMC draws (r_eff in [0.7, 1.1]).

Pareto k diagnostic values:
                         Count Pct.    Min. ESS
(-Inf, 0.7]   (good)     1085  99.6%   345     
   (0.7, 1]   (bad)         3   0.3%   <NA>    
   (1, Inf)   (very bad)    1   0.1%   <NA>    
See help('pareto-k-diagnostic') for details.

# Extract the generated quantities
generated_quantities <- extract(fit_mixed2)

Y <- ordinal_data$Y
empirical <- tibble(
    tv = total_variance(Y),
    lv = lineage_wise_variance(Y, variable),
    cv = crossfamily_variance(Y, variable)
)

prior_pc <- tibble(
    tv = apply(generated_quantities$y_prior, 1, total_variance),
    lv = apply(
        generated_quantities$y_prior, 1, function(x) lineage_wise_variance(x, variable)
    ),
    cv = apply(
        generated_quantities$y_prior, 1, function(x) crossfamily_variance(x, variable)
    ),
)
predictive_plot(prior_pc)

Warning message in geom_point(aes(x = 1, y = empirical$tv, color = "empirical"), :
“All aesthetics have length 1, but the data has 4000 rows.
ℹ Did you mean to use `annotate()`?”
Warning message in geom_point(aes(x = 1, y = empirical$lv, color = "empirical"), :
“All aesthetics have length 1, but the data has 4000 rows.
ℹ Did you mean to use `annotate()`?”
Warning message in geom_point(aes(x = 1, y = empirical$cv, color = "empirical"), :
“All aesthetics have length 1, but the data has 4000 rows.
ℹ Did you mean to use `annotate()`?”

posterior_pc <- tibble(
    tv = apply(generated_quantities$y_posterior, 1, total_variance),
    lv = apply(
        generated_quantities$y_posterior, 1, function(x) lineage_wise_variance(x, variable)
    ),
    cv = apply(
        generated_quantities$y_posterior, 1, function(x) crossfamily_variance(x, variable)
    ),
)
predictive_plot(posterior_pc)

Warning message in geom_point(aes(x = 1, y = empirical$tv, color = "empirical"), :
“All aesthetics have length 1, but the data has 4000 rows.
ℹ Did you mean to use `annotate()`?”
Warning message in geom_point(aes(x = 1, y = empirical$lv, color = "empirical"), :
“All aesthetics have length 1, but the data has 4000 rows.
ℹ Did you mean to use `annotate()`?”
Warning message in geom_point(aes(x = 1, y = empirical$cv, color = "empirical"), :
“All aesthetics have length 1, but the data has 4000 rows.
ℹ Did you mean to use `annotate()`?”

model comparison

tibble(model = c("ordinal", "brownian", "mixed"),
       looic = c(
            loo_ordinal$estimates[3,1],
            loo_brownian$estimates[3,1],
            loo_mixed$estimates[3,1]
       )
) %>%
mutate(delta_looic = round(looic-min(looic), 3)) %>%
arrange(delta_looic)

A tibble: 3 × 3
model	looic	delta_looic
<chr>	<dbl>	<dbl>
brownian	1873.185	0.000
mixed	1875.527	2.342
ordinal	2079.891	206.705

Variable 3: domestic_organisation

variable <- "domestic_organisation"

Y <- d %>%
    select(all_of(variable)) %>%
    na.omit %>%
    pull
Y <- match(Y, sort(unique(Y)))

ordinal_data <- list()
ordinal_data$Ndata <- length(Y)
ordinal_data$nthres <- max(Y) -1
ordinal_data$Y <- Y

fit_ordinal <- rstan::sampling(
    model_ordinal,
    data = ordinal_data,
    chains = 4,
    iter = 2000,
)

print(fit_ordinal, pars=c("Intercept", "lp__"))

Inference for Stan model: anon_model.
4 chains, each with iter=2000; warmup=1000; thin=1; 
post-warmup draws per chain=1000, total post-warmup draws=4000.

                 mean se_mean   sd     2.5%      25%      50%      75%    97.5%
Intercept[1]    -0.93    0.00 0.06    -1.06    -0.97    -0.93    -0.89    -0.80
Intercept[2]     0.05    0.00 0.06    -0.06     0.01     0.05     0.09     0.17
Intercept[3]     1.45    0.00 0.07     1.31     1.40     1.45     1.50     1.60
lp__         -1631.03    0.03 1.27 -1634.30 -1631.56 -1630.69 -1630.12 -1629.64
             n_eff Rhat
Intercept[1]  2517    1
Intercept[2]  3701    1
Intercept[3]  4405    1
lp__          1480    1

Samples were drawn using NUTS(diag_e) at Sat Mar 16 20:21:53 2024.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at 
convergence, Rhat=1).

(loo_ordinal <- loo(fit_ordinal))


Computed from 4000 by 1183 log-likelihood matrix.

         Estimate   SE
elpd_loo  -1625.2  5.9
p_loo         3.0  0.0
looic      3250.4 11.8
------
MCSE of elpd_loo is 0.0.
MCSE and ESS estimates assume MCMC draws (r_eff in [0.6, 1.1]).

All Pareto k estimates are good (k < 0.7).
See help('pareto-k-diagnostic') for details.

# Extract the generated quantities
generated_quantities <- extract(fit_ordinal)

Y <- ordinal_data$Y
empirical <- tibble(
    tv = total_variance(Y),
    lv = lineage_wise_variance(Y, variable),
    cv = crossfamily_variance(Y, variable)
)

prior_pc <- tibble(
    tv = apply(generated_quantities$y_prior, 1, total_variance),
    lv = apply(
        generated_quantities$y_prior, 1, function(x) lineage_wise_variance(x, variable)
    ),
    cv = apply(
        generated_quantities$y_prior, 1, function(x) crossfamily_variance(x, variable)
    ),
)
predictive_plot(prior_pc)

Warning message in geom_point(aes(x = 1, y = empirical$tv, color = "empirical"), :
“All aesthetics have length 1, but the data has 4000 rows.
ℹ Did you mean to use `annotate()`?”
Warning message in geom_point(aes(x = 1, y = empirical$lv, color = "empirical"), :
“All aesthetics have length 1, but the data has 4000 rows.
ℹ Did you mean to use `annotate()`?”
Warning message in geom_point(aes(x = 1, y = empirical$cv, color = "empirical"), :
“All aesthetics have length 1, but the data has 4000 rows.
ℹ Did you mean to use `annotate()`?”

posterior_pc <- tibble(
    tv = apply(generated_quantities$y_posterior, 1, total_variance),
    lv = apply(
        generated_quantities$y_posterior, 1, function(x) lineage_wise_variance(x, variable)
    ),
    cv = apply(
        generated_quantities$y_posterior, 1, function(x) crossfamily_variance(x, variable)
    ),
)
predictive_plot(posterior_pc)

Warning message in geom_point(aes(x = 1, y = empirical$tv, color = "empirical"), :
“All aesthetics have length 1, but the data has 4000 rows.
ℹ Did you mean to use `annotate()`?”
Warning message in geom_point(aes(x = 1, y = empirical$lv, color = "empirical"), :
“All aesthetics have length 1, but the data has 4000 rows.
ℹ Did you mean to use `annotate()`?”
Warning message in geom_point(aes(x = 1, y = empirical$cv, color = "empirical"), :
“All aesthetics have length 1, but the data has 4000 rows.
ℹ Did you mean to use `annotate()`?”

defined <- which(!is.na(d[[variable]]))
undefined <- which(is.na(d[[variable]]))

po <- c()
for (i in get_postorder(roots, edges)) {
    if ((i %in% edges[,2]) & !(i %in% undefined)) {
        po <- c(po, i)
    }
}
edge_po <- match(po, edges[,2])

phylo_data = list()
phylo_data$Ntrees <- length(trees)
phylo_data$Nnodes <- nrow(nodes)
phylo_data$Y <- ordinal_data$Y
phylo_data$Nobservations <- length(phylo_data$Y)
phylo_data$Nedges <- length(edge_po)
phylo_data$observations <- match(pull(d[defined, "glottocode"]), nodes$label)
phylo_data$roots <- roots
phylo_data$edges <- edges[edge_po,]
phylo_data$edge_lengths <- edge_lengths[edge_po]
phylo_data$nthres <- length(unique(phylo_data$Y)) - 1

fit_brownian <- rstan::sampling(
   model_brownian,
   data = phylo_data,
   chains = 4,
   iter=2000
)

print(fit_brownian, pars=c("mu_root", "sigma_root", "sigma", "Intercept", "lp__"))

Inference for Stan model: anon_model.
4 chains, each with iter=2000; warmup=1000; thin=1; 
post-warmup draws per chain=1000, total post-warmup draws=4000.

                 mean se_mean    sd     2.5%      25%      50%      75%
mu_root         -0.19    0.01  0.31    -0.80    -0.41    -0.19     0.01
sigma_root       1.00    0.01  0.18     0.68     0.87     0.99     1.13
sigma            1.53    0.01  0.22     1.11     1.37     1.53     1.68
Intercept[1]    -1.25    0.01  0.28    -1.83    -1.44    -1.25    -1.06
Intercept[2]    -0.01    0.01  0.28    -0.56    -0.20    -0.01     0.18
Intercept[3]     1.77    0.01  0.30     1.20     1.57     1.77     1.97
lp__         -4597.03    1.43 42.38 -4682.29 -4625.63 -4597.17 -4568.26
                97.5% n_eff Rhat
mu_root          0.41  2858    1
sigma_root       1.39  1066    1
sigma            1.98   813    1
Intercept[1]    -0.70  3160    1
Intercept[2]     0.55  2935    1
Intercept[3]     2.36  2409    1
lp__         -4515.18   878    1

Samples were drawn using NUTS(diag_e) at Sat Mar 16 20:24:15 2024.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at 
convergence, Rhat=1).

(loo_brownian <- loo(fit_brownian))

Warning message:
“Some Pareto k diagnostic values are too high. See help('pareto-k-diagnostic') for details.
”


Computed from 4000 by 1183 log-likelihood matrix.

         Estimate   SE
elpd_loo  -1540.1 13.9
p_loo       213.0  5.1
looic      3080.3 27.7
------
MCSE of elpd_loo is NA.
MCSE and ESS estimates assume MCMC draws (r_eff in [0.4, 1.9]).

Pareto k diagnostic values:
                         Count Pct.    Min. ESS
(-Inf, 0.7]   (good)     1178  99.6%   264     
   (0.7, 1]   (bad)         5   0.4%   <NA>    
   (1, Inf)   (very bad)    0   0.0%   <NA>    
See help('pareto-k-diagnostic') for details.

# Extract the generated quantities
generated_quantities <- extract(fit_brownian)

Y <- ordinal_data$Y
empirical <- tibble(
    tv = total_variance(Y),
    lv = lineage_wise_variance(Y, variable),
    cv = crossfamily_variance(Y, variable)
)

prior_pc <- tibble(
    tv = apply(generated_quantities$y_prior, 1, total_variance),
    lv = apply(
        generated_quantities$y_prior, 1, function(x) lineage_wise_variance(x, variable)
    ),
    cv = apply(
        generated_quantities$y_prior, 1, function(x) crossfamily_variance(x, variable)
    ),
)
predictive_plot(prior_pc)

Warning message in geom_point(aes(x = 1, y = empirical$tv, color = "empirical"), :
“All aesthetics have length 1, but the data has 4000 rows.
ℹ Did you mean to use `annotate()`?”
Warning message in geom_point(aes(x = 1, y = empirical$lv, color = "empirical"), :
“All aesthetics have length 1, but the data has 4000 rows.
ℹ Did you mean to use `annotate()`?”
Warning message in geom_point(aes(x = 1, y = empirical$cv, color = "empirical"), :
“All aesthetics have length 1, but the data has 4000 rows.
ℹ Did you mean to use `annotate()`?”

posterior_pc <- tibble(
    tv = apply(generated_quantities$y_posterior, 1, total_variance),
    lv = apply(
        generated_quantities$y_posterior, 1, function(x) lineage_wise_variance(x, variable)
    ),
    cv = apply(
        generated_quantities$y_posterior, 1, function(x) crossfamily_variance(x, variable)
    ),
)
predictive_plot(posterior_pc)

Warning message in geom_point(aes(x = 1, y = empirical$tv, color = "empirical"), :
“All aesthetics have length 1, but the data has 4000 rows.
ℹ Did you mean to use `annotate()`?”
Warning message in geom_point(aes(x = 1, y = empirical$lv, color = "empirical"), :
“All aesthetics have length 1, but the data has 4000 rows.
ℹ Did you mean to use `annotate()`?”
Warning message in geom_point(aes(x = 1, y = empirical$cv, color = "empirical"), :
“All aesthetics have length 1, but the data has 4000 rows.
ℹ Did you mean to use `annotate()`?”

fit_mixed3 <- readRDS("models/fit_mixed3.rds")

# fit_mixed3 <- rstan::sampling(
#    model_mixed,
#    data = phylo_data,
#    chains = 4,
#    iter = 40000,
#    thin = 20,
#    control = list(max_treedepth = 15)
# )

# saveRDS(fit_mixed3, "models/fit_mixed3.rds")

print(
    fit_mixed3,
    pars = c("mu_root", "sigma_root", "sigma", "Intercept", "sigma_error", "lp__")
)

Inference for Stan model: anon_model.
4 chains, each with iter=40000; warmup=20000; thin=20; 
post-warmup draws per chain=1000, total post-warmup draws=4000.

                 mean se_mean     sd     2.5%      25%      50%      75%
mu_root         -0.17    0.00   0.31    -0.78    -0.38    -0.18     0.04
sigma_root       0.99    0.00   0.19     0.66     0.86     0.98     1.11
sigma            1.54    0.00   0.23     1.11     1.37     1.53     1.69
Intercept[1]    -1.25    0.00   0.29    -1.80    -1.44    -1.25    -1.05
Intercept[2]     0.01    0.00   0.29    -0.56    -0.19     0.01     0.20
Intercept[3]     1.80    0.01   0.31     1.22     1.59     1.80     2.01
sigma_error      0.24    0.01   0.09     0.13     0.18     0.22     0.27
lp__         -4495.62   25.99 377.70 -5338.93 -4720.97 -4470.54 -4230.67
                97.5% n_eff Rhat
mu_root          0.46  3899 1.00
sigma_root       1.38  3222 1.00
sigma            2.03  3691 1.00
Intercept[1]    -0.68  3348 1.00
Intercept[2]     0.58  3338 1.00
Intercept[3]     2.40  3038 1.00
sigma_error      0.46   197 1.02
lp__         -3843.50   211 1.01

Samples were drawn using NUTS(diag_e) at Sun Mar 10 11:53:58 2024.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at 
convergence, Rhat=1).

(loo_mixed <- loo(fit_mixed3))


Computed from 4000 by 1183 log-likelihood matrix.

         Estimate   SE
elpd_loo  -1540.2 13.8
p_loo       226.2  5.3
looic      3080.3 27.5
------
MCSE of elpd_loo is 0.4.
MCSE and ESS estimates assume MCMC draws (r_eff in [0.6, 1.1]).

All Pareto k estimates are good (k < 0.7).
See help('pareto-k-diagnostic') for details.

# Extract the generated quantities
generated_quantities <- extract(fit_mixed3)

Y <- ordinal_data$Y
empirical <- tibble(
    tv = total_variance(Y),
    lv = lineage_wise_variance(Y, variable),
    cv = crossfamily_variance(Y, variable)
)

prior_pc <- tibble(
    tv = apply(generated_quantities$y_prior, 1, total_variance),
    lv = apply(
        generated_quantities$y_prior, 1, function(x) lineage_wise_variance(x, variable)
    ),
    cv = apply(
        generated_quantities$y_prior, 1, function(x) crossfamily_variance(x, variable)
    ),
)
predictive_plot(prior_pc)

Warning message in geom_point(aes(x = 1, y = empirical$tv, color = "empirical"), :
“All aesthetics have length 1, but the data has 4000 rows.
ℹ Did you mean to use `annotate()`?”
Warning message in geom_point(aes(x = 1, y = empirical$lv, color = "empirical"), :
“All aesthetics have length 1, but the data has 4000 rows.
ℹ Did you mean to use `annotate()`?”
Warning message in geom_point(aes(x = 1, y = empirical$cv, color = "empirical"), :
“All aesthetics have length 1, but the data has 4000 rows.
ℹ Did you mean to use `annotate()`?”

posterior_pc <- tibble(
    tv = apply(generated_quantities$y_posterior, 1, total_variance),
    lv = apply(
        generated_quantities$y_posterior, 1, function(x) lineage_wise_variance(x, variable)
    ),
    cv = apply(
        generated_quantities$y_posterior, 1, function(x) crossfamily_variance(x, variable)
    ),
)
predictive_plot(posterior_pc)

Warning message in geom_point(aes(x = 1, y = empirical$tv, color = "empirical"), :
“All aesthetics have length 1, but the data has 4000 rows.
ℹ Did you mean to use `annotate()`?”
Warning message in geom_point(aes(x = 1, y = empirical$lv, color = "empirical"), :
“All aesthetics have length 1, but the data has 4000 rows.
ℹ Did you mean to use `annotate()`?”
Warning message in geom_point(aes(x = 1, y = empirical$cv, color = "empirical"), :
“All aesthetics have length 1, but the data has 4000 rows.
ℹ Did you mean to use `annotate()`?”

model comparison

tibble(model = c("ordinal", "brownian", "mixed"),
       looic = c(
            loo_ordinal$estimates[3,1],
            loo_brownian$estimates[3,1],
            loo_mixed$estimates[3,1]
       )
) %>%
mutate(delta_looic = round(looic-min(looic), 3)) %>%
arrange(delta_looic)

A tibble: 3 × 3
model	looic	delta_looic
<chr>	<dbl>	<dbl>
brownian	3080.276	0.000
mixed	3080.314	0.038
ordinal	3250.418	170.143

Variable 4: agricultureLevel

variable <- "agricultureLevel"

Y <- d %>%
    select(all_of(variable)) %>%
    na.omit %>%
    pull
Y <- match(Y, sort(unique(Y)))

ordinal_data <- list()
ordinal_data$Ndata <- length(Y)
ordinal_data$nthres <- max(Y) -1
ordinal_data$Y <- Y

fit_ordinal <- rstan::sampling(
    model_ordinal,
    data = ordinal_data,
    chains = 4,
    iter = 2000,
)

print(fit_ordinal, pars=c("Intercept", "lp__"))

Inference for Stan model: anon_model.
4 chains, each with iter=2000; warmup=1000; thin=1; 
post-warmup draws per chain=1000, total post-warmup draws=4000.

                 mean se_mean   sd     2.5%      25%      50%      75%    97.5%
Intercept[1]    -1.53    0.00 0.07    -1.68    -1.57    -1.53    -1.48    -1.39
Intercept[2]    -1.32    0.00 0.07    -1.46    -1.37    -1.32    -1.28    -1.19
Intercept[3]    -1.19    0.00 0.07    -1.32    -1.23    -1.18    -1.14    -1.06
Intercept[4]    -1.02    0.00 0.06    -1.15    -1.07    -1.02    -0.98    -0.90
Intercept[5]    -0.67    0.00 0.06    -0.79    -0.71    -0.67    -0.63    -0.56
Intercept[6]     0.03    0.00 0.06    -0.08    -0.01     0.03     0.07     0.15
Intercept[7]     1.08    0.00 0.07     0.95     1.04     1.08     1.13     1.21
Intercept[8]     2.35    0.00 0.10     2.15     2.28     2.35     2.42     2.55
Intercept[9]     4.11    0.00 0.22     3.70     3.96     4.10     4.26     4.57
lp__         -2448.97    0.05 2.07 -2453.62 -2450.17 -2448.61 -2447.47 -2445.90
             n_eff Rhat
Intercept[1]  3364    1
Intercept[2]  3978    1
Intercept[3]  4222    1
Intercept[4]  4554    1
Intercept[5]  5240    1
Intercept[6]  5007    1
Intercept[7]  4919    1
Intercept[8]  5653    1
Intercept[9]  5456    1
lp__          1955    1

Samples were drawn using NUTS(diag_e) at Sat Mar 16 20:27:19 2024.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at 
convergence, Rhat=1).

(loo_ordinal <- loo(fit_ordinal))


Computed from 4000 by 1209 log-likelihood matrix.

         Estimate   SE
elpd_loo  -2424.1 23.3
p_loo         8.9  0.3
looic      4848.2 46.7
------
MCSE of elpd_loo is 0.0.
MCSE and ESS estimates assume MCMC draws (r_eff in [0.8, 1.6]).

All Pareto k estimates are good (k < 0.7).
See help('pareto-k-diagnostic') for details.

# Extract the generated quantities
generated_quantities <- extract(fit_ordinal)

Y <- ordinal_data$Y
empirical <- tibble(
    tv = total_variance(Y),
    lv = lineage_wise_variance(Y, variable),
    cv = crossfamily_variance(Y, variable)
)

prior_pc <- tibble(
    tv = apply(generated_quantities$y_prior, 1, total_variance),
    lv = apply(
        generated_quantities$y_prior, 1, function(x) lineage_wise_variance(x, variable)
    ),
    cv = apply(
        generated_quantities$y_prior, 1, function(x) crossfamily_variance(x, variable)
    ),
)
predictive_plot(prior_pc)

Warning message in geom_point(aes(x = 1, y = empirical$tv, color = "empirical"), :
“All aesthetics have length 1, but the data has 4000 rows.
ℹ Did you mean to use `annotate()`?”
Warning message in geom_point(aes(x = 1, y = empirical$lv, color = "empirical"), :
“All aesthetics have length 1, but the data has 4000 rows.
ℹ Did you mean to use `annotate()`?”
Warning message in geom_point(aes(x = 1, y = empirical$cv, color = "empirical"), :
“All aesthetics have length 1, but the data has 4000 rows.
ℹ Did you mean to use `annotate()`?”

posterior_pc <- tibble(
    tv = apply(generated_quantities$y_posterior, 1, total_variance),
    lv = apply(
        generated_quantities$y_posterior, 1, function(x) lineage_wise_variance(x, variable)
    ),
    cv = apply(
        generated_quantities$y_posterior, 1, function(x) crossfamily_variance(x, variable)
    ),
)
predictive_plot(posterior_pc)

Warning message in geom_point(aes(x = 1, y = empirical$tv, color = "empirical"), :
“All aesthetics have length 1, but the data has 4000 rows.
ℹ Did you mean to use `annotate()`?”
Warning message in geom_point(aes(x = 1, y = empirical$lv, color = "empirical"), :
“All aesthetics have length 1, but the data has 4000 rows.
ℹ Did you mean to use `annotate()`?”
Warning message in geom_point(aes(x = 1, y = empirical$cv, color = "empirical"), :
“All aesthetics have length 1, but the data has 4000 rows.
ℹ Did you mean to use `annotate()`?”

defined <- which(!is.na(d[[variable]]))
undefined <- which(is.na(d[[variable]]))

po <- c()
for (i in get_postorder(roots, edges)) {
    if ((i %in% edges[,2]) & !(i %in% undefined)) {
        po <- c(po, i)
    }
}
edge_po <- match(po, edges[,2])

phylo_data = list()
phylo_data$Ntrees <- length(trees)
phylo_data$Nnodes <- nrow(nodes)
phylo_data$Y <- ordinal_data$Y
phylo_data$Nobservations <- length(phylo_data$Y)
phylo_data$Nedges <- length(edge_po)
phylo_data$observations <- match(pull(d[defined, "glottocode"]), nodes$label)
phylo_data$roots <- roots
phylo_data$edges <- edges[edge_po,]
phylo_data$edge_lengths <- edge_lengths[edge_po]
phylo_data$nthres <- length(unique(phylo_data$Y)) - 1

fit_brownian <- rstan::sampling(
   model_brownian,
   data = phylo_data,
   chains = 4,
   iter = 2000
)

print(fit_brownian, pars=c("mu_root", "sigma_root", "sigma", "Intercept", "lp__"))

Inference for Stan model: anon_model.
4 chains, each with iter=2000; warmup=1000; thin=1; 
post-warmup draws per chain=1000, total post-warmup draws=4000.

                 mean se_mean    sd     2.5%      25%      50%      75%
mu_root          0.34    0.03  1.41    -2.49    -0.60     0.36     1.32
sigma_root       3.14    0.01  0.31     2.56     2.93     3.13     3.33
sigma            1.81    0.01  0.24     1.36     1.65     1.80     1.96
Intercept[1]    -1.24    0.02  1.39    -3.97    -2.16    -1.21    -0.30
Intercept[2]    -0.70    0.02  1.39    -3.46    -1.64    -0.67     0.24
Intercept[3]    -0.36    0.02  1.39    -3.10    -1.30    -0.33     0.58
Intercept[4]     0.01    0.02  1.39    -2.75    -0.92     0.04     0.96
Intercept[5]     0.78    0.02  1.39    -1.97    -0.14     0.81     1.72
Intercept[6]     2.15    0.02  1.39    -0.61     1.21     2.18     3.08
Intercept[7]     3.87    0.02  1.39     1.11     2.92     3.89     4.81
Intercept[8]     5.65    0.02  1.39     2.86     4.71     5.67     6.59
Intercept[9]     7.74    0.02  1.42     4.93     6.76     7.76     8.70
lp__         -5002.57    1.79 46.21 -5092.02 -5033.70 -5002.50 -4971.48
                97.5% n_eff Rhat
mu_root          3.03  3153 1.00
sigma_root       3.78   795 1.01
sigma            2.30   650 1.01
Intercept[1]     1.43  3360 1.00
Intercept[2]     1.97  3396 1.00
Intercept[3]     2.30  3420 1.00
Intercept[4]     2.71  3451 1.00
Intercept[5]     3.46  3469 1.00
Intercept[6]     4.81  3519 1.00
Intercept[7]     6.52  3534 1.00
Intercept[8]     8.31  3530 1.00
Intercept[9]    10.44  3577 1.00
lp__         -4911.87   665 1.01

Samples were drawn using NUTS(diag_e) at Sat Mar 16 20:34:08 2024.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at 
convergence, Rhat=1).

(loo_brownian <- loo(fit_brownian))

Warning message:
“Some Pareto k diagnostic values are too high. See help('pareto-k-diagnostic') for details.
”


Computed from 4000 by 1209 log-likelihood matrix.

         Estimate   SE
elpd_loo  -2009.9 34.6
p_loo       322.1  9.7
looic      4019.9 69.3
------
MCSE of elpd_loo is NA.
MCSE and ESS estimates assume MCMC draws (r_eff in [0.3, 1.8]).

Pareto k diagnostic values:
                         Count Pct.    Min. ESS
(-Inf, 0.7]   (good)     1135  93.9%   99      
   (0.7, 1]   (bad)        72   6.0%   <NA>    
   (1, Inf)   (very bad)    2   0.2%   <NA>    
See help('pareto-k-diagnostic') for details.

# Extract the generated quantities
generated_quantities <- extract(fit_brownian)

Y <- ordinal_data$Y
empirical <- tibble(
    tv = total_variance(Y),
    lv = lineage_wise_variance(Y, variable),
    cv = crossfamily_variance(Y, variable)
)

prior_pc <- tibble(
    tv = apply(generated_quantities$y_prior, 1, total_variance),
    lv = apply(
        generated_quantities$y_prior, 1, function(x) lineage_wise_variance(x, variable)
    ),
    cv = apply(
        generated_quantities$y_prior, 1, function(x) crossfamily_variance(x, variable)
    ),
)
predictive_plot(prior_pc)

Warning message in geom_point(aes(x = 1, y = empirical$tv, color = "empirical"), :
“All aesthetics have length 1, but the data has 4000 rows.
ℹ Did you mean to use `annotate()`?”
Warning message in geom_point(aes(x = 1, y = empirical$lv, color = "empirical"), :
“All aesthetics have length 1, but the data has 4000 rows.
ℹ Did you mean to use `annotate()`?”
Warning message in geom_point(aes(x = 1, y = empirical$cv, color = "empirical"), :
“All aesthetics have length 1, but the data has 4000 rows.
ℹ Did you mean to use `annotate()`?”

posterior_pc <- tibble(
    tv = apply(generated_quantities$y_posterior, 1, total_variance),
    lv = apply(
        generated_quantities$y_posterior, 1, function(x) lineage_wise_variance(x, variable)
    ),
    cv = apply(
        generated_quantities$y_posterior, 1, function(x) crossfamily_variance(x, variable)
    ),
)
predictive_plot(posterior_pc)

Warning message in geom_point(aes(x = 1, y = empirical$tv, color = "empirical"), :
“All aesthetics have length 1, but the data has 4000 rows.
ℹ Did you mean to use `annotate()`?”
Warning message in geom_point(aes(x = 1, y = empirical$lv, color = "empirical"), :
“All aesthetics have length 1, but the data has 4000 rows.
ℹ Did you mean to use `annotate()`?”
Warning message in geom_point(aes(x = 1, y = empirical$cv, color = "empirical"), :
“All aesthetics have length 1, but the data has 4000 rows.
ℹ Did you mean to use `annotate()`?”

fit_mixed4 <- readRDS("models/fit_mixed4.rds")

# fit_mixed4 <- rstan::sampling(
#    model_mixed,
#    data = phylo_data,
#    chains = 4,
#    iter = 40000,
#    thin = 20,
#    control = list(max_treedepth=15)
# )

# saveRDS(fit_mixed4, "models/fit_mixed4.rds")

print(fit_mixed4, pars=c("mu_root", "sigma_root", "sigma", "Intercept", "sigma_error", "lp__"))

Inference for Stan model: anon_model.
4 chains, each with iter=40000; warmup=20000; thin=20; 
post-warmup draws per chain=1000, total post-warmup draws=4000.

                 mean se_mean     sd     2.5%      25%      50%      75%
mu_root          0.13    0.01   0.84    -1.50    -0.44     0.14     0.71
sigma_root       3.07    0.01   0.29     2.52     2.86     3.05     3.25
sigma            1.79    0.00   0.24     1.35     1.63     1.79     1.95
Intercept[1]    -1.45    0.01   0.86    -3.10    -2.05    -1.44    -0.84
Intercept[2]    -0.91    0.01   0.86    -2.55    -1.51    -0.90    -0.31
Intercept[3]    -0.58    0.01   0.86    -2.20    -1.17    -0.57     0.03
Intercept[4]    -0.20    0.01   0.86    -1.86    -0.79    -0.19     0.40
Intercept[5]     0.57    0.01   0.86    -1.07    -0.02     0.58     1.16
Intercept[6]     1.94    0.01   0.86     0.31     1.35     1.95     2.53
Intercept[7]     3.67    0.01   0.86     2.02     3.08     3.68     4.25
Intercept[8]     5.45    0.01   0.87     3.79     4.85     5.46     6.04
Intercept[9]     7.55    0.02   0.91     5.78     6.94     7.54     8.19
sigma_error      0.23    0.00   0.08     0.12     0.17     0.21     0.26
lp__         -4864.79   23.60 369.52 -5609.00 -5105.58 -4839.37 -4604.52
                97.5% n_eff Rhat
mu_root          1.71  3743 1.00
sigma_root       3.65  3157 1.00
sigma            2.29  4095 1.00
Intercept[1]     0.19  3849 1.00
Intercept[2]     0.72  3841 1.00
Intercept[3]     1.06  3821 1.00
Intercept[4]     1.44  3799 1.00
Intercept[5]     2.21  3780 1.00
Intercept[6]     3.59  3744 1.00
Intercept[7]     5.30  3695 1.00
Intercept[8]     7.13  3599 1.00
Intercept[9]     9.32  3558 1.00
sigma_error      0.41   230 1.01
lp__         -4200.97   245 1.01

Samples were drawn using NUTS(diag_e) at Sun Mar 10 14:07:50 2024.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at 
convergence, Rhat=1).

(loo_mixed <- loo(fit_mixed4))

Warning message:
“Some Pareto k diagnostic values are too high. See help('pareto-k-diagnostic') for details.
”


Computed from 4000 by 1209 log-likelihood matrix.

         Estimate   SE
elpd_loo  -2009.6 34.6
p_loo       328.1  9.7
looic      4019.2 69.1
------
MCSE of elpd_loo is NA.
MCSE and ESS estimates assume MCMC draws (r_eff in [0.8, 1.1]).

Pareto k diagnostic values:
                         Count Pct.    Min. ESS
(-Inf, 0.7]   (good)     1144  94.6%   207     
   (0.7, 1]   (bad)        65   5.4%   <NA>    
   (1, Inf)   (very bad)    0   0.0%   <NA>    
See help('pareto-k-diagnostic') for details.

# Extract the generated quantities
generated_quantities <- extract(fit_mixed4)

Y <- ordinal_data$Y
empirical <- tibble(
    tv = total_variance(Y),
    lv = lineage_wise_variance(Y, variable),
    cv = crossfamily_variance(Y, variable)
)

prior_pc <- tibble(
    tv = apply(generated_quantities$y_prior, 1, total_variance),
    lv = apply(
        generated_quantities$y_prior, 1, function(x) lineage_wise_variance(x, variable)
    ),
    cv = apply(
        generated_quantities$y_prior, 1, function(x) crossfamily_variance(x, variable)
    ),
)
predictive_plot(prior_pc)

Warning message in geom_point(aes(x = 1, y = empirical$tv, color = "empirical"), :
“All aesthetics have length 1, but the data has 4000 rows.
ℹ Did you mean to use `annotate()`?”
Warning message in geom_point(aes(x = 1, y = empirical$lv, color = "empirical"), :
“All aesthetics have length 1, but the data has 4000 rows.
ℹ Did you mean to use `annotate()`?”
Warning message in geom_point(aes(x = 1, y = empirical$cv, color = "empirical"), :
“All aesthetics have length 1, but the data has 4000 rows.
ℹ Did you mean to use `annotate()`?”

posterior_pc <- tibble(
    tv = apply(generated_quantities$y_posterior, 1, total_variance),
    lv = apply(
        generated_quantities$y_posterior, 1, function(x) lineage_wise_variance(x, variable)
    ),
    cv = apply(
        generated_quantities$y_posterior, 1, function(x) crossfamily_variance(x, variable)
    ),
)
predictive_plot(posterior_pc)

Warning message in geom_point(aes(x = 1, y = empirical$tv, color = "empirical"), :
“All aesthetics have length 1, but the data has 4000 rows.
ℹ Did you mean to use `annotate()`?”
Warning message in geom_point(aes(x = 1, y = empirical$lv, color = "empirical"), :
“All aesthetics have length 1, but the data has 4000 rows.
ℹ Did you mean to use `annotate()`?”
Warning message in geom_point(aes(x = 1, y = empirical$cv, color = "empirical"), :
“All aesthetics have length 1, but the data has 4000 rows.
ℹ Did you mean to use `annotate()`?”

model comparison

tibble(model = c("ordinal", "brownian", "mixed"),
       looic = c(
            loo_ordinal$estimates[3,1],
            loo_brownian$estimates[3,1],
            loo_mixed$estimates[3,1]
       )
) %>%
mutate(delta_looic = round(looic-min(looic), 3)) %>%
arrange(delta_looic)

A tibble: 3 × 3
model	looic	delta_looic
<chr>	<dbl>	<dbl>
mixed	4019.184	0.000
brownian	4019.874	0.690
ordinal	4848.207	829.023

Variable 5: settlement_strategy

variable <- "settlement_strategy"

Y <- d %>%
    select(all_of(variable)) %>%
    na.omit %>%
    pull
Y <- match(Y, sort(unique(Y)))

ordinal_data <- list()
ordinal_data$Ndata <- length(Y)
ordinal_data$nthres <- max(Y) -1
ordinal_data$Y <- Y

fit_ordinal <- rstan::sampling(
    model_ordinal,
    data=ordinal_data,
    chains=4,
    iter=2000,
)

print(fit_ordinal, pars=c("Intercept", "lp__"))

Inference for Stan model: anon_model.
4 chains, each with iter=2000; warmup=1000; thin=1; 
post-warmup draws per chain=1000, total post-warmup draws=4000.

                 mean se_mean   sd     2.5%      25%      50%      75%    97.5%
Intercept[1]    -1.42    0.00 0.08    -1.57    -1.47    -1.42    -1.37    -1.27
Intercept[2]    -0.88    0.00 0.07    -1.01    -0.93    -0.88    -0.84    -0.75
Intercept[3]     0.06    0.00 0.06    -0.06     0.02     0.06     0.10     0.18
lp__         -1371.87    0.03 1.23 -1375.05 -1372.48 -1371.55 -1370.97 -1370.44
             n_eff Rhat
Intercept[1]  2523    1
Intercept[2]  3364    1
Intercept[3]  5206    1
lp__          1882    1

Samples were drawn using NUTS(diag_e) at Sat Mar 16 20:37:54 2024.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at 
convergence, Rhat=1).

(loo_ordinal <- loo(fit_ordinal))


Computed from 4000 by 1107 log-likelihood matrix.

         Estimate   SE
elpd_loo  -1365.1 18.0
p_loo         3.1  0.1
looic      2730.2 36.0
------
MCSE of elpd_loo is 0.0.
MCSE and ESS estimates assume MCMC draws (r_eff in [0.6, 1.3]).

All Pareto k estimates are good (k < 0.7).
See help('pareto-k-diagnostic') for details.

# Extract the generated quantities
generated_quantities <- extract(fit_ordinal)

Y <- ordinal_data$Y
empirical <- tibble(
    tv = total_variance(Y),
    lv = lineage_wise_variance(Y, variable),
    cv = crossfamily_variance(Y, variable)
)

prior_pc <- tibble(
    tv = apply(generated_quantities$y_prior, 1, total_variance),
    lv = apply(
        generated_quantities$y_prior, 1, function(x) lineage_wise_variance(x, variable)
    ),
    cv = apply(
        generated_quantities$y_prior, 1, function(x) crossfamily_variance(x, variable)
    ),
)
predictive_plot(prior_pc)

Warning message in geom_point(aes(x = 1, y = empirical$tv, color = "empirical"), :
“All aesthetics have length 1, but the data has 4000 rows.
ℹ Did you mean to use `annotate()`?”
Warning message in geom_point(aes(x = 1, y = empirical$lv, color = "empirical"), :
“All aesthetics have length 1, but the data has 4000 rows.
ℹ Did you mean to use `annotate()`?”
Warning message in geom_point(aes(x = 1, y = empirical$cv, color = "empirical"), :
“All aesthetics have length 1, but the data has 4000 rows.
ℹ Did you mean to use `annotate()`?”

posterior_pc <- tibble(
    tv = apply(generated_quantities$y_posterior, 1, total_variance),
    lv = apply(
        generated_quantities$y_posterior, 1, function(x) lineage_wise_variance(x, variable)
    ),
    cv = apply(
        generated_quantities$y_posterior, 1, function(x) crossfamily_variance(x, variable)
    ),
)
predictive_plot(posterior_pc)

Warning message in geom_point(aes(x = 1, y = empirical$tv, color = "empirical"), :
“All aesthetics have length 1, but the data has 4000 rows.
ℹ Did you mean to use `annotate()`?”
Warning message in geom_point(aes(x = 1, y = empirical$lv, color = "empirical"), :
“All aesthetics have length 1, but the data has 4000 rows.
ℹ Did you mean to use `annotate()`?”
Warning message in geom_point(aes(x = 1, y = empirical$cv, color = "empirical"), :
“All aesthetics have length 1, but the data has 4000 rows.
ℹ Did you mean to use `annotate()`?”

defined <- which(!is.na(d[[variable]]))
undefined <- which(is.na(d[[variable]]))

po <- c()
for (i in get_postorder(roots, edges)) {
    if ((i %in% edges[,2]) & !(i %in% undefined)) {
        po <- c(po, i)
    }
}
edge_po <- match(po, edges[,2])

phylo_data = list()
phylo_data$Ntrees <- length(trees)
phylo_data$Nnodes <- nrow(nodes)
phylo_data$Y <- ordinal_data$Y
phylo_data$Nobservations <- length(phylo_data$Y)
phylo_data$Nedges <- length(edge_po)
phylo_data$observations <- match(pull(d[defined, "glottocode"]), nodes$label)
phylo_data$roots <- roots
phylo_data$edges <- edges[edge_po,]
phylo_data$edge_lengths <- edge_lengths[edge_po]
phylo_data$nthres <- length(unique(phylo_data$Y)) - 1

fit_brownian <- rstan::sampling(
   model_brownian,
   data = phylo_data,
   chains = 4,
   iter = 2000
)

print(fit_brownian, pars=c("mu_root", "sigma_root", "sigma", "Intercept", "lp__"))

Inference for Stan model: anon_model.
4 chains, each with iter=2000; warmup=1000; thin=1; 
post-warmup draws per chain=1000, total post-warmup draws=4000.

                 mean se_mean    sd     2.5%      25%      50%      75%
mu_root          0.37    0.01  0.25    -0.13     0.20     0.37     0.54
sigma_root       1.83    0.01  0.28     1.34     1.63     1.81     2.01
sigma            1.37    0.02  0.36     0.71     1.14     1.37     1.60
Intercept[1]    -1.17    0.00  0.17    -1.52    -1.28    -1.16    -1.05
Intercept[2]    -0.38    0.00  0.16    -0.70    -0.49    -0.38    -0.27
Intercept[3]     0.96    0.00  0.17     0.64     0.85     0.97     1.08
lp__         -4297.83    1.89 44.43 -4383.08 -4328.19 -4298.55 -4268.17
                97.5% n_eff Rhat
mu_root          0.88  2151 1.00
sigma_root       2.45   718 1.01
sigma            2.12   415 1.01
Intercept[1]    -0.84  2137 1.00
Intercept[2]    -0.07  4349 1.00
Intercept[3]     1.29  4159 1.00
lp__         -4210.12   550 1.01

Samples were drawn using NUTS(diag_e) at Sat Mar 16 20:39:49 2024.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at 
convergence, Rhat=1).

(loo_brownian <- loo(fit_brownian))

Warning message:
“Some Pareto k diagnostic values are too high. See help('pareto-k-diagnostic') for details.
”


Computed from 4000 by 1107 log-likelihood matrix.

         Estimate   SE
elpd_loo  -1231.3 21.8
p_loo       186.0  6.1
looic      2462.7 43.6
------
MCSE of elpd_loo is NA.
MCSE and ESS estimates assume MCMC draws (r_eff in [0.3, 2.2]).

Pareto k diagnostic values:
                         Count Pct.    Min. ESS
(-Inf, 0.7]   (good)     1100  99.4%   175     
   (0.7, 1]   (bad)         7   0.6%   <NA>    
   (1, Inf)   (very bad)    0   0.0%   <NA>    
See help('pareto-k-diagnostic') for details.

# Extract the generated quantities
generated_quantities <- extract(fit_brownian)

Y <- ordinal_data$Y
empirical <- tibble(
    tv = total_variance(Y),
    lv = lineage_wise_variance(Y, variable),
    cv = crossfamily_variance(Y, variable)
)

prior_pc <- tibble(
    tv = apply(generated_quantities$y_prior, 1, total_variance),
    lv = apply(
        generated_quantities$y_prior, 1, function(x) lineage_wise_variance(x, variable)
    ),
    cv = apply(
        generated_quantities$y_prior, 1, function(x) crossfamily_variance(x, variable)
    ),
)
predictive_plot(prior_pc)

Warning message in geom_point(aes(x = 1, y = empirical$tv, color = "empirical"), :
“All aesthetics have length 1, but the data has 4000 rows.
ℹ Did you mean to use `annotate()`?”
Warning message in geom_point(aes(x = 1, y = empirical$lv, color = "empirical"), :
“All aesthetics have length 1, but the data has 4000 rows.
ℹ Did you mean to use `annotate()`?”
Warning message in geom_point(aes(x = 1, y = empirical$cv, color = "empirical"), :
“All aesthetics have length 1, but the data has 4000 rows.
ℹ Did you mean to use `annotate()`?”

posterior_pc <- tibble(
    tv = apply(generated_quantities$y_posterior, 1, total_variance),
    lv = apply(
        generated_quantities$y_posterior, 1, function(x) lineage_wise_variance(x, variable)
    ),
    cv = apply(
        generated_quantities$y_posterior, 1, function(x) crossfamily_variance(x, variable)
    ),
)
predictive_plot(posterior_pc)

Warning message in geom_point(aes(x = 1, y = empirical$tv, color = "empirical"), :
“All aesthetics have length 1, but the data has 4000 rows.
ℹ Did you mean to use `annotate()`?”
Warning message in geom_point(aes(x = 1, y = empirical$lv, color = "empirical"), :
“All aesthetics have length 1, but the data has 4000 rows.
ℹ Did you mean to use `annotate()`?”
Warning message in geom_point(aes(x = 1, y = empirical$cv, color = "empirical"), :
“All aesthetics have length 1, but the data has 4000 rows.
ℹ Did you mean to use `annotate()`?”

fit_mixed5 <- readRDS("models/fit_mixed5.rds")

# fit_mixed5 <- rstan::sampling(
#    model_mixed,
#    data = phylo_data,
#    chains = 4,
#    iter = 40000,
#    thin = 20,
#    control = list(max_treedepth = 15)
# )

# saveRDS(fit_mixed5, "models/fit_mixed5.rds")

print(fit_mixed5, pars=c("mu_root", "sigma_root", "sigma", "Intercept", "sigma_error", "lp__"))

Inference for Stan model: anon_model.
4 chains, each with iter=40000; warmup=20000; thin=20; 
post-warmup draws per chain=1000, total post-warmup draws=4000.

                 mean se_mean     sd     2.5%      25%      50%      75%
mu_root          0.35    0.00   0.24    -0.12     0.18     0.35     0.51
sigma_root       1.77    0.00   0.26     1.29     1.59     1.76     1.94
sigma            1.34    0.01   0.36     0.67     1.10     1.33     1.58
Intercept[1]    -1.19    0.00   0.17    -1.53    -1.31    -1.18    -1.07
Intercept[2]    -0.39    0.00   0.16    -0.71    -0.50    -0.39    -0.28
Intercept[3]     0.96    0.00   0.17     0.64     0.85     0.96     1.07
sigma_error      0.25    0.01   0.10     0.13     0.18     0.22     0.29
lp__         -4246.96   30.90 395.49 -5120.90 -4492.16 -4205.32 -3959.52
                97.5% n_eff Rhat
mu_root          0.83  4040 1.00
sigma_root       2.35  3992 1.00
sigma            2.11  3667 1.00
Intercept[1]    -0.86  4141 1.00
Intercept[2]    -0.07  4317 1.00
Intercept[3]     1.29  4342 1.00
sigma_error      0.51   167 1.01
lp__         -3585.81   164 1.01

Samples were drawn using NUTS(diag_e) at Sun Mar 10 18:04:25 2024.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at 
convergence, Rhat=1).

(loo_mixed <- loo(fit_mixed5))

Warning message:
“Some Pareto k diagnostic values are too high. See help('pareto-k-diagnostic') for details.
”


Computed from 4000 by 1107 log-likelihood matrix.

         Estimate   SE
elpd_loo  -1231.4 21.7
p_loo       193.9  6.0
looic      2462.8 43.4
------
MCSE of elpd_loo is NA.
MCSE and ESS estimates assume MCMC draws (r_eff in [0.5, 1.1]).

Pareto k diagnostic values:
                         Count Pct.    Min. ESS
(-Inf, 0.7]   (good)     1101  99.5%   320     
   (0.7, 1]   (bad)         6   0.5%   <NA>    
   (1, Inf)   (very bad)    0   0.0%   <NA>    
See help('pareto-k-diagnostic') for details.

# Extract the generated quantities
generated_quantities <- extract(fit_mixed5)

Y <- ordinal_data$Y
empirical <- tibble(
    tv = total_variance(Y),
    lv = lineage_wise_variance(Y, variable),
    cv = crossfamily_variance(Y, variable)
)

prior_pc <- tibble(
    tv = apply(generated_quantities$y_prior, 1, total_variance),
    lv = apply(
        generated_quantities$y_prior, 1, function(x) lineage_wise_variance(x, variable)
    ),
    cv = apply(
        generated_quantities$y_prior, 1, function(x) crossfamily_variance(x, variable)
    ),
)
predictive_plot(prior_pc)

Warning message in geom_point(aes(x = 1, y = empirical$tv, color = "empirical"), :
“All aesthetics have length 1, but the data has 4000 rows.
ℹ Did you mean to use `annotate()`?”
Warning message in geom_point(aes(x = 1, y = empirical$lv, color = "empirical"), :
“All aesthetics have length 1, but the data has 4000 rows.
ℹ Did you mean to use `annotate()`?”
Warning message in geom_point(aes(x = 1, y = empirical$cv, color = "empirical"), :
“All aesthetics have length 1, but the data has 4000 rows.
ℹ Did you mean to use `annotate()`?”

posterior_pc <- tibble(
    tv = apply(generated_quantities$y_posterior, 1, total_variance),
    lv = apply(
        generated_quantities$y_posterior, 1, function(x) lineage_wise_variance(x, variable)
    ),
    cv = apply(
        generated_quantities$y_posterior, 1, function(x) crossfamily_variance(x, variable)
    ),
)
predictive_plot(posterior_pc)

Warning message in geom_point(aes(x = 1, y = empirical$tv, color = "empirical"), :
“All aesthetics have length 1, but the data has 4000 rows.
ℹ Did you mean to use `annotate()`?”
Warning message in geom_point(aes(x = 1, y = empirical$lv, color = "empirical"), :
“All aesthetics have length 1, but the data has 4000 rows.
ℹ Did you mean to use `annotate()`?”
Warning message in geom_point(aes(x = 1, y = empirical$cv, color = "empirical"), :
“All aesthetics have length 1, but the data has 4000 rows.
ℹ Did you mean to use `annotate()`?”

model comparison

tibble(model = c("ordinal", "brownian", "mixed"),
       looic = c(
            loo_ordinal$estimates[3,1],
            loo_brownian$estimates[3,1],
            loo_mixed$estimates[3,1]
       )
) %>%
mutate(delta_looic = round(looic-min(looic), 3)) %>%
arrange(delta_looic)

A tibble: 3 × 3
model	looic	delta_looic
<chr>	<dbl>	<dbl>
brownian	2462.659	0.000
mixed	2462.809	0.150
ordinal	2730.171	267.512

Variable 6: exogamy

variable <- "exogamy"

Y <- d %>%
    select(all_of(variable)) %>%
    na.omit %>%
    pull
Y <- match(Y, sort(unique(Y)))

Since there are only two values for this variable, we need a logistic rather than an ordinal model.

binary_data <- list()
binary_data$Ndata <- length(Y)
binary_data$Y <- Y-1

binary_code <- "
data {
  int<lower=1> Ndata;  // total number of observations
  int<lower=0, upper=1> Y[Ndata];  // binary response variable
}
parameters {
  real Intercept;  // intercept for logistic regression
}

model {
  target += normal_lpdf(Intercept | 0, 2);
  for (n in 1:Ndata) {
    target += bernoulli_logit_lpmf(Y[n] | Intercept);
  }
}

generated quantities {
  int y_prior[Ndata];
  int y_posterior[Ndata];
  vector[Ndata] log_lik;
  real Intercept_prior = normal_rng(0, 2); // sample from the prior
  
  for (n in 1:Ndata) {
    y_prior[n] = bernoulli_logit_rng(Intercept_prior); // simulate prior predictive data
    y_posterior[n] = bernoulli_logit_rng(Intercept); // simulate posterior predictive data
    log_lik[n] = bernoulli_logit_lpmf(Y[n] | Intercept); // log likelihood for each observation
  }
}

"

model_binary <- stan_model(model_code=binary_code)

fit_binary <- rstan::sampling(
    model_binary,
    data=binary_data,
    chains=4,
    iter=2000,
)

print(fit_binary, pars=c("Intercept", "lp__"))

Inference for Stan model: anon_model.
4 chains, each with iter=2000; warmup=1000; thin=1; 
post-warmup draws per chain=1000, total post-warmup draws=4000.

             mean se_mean   sd    2.5%     25%     50%     75%   97.5% n_eff
Intercept   -0.66    0.00 0.07   -0.79   -0.70   -0.66   -0.61   -0.53  1646
lp__      -666.79    0.02 0.70 -668.79 -666.98 -666.53 -666.33 -666.28  1601
          Rhat
Intercept    1
lp__         1

Samples were drawn using NUTS(diag_e) at Sat Mar 16 20:43:43 2024.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at 
convergence, Rhat=1).

(loo_binary <- loo(fit_binary))


Computed from 4000 by 1036 log-likelihood matrix.

         Estimate   SE
elpd_loo   -665.6 10.1
p_loo         1.0  0.0
looic      1331.3 20.2
------
MCSE of elpd_loo is 0.0.
MCSE and ESS estimates assume MCMC draws (r_eff in [0.4, 0.4]).

All Pareto k estimates are good (k < 0.7).
See help('pareto-k-diagnostic') for details.

# Extract the generated quantities
generated_quantities <- extract(fit_binary)

Y <- binary_data$Y
empirical <- tibble(
    tv = total_variance(Y),
    lv = lineage_wise_variance(Y, variable),
    cv = crossfamily_variance(Y, variable)
)

prior_pc <- tibble(
    tv = apply(generated_quantities$y_prior, 1, total_variance),
    lv = apply(
        generated_quantities$y_prior, 1, function(x) lineage_wise_variance(x, variable)
    ),
    cv = apply(
        generated_quantities$y_prior, 1, function(x) crossfamily_variance(x, variable)
    ),
)
predictive_plot(prior_pc)

Warning message in geom_point(aes(x = 1, y = empirical$tv, color = "empirical"), :
“All aesthetics have length 1, but the data has 4000 rows.
ℹ Did you mean to use `annotate()`?”
Warning message in geom_point(aes(x = 1, y = empirical$lv, color = "empirical"), :
“All aesthetics have length 1, but the data has 4000 rows.
ℹ Did you mean to use `annotate()`?”
Warning message in geom_point(aes(x = 1, y = empirical$cv, color = "empirical"), :
“All aesthetics have length 1, but the data has 4000 rows.
ℹ Did you mean to use `annotate()`?”

posterior_pc <- tibble(
    tv = apply(generated_quantities$y_posterior, 1, total_variance),
    lv = apply(
        generated_quantities$y_posterior, 1, function(x) lineage_wise_variance(x, variable)
    ),
    cv = apply(
        generated_quantities$y_posterior, 1, function(x) crossfamily_variance(x, variable)
    ),
)
predictive_plot(posterior_pc)

Warning message in geom_point(aes(x = 1, y = empirical$tv, color = "empirical"), :
“All aesthetics have length 1, but the data has 4000 rows.
ℹ Did you mean to use `annotate()`?”
Warning message in geom_point(aes(x = 1, y = empirical$lv, color = "empirical"), :
“All aesthetics have length 1, but the data has 4000 rows.
ℹ Did you mean to use `annotate()`?”
Warning message in geom_point(aes(x = 1, y = empirical$cv, color = "empirical"), :
“All aesthetics have length 1, but the data has 4000 rows.
ℹ Did you mean to use `annotate()`?”

defined <- which(!is.na(d[[variable]]))
undefined <- which(is.na(d[[variable]]))

po <- c()
for (i in get_postorder(roots, edges)) {
    if ((i %in% edges[,2]) & !(i %in% undefined)) {
        po <- c(po, i)
    }
}
edge_po <- match(po, edges[,2])

phylo_data = list()
phylo_data$Ntrees <- length(trees)
phylo_data$Nnodes <- nrow(nodes)
phylo_data$Y <- binary_data$Y
phylo_data$Nobservations <- length(phylo_data$Y)
phylo_data$Nedges <- length(edge_po)
phylo_data$observations <- match(pull(d[defined, "glottocode"]), nodes$label)
phylo_data$roots <- roots
phylo_data$edges <- edges[edge_po,]
phylo_data$edge_lengths <- edge_lengths[edge_po]
phylo_data$nthres <- length(unique(phylo_data$Y)) - 1

The phylogenetic ordinal regression models also have to be adjusted to a binary outcome variable.

brownian_binary_code <- "
data {
    int<lower=1> Ntrees;
    int<lower=0> Nnodes;
    int<lower=1> Nobservations;
    int<lower=1> Nedges;
    int<lower=1,upper=Nnodes> observations[Nobservations]; // indices of the observed tips
    int<lower=1, upper=Nnodes> roots[Ntrees];
    int<lower=1, upper=Nnodes> edges[Nedges, 2];
    vector[Nedges] edge_lengths;
    int<lower=0, upper=1> Y[Nobservations];  // response variable
}

parameters {
    vector[Nnodes] z_normal;
    real<lower=0> sigma;
    real mu_root;
    real<lower=0> sigma_root;
}

model {
    vector[Nnodes] z = rep_vector(0, Nnodes);
    target += exponential_lpdf(sigma | 1);
    target += normal_lpdf(mu_root | 0, 1);
    target += exponential_lpdf(sigma_root | 1);
    target += normal_lpdf(z_normal | 0, 1);
    
    for (i in 1:Ntrees) {
        int idx = roots[i];
        z[idx] = mu_root + z_normal[idx] * sigma_root;
    }
    
    for (i in 1:Nedges) {
        int j = Nedges - i + 1;
        int mother_node = edges[j, 1];
        int daughter_node = edges[j, 2];
        real local_mu = z[mother_node];
        real local_sigma = sigma * sqrt(edge_lengths[j]);
        z[daughter_node] = local_mu + z_normal[daughter_node] * local_sigma;
    }

    for (n in 1:Nobservations) {
        target += bernoulli_logit_lpmf(Y[n] | z[observations[n]]);
    }    
}


generated quantities {
    int y_prior[Nobservations]; // Prior predictive samples
    int y_posterior[Nobservations]; // Posterior predictive samples
    vector[Nnodes] z_prior = rep_vector(0, Nnodes);
    vector[Nnodes] z_posterior = rep_vector(0, Nnodes);
    real<lower=0> sigma_prior;
    real mu_root_prior;
    real<lower=0> sigma_root_prior;
    real log_lik[Nobservations];

    for (i in 1:Ntrees) {
        int idx = roots[i];
        z_posterior[idx] = mu_root + z_normal[idx] * sigma_root;
    }
    
    for (i in 1:Nedges) {
        int j = Nedges - i + 1;
        int mother_node = edges[j, 1];
        int daughter_node = edges[j, 2];
        real local_mu = z_posterior[mother_node];
        real local_sigma = sigma * sqrt(edge_lengths[j]);
        z_posterior[daughter_node] = local_mu + z_normal[daughter_node] * local_sigma;
    }


    sigma_prior = exponential_rng(1);
    mu_root_prior = normal_rng(0, 1);
    sigma_root_prior = exponential_rng(1);

    for (i in 1:Ntrees) {
        z_prior[roots[i]] = normal_rng(mu_root_prior, sigma_root_prior);
    }

    
    for (i in 1:Nedges) {
        int j = Nedges - i + 1;
        int mother_node = edges[j, 1];
        int daughter_node = edges[j, 2];
        real local_mu = z_posterior[mother_node];
        real local_sigma = sigma * sqrt(edge_lengths[j]);
        z_prior[daughter_node] = normal_rng(
            z_prior[mother_node],
            sigma_prior * sqrt(edge_lengths[j])
        );
        z_posterior[daughter_node] = local_mu + z_normal[daughter_node] * local_sigma;
    }
    
    for (n in 1:Nobservations) {
        y_prior[n] = bernoulli_logit_rng(z_prior[observations[n]]); 
        y_posterior[n] = bernoulli_logit_rng(z_posterior[observations[n]]); 
        log_lik[n] = bernoulli_logit_lpmf(Y[n] | z_posterior[observations[n]]);
    }
}
"

model_brownian_binary <- stan_model(model_code=brownian_binary_code)

fit_brownian <- rstan::sampling(
   model_brownian_binary,
   data = phylo_data,
   chains = 4,
   iter = 2000
)

print(fit_brownian, pars=c("mu_root", "sigma_root", "sigma", "lp__"))

Inference for Stan model: anon_model.
4 chains, each with iter=2000; warmup=1000; thin=1; 
post-warmup draws per chain=1000, total post-warmup draws=4000.

               mean se_mean    sd     2.5%      25%      50%      75%    97.5%
mu_root       -1.05    0.01  0.22    -1.54    -1.19    -1.04    -0.90    -0.66
sigma_root     1.39    0.02  0.44     0.61     1.09     1.35     1.66     2.36
sigma          3.16    0.03  0.71     2.00     2.64     3.07     3.59     4.75
lp__       -3638.22    1.93 45.74 -3724.97 -3669.21 -3639.19 -3607.50 -3546.27
           n_eff Rhat
mu_root     1405 1.00
sigma_root   573 1.01
sigma        491 1.00
lp__         559 1.01

Samples were drawn using NUTS(diag_e) at Sat Mar 16 20:45:39 2024.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at 
convergence, Rhat=1).

(loo_brownian <- loo(fit_brownian))

Warning message:
“Some Pareto k diagnostic values are too high. See help('pareto-k-diagnostic') for details.
”


Computed from 4000 by 1036 log-likelihood matrix.

         Estimate   SE
elpd_loo   -597.7 13.1
p_loo       208.0  6.8
looic      1195.4 26.3
------
MCSE of elpd_loo is NA.
MCSE and ESS estimates assume MCMC draws (r_eff in [0.5, 1.9]).

Pareto k diagnostic values:
                         Count Pct.    Min. ESS
(-Inf, 0.7]   (good)     917   88.5%   391     
   (0.7, 1]   (bad)       41    4.0%   <NA>    
   (1, Inf)   (very bad)  78    7.5%   <NA>    
See help('pareto-k-diagnostic') for details.

# Extract the generated quantities
generated_quantities <- extract(fit_brownian)

Y <- binary_data$Y
empirical <- tibble(
    tv = total_variance(Y),
    lv = lineage_wise_variance(Y, variable),
    cv = crossfamily_variance(Y, variable)
)

prior_pc <- tibble(
    tv = apply(generated_quantities$y_prior, 1, total_variance),
    lv = apply(
        generated_quantities$y_prior, 1, function(x) lineage_wise_variance(x, variable)
    ),
    cv = apply(
        generated_quantities$y_prior, 1, function(x) crossfamily_variance(x, variable)
    ),
)
predictive_plot(prior_pc)

Warning message in geom_point(aes(x = 1, y = empirical$tv, color = "empirical"), :
“All aesthetics have length 1, but the data has 4000 rows.
ℹ Did you mean to use `annotate()`?”
Warning message in geom_point(aes(x = 1, y = empirical$lv, color = "empirical"), :
“All aesthetics have length 1, but the data has 4000 rows.
ℹ Did you mean to use `annotate()`?”
Warning message in geom_point(aes(x = 1, y = empirical$cv, color = "empirical"), :
“All aesthetics have length 1, but the data has 4000 rows.
ℹ Did you mean to use `annotate()`?”

posterior_pc <- tibble(
    tv = apply(generated_quantities$y_posterior, 1, total_variance),
    lv = apply(
        generated_quantities$y_posterior, 1, function(x) lineage_wise_variance(x, variable)
    ),
    cv = apply(
        generated_quantities$y_posterior, 1, function(x) crossfamily_variance(x, variable)
    ),
)
predictive_plot(posterior_pc)

Warning message in geom_point(aes(x = 1, y = empirical$tv, color = "empirical"), :
“All aesthetics have length 1, but the data has 4000 rows.
ℹ Did you mean to use `annotate()`?”
Warning message in geom_point(aes(x = 1, y = empirical$lv, color = "empirical"), :
“All aesthetics have length 1, but the data has 4000 rows.
ℹ Did you mean to use `annotate()`?”
Warning message in geom_point(aes(x = 1, y = empirical$cv, color = "empirical"), :
“All aesthetics have length 1, but the data has 4000 rows.
ℹ Did you mean to use `annotate()`?”

mixed_binary_code <- "
data {
    int<lower=1> Ntrees;
    int<lower=0> Nnodes;
    int<lower=1> Nobservations;
    int<lower=1> Nedges;
    int<lower=1,upper=Nnodes> observations[Nobservations]; // indices of the observed tips
    int<lower=1, upper=Nnodes> roots[Ntrees];
    int<lower=1, upper=Nnodes> edges[Nedges, 2];
    vector[Nedges] edge_lengths;
    int<lower=0, upper=1> Y[Nobservations];  // response variable
}

parameters {
    vector[Nnodes] z_normal;
    real mu;
    real<lower=0> sigma;
    real mu_root;
    real<lower=0> sigma_root;
    real<lower=0> sigma_error;
    vector[Nobservations] z_error;
}

model {
    vector[Nnodes] z = rep_vector(0, Nnodes);
    target += normal_lpdf(mu | 0, 10);
    target += exponential_lpdf(sigma | 1);
    target += normal_lpdf(mu_root | 0, 1);
    target += exponential_lpdf(sigma_root | 1);
    target += normal_lpdf(z_normal | 0, 1);
    target += inv_gamma_lpdf(square(sigma_error) | 2, 0.1);

    target += normal_lpdf(z_error | 0, sigma_error);

    for (i in 1:Ntrees) {
        int idx = roots[i];
        z[idx] = mu_root + z_normal[idx] * sigma_root;
    }
    
    for (i in 1:Nedges) {
        int j = Nedges - i + 1;
        int mother_node = edges[j, 1];
        int daughter_node = edges[j, 2];
        real local_mu = z[mother_node];
        real local_sigma = sigma * sqrt(edge_lengths[j]);
        z[daughter_node] = local_mu + z_normal[daughter_node] * local_sigma;
    }

    for (n in 1:Nobservations) {
        target += bernoulli_logit_lpmf(Y[n] | z[observations[n]] + z_error[n]);
    }
    
}

generated quantities {
    int y_prior[Nobservations]; // Prior predictive samples
    int y_posterior[Nobservations]; // Posterior predictive samples
    vector[Nnodes] z_prior = rep_vector(0, Nnodes);
    vector[Nnodes] z_posterior = rep_vector(0, Nnodes);
    real<lower=0> sigma_prior;
    real mu_root_prior;
    real<lower=0> sigma_root_prior;
    real log_lik[Nobservations];
    vector[Nobservations] z_error_prior;


    real sigma_error_sq = 1 / gamma_rng(2, 10);
    real sigma_error_prior = sqrt(sigma_error_sq);
    for (n in 1:Nobservations) {
        z_error_prior[n] = normal_rng(0, sigma_error_prior);
    }


    mu_root_prior = normal_rng(0, 1);
    sigma_root_prior = exponential_rng(1);
    sigma_prior = exponential_rng(1);

    for (i in 1:Ntrees) {
        z_prior[roots[i]] = normal_rng(mu_root_prior, sigma_root_prior);
        z_posterior[roots[i]] = mu_root + z_normal[roots[i]] * sigma_root;
    }

    
    for (i in 1:Nedges) {
        int j = Nedges - i + 1;
        int mother_node = edges[j, 1];
        int daughter_node = edges[j, 2];
        real local_mu = z_posterior[mother_node];
        real local_sigma = sigma * sqrt(edge_lengths[j]);
        z_prior[daughter_node] = normal_rng(
            z_prior[mother_node],
            sigma_prior * sqrt(edge_lengths[j])
        );
        z_posterior[daughter_node] = local_mu + z_normal[daughter_node] * local_sigma;
    }
    
    for (n in 1:Nobservations) {
        y_prior[n] = bernoulli_logit_rng(z_prior[observations[n]] + z_error_prior[n]); 
        y_posterior[n] = bernoulli_logit_rng(z_posterior[observations[n]] + z_error[n]); 
        log_lik[n] = bernoulli_logit_lpmf(Y[n] | z_posterior[observations[n]] + z_error[n]);
    }
}

"

model_mixed_binary <- stan_model(model_code=mixed_binary_code)

fit_mixed6 <- readRDS("models/fit_mixed6.rds")

# fit_mixed6 <- rstan::sampling(
#    model_mixed_binary,
#    data = phylo_data,
#    chains = 4,
#    iter=40000,
#    thin=20,
#    control=list(max_treedepth=15)
# )

# saveRDS(fit_mixed6, "models/fit_mixed6.rds")

print(fit_mixed6, pars=c("mu_root", "sigma_root", "sigma", "sigma_error", "lp__"))

Inference for Stan model: anon_model.
4 chains, each with iter=40000; warmup=20000; thin=20; 
post-warmup draws per chain=1000, total post-warmup draws=4000.

                mean se_mean     sd     2.5%      25%      50%      75%
mu_root        -1.08    0.00   0.23    -1.58    -1.21    -1.06    -0.92
sigma_root      1.41    0.01   0.45     0.61     1.10     1.39     1.69
sigma           3.19    0.01   0.71     2.01     2.69     3.10     3.60
sigma_error     0.24    0.01   0.09     0.12     0.17     0.22     0.28
lp__        -3550.26   26.78 359.14 -4338.65 -3768.59 -3515.73 -3298.47
               97.5% n_eff Rhat
mu_root        -0.67  3673 1.00
sigma_root      2.37  3922 1.00
sigma           4.79  3991 1.00
sigma_error     0.47   176 1.02
lp__        -2933.81   180 1.02

Samples were drawn using NUTS(diag_e) at Sun Mar 10 21:12:03 2024.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at 
convergence, Rhat=1).

(loo_mixed <- loo(fit_mixed6))

Warning message:
“Some Pareto k diagnostic values are too high. See help('pareto-k-diagnostic') for details.
”


Computed from 4000 by 1036 log-likelihood matrix.

         Estimate   SE
elpd_loo   -598.3 13.1
p_loo       214.4  6.8
looic      1196.7 26.2
------
MCSE of elpd_loo is NA.
MCSE and ESS estimates assume MCMC draws (r_eff in [0.7, 1.1]).

Pareto k diagnostic values:
                         Count Pct.    Min. ESS
(-Inf, 0.7]   (good)     993   95.8%   189     
   (0.7, 1]   (bad)       42    4.1%   <NA>    
   (1, Inf)   (very bad)   1    0.1%   <NA>    
See help('pareto-k-diagnostic') for details.

# Extract the generated quantities
generated_quantities <- extract(fit_mixed6)

Y <- binary_data$Y
empirical <- tibble(
    tv = total_variance(Y),
    lv = lineage_wise_variance(Y, variable),
    cv = crossfamily_variance(Y, variable)
)

prior_pc <- tibble(
    tv = apply(generated_quantities$y_prior, 1, total_variance),
    lv = apply(
        generated_quantities$y_prior, 1, function(x) lineage_wise_variance(x, variable)
    ),
    cv = apply(
        generated_quantities$y_prior, 1, function(x) crossfamily_variance(x, variable)
    ),
)
predictive_plot(prior_pc)

Warning message in geom_point(aes(x = 1, y = empirical$tv, color = "empirical"), :
“All aesthetics have length 1, but the data has 4000 rows.
ℹ Did you mean to use `annotate()`?”
Warning message in geom_point(aes(x = 1, y = empirical$lv, color = "empirical"), :
“All aesthetics have length 1, but the data has 4000 rows.
ℹ Did you mean to use `annotate()`?”
Warning message in geom_point(aes(x = 1, y = empirical$cv, color = "empirical"), :
“All aesthetics have length 1, but the data has 4000 rows.
ℹ Did you mean to use `annotate()`?”

posterior_pc <- tibble(
    tv = apply(generated_quantities$y_posterior, 1, total_variance),
    lv = apply(
        generated_quantities$y_posterior, 1, function(x) lineage_wise_variance(x, variable)
    ),
    cv = apply(
        generated_quantities$y_posterior, 1, function(x) crossfamily_variance(x, variable)
    ),
)
predictive_plot(posterior_pc)

Warning message in geom_point(aes(x = 1, y = empirical$tv, color = "empirical"), :
“All aesthetics have length 1, but the data has 4000 rows.
ℹ Did you mean to use `annotate()`?”
Warning message in geom_point(aes(x = 1, y = empirical$lv, color = "empirical"), :
“All aesthetics have length 1, but the data has 4000 rows.
ℹ Did you mean to use `annotate()`?”
Warning message in geom_point(aes(x = 1, y = empirical$cv, color = "empirical"), :
“All aesthetics have length 1, but the data has 4000 rows.
ℹ Did you mean to use `annotate()`?”

model comparison

tibble(model = c("binary", "brownian", "mixed"),
       looic = c(
            loo_binary$estimates[3,1],
            loo_brownian$estimates[3,1],
            loo_mixed$estimates[3,1]
       )
) %>%
mutate(delta_looic = round(looic-min(looic), 3)) %>%
arrange(delta_looic)

A tibble: 3 × 3
model	looic	delta_looic
<chr>	<dbl>	<dbl>
brownian	1195.383	0.000
mixed	1196.675	1.292
binary	1331.269	135.885

Variable 7: crop_type

variable <- "crop_type"

Y <- d %>%
    select(all_of(variable)) %>%
    na.omit %>%
    pull
Y <- match(Y, sort(unique(Y)))

Since there is no inherent order in the crop types, we need a logistic rather than an ordinal model.

cardinal_data <- list()
cardinal_data$Ndata <- length(Y)
cardinal_data$Nclasses <- max(Y)
cardinal_data$Y <- Y

cardinal_code <- "
data {
  int<lower=1> Ndata;
  int<lower=1> Nclasses;
  int<lower=0, upper=Nclasses> Y[Ndata];
}
parameters {
  vector[Nclasses] mu;
}

model {
    mu ~ normal(0, 2);
    for (n in 1:Ndata) {
        Y[n] ~ categorical(softmax(mu));
    }
}

generated quantities {
  int y_prior[Ndata];
  int y_posterior[Ndata];
  vector[Ndata] log_lik;
  vector[Nclasses] mu_prior;
  for (k in 1:Nclasses) {
      mu_prior[k ]= normal_rng(0, 2);
  }
    
  
  for (n in 1:Ndata) {
    y_prior[n] = categorical_rng(softmax(mu_prior));
    y_posterior[n] = categorical_rng(softmax(mu));
    log_lik[n] = categorical_lpmf(Y[n] | softmax(mu));
  }
}

"

model_cardinal <- stan_model(model_code = cardinal_code)

fit_cardinal <- rstan::sampling(
    model_cardinal,
    data = cardinal_data,
    chains = 4,
    iter = 2000,
)

print(fit_cardinal, pars = c("mu", "lp__"))

Inference for Stan model: anon_model.
4 chains, each with iter=2000; warmup=1000; thin=1; 
post-warmup draws per chain=1000, total post-warmup draws=4000.

          mean se_mean   sd     2.5%      25%      50%      75%    97.5% n_eff
mu[1]     1.27    0.04 0.83    -0.32     0.72     1.28     1.82     2.87   538
mu[2]    -3.18    0.04 1.00    -5.24    -3.83    -3.14    -2.54    -1.28   598
mu[3]    -2.42    0.04 0.92    -4.27    -3.03    -2.41    -1.79    -0.67   581
mu[4]     0.27    0.04 0.84    -1.33    -0.28     0.29     0.83     1.86   540
mu[5]     1.39    0.04 0.83    -0.20     0.83     1.40     1.93     2.99   537
mu[6]     2.32    0.04 0.83     0.74     1.78     2.34     2.87     3.92   536
lp__  -1324.64    0.05 1.83 -1329.11 -1325.62 -1324.28 -1323.31 -1322.15  1146
      Rhat
mu[1] 1.00
mu[2] 1.01
mu[3] 1.00
mu[4] 1.00
mu[5] 1.01
mu[6] 1.01
lp__  1.00

Samples were drawn using NUTS(diag_e) at Sat Mar 16 20:51:24 2024.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at 
convergence, Rhat=1).

(loo_cardinal <- loo(fit_cardinal))


Computed from 4000 by 1103 log-likelihood matrix.

         Estimate   SE
elpd_loo  -1323.5 24.2
p_loo         4.9  0.7
looic      2647.0 48.3
------
MCSE of elpd_loo is 0.0.
MCSE and ESS estimates assume MCMC draws (r_eff in [0.6, 1.1]).

All Pareto k estimates are good (k < 0.7).
See help('pareto-k-diagnostic') for details.

# Extract the generated quantities
generated_quantities <- extract(fit_cardinal)

Y <- cardinal_data$Y
empirical <- tibble(
    tv = total_variance(Y),
    lv = lineage_wise_variance(Y, variable),
    cv = crossfamily_variance(Y, variable)
)

prior_pc <- tibble(
    tv = apply(generated_quantities$y_prior, 1, total_variance),
    lv = apply(
        generated_quantities$y_prior, 1, function(x) lineage_wise_variance(x, variable)
    ),
    cv = apply(
        generated_quantities$y_prior, 1, function(x) crossfamily_variance(x, variable)
    ),
)
predictive_plot(prior_pc)

Warning message in geom_point(aes(x = 1, y = empirical$tv, color = "empirical"), :
“All aesthetics have length 1, but the data has 4000 rows.
ℹ Did you mean to use `annotate()`?”
Warning message in geom_point(aes(x = 1, y = empirical$lv, color = "empirical"), :
“All aesthetics have length 1, but the data has 4000 rows.
ℹ Did you mean to use `annotate()`?”
Warning message in geom_point(aes(x = 1, y = empirical$cv, color = "empirical"), :
“All aesthetics have length 1, but the data has 4000 rows.
ℹ Did you mean to use `annotate()`?”

posterior_pc <- tibble(
    tv = apply(generated_quantities$y_posterior, 1, total_variance),
    lv = apply(
        generated_quantities$y_posterior, 1, function(x) lineage_wise_variance(x, variable)
    ),
    cv = apply(
        generated_quantities$y_posterior, 1, function(x) crossfamily_variance(x, variable)
    ),
)
predictive_plot(posterior_pc)

Warning message in geom_point(aes(x = 1, y = empirical$tv, color = "empirical"), :
“All aesthetics have length 1, but the data has 4000 rows.
ℹ Did you mean to use `annotate()`?”
Warning message in geom_point(aes(x = 1, y = empirical$lv, color = "empirical"), :
“All aesthetics have length 1, but the data has 4000 rows.
ℹ Did you mean to use `annotate()`?”
Warning message in geom_point(aes(x = 1, y = empirical$cv, color = "empirical"), :
“All aesthetics have length 1, but the data has 4000 rows.
ℹ Did you mean to use `annotate()`?”

defined <- which(!is.na(d[[variable]]))
undefined <- which(is.na(d[[variable]]))

po <- c()
for (i in get_postorder(roots, edges)) {
  if ((i %in% edges[, 2]) && !(i %in% undefined)) {
    po <- c(po, i)
  }
}
edge_po <- match(po, edges[, 2])

phylo_data <- list()
phylo_data$Ntrees <- length(trees)
phylo_data$Nnodes <- nrow(nodes)
phylo_data$Y <- Y
phylo_data$Nobservations <- length(phylo_data$Y)
phylo_data$Nedges <- length(edge_po)
phylo_data$observations <- match(pull(d[defined, "glottocode"]), nodes$label)
phylo_data$roots <- roots
phylo_data$edges <- edges[edge_po,]
phylo_data$edge_lengths <- edge_lengths[edge_po]
phylo_data$Nclasses <- length(unique(phylo_data$Y))

cardinal_brown_code <- "
data {
    int<lower=1> Ntrees;
    int<lower=0> Nnodes;
    int<lower=1> Nobservations;
    int<lower=1> Nedges;
    int<lower=1> Nclasses;
    int<lower=1,upper=Nnodes> observations[Nobservations];
    int<lower=1, upper=Nnodes> roots[Ntrees];
    int<lower=1, upper=Nnodes> edges[Nedges, 2];
    vector[Nedges] edge_lengths;
    int<lower=0, upper=Nclasses> Y[Nobservations];
}

parameters {
    matrix[Nnodes, Nclasses] z;
    vector[Nclasses] mu_root;
    row_vector<lower=0>[Nclasses] sigma;
    vector<lower=0>[Nclasses] sigma_root;
}

model {
    mu_root ~ normal(0, 1);
    sigma ~ exponential(1);
    sigma_root ~ exponential(1);
   
    for (i in 1:Ntrees) {
        int idx = roots[i];
        z[idx] ~ normal(mu_root, sigma_root);
    }
    
    for (i in 1:Nedges) {
        int j = Nedges - i + 1;
        int mother_node = edges[j, 1];
        int daughter_node = edges[j, 2];
        row_vector[Nclasses] local_mu = z[mother_node];
        row_vector[Nclasses] local_sigma = sigma .* sqrt(edge_lengths[j]);
        for (k in 1:Nclasses) {
            z[daughter_node, k] ~ normal(local_mu[k], local_sigma[k]);
        }
    }

    for (n in 1:Nobservations) {
        Y[n] ~ categorical(softmax(to_vector(z[observations[n]])));
    }
}


generated quantities {
    int y_prior[Nobservations];
    int y_posterior[Nobservations];
    matrix[Nnodes, Nclasses] z_prior = rep_matrix(0, Nnodes, Nclasses);
    row_vector[Nclasses] sigma_prior;
    vector[Nclasses] mu_root_prior;
    vector[Nclasses] sigma_root_prior;
    vector[Nobservations] log_lik;

    for (k in 1:Nclasses) {
        sigma_prior[k] = exponential_rng(1);
        mu_root_prior[k] = normal_rng(0, 1);
        sigma_root_prior[k] = exponential_rng(1);
    }

    for (i in 1:Ntrees) {
        int idx = roots[i];
        for (k in 1:Nclasses) {
            z_prior[idx, k] = normal_rng(mu_root_prior[k], sigma_root_prior[k]);
        }
    }
    
    for (i in 1:Nedges) {
        int j = Nedges - i + 1;
        int mother_node = edges[j, 1];
        int daughter_node = edges[j, 2];
        row_vector[Nclasses] local_mu = z_prior[mother_node];
        row_vector[Nclasses] local_sigma = sigma_prior .* sqrt(edge_lengths[j]);
        for (k in 1:Nclasses) {
            z_prior[daughter_node, k] = normal_rng(local_mu[k], local_sigma[k]);
        }
    }

    
    for (n in 1:Nobservations) {
        y_prior[n] = categorical_rng(softmax(to_vector(z_prior[observations[n]])));
        y_posterior[n] = categorical_rng(softmax(to_vector(z[observations[n]])));
        log_lik[n] = categorical_lpmf(Y[n] | softmax(to_vector(z[observations[n]])));
    }
}

"

model_cardinal_brownian <- stan_model(model_code = cardinal_brown_code)

fit_brownian_cardinal <- rstan::sampling(
  model_cardinal_brownian,
  data = phylo_data,
  chains = 4,
  iter = 2000
)
saveRDS(fit_brownian, "models/fit_cardinal_brownian.rds")

Warning message:
“There were 4000 transitions after warmup that exceeded the maximum treedepth. Increase max_treedepth above 10. See
https://mc-stan.org/misc/warnings.html#maximum-treedepth-exceeded”
Warning message:
“There were 3 chains where the estimated Bayesian Fraction of Missing Information was low. See
https://mc-stan.org/misc/warnings.html#bfmi-low”
Warning message:
“Examine the pairs() plot to diagnose sampling problems
”
Warning message:
“The largest R-hat is NA, indicating chains have not mixed.
Running the chains for more iterations may help. See
https://mc-stan.org/misc/warnings.html#r-hat”
Warning message:
“Bulk Effective Samples Size (ESS) is too low, indicating posterior means and medians may be unreliable.
Running the chains for more iterations may help. See
https://mc-stan.org/misc/warnings.html#bulk-ess”
Warning message:
“Tail Effective Samples Size (ESS) is too low, indicating posterior variances and tail quantiles may be unreliable.
Running the chains for more iterations may help. See
https://mc-stan.org/misc/warnings.html#tail-ess”

#fit_brownian_cardinal <- readRDS("models/fit_cardinal_brownian.rds")

print(fit_brownian_cardinal, pars = c("sigma", "mu_root", "sigma_root", "lp__"))

Inference for Stan model: anon_model.
4 chains, each with iter=2000; warmup=1000; thin=1; 
post-warmup draws per chain=1000, total post-warmup draws=4000.

                  mean se_mean      sd     2.5%      25%      50%      75%
sigma[1]          4.38    0.89    1.27     2.22     3.65     4.80     5.37
sigma[2]          2.57    0.36    0.54     1.55     2.13     2.76     2.92
sigma[3]          2.28    0.40    0.58     1.63     1.83     1.99     2.71
sigma[4]          4.48    0.76    1.10     2.70     3.85     4.69     5.14
sigma[5]          3.60    0.52    0.76     2.33     2.90     3.53     4.38
sigma[6]          6.17    0.96    1.38     3.87     5.45     6.43     7.17
mu_root[1]        0.98    0.10    0.50    -0.04     0.65     0.93     1.33
mu_root[2]       -2.48    0.14    0.24    -2.91    -2.67    -2.45    -2.27
mu_root[3]       -2.06    0.40    0.59    -2.89    -2.68    -1.88    -1.58
mu_root[4]       -0.69    0.43    0.62    -1.67    -0.95    -0.66    -0.31
mu_root[5]        1.27    0.88    1.28    -0.39     0.16     1.25     2.14
mu_root[6]        1.98    0.28    0.53     1.06     1.64     1.92     2.22
sigma_root[1]     6.95    0.92    1.42     4.75     5.83     6.64     8.04
sigma_root[2]     1.12    0.35    0.50     0.63     0.72     0.89     1.45
sigma_root[3]     1.10    0.32    0.47     0.51     0.70     0.96     1.46
sigma_root[4]     0.58    0.23    0.33     0.16     0.25     0.57     0.86
sigma_root[5]     3.12    0.66    0.97     1.78     2.18     3.03     4.06
sigma_root[6]     4.73    1.04    1.52     2.03     3.85     5.40     5.82
lp__          -5952.34 1675.64 2395.12 -8004.89 -7721.90 -7077.39 -5174.37
                 97.5% n_eff  Rhat
sigma[1]          5.94     2  7.75
sigma[2]          3.44     2  4.41
sigma[3]          3.49     2  6.33
sigma[4]          6.29     2  6.06
sigma[5]          4.63     2  4.99
sigma[6]          8.23     2  7.68
mu_root[1]        1.95    24  1.24
mu_root[2]       -2.13     3  2.08
mu_root[3]       -1.26     2  4.55
mu_root[4]        0.29     2  7.28
mu_root[5]        3.38     2  4.74
mu_root[6]        3.19     3  1.98
sigma_root[1]     9.48     2  4.32
sigma_root[2]     2.10     2  6.42
sigma_root[3]     2.01     2  5.31
sigma_root[4]     1.13     2  5.34
sigma_root[5]     4.56     2  4.74
sigma_root[6]     6.33     2  5.92
lp__          -1227.60     2 11.19

Samples were drawn using NUTS(diag_e) at Sat Mar 16 22:35:53 2024.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at 
convergence, Rhat=1).

(loo_brownian <- loo(fit_brownian))

Warning message:
“Some Pareto k diagnostic values are too high. See help('pareto-k-diagnostic') for details.
”


Computed from 4000 by 1036 log-likelihood matrix.

         Estimate   SE
elpd_loo   -597.7 13.1
p_loo       208.0  6.8
looic      1195.4 26.3
------
MCSE of elpd_loo is NA.
MCSE and ESS estimates assume MCMC draws (r_eff in [0.5, 1.9]).

Pareto k diagnostic values:
                         Count Pct.    Min. ESS
(-Inf, 0.7]   (good)     917   88.5%   391     
   (0.7, 1]   (bad)       41    4.0%   <NA>    
   (1, Inf)   (very bad)  78    7.5%   <NA>    
See help('pareto-k-diagnostic') for details.

# Extract the generated quantities
generated_quantities <- extract(fit_brownian_cardinal)

Y <- cardinal_data$Y
empirical <- tibble(
    tv = total_variance(Y),
    lv = lineage_wise_variance(Y, variable),
    cv = crossfamily_variance(Y, variable)
)

prior_pc <- tibble(
    tv = apply(generated_quantities$y_prior, 1, total_variance),
    lv = apply(
        generated_quantities$y_prior, 1, function(x) lineage_wise_variance(x, variable)
    ),
    cv = apply(
        generated_quantities$y_prior, 1, function(x) crossfamily_variance(x, variable)
    ),
)
predictive_plot(prior_pc)

Warning message in geom_point(aes(x = 1, y = empirical$tv, color = "empirical"), :
“All aesthetics have length 1, but the data has 4000 rows.
ℹ Did you mean to use `annotate()`?”
Warning message in geom_point(aes(x = 1, y = empirical$lv, color = "empirical"), :
“All aesthetics have length 1, but the data has 4000 rows.
ℹ Did you mean to use `annotate()`?”
Warning message in geom_point(aes(x = 1, y = empirical$cv, color = "empirical"), :
“All aesthetics have length 1, but the data has 4000 rows.
ℹ Did you mean to use `annotate()`?”

posterior_pc <- tibble(
    tv = apply(generated_quantities$y_posterior, 1, total_variance),
    lv = apply(
        generated_quantities$y_posterior, 1, function(x) lineage_wise_variance(x, variable)
    ),
    cv = apply(
        generated_quantities$y_posterior, 1, function(x) crossfamily_variance(x, variable)
    ),
)
predictive_plot(posterior_pc)

Warning message in geom_point(aes(x = 1, y = empirical$tv, color = "empirical"), :
“All aesthetics have length 1, but the data has 4000 rows.
ℹ Did you mean to use `annotate()`?”
Warning message in geom_point(aes(x = 1, y = empirical$lv, color = "empirical"), :
“All aesthetics have length 1, but the data has 4000 rows.
ℹ Did you mean to use `annotate()`?”
Warning message in geom_point(aes(x = 1, y = empirical$cv, color = "empirical"), :
“All aesthetics have length 1, but the data has 4000 rows.
ℹ Did you mean to use `annotate()`?”

cardinal_mixed_code <- "
data {
    int<lower=1> Ntrees;
    int<lower=0> Nnodes;
    int<lower=1> Nobservations;
    int<lower=1> Nedges;
    int<lower=1> Nclasses;
    int<lower=1,upper=Nnodes> observations[Nobservations];
    int<lower=1, upper=Nnodes> roots[Ntrees];
    int<lower=1, upper=Nnodes> edges[Nedges, 2];
    vector[Nedges] edge_lengths;
    int<lower=0, upper=Nclasses> Y[Nobservations];
}

parameters {
    matrix[Nnodes, Nclasses] z;
    vector[Nclasses] mu_root;
    row_vector<lower=0>[Nclasses] sigma;
    vector<lower=0>[Nclasses] sigma_root;
    matrix[Nobservations, Nclasses] z_error;
    vector<lower=0>[Nclasses] sigma_error;

}

model {
    mu_root ~ normal(0, 1);
    sigma ~ exponential(1);
    sigma_root ~ exponential(1);
    sigma_error ~ exponential(1);
    for (i in 1:Nobservations) {
        z_error[i] ~ normal(0, sigma_error);
    }
   
    for (i in 1:Ntrees) {
        int idx = roots[i];
        z[idx] ~ normal(mu_root, sigma_root);
    }
    
    for (i in 1:Nedges) {
        int j = Nedges - i + 1;
        int mother_node = edges[j, 1];
        int daughter_node = edges[j, 2];
        row_vector[Nclasses] local_mu = z[mother_node];
        row_vector[Nclasses] local_sigma = sigma .* sqrt(edge_lengths[j]);
        for (k in 1:Nclasses) {
            z[daughter_node, k] ~ normal(local_mu[k], local_sigma[k]);
        }
    }

    for (n in 1:Nobservations) {
        Y[n] ~ categorical(softmax(to_vector(
            z[observations[n]] + z_error[n]
            )));
    }
}


generated quantities {
    int y_prior[Nobservations];
    int y_posterior[Nobservations];
    matrix[Nnodes, Nclasses] z_prior = rep_matrix(0, Nnodes, Nclasses);
    row_vector[Nclasses] sigma_prior;
    vector[Nclasses] mu_root_prior;
    vector[Nclasses] sigma_root_prior;
    matrix[Nobservations, Nclasses] z_error_prior;
    vector<lower=0>[Nclasses] sigma_error_prior;
    vector[Nobservations] log_lik;

    for (k in 1:Nclasses) {
        sigma_prior[k] = exponential_rng(1);
        mu_root_prior[k] = normal_rng(0, 1);
        sigma_root_prior[k] = exponential_rng(1);
        sigma_error_prior[k] = exponential_rng(1);
        for (i in 1:Nobservations) {
            z_error_prior[i, k] = normal_rng(0, sigma_error_prior[k]);
        }
    }

    for (i in 1:Ntrees) {
        int idx = roots[i];
        for (k in 1:Nclasses) {
            z_prior[idx, k] = normal_rng(mu_root_prior[k], sigma_root_prior[k]);
        }
    }
    
    for (i in 1:Nedges) {
        int j = Nedges - i + 1;
        int mother_node = edges[j, 1];
        int daughter_node = edges[j, 2];
        row_vector[Nclasses] local_mu = z_prior[mother_node];
        row_vector[Nclasses] local_sigma = sigma_prior .* sqrt(edge_lengths[j]);
        for (k in 1:Nclasses) {
            z_prior[daughter_node, k] = normal_rng(local_mu[k], local_sigma[k]);
        }
    }

    
    for (n in 1:Nobservations) {
        y_prior[n] = categorical_rng(softmax(to_vector(
            z_prior[observations[n]] + z_error_prior[n]
            )));
        y_posterior[n] = categorical_rng(softmax(to_vector(
            z[observations[n]] + z_error[n]
            )));
        log_lik[n] = categorical_lpmf(Y[n] | softmax(to_vector(
            z[observations[n]] + z_error[n]
            )));
    }
}


"

model_mixed7 <- stan_model(model_code = cardinal_mixed_code)

fit_mixed7 <- readRDS("models/fit_mixed7.rds")

# fit_mixed7 <- rstan::sampling(
#    model_mixed7,
#    data = phylo_data,
#    chains = 4,
#    iter = 2000,
#    thin = 1,
#    control = list(max_treedepth=15)
# )

#saveRDS(fit_mixed7, "models/fit_mixed7.rds")

print(fit_mixed7, pars=c("mu_root", "sigma_root", "sigma", "sigma_error", "lp__"))

Inference for Stan model: anon_model.
4 chains, each with iter=2000; warmup=1000; thin=1; 
post-warmup draws per chain=1000, total post-warmup draws=4000.

                   mean se_mean      sd      2.5%       25%       50%      75%
mu_root[1]         1.53    0.39    0.79     -0.09      0.98      1.59     2.07
mu_root[2]        -2.36    0.33    0.48     -3.08     -2.75     -2.48    -1.89
mu_root[3]        -2.22    0.44    0.63     -3.03     -2.75     -2.39    -1.75
mu_root[4]        -0.31    0.69    0.98     -1.84     -1.01     -0.19     0.37
mu_root[5]         1.64    0.30    0.54      0.65      1.21      1.66     2.06
mu_root[6]         2.22    0.40    0.75      0.57      1.78      2.35     2.75
sigma_root[1]      7.72    1.30    1.91      5.31      6.30      7.11     8.43
sigma_root[2]      0.97    0.41    0.59      0.23      0.44      0.90     1.40
sigma_root[3]      1.22    0.43    0.62      0.44      0.63      1.12     1.69
sigma_root[4]      0.73    0.32    0.49      0.25      0.38      0.57     0.81
sigma_root[5]      3.86    0.54    0.81      2.58      3.09      3.92     4.62
sigma_root[6]      6.17    0.28    0.49      5.10      5.83      6.24     6.55
sigma[1]           5.39    0.66    0.96      4.21      4.52      5.40     5.96
sigma[2]           3.39    0.62    0.88      2.22      2.75      3.22     3.91
sigma[3]           2.89    0.45    0.65      2.13      2.24      2.80     3.53
sigma[4]           5.09    0.97    1.39      3.36      4.05      4.69     6.14
sigma[5]           3.82    1.05    1.49      1.76      2.85      3.61     4.62
sigma[6]           5.60    1.05    1.51      3.33      4.56      5.62     6.42
sigma_error[1]     0.68    0.13    0.19      0.35      0.52      0.72     0.86
sigma_error[2]     0.75    0.25    0.36      0.29      0.45      0.70     1.10
sigma_error[3]     0.62    0.12    0.18      0.29      0.54      0.70     0.73
sigma_error[4]     1.04    0.22    0.31      0.65      0.83      0.97     1.15
sigma_error[5]     1.03    0.04    0.10      0.88      0.96      1.00     1.08
sigma_error[6]     1.62    0.42    0.60      0.88      1.01      1.63     2.21
lp__           -9806.20 1631.45 2331.40 -12416.44 -11840.28 -10335.07 -8565.76
                  97.5% n_eff  Rhat
mu_root[1]         2.97     4  1.57
mu_root[2]        -1.68     2  6.05
mu_root[3]        -1.18     2  6.60
mu_root[4]         1.03     2  9.25
mu_root[5]         2.63     3  2.01
mu_root[6]         3.47     3  1.91
sigma_root[1]     11.63     2  4.75
sigma_root[2]      2.00     2  6.72
sigma_root[3]      2.28     2  7.86
sigma_root[4]      1.97     2  5.86
sigma_root[5]      5.08     2  3.48
sigma_root[6]      6.95     3  2.02
sigma[1]           7.21     2  5.83
sigma[2]           4.93     2  9.52
sigma[3]           3.89     2  6.08
sigma[4]           7.61     2  7.43
sigma[5]           6.33     2 10.43
sigma[6]           7.99     2  8.17
sigma_error[1]     0.92     2  8.14
sigma_error[2]     1.31     2 10.38
sigma_error[3]     0.83     2  7.90
sigma_error[4]     1.62     2  9.82
sigma_error[5]     1.27     7  2.21
sigma_error[6]     2.38     2 10.29
lp__           -5819.33     2 10.72

Samples were drawn using NUTS(diag_e) at Sat Mar 16 07:19:14 2024.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at 
convergence, Rhat=1).

(loo_mixed <- loo(fit_mixed7))

Warning message:
“Some Pareto k diagnostic values are too high. See help('pareto-k-diagnostic') for details.
”


Computed from 4000 by 1103 log-likelihood matrix.

         Estimate   SE
elpd_loo   -546.6 24.2
p_loo       374.9 18.7
looic      1093.1 48.5
------
MCSE of elpd_loo is NA.
MCSE and ESS estimates assume MCMC draws (r_eff in [0.0, 1.0]).

Pareto k diagnostic values:
                         Count Pct.    Min. ESS
(-Inf, 0.7]   (good)     302   27.4%   0       
   (0.7, 1]   (bad)      365   33.1%   <NA>    
   (1, Inf)   (very bad) 436   39.5%   <NA>    
See help('pareto-k-diagnostic') for details.

# Extract the generated quantities
generated_quantities <- extract(fit_mixed7)

Y <- cardinal_data$Y
empirical <- tibble(
    tv = total_variance(Y),
    lv = lineage_wise_variance(Y, variable),
    cv = crossfamily_variance(Y, variable)
)

prior_pc <- tibble(
    tv = apply(generated_quantities$y_prior, 1, total_variance),
    lv = apply(
        generated_quantities$y_prior, 1, function(x) lineage_wise_variance(x, variable)
    ),
    cv = apply(
        generated_quantities$y_prior, 1, function(x) crossfamily_variance(x, variable)
    ),
)
predictive_plot(prior_pc)

Warning message in geom_point(aes(x = 1, y = empirical$tv, color = "empirical"), :
“All aesthetics have length 1, but the data has 4000 rows.
ℹ Did you mean to use `annotate()`?”
Warning message in geom_point(aes(x = 1, y = empirical$lv, color = "empirical"), :
“All aesthetics have length 1, but the data has 4000 rows.
ℹ Did you mean to use `annotate()`?”
Warning message in geom_point(aes(x = 1, y = empirical$cv, color = "empirical"), :
“All aesthetics have length 1, but the data has 4000 rows.
ℹ Did you mean to use `annotate()`?”

posterior_pc <- tibble(
    tv = apply(generated_quantities$y_posterior, 1, total_variance),
    lv = apply(
        generated_quantities$y_posterior, 1, function(x) lineage_wise_variance(x, variable)
    ),
    cv = apply(
        generated_quantities$y_posterior, 1, function(x) crossfamily_variance(x, variable)
    ),
)
predictive_plot(posterior_pc)

Warning message in geom_point(aes(x = 1, y = empirical$tv, color = "empirical"), :
“All aesthetics have length 1, but the data has 4000 rows.
ℹ Did you mean to use `annotate()`?”
Warning message in geom_point(aes(x = 1, y = empirical$lv, color = "empirical"), :
“All aesthetics have length 1, but the data has 4000 rows.
ℹ Did you mean to use `annotate()`?”
Warning message in geom_point(aes(x = 1, y = empirical$cv, color = "empirical"), :
“All aesthetics have length 1, but the data has 4000 rows.
ℹ Did you mean to use `annotate()`?”

model comparison

tibble(model = c("cardinal", "brownian", "mixed"),
       looic = c(
            loo_cardinal$estimates[3,1],
            loo_brownian$estimates[3,1],
            loo_mixed$estimates[3,1]
       )
) %>%
mutate(delta_looic = round(looic-min(looic), 3)) %>%
arrange(delta_looic)

A tibble: 3 × 3
model	looic	delta_looic
<chr>	<dbl>	<dbl>
mixed	1093.123	0.000
brownian	1195.383	102.261
cardinal	2646.981	1553.858

wrapping up

For all seven variables, the mixed model is clearly better than the non-phylogenetic logistic model, and it is about as good as the Brownian motion model. It is thus justified to work with for downstream tasks.

z_error <- list(
    extract(fit_mixed1, pars="z_error")$z_error,
    extract(fit_mixed2, pars="z_error")$z_error,
    extract(fit_mixed3, pars="z_error")$z_error,
    extract(fit_mixed4, pars="z_error")$z_error,
    extract(fit_mixed5, pars="z_error")$z_error,
    extract(fit_mixed6, pars="z_error")$z_error,
    extract(fit_mixed7, pars="z_error")$z_error
)

variables <- colnames(d)[3:9]

# for each i in 1:6, create a new column in d and fill those rows where d[[variables[i]]] is NA with NA, and the other rows with the posterior mean of z_error[[i]]

for (i in 1:6) {
    d[[paste0(variables[i], "_residuals")]] <- NA
    d[[paste0(variables[i], "_residuals")]][!is.na(d[[variables[i]]])] <- apply(z_error[[i]], 2, mean) 
}

z_error_7 <- apply(extract(fit_mixed7, pars="z_error")$z_error, c(2,3), mean)

z7_names <- paste(variables[7], 1:6, 'residuals', sep="_")

for (i in 1:6) {
    d[[z7_names[i]]] <- NA
    d[[z7_names[i]]][!is.na(d[[variables[7]]])] <- z_error_7[,i]
}

write_csv(d, "../data/data_with_latent_variables.csv")