Finding the best marginal model for each variable

Before I investigate the correlations between the variables, I want to check which phylogenetic model provides the best fit of the individual variables.

All variables are ordinal.

I will test three models for each variable:

For all models, Bayesian inference will be used, implemented in Stan.

preparation

Im my experience, R is the best environment for working with Stan.

I start with loading relevant packages.

library(repr)
options(repr.plot.width=12, repr.plot.height=10)
library(tidyverse)
library(gridExtra)
library(patchwork)
library(stringr)
library(reshape2)
library(rstan)
library(bridgesampling)
library(loo)
library(ape)
library(bayesplot)
library(shinystan)
library(coda)
library(codetools)
library(brms)
library(abind)

options(mc.cores = parallel::detectCores())
rstan_options("auto_write" = TRUE)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.4.4     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

Attaching package: ‘gridExtra’


The following object is masked from ‘package:dplyr’:

    combine



Attaching package: ‘reshape2’


The following object is masked from ‘package:tidyr’:

    smiths


Loading required package: StanHeaders


rstan version 2.32.3 (Stan version 2.26.1)


For execution on a local, multicore CPU with excess RAM we recommend calling
options(mc.cores = parallel::detectCores()).
To avoid recompilation of unchanged Stan programs, we recommend calling
rstan_options(auto_write = TRUE)
For within-chain threading using `reduce_sum()` or `map_rect()` Stan functions,
change `threads_per_chain` option:
rstan_options(threads_per_chain = 1)



Attaching package: ‘rstan’


The following object is masked from ‘package:tidyr’:

    extract


This is loo version 2.6.0

- Online documentation and vignettes at mc-stan.org/loo

- As of v2.0.0 loo defaults to 1 core but we recommend using as many as possible. Use the 'cores' argument or set options(mc.cores = NUM_CORES) for an entire session. 


Attaching package: ‘loo’


The following object is masked from ‘package:rstan’:

    loo



Attaching package: ‘ape’


The following object is masked from ‘package:dplyr’:

    where


This is bayesplot version 1.10.0

- Online documentation and vignettes at mc-stan.org/bayesplot

- bayesplot theme set to bayesplot::theme_default()

   * Does _not_ affect other ggplot2 plots

   * See ?bayesplot_theme_set for details on theme setting

Loading required package: shiny


This is shinystan version 2.6.0



Attaching package: ‘coda’


The following object is masked from ‘package:rstan’:

    traceplot


Loading required package: Rcpp

Loading 'brms' package (version 2.20.4). Useful instructions
can be found by typing help('brms'). A more detailed introduction
to the package is available through vignette('brms_overview').


Attaching package: ‘brms’


The following object is masked from ‘package:bayesplot’:

    rhat


The following object is masked from ‘package:bridgesampling’:

    bf


The following object is masked from ‘package:rstan’:

    loo


The following object is masked from ‘package:stats’:

    ar

karminrot <- rgb(206/255, 143/255, 137/255)
gold <- rgb(217/255, 205/255, 177/255)
anthrazit <- rgb(143/255, 143/255, 149/255)

Next, the data are loaded into a tibble called d.

d <- read_csv("../data/data_with_families.csv") %>%
    distinct(glottocode, .keep_all = TRUE)
taxa <- d$glottocode
head(d)
Rows: 1291 Columns: 10
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (3): soc_id, glottocode, Family
dbl (7): political_complexity, hierarchy_within, domestic_organisation, agri...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
A tibble: 6 × 10
soc_id glottocode political_complexity hierarchy_within domestic_organisation agricultureLevel settlement_strategy exogamy crop_type Family
<chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr>
Aa1 juho1239 1 3 3 0 1 1 1 Kxa
Aa2 okie1245 1 2 1 0 1 1 1 Nilotic
Aa3 nama1265 2 2 1 0 1 1 1 Khoe_Kwadi
Aa4 dama1270 NA NA 2 0 NA 1 NA Khoe_Kwadi
Aa5 bila1255 1 2 1 0 1 1 1 Atlantic_Congo
Aa6 sand1273 2 2 1 5 3 1 6 Sandawe

In the next step, the maximum clade credibility trees are loaded as phylo objects.

mcc_trees <- list.files("../data/mcc_trees", pattern = ".tre", full.names = TRUE)

trees <- purrr::map(mcc_trees, read.tree)

# get a vector of tip labels of all trees in trees

tips <- purrr::map(trees, function(x) x$tip.label) %>% unlist %>% unique

A brief sanity check to ensure that all tips of the trees correspond to a Glottocode in d.

if (length(setdiff(tips, taxa)) == 0 ) {
    print("All tips in trees are in taxa")
} else {
    print("Not all tips in trees are in taxa")
}
[1] "All tips in trees are in taxa"

All Glottocodes in d that do not occur as a tip are isolates (for the purpose of this study).

For each isolate I create a list object with the fields "tip.lable" and "Nnode", to make them compatible with the phylo objects for language families. These list objects are added to the list trees.

isolates <- setdiff(taxa, tips)

# convert isolates to one-node trees and add to trees
for (i in isolates) {
    tr <- list()
    tr[["tip.label"]] <- i
    tr[["Nnode"]] <- 0
    trees[[length(trees) + 1]] <- tr
}

Just to make sure, I test whether the tips of this forest are exactly the Glottocodes in the data.

tips <- c()
for (tr in trees) {
    tips <- c(tips, tr$tip.label)
}

if (all(sort(tips) == sort(taxa))) {
    print("taxa and tips are identical")
} else {
    print("taxa and tips are not identical")
}
[1] "taxa and tips are identical"

Now I re-order the rows in d to make sure they are in the same order as tips.

d <- d[match(tips, d$glottocode), ]

The trees in trees form a forest, i.e., a disconnected directed acyclic graph. I want to construct a representation of this graph which I can pass to Stan, consisting of

  • the number of nodes N,
  • a Nx2-matrix edges, and
  • a length-N vector of edge lengths, edge_lengths.
  • a vector of root nodes
  • a vector of tip nodes

First I create an Nx4 matrix where each row represents one node. The columns are

  • node number
  • tree number
  • tree-internal node number
  • label

For internal nodes, the label is the empty string.

node_matrix <- c()
node_number <- 1
for (tn in 1:length(trees)) {
    tr <- trees[[tn]]
    tr_nodes <- length(tr$tip.label) + tr$Nnode
    for (i in 1:tr_nodes) {
        tl <- ""
        if (i <= length(tr$tip.label)) {
            tl <- tr$tip.label[i]
        }
        node_matrix <- rbind(node_matrix, c(node_number, tn, i, tl))
        node_number <- node_number+1
    }
}

I convert the matrix to a tibble.

nodes <- tibble(
    node_number = as.numeric(node_matrix[,1]),
    tree_number = as.numeric(node_matrix[,2]),
    internal_number = as.numeric(node_matrix[,3]),
    label = node_matrix[,4]
)
head(nodes)
tail(nodes)
A tibble: 6 × 4
node_number tree_number internal_number label
<dbl> <dbl> <dbl> <chr>
1 1 1 kron1241
2 1 2 tuli1249
3 1 3
4 2 1 zarm1239
5 2 2 koyr1242
6 2 3
A tibble: 6 × 4
node_number tree_number internal_number label
<dbl> <dbl> <dbl> <chr>
2231 179 1 qawa1238
2232 180 1 cham1315
2233 181 1 nort2971
2234 182 1 trum1247
2235 183 1 sout2994
2236 184 1 guat1253

Next I define a function that maps a tree number and an internal node number to the global node number.

get_global_id <- function(tree_number, local_id) which((nodes$tree_number == tree_number) & (nodes$internal_number == local_id))

Now I create a matrix of edges and, concomittantly, a vector of edge lengths.

edges <- c()
edge_lengths <- c()
for (tn in 1:length(trees)) {
    tr <- trees[[tn]]
    if (length(tr$edge) > 0 ) {
        for (en in 1:nrow(tr$edge)) {
            mother <- tr$edge[en, 1]
            daughter <- tr$edge[en, 2]
            mother_id <- get_global_id(tn, mother)
            daughter_id <- get_global_id(tn, daughter)
            el <- tr$edge.length[en]
            edges <- rbind(edges, c(mother_id, daughter_id))
            edge_lengths <- c(edge_lengths, el)
        }
    }
}

Finally I identify the root nodes and tip nodes.

roots <- c(sort(setdiff(edges[,1], edges[,2])), match(isolates, nodes$label))
tips <- which(nodes$label != "")

Next I define a function computing the post-order.

get_postorder <- function(roots, edges) {
    input <- roots
    output <- c()
    while (length(input) > 0) {
        pivot <- input[1]
        daughters <- setdiff(edges[edges[,1] == pivot, 2], output)
        if (length(daughters) == 0) {
            input <- input[-1]
            output <- c(output, pivot)
        } else {
            input <- c(daughters, input)
        }
    }
    return(output)
}

Ordinal regression

For ordinal regression, the model specifications are as follows. \(K\) is the number of possible values.

\[ \begin{aligned} C_k &\sim \mathcal N(0, 5)\\ Y_i=1 &\sim \mathrm{Bernoulli}(\mathrm{logit}^{-1}(C_1))\\ Y_i=k &\sim \mathrm{Bernoulli}(\mathrm{logit}^{-1}(C_k) - \mathrm{logit}^{-1}(C_{k-1}))\\ Y_K=K &\sim \mathrm{Bernoulli}(1-\mathrm{logit}^{-1}(C_{K-1}))\\ \end{aligned} \]

Variable 1: Political complexity

variable <- "political_complexity"
Y <- d %>%
    select(all_of(variable)) %>%
    na.omit %>%
    pull
Y <- match(Y, sort(unique(Y)))
ordinal_data <- list()
ordinal_data$Ndata <- length(Y)
ordinal_data$nthres <- max(Y) -1
ordinal_data$Y <- Y

I generated Stan code for ordinal regression from brms and then did some manual adjustments.

ordinal_code <- "
data {
  int<lower=1> Ndata;  // total number of observations
  int Y[Ndata];  // response variable
  int<lower=2> nthres;  // number of thresholds
}
parameters {
  ordered[nthres] Intercept;  // temporary thresholds for centered predictors
}

model {
  for (n in 1:Ndata) {
    target += ordered_logistic_lpmf(Y[n] | 0, Intercept);
  }
  target += normal_lpdf(Intercept | 0, 5);
}

generated quantities {
  int y_prior[Ndata];
  int y_posterior[Ndata];
  vector[nthres] Intercept_sample;
  ordered[nthres] Intercept_prior;
  vector[Ndata] log_lik;
  for (k in 1:nthres) {
    Intercept_sample[k] = normal_rng(0, 5);
  }
  Intercept_prior = sort_asc(Intercept_sample);
  for (n in 1:Ndata) {
    y_prior[n] = ordered_logistic_rng(0, Intercept_prior);
    y_posterior[n] = ordered_logistic_rng(0, Intercept);
    log_lik[n] = ordered_logistic_lpmf(Y[n] | 0, Intercept);
  }
}

"
model_ordinal <- stan_model(model_code=ordinal_code)
fit_ordinal <- rstan::sampling(
    model_ordinal,
    data=ordinal_data,
    chains=4,
    iter=2000,
)
print(fit_ordinal, pars=c("Intercept", "lp__"))
Inference for Stan model: anon_model.
4 chains, each with iter=2000; warmup=1000; thin=1; 
post-warmup draws per chain=1000, total post-warmup draws=4000.

                 mean se_mean   sd     2.5%      25%      50%      75%    97.5%
Intercept[1]    -0.25    0.00 0.06    -0.37    -0.29    -0.25    -0.20    -0.12
Intercept[2]     1.06    0.00 0.07     0.93     1.02     1.06     1.11     1.20
Intercept[3]     2.07    0.00 0.09     1.88     2.00     2.06     2.13     2.25
Intercept[4]     3.18    0.00 0.15     2.89     3.07     3.18     3.28     3.49
lp__         -1436.95    0.03 1.39 -1440.41 -1437.64 -1436.68 -1435.94 -1435.19
             n_eff Rhat
Intercept[1]  3469    1
Intercept[2]  5482    1
Intercept[3]  4885    1
Intercept[4]  4118    1
lp__          2333    1

Samples were drawn using NUTS(diag_e) at Sun Feb 11 15:59:38 2024.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at 
convergence, Rhat=1).
(loo_ordinal <- loo(fit_ordinal))

Computed from 4000 by 1077 log-likelihood matrix

         Estimate   SE
elpd_loo  -1428.9 21.9
p_loo         4.0  0.2
looic      2857.8 43.7
------
Monte Carlo SE of elpd_loo is 0.0.

All Pareto k estimates are good (k < 0.5).
See help('pareto-k-diagnostic') for details.
# Extract the generated quantities
generated_quantities <- extract(fit_ordinal)

Predictive checks

Following Jäger & Wahle (2021), I define three statistics to be used in prior and posterior predictive checks:

  • total variance (tv)
  • mean lineage-wise variance (lv)
  • cross-family variance (cv)

Here are the corresponding R functions.

total_variance <- function(Y) var(Y)
lineage_wise_variance <- function(Y, variable=variable) {
    out <- d %>% 
        filter(!is.na(d[[variable]])) %>%
        mutate(Y = Y) %>%
        group_by(Family) %>%
        summarize(variance = var(Y, na.rm = TRUE)) %>%
        summarize(mean_variance = mean(variance, na.rm = TRUE)) %>%
        pull
    return(out)
}
crossfamily_variance <- function(Y, variable=variable) {
    out <- d %>% 
        filter(!is.na(d[[variable]])) %>%
        mutate(Y = Y) %>%
        group_by(Family) %>%
        summarize(centroids = mean(Y, na.rm = TRUE)) %>%
        pull(centroids) %>%
        var
    return(out)
}

First we get the empirical values.

Y <- ordinal_data$Y
empirical <- tibble(
    tv = total_variance(Y),
    lv = lineage_wise_variance(Y, variable),
    cv = crossfamily_variance(Y, variable)
)

Thanks to the generated quantities block in the Stan programm, fitting the model simultaneously produced prior and posterior predictive samples.

prior_pc <- tibble(
    tv = apply(generated_quantities$y_prior, 1, total_variance),
    lv = apply(
        generated_quantities$y_prior, 1, function(x) lineage_wise_variance(x, variable)
    ),
    cv = apply(
        generated_quantities$y_prior, 1, function(x) crossfamily_variance(x, variable)
    ),
)
predictive_plot <- function(prior_pc) {
    p1 <- prior_pc %>%
      mutate(
          x_jittered = 1 + rnorm(n(), mean = 0, sd = 0.01),
          y_jittered = tv + rnorm(n(), mean = 0, sd = sd(prior_pc$tv)/10)
      ) %>%
      ggplot(aes(x = x_jittered, y = tv)) +
      geom_boxplot(aes(x = factor(1), y = tv), width = 0.5, fill=karminrot) +
      geom_point(alpha = 0.5, size = 0.1, position = position_identity()) +
      theme_grey(base_size = 16) +
      theme(
        axis.text.x = element_blank(),
        axis.ticks.x = element_blank(),
        axis.title.x = element_blank(),
        axis.title.y = element_text(size = 18),
        plot.title = element_text(size = 20, hjust = 0.5)
      ) +
      labs(title = "total variance", y = "") +
      geom_point(aes(x=1, y=empirical$tv, color="empirical"), size=5, fill='red') +
      guides(color = "none")

    p2 <- prior_pc %>%
      mutate(
          x_jittered = 1 + rnorm(n(), mean = 0, sd = 0.01),
          y_jittered = lv + rnorm(n(), mean = 0, sd = sd(prior_pc$lv)/10)
      ) %>%
      ggplot(aes(x = x_jittered, y = lv)) +
      geom_boxplot(aes(x = factor(1), y = lv), width = 0.5, fill=gold) +
      geom_point(alpha = 0.5, size = 0.1, position = position_identity()) +
      theme_grey(base_size = 16) +
      theme(
        axis.text.x = element_blank(),
        axis.ticks.x = element_blank(),
        axis.title.x = element_blank(),
        axis.title.y = element_text(size = 18),
        plot.title = element_text(size = 20, hjust = 0.5)
      ) +
      labs(title = "lineage-wise variance", y = "") +
      geom_point(aes(x=1, y=empirical$lv, color="empirical"), size=5, fill='red') +
      guides(color = "none")

    p3 <- prior_pc %>%
      mutate(
          x_jittered = 1 + rnorm(n(), mean = 0, sd = 0.01),
          y_jittered = cv + rnorm(n(), mean = 0, sd = sd(prior_pc$cv)/10)
      ) %>%
      ggplot(aes(x = x_jittered, y = cv)) +
      geom_boxplot(aes(x = factor(1), y = cv), width = 0.5, fill=anthrazit) +
      geom_point(alpha = 0.5, size = 0.1, position = position_identity()) +
      theme_grey(base_size = 16) +
      theme(
        axis.text.x = element_blank(),
        axis.ticks.x = element_blank(),
        axis.title.x = element_blank(),
        axis.title.y = element_text(size = 18),
        plot.title = element_text(size = 20, hjust = 0.5)
      ) +
      labs(title = "cross-lineage variance", y = "") +
      geom_point(aes(x=1, y=empirical$cv, color="empirical"), size=5, fill='red') +
      labs(color = "") +
      guides(color = "none")

    return(grid.arrange(p1, p2, p3, ncol=3))
}
predictive_plot(prior_pc)

We see that the prior cover the empirical values quite well. Here is what happens after the model is fitted:

posterior_pc <- tibble(
    tv = apply(generated_quantities$y_posterior, 1, total_variance),
    lv = apply(
        generated_quantities$y_posterior, 1, function(x) lineage_wise_variance(x, variable)
    ),
    cv = apply(
        generated_quantities$y_posterior, 1, function(x) crossfamily_variance(x, variable)
    ),
)
predictive_plot(posterior_pc)

Something is seriously wrong with this model, obviously.

Setting up a phylogenetic model

Brownian motion

Next I will consider the phylogenetic Brownian motion model, combined with a logistic component. Here the basic assumptions are

  • there is a continuous latent variable \(z\) which has a value at each node
  • at the root nodes, \(z\) is drawn from a distribution like \(\mathcal N(0, 1)\)
  • once the value of \(z\) at a mother node is fixed, the values at the daughter nodes is drawn from \(\mathcal N(z_i, \sigma t_j)\), where \(z_i\) is the value at the root and \(t_j\) is the branch length to the daughter
  • the observed values \(y_i\) at tip \(i\) are distributed as

\[ \begin{aligned} Y_i=1 &\sim \mathrm{Bernoulli}(\mathrm{logit}^{-1}(C_1-z_i))\\ Y_i=k &\sim \mathrm{Bernoulli}(\mathrm{logit}^{-1}(C_k-z_i) - \mathrm{logit}^{-1}(C_{k-1}-z_i))\\ Y_K=K &\sim \mathrm{Bernoulli}(1-\mathrm{logit}^{-1}(C_{K-1}-z_i))\\ \end{aligned} \]

The parameter \(\sigma\) can be interpreted as the rate of evolution.

defined <- which(!is.na(d[[variable]]))
undefined <- which(is.na(d[[variable]]))
po <- c()
for (i in get_postorder(roots, edges)) {
    if ((i %in% edges[,2]) & !(i %in% undefined)) {
        po <- c(po, i)
    }
}
edge_po <- match(po, edges[,2])
phylo_data = list()
phylo_data$Ntrees <- length(trees)
phylo_data$Nnodes <- nrow(nodes)
phylo_data$Y <- ordinal_data$Y
phylo_data$Nobservations <- length(phylo_data$Y)
phylo_data$Nedges <- length(edge_po)
phylo_data$observations <- match(pull(d[defined, "glottocode"]), nodes$label)
phylo_data$roots <- roots
phylo_data$edges <- edges[edge_po,]
phylo_data$edge_lengths <- edge_lengths[edge_po]
phylo_data$nthres <- length(unique(phylo_data$Y)) - 1
brown_code <- "
data {
    int<lower=1> Ntrees;
    int<lower=0> Nnodes;
    int<lower=1> Nobservations;
    int<lower=1> Nedges;
    int<lower=1,upper=Nnodes> observations[Nobservations]; // indices of the observed tips
    int<lower=1, upper=Nnodes> roots[Ntrees];
    int<lower=1, upper=Nnodes> edges[Nedges, 2];
    vector[Nedges] edge_lengths;
    int<lower=2> nthres;  // number of thresholds
    int Y[Nobservations];  // response variable
}

parameters {
    vector[Nnodes] z_normal;
    real mu;
    real<lower=0> sigma;
    real<lower=0> sigma_root;
    ordered[nthres] Intercept;  // temporary thresholds for centered predictors
}

model {
    vector[Nnodes] z = rep_vector(0, Nnodes);
    target += normal_lpdf(mu | 0, 1);
    target += exponential_lpdf(sigma | 1);
    target += exponential_lpdf(sigma_root | 1);
    target += normal_lpdf(z_normal | 0, 1);
    target += normal_lpdf(Intercept | 0, 5);
    
    for (i in 1:Ntrees) {
        int idx = roots[i];
        z[idx] = mu + z_normal[idx] * sigma_root;
    }
    
    for (i in 1:Nedges) {
        int j = Nedges - i + 1;
        int mother_node = edges[j, 1];
        int daughter_node = edges[j, 2];
        real local_mu = z[mother_node];
        real local_sigma = sigma * sqrt(edge_lengths[j]);
        z[daughter_node] = local_mu + z_normal[daughter_node] * local_sigma;
    }

    for (n in 1:Nobservations) {
        target += ordered_logistic_lpmf(Y[n] | z[observations[n]], Intercept);
    }    
}


generated quantities {
    vector[Nnodes] z = rep_vector(0, Nnodes);
    int y_prior[Nobservations]; // Prior predictive samples
    int y_posterior[Nobservations]; // Posterior predictive samples
    vector[Nnodes] z_prior = rep_vector(0, Nnodes);
    vector[Nnodes] z_posterior = rep_vector(0, Nnodes);
    real mu_prior;
    real<lower=0> sigma_prior;
    real<lower=0> sigma_root_prior;
    real log_lik[Nobservations];
    vector[nthres] Intercept_sample;
    ordered[nthres] Intercept_prior;

    for (i in 1:Ntrees) {
        int idx = roots[i];
        z[idx] = mu + z_normal[idx] * sigma_root;
    }
    
    for (i in 1:Nedges) {
        int j = Nedges - i + 1;
        int mother_node = edges[j, 1];
        int daughter_node = edges[j, 2];
        real local_mu = z[mother_node];
        real local_sigma = sigma * sqrt(edge_lengths[j]);
        z[daughter_node] = local_mu + z_normal[daughter_node] * local_sigma;
    }

    for (k in 1:nthres) {
        Intercept_sample[k] = normal_rng(0, 5);
    }
    Intercept_prior = sort_asc(Intercept_sample);

    mu_prior = normal_rng(0, 10);
    sigma_prior = exponential_rng(1);
    sigma_root_prior = exponential_rng(1);

    for (i in 1:Ntrees) {
        z_prior[roots[i]] = normal_rng(mu_prior, sigma_root_prior);
    }

    for (i in 1:Ntrees) {
        int idx = roots[i];
        z_posterior[idx] = mu + z_normal[idx] * sigma_root;
    }
    
    for (i in 1:Nedges) {
        int j = Nedges - i + 1;
        int mother_node = edges[j, 1];
        int daughter_node = edges[j, 2];
        real local_mu = z_posterior[mother_node];
        real local_sigma = sigma * sqrt(edge_lengths[j]);
        z_prior[daughter_node] = normal_rng(
            z_prior[mother_node],
            sigma_prior * sqrt(edge_lengths[j])
        );

        z_posterior[daughter_node] = local_mu + z_normal[daughter_node] * local_sigma;
    }
    
    for (n in 1:Nobservations) {
        y_prior[n] = ordered_logistic_rng(z_prior[observations[n]], Intercept_prior);
        y_posterior[n] = ordered_logistic_rng(z[observations[n]], Intercept);
        log_lik[n] = ordered_logistic_lpmf(Y[n] | z[observations[n]], Intercept);
    }
}
"
model_brownian <- stan_model(model_code=brown_code)
fit_brownian <- rstan::sampling(
   model_brownian,
   data = phylo_data,
   chains = 4,
   iter=2000
)
print(fit_brownian, pars=c("mu", "sigma", "sigma_root", "Intercept", "lp__"))
Inference for Stan model: anon_model.
4 chains, each with iter=2000; warmup=1000; thin=1; 
post-warmup draws per chain=1000, total post-warmup draws=4000.

                 mean se_mean    sd     2.5%      25%      50%      75%
mu              -1.46    0.01  0.28    -2.02    -1.64    -1.45    -1.26
sigma            1.90    0.01  0.27     1.38     1.72     1.89     2.07
sigma_root       1.90    0.01  0.27     1.42     1.72     1.90     2.07
Intercept[1]    -0.65    0.00  0.15    -0.95    -0.75    -0.65    -0.55
Intercept[2]     1.32    0.00  0.16     1.01     1.21     1.32     1.43
Intercept[3]     2.72    0.00  0.20     2.33     2.58     2.72     2.85
Intercept[4]     4.07    0.00  0.25     3.60     3.89     4.06     4.24
lp__         -4319.19    1.47 41.59 -4404.31 -4345.90 -4317.98 -4290.17
                97.5% n_eff Rhat
mu              -0.94  1581 1.00
sigma            2.43   743 1.01
sigma_root       2.47  1203 1.00
Intercept[1]    -0.34  3592 1.00
Intercept[2]     1.65  5223 1.00
Intercept[3]     3.13  3012 1.00
Intercept[4]     4.58  2774 1.00
lp__         -4239.83   800 1.00

Samples were drawn using NUTS(diag_e) at Sun Feb 11 16:04:29 2024.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at 
convergence, Rhat=1).
log_lik <- extract_log_lik(
    fit_brownian,
    parameter_name = "log_lik",
    merge_chains = FALSE
)
reff <- relative_eff(exp(log_lik))
(loo_brownian <- loo(log_lik, r_eff=reff))
Warning message:
“Some Pareto k diagnostic values are too high. See help('pareto-k-diagnostic') for details.
”

Computed from 4000 by 1077 log-likelihood matrix

         Estimate   SE
elpd_loo  -1261.6 25.4
p_loo       209.2  6.9
looic      2523.2 50.8
------
Monte Carlo SE of elpd_loo is NA.

Pareto k diagnostic values:
                         Count Pct.    Min. n_eff
(-Inf, 0.5]   (good)     963   89.4%   350       
 (0.5, 0.7]   (ok)       104    9.7%   174       
   (0.7, 1]   (bad)       10    0.9%   72        
   (1, Inf)   (very bad)   0    0.0%   <NA>      
See help('pareto-k-diagnostic') for details.
generated_quantities <- extract(fit_brownian)

prior predictive check

prior_pc <- tibble(
    tv = apply(generated_quantities$y_prior, 1, total_variance),
    lv = apply(
        generated_quantities$y_prior, 1, function(x) lineage_wise_variance(x, variable)
    ),
    cv = apply(
        generated_quantities$y_prior, 1, function(x) crossfamily_variance(x, variable)
    ),
)

predictive_plot(prior_pc)

posterior predictive check

posterior_pc <- tibble(
    tv = apply(generated_quantities$y_posterior, 1, total_variance),
    lv = apply(
        generated_quantities$y_posterior, 1, function(x) lineage_wise_variance(x, variable)
    ),
    cv = apply(
        generated_quantities$y_posterior, 1, function(x) crossfamily_variance(x, variable)
    ),
)


predictive_plot(posterior_pc)

Mixed model

Next I will fit a model with two random effect, one phylogenetic and one uncorrelated. The variable \(z\) is as in the Brownian motion model. Additionally, we have a random effect variable \(\epsilon\) which is drawn from a normal distribution with mean 0 and standard deviation \(\sigma_\epsilon\).

\[ \begin{aligned} \sigma_\epsilon^2 &\sim \text{Inverse\_Gamma}(2, 0.01)\\ \epsilon &\sim \mathcal N(0, \sigma_\epsilon)\\ Y_i=1 &\sim \mathrm{Bernoulli}(\mathrm{logit}^{-1}(C_1-z_i-\epsilon_i))\\ Y_i=k &\sim \mathrm{Bernoulli}(\mathrm{logit}^{-1}(C_k-z_i-\epsilon_i) - \mathrm{logit}^{-1}(C_{k-1}-z_i))\\ Y_K=K &\sim \mathrm{Bernoulli}(1-\mathrm{logit}^{-1}(C_{K-1}-z_i-\epsilon_i))\\ \end{aligned} \]

mixed_code <- "
data {
    int<lower=1> Ntrees;
    int<lower=0> Nnodes;
    int<lower=1> Nobservations;
    int<lower=1> Nedges;
    int<lower=1,upper=Nnodes> observations[Nobservations]; // indices of the observed tips
    int<lower=1, upper=Nnodes> roots[Ntrees];
    int<lower=1, upper=Nnodes> edges[Nedges, 2];
    vector[Nedges] edge_lengths;
    int<lower=2> nthres;  // number of thresholds
    int Y[Nobservations];  // response variable
}

parameters {
    vector[Nnodes] z_normal;
    real mu;
    real<lower=0> sigma;
    real<lower=0> sigma_root;
    ordered[nthres] Intercept;  // temporary thresholds for centered predictors
    real<lower=0> sigma_error;
    vector[Nobservations] z_error;
}

model {
    vector[Nnodes] z = rep_vector(0, Nnodes);
    target += normal_lpdf(mu | 0, 10);
    target += exponential_lpdf(sigma | 1);
    target += exponential_lpdf(sigma_root | 1);
    target += normal_lpdf(z_normal | 0, 1);
    target += normal_lpdf(Intercept | 0, 5);
    target += inv_gamma_lpdf(square(sigma_error) | 2, 0.1);

    target += normal_lpdf(z_error | 0, sigma_error);

    for (i in 1:Ntrees) {
        int idx = roots[i];
        z[idx] = mu + z_normal[idx] * sigma_root;
    }
    
    for (i in 1:Nedges) {
        int j = Nedges - i + 1;
        int mother_node = edges[j, 1];
        int daughter_node = edges[j, 2];
        real local_mu = z[mother_node];
        real local_sigma = sigma * sqrt(edge_lengths[j]);
        z[daughter_node] = local_mu + z_normal[daughter_node] * local_sigma;
    }

    for (n in 1:Nobservations) {
        target += ordered_logistic_lpmf(Y[n] | z[observations[n]] + z_error[n], Intercept);
    }
    
}

generated quantities {
    vector[Nnodes] z = rep_vector(0, Nnodes);
    int y_prior[Nobservations]; // Prior predictive samples
    int y_posterior[Nobservations]; // Posterior predictive samples
    vector[Nnodes] z_prior = rep_vector(0, Nnodes);
    vector[Nnodes] z_posterior = rep_vector(0, Nnodes);
    real mu_prior;
    real<lower=0> sigma_prior;
    real<lower=0> sigma_root_prior;
    real log_lik[Nobservations];
    vector[nthres] Intercept_sample;
    ordered[nthres] Intercept_prior;
    vector[Nobservations] z_error_prior;


    real sigma_error_sq = 1 / gamma_rng(2, 10);
    real sigma_error_prior = sqrt(sigma_error_sq);

    for (i in 1:Ntrees) {
        int idx = roots[i];
        z[idx] = mu + z_normal[idx] * sigma_root;
    }

    for (n in 1:Nobservations) {
        z_error_prior[n] = normal_rng(0, sigma_error_prior);
    }
    
    for (i in 1:Nedges) {
        int j = Nedges - i + 1;
        int mother_node = edges[j, 1];
        int daughter_node = edges[j, 2];
        real local_mu = z[mother_node];
        real local_sigma = sigma * sqrt(edge_lengths[j]);
        z[daughter_node] = local_mu + z_normal[daughter_node] * local_sigma;
    }

    for (k in 1:nthres) {
        Intercept_sample[k] = normal_rng(0, 5);
    }
    Intercept_prior = sort_asc(Intercept_sample);

    mu_prior = normal_rng(0, 10);
    sigma_prior = exponential_rng(1);
    sigma_root_prior = exponential_rng(1);

    for (i in 1:Ntrees) {
        z_prior[roots[i]] = normal_rng(mu_prior, sigma_root_prior);
    }

    for (i in 1:Ntrees) {
        int idx = roots[i];
        z_posterior[idx] = mu + z_normal[idx] * sigma_root;
    }
    
    for (i in 1:Nedges) {
        int j = Nedges - i + 1;
        int mother_node = edges[j, 1];
        int daughter_node = edges[j, 2];
        real local_mu = z_posterior[mother_node];
        real local_sigma = sigma * sqrt(edge_lengths[j]);
        z_prior[daughter_node] = normal_rng(
            z_prior[mother_node],
            sigma_prior * sqrt(edge_lengths[j])
        );

        z_posterior[daughter_node] = local_mu + z_normal[daughter_node] * local_sigma;
    }
    
    for (n in 1:Nobservations) {
        y_prior[n] = ordered_logistic_rng(z_prior[observations[n]] + z_error_prior[n], Intercept_prior);
        y_posterior[n] = ordered_logistic_rng(z[observations[n]] + z_error[n], Intercept);
        log_lik[n] = ordered_logistic_lpmf(Y[n] | z[observations[n]] + z_error[n], Intercept);
    }
}

"
model_mixed <- stan_model(model_code=mixed_code)
## fit_mixed1 <- readRDS("models/fit_mixed1.rds")
fit_mixed1 <- rstan::sampling(
   model_mixed,
   data = phylo_data,
   chains = 4,
   iter=40000,
   thin=20,
   control=list(max_treedepth=15)
)
Warning message:
“There were 4 chains where the estimated Bayesian Fraction of Missing Information was low. See
https://mc-stan.org/misc/warnings.html#bfmi-low”
Warning message:
“Examine the pairs() plot to diagnose sampling problems
”
Warning message:
“Bulk Effective Samples Size (ESS) is too low, indicating posterior means and medians may be unreliable.
Running the chains for more iterations may help. See
https://mc-stan.org/misc/warnings.html#bulk-ess”
Warning message:
“Tail Effective Samples Size (ESS) is too low, indicating posterior variances and tail quantiles may be unreliable.
Running the chains for more iterations may help. See
https://mc-stan.org/misc/warnings.html#tail-ess”
saveRDS(fit_mixed1, "models/fit_mixed1.rds")
print(fit_mixed1, pars=c("mu", "sigma", "sigma_root", "Intercept", "sigma_error", "lp__"))
Inference for Stan model: anon_model.
4 chains, each with iter=40000; warmup=20000; thin=20; 
post-warmup draws per chain=1000, total post-warmup draws=4000.

                 mean se_mean     sd     2.5%      25%      50%      75%
mu              -1.61    0.01   0.31    -2.25    -1.81    -1.60    -1.39
sigma            1.93    0.00   0.28     1.42     1.73     1.93     2.12
sigma_root       1.98    0.00   0.28     1.48     1.78     1.96     2.16
Intercept[1]    -0.70    0.00   0.16    -1.01    -0.80    -0.70    -0.59
Intercept[2]     1.31    0.00   0.17     0.99     1.19     1.31     1.42
Intercept[3]     2.72    0.00   0.21     2.34     2.58     2.72     2.86
Intercept[4]     4.08    0.00   0.26     3.59     3.90     4.08     4.25
sigma_error      0.22    0.01   0.09     0.13     0.17     0.20     0.25
lp__         -4158.44   22.95 340.75 -4931.29 -4346.85 -4120.79 -3926.49
                97.5% n_eff Rhat
mu              -1.03  3705 1.00
sigma            2.51  3930 1.00
sigma_root       2.59  3823 1.00
Intercept[1]    -0.38  3815 1.00
Intercept[2]     1.65  3938 1.00
Intercept[3]     3.13  3395 1.00
Intercept[4]     4.59  3134 1.00
sigma_error      0.43   188 1.01
lp__         -3608.28   220 1.01

Samples were drawn using NUTS(diag_e) at Sun Feb 11 17:10:02 2024.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at 
convergence, Rhat=1).
log_lik <- extract_log_lik(
    fit_mixed1,
    parameter_name = "log_lik",
    merge_chains = FALSE
)
reff <- relative_eff(exp(log_lik))
(loo_mixed <- loo(log_lik, r_eff=reff))
Warning message:
“Some Pareto k diagnostic values are too high. See help('pareto-k-diagnostic') for details.
”

Computed from 4000 by 1077 log-likelihood matrix

         Estimate   SE
elpd_loo  -1260.7 25.4
p_loo       222.6  7.2
looic      2521.3 50.9
------
Monte Carlo SE of elpd_loo is NA.

Pareto k diagnostic values:
                         Count Pct.    Min. n_eff
(-Inf, 0.5]   (good)     952   88.4%   423       
 (0.5, 0.7]   (ok)       117   10.9%   239       
   (0.7, 1]   (bad)        8    0.7%   49        
   (1, Inf)   (very bad)   0    0.0%   <NA>      
See help('pareto-k-diagnostic') for details.
generated_quantities <- extract(fit_mixed1)

prior predictive check

prior_pc <- tibble(
    tv = apply(generated_quantities$y_prior, 1, total_variance),
    lv = apply(
        generated_quantities$y_prior, 1, function(x) lineage_wise_variance(x, variable)
    ),
    cv = apply(
        generated_quantities$y_prior, 1, function(x) crossfamily_variance(x, variable)
    ),
)

predictive_plot(prior_pc)

posterior predictive check

posterior_pc <- tibble(
    tv = apply(generated_quantities$y_posterior, 1, total_variance),
    lv = apply(
        generated_quantities$y_posterior, 1, function(x) lineage_wise_variance(x, variable)
    ),
    cv = apply(
        generated_quantities$y_posterior, 1, function(x) crossfamily_variance(x, variable)
    ),
)

predictive_plot(posterior_pc)

Let us summarize the model comparisons:

This is very decisive evidence in favor of the CTMC model.

model comparison

tibble(model = c("ordinal", "brownian", "mixed"),
       looic = c(
            loo_ordinal$estimates[3,1],
            loo_brownian$estimates[3,1],
            loo_mixed$estimates[3,1]
       )
) %>%
mutate(delta_looic = round(looic-min(looic), 3)) %>%
arrange(delta_looic)
A tibble: 3 × 3
model looic delta_looic
<chr> <dbl> <dbl>
mixed 2521.317 0.000
brownian 2523.228 1.911
ordinal 2857.780 336.463

We can conclude that phylogenetic control massively improves the fit to the data.

Variable 2: hierarchy_within

variable <- "hierarchy_within"
Y <- d %>%
    select(all_of(variable)) %>%
    na.omit %>%
    pull
Y <- match(Y, sort(unique(Y)))
ordinal_data <- list()
ordinal_data$Ndata <- length(Y)
ordinal_data$nthres <- max(Y) -1
ordinal_data$Y <- Y
fit_ordinal <- rstan::sampling(
    model_ordinal,
    data=ordinal_data,
    chains=4,
    iter=2000,
)
print(fit_ordinal, pars=c("Intercept", "lp__"))
Inference for Stan model: anon_model.
4 chains, each with iter=2000; warmup=1000; thin=1; 
post-warmup draws per chain=1000, total post-warmup draws=4000.

                 mean se_mean   sd     2.5%     25%      50%      75%    97.5%
Intercept[1]    -0.86    0.00 0.07    -0.99    -0.9    -0.85    -0.81    -0.72
Intercept[2]     1.86    0.00 0.09     1.69     1.8     1.86     1.92     2.04
lp__         -1043.10    0.03 1.04 -1045.95 -1043.5 -1042.78 -1042.36 -1042.08
             n_eff Rhat
Intercept[1]  1730    1
Intercept[2]  3359    1
lp__          1529    1

Samples were drawn using NUTS(diag_e) at Sun Feb 11 17:15:35 2024.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at 
convergence, Rhat=1).
(loo_ordinal <- loo(fit_ordinal))

Computed from 4000 by 1089 log-likelihood matrix

         Estimate   SE
elpd_loo  -1040.0 16.7
p_loo         2.1  0.1
looic      2080.0 33.3
------
Monte Carlo SE of elpd_loo is 0.0.

All Pareto k estimates are good (k < 0.5).
See help('pareto-k-diagnostic') for details.
# Extract the generated quantities
generated_quantities <- extract(fit_ordinal)
Y <- ordinal_data$Y
empirical <- tibble(
    tv = total_variance(Y),
    lv = lineage_wise_variance(Y, variable),
    cv = crossfamily_variance(Y, variable)
)
prior_pc <- tibble(
    tv = apply(generated_quantities$y_prior, 1, total_variance),
    lv = apply(
        generated_quantities$y_prior, 1, function(x) lineage_wise_variance(x, variable)
    ),
    cv = apply(
        generated_quantities$y_prior, 1, function(x) crossfamily_variance(x, variable)
    ),
)
predictive_plot(prior_pc)

posterior_pc <- tibble(
    tv = apply(generated_quantities$y_posterior, 1, total_variance),
    lv = apply(
        generated_quantities$y_posterior, 1, function(x) lineage_wise_variance(x, variable)
    ),
    cv = apply(
        generated_quantities$y_posterior, 1, function(x) crossfamily_variance(x, variable)
    ),
)
predictive_plot(posterior_pc)

defined <- which(!is.na(d[[variable]]))
undefined <- which(is.na(d[[variable]]))
po <- c()
for (i in get_postorder(roots, edges)) {
    if ((i %in% edges[,2]) && !(i %in% undefined)) {
        po <- c(po, i)
    }
}
edge_po <- match(po, edges[,2])
phylo_data = list()
phylo_data$Ntrees <- length(trees)
phylo_data$Nnodes <- nrow(nodes)
phylo_data$Y <- ordinal_data$Y
phylo_data$Nobservations <- length(phylo_data$Y)
phylo_data$Nedges <- length(edge_po)
phylo_data$observations <- match(pull(d[defined, "glottocode"]), nodes$label)
phylo_data$roots <- roots
phylo_data$edges <- edges[edge_po,]
phylo_data$edge_lengths <- edge_lengths[edge_po]
phylo_data$nthres <- length(unique(phylo_data$Y)) - 1
fit_brownian <- rstan::sampling(
   model_brownian,
   data = phylo_data,
   chains = 4,
   iter=2000
)
print(fit_brownian, pars=c("mu", "sigma", "sigma_root", "Intercept", "lp__"))
Inference for Stan model: anon_model.
4 chains, each with iter=2000; warmup=1000; thin=1; 
post-warmup draws per chain=1000, total post-warmup draws=4000.

                 mean se_mean    sd     2.5%      25%      50%      75%
mu               0.38    0.00  0.22    -0.05     0.24     0.38     0.53
sigma            1.95    0.01  0.29     1.42     1.74     1.94     2.13
sigma_root       1.14    0.01  0.21     0.75     1.00     1.13     1.28
Intercept[1]    -0.63    0.00  0.15    -0.93    -0.73    -0.63    -0.53
Intercept[2]     3.15    0.01  0.23     2.69     2.99     3.14     3.30
lp__         -3996.47    1.49 41.60 -4077.98 -4024.97 -3997.11 -3969.64
                97.5% n_eff Rhat
mu               0.82  2841    1
sigma            2.57   755    1
sigma_root       1.57   875    1
Intercept[1]    -0.33  3752    1
Intercept[2]     3.61  1494    1
lp__         -3911.79   781    1

Samples were drawn using NUTS(diag_e) at Sun Feb 11 17:20:01 2024.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at 
convergence, Rhat=1).
(loo_brownian <- loo(fit_brownian))
Warning message:
“Some Pareto k diagnostic values are too high. See help('pareto-k-diagnostic') for details.
”

Computed from 4000 by 1089 log-likelihood matrix

         Estimate   SE
elpd_loo   -937.2 18.3
p_loo       201.7  6.7
looic      1874.4 36.6
------
Monte Carlo SE of elpd_loo is NA.

Pareto k diagnostic values:
                         Count Pct.    Min. n_eff
(-Inf, 0.5]   (good)     1023  93.9%   848       
 (0.5, 0.7]   (ok)         63   5.8%   297       
   (0.7, 1]   (bad)         3   0.3%   92        
   (1, Inf)   (very bad)    0   0.0%   <NA>      
See help('pareto-k-diagnostic') for details.
# Extract the generated quantities
generated_quantities <- extract(fit_brownian)
Y <- ordinal_data$Y
empirical <- tibble(
    tv = total_variance(Y),
    lv = lineage_wise_variance(Y, variable),
    cv = crossfamily_variance(Y, variable)
)
prior_pc <- tibble(
    tv = apply(generated_quantities$y_prior, 1, total_variance),
    lv = apply(
        generated_quantities$y_prior, 1, function(x) lineage_wise_variance(x, variable)
    ),
    cv = apply(
        generated_quantities$y_prior, 1, function(x) crossfamily_variance(x, variable)
    ),
)
predictive_plot(prior_pc)

posterior_pc <- tibble(
  tv = apply(generated_quantities$y_posterior, 1, total_variance),
  lv = apply(
    generated_quantities$y_posterior, 1, function(x) lineage_wise_variance(x, variable)
  ),
  cv = apply(
    generated_quantities$y_posterior, 1, function(x) crossfamily_variance(x, variable)
  ),
)
predictive_plot(posterior_pc)

# fit_mixed2 <- readRDS("models/fit_mixed2.rds")
fit_mixed2 <- rstan::sampling(
   model_mixed,
   data = phylo_data,
   chains = 4,
   iter=40000,
   thin=20,
   control=list(max_treedepth=15)
)
Warning message:
“There were 4 chains where the estimated Bayesian Fraction of Missing Information was low. See
https://mc-stan.org/misc/warnings.html#bfmi-low”
Warning message:
“Examine the pairs() plot to diagnose sampling problems
”
Warning message:
“Bulk Effective Samples Size (ESS) is too low, indicating posterior means and medians may be unreliable.
Running the chains for more iterations may help. See
https://mc-stan.org/misc/warnings.html#bulk-ess”
Warning message:
“Tail Effective Samples Size (ESS) is too low, indicating posterior variances and tail quantiles may be unreliable.
Running the chains for more iterations may help. See
https://mc-stan.org/misc/warnings.html#tail-ess”
saveRDS(fit_mixed2, "models/fit_mixed2.rds")
print(fit_mixed2, pars=c("mu", "sigma", "sigma_root", "Intercept", "sigma_error", "lp__"))
Inference for Stan model: anon_model.
4 chains, each with iter=40000; warmup=20000; thin=20; 
post-warmup draws per chain=1000, total post-warmup draws=4000.

                 mean se_mean     sd     2.5%      25%      50%      75%
mu               0.41    0.00   0.23    -0.03     0.25     0.41     0.56
sigma            1.96    0.00   0.28     1.45     1.77     1.96     2.14
sigma_root       1.16    0.00   0.21     0.77     1.01     1.15     1.30
Intercept[1]    -0.63    0.00   0.15    -0.93    -0.74    -0.63    -0.53
Intercept[2]     3.18    0.00   0.23     2.75     3.02     3.18     3.33
sigma_error      0.23    0.01   0.08     0.13     0.18     0.22     0.27
lp__         -3900.68   25.61 344.83 -4637.51 -4108.79 -3876.92 -3654.71
                97.5% n_eff Rhat
mu               0.85  4120 1.00
sigma            2.55  3891 1.00
sigma_root       1.60  4123 1.00
Intercept[1]    -0.33  4113 1.00
Intercept[2]     3.67  4076 1.00
sigma_error      0.44   181 1.03
lp__         -3311.09   181 1.03

Samples were drawn using NUTS(diag_e) at Sun Feb 11 18:23:05 2024.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at 
convergence, Rhat=1).
(loo_mixed <- loo(fit_mixed2))
Warning message:
“Some Pareto k diagnostic values are too high. See help('pareto-k-diagnostic') for details.
”

Computed from 4000 by 1089 log-likelihood matrix

         Estimate   SE
elpd_loo   -936.6 18.3
p_loo       211.7  6.9
looic      1873.3 36.6
------
Monte Carlo SE of elpd_loo is NA.

Pareto k diagnostic values:
                         Count Pct.    Min. n_eff
(-Inf, 0.5]   (good)     1027  94.3%   383       
 (0.5, 0.7]   (ok)         59   5.4%   220       
   (0.7, 1]   (bad)         3   0.3%   194       
   (1, Inf)   (very bad)    0   0.0%   <NA>      
See help('pareto-k-diagnostic') for details.
# Extract the generated quantities
generated_quantities <- extract(fit_mixed2)
Y <- ordinal_data$Y
empirical <- tibble(
    tv = total_variance(Y),
    lv = lineage_wise_variance(Y, variable),
    cv = crossfamily_variance(Y, variable)
)
prior_pc <- tibble(
    tv = apply(generated_quantities$y_prior, 1, total_variance),
    lv = apply(
        generated_quantities$y_prior, 1, function(x) lineage_wise_variance(x, variable)
    ),
    cv = apply(
        generated_quantities$y_prior, 1, function(x) crossfamily_variance(x, variable)
    ),
)
predictive_plot(prior_pc)

posterior_pc <- tibble(
    tv = apply(generated_quantities$y_posterior, 1, total_variance),
    lv = apply(
        generated_quantities$y_posterior, 1, function(x) lineage_wise_variance(x, variable)
    ),
    cv = apply(
        generated_quantities$y_posterior, 1, function(x) crossfamily_variance(x, variable)
    ),
)
predictive_plot(posterior_pc)

model comparison

tibble(model = c("ordinal", "brownian", "mixed"),
       looic = c(
            loo_ordinal$estimates[3,1],
            loo_brownian$estimates[3,1],
            loo_mixed$estimates[3,1]
       )
) %>%
mutate(delta_looic = round(looic-min(looic), 3)) %>%
arrange(delta_looic)
A tibble: 3 × 3
model looic delta_looic
<chr> <dbl> <dbl>
mixed 1873.273 0.000
brownian 1874.419 1.147
ordinal 2080.010 206.737

Variable 3: domestic_organisation

variable <- "domestic_organisation"
Y <- d %>%
    select(all_of(variable)) %>%
    na.omit %>%
    pull
Y <- match(Y, sort(unique(Y)))
ordinal_data <- list()
ordinal_data$Ndata <- length(Y)
ordinal_data$nthres <- max(Y) -1
ordinal_data$Y <- Y
fit_ordinal <- rstan::sampling(
    model_ordinal,
    data=ordinal_data,
    chains=4,
    iter=2000,
)
print(fit_ordinal, pars=c("Intercept", "lp__"))
Inference for Stan model: anon_model.
4 chains, each with iter=2000; warmup=1000; thin=1; 
post-warmup draws per chain=1000, total post-warmup draws=4000.

                 mean se_mean   sd     2.5%      25%      50%      75%    97.5%
Intercept[1]    -0.93    0.00 0.06    -1.05    -0.97    -0.93    -0.88    -0.80
Intercept[2]     0.05    0.00 0.06    -0.06     0.01     0.05     0.09     0.17
Intercept[3]     1.45    0.00 0.07     1.31     1.40     1.45     1.50     1.60
lp__         -1631.06    0.03 1.23 -1634.26 -1631.63 -1630.73 -1630.16 -1629.66
             n_eff Rhat
Intercept[1]  2747    1
Intercept[2]  3929    1
Intercept[3]  3925    1
lp__          1634    1

Samples were drawn using NUTS(diag_e) at Sun Feb 11 18:28:43 2024.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at 
convergence, Rhat=1).
(loo_ordinal <- loo(fit_ordinal))

Computed from 4000 by 1183 log-likelihood matrix

         Estimate   SE
elpd_loo  -1625.3  5.9
p_loo         3.1  0.0
looic      3250.5 11.8
------
Monte Carlo SE of elpd_loo is 0.0.

All Pareto k estimates are good (k < 0.5).
See help('pareto-k-diagnostic') for details.
# Extract the generated quantities
generated_quantities <- extract(fit_ordinal)
Y <- ordinal_data$Y
empirical <- tibble(
    tv = total_variance(Y),
    lv = lineage_wise_variance(Y, variable),
    cv = crossfamily_variance(Y, variable)
)
prior_pc <- tibble(
    tv = apply(generated_quantities$y_prior, 1, total_variance),
    lv = apply(
        generated_quantities$y_prior, 1, function(x) lineage_wise_variance(x, variable)
    ),
    cv = apply(
        generated_quantities$y_prior, 1, function(x) crossfamily_variance(x, variable)
    ),
)
predictive_plot(prior_pc)

posterior_pc <- tibble(
    tv = apply(generated_quantities$y_posterior, 1, total_variance),
    lv = apply(
        generated_quantities$y_posterior, 1, function(x) lineage_wise_variance(x, variable)
    ),
    cv = apply(
        generated_quantities$y_posterior, 1, function(x) crossfamily_variance(x, variable)
    ),
)
predictive_plot(posterior_pc)

defined <- which(!is.na(d[[variable]]))
undefined <- which(is.na(d[[variable]]))
po <- c()
for (i in get_postorder(roots, edges)) {
    if ((i %in% edges[,2]) & !(i %in% undefined)) {
        po <- c(po, i)
    }
}
edge_po <- match(po, edges[,2])
phylo_data = list()
phylo_data$Ntrees <- length(trees)
phylo_data$Nnodes <- nrow(nodes)
phylo_data$Y <- ordinal_data$Y
phylo_data$Nobservations <- length(phylo_data$Y)
phylo_data$Nedges <- length(edge_po)
phylo_data$observations <- match(pull(d[defined, "glottocode"]), nodes$label)
phylo_data$roots <- roots
phylo_data$edges <- edges[edge_po,]
phylo_data$edge_lengths <- edge_lengths[edge_po]
phylo_data$nthres <- length(unique(phylo_data$Y)) - 1
fit_brownian <- rstan::sampling(
   model_brownian,
   data = phylo_data,
   chains = 4,
   iter=2000
)
print(fit_brownian, pars=c("mu", "sigma", "sigma_root", "Intercept", "lp__"))
Inference for Stan model: anon_model.
4 chains, each with iter=2000; warmup=1000; thin=1; 
post-warmup draws per chain=1000, total post-warmup draws=4000.

                 mean se_mean    sd     2.5%      25%      50%      75%
mu              -0.17    0.01  0.30    -0.74    -0.37    -0.17     0.03
sigma            1.54    0.01  0.22     1.12     1.38     1.53     1.69
sigma_root       0.97    0.01  0.18     0.65     0.84     0.96     1.09
Intercept[1]    -1.23    0.01  0.28    -1.78    -1.41    -1.23    -1.04
Intercept[2]     0.01    0.01  0.28    -0.54    -0.17     0.01     0.20
Intercept[3]     1.79    0.01  0.29     1.20     1.60     1.79     1.98
lp__         -4598.51    1.46 42.43 -4680.89 -4626.97 -4599.23 -4569.67
                97.5% n_eff Rhat
mu               0.40  2039 1.00
sigma            2.01   649 1.01
sigma_root       1.37   787 1.01
Intercept[1]    -0.68  2109 1.00
Intercept[2]     0.53  2203 1.00
Intercept[3]     2.36  1908 1.00
lp__         -4513.59   844 1.01

Samples were drawn using NUTS(diag_e) at Sun Feb 11 18:33:54 2024.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at 
convergence, Rhat=1).
(loo_brownian <- loo(fit_brownian))
Warning message:
“Some Pareto k diagnostic values are too high. See help('pareto-k-diagnostic') for details.
”

Computed from 4000 by 1183 log-likelihood matrix

         Estimate   SE
elpd_loo  -1539.0 13.8
p_loo       211.4  5.0
looic      3077.9 27.5
------
Monte Carlo SE of elpd_loo is NA.

Pareto k diagnostic values:
                         Count Pct.    Min. n_eff
(-Inf, 0.5]   (good)     1156  97.7%   671       
 (0.5, 0.7]   (ok)         25   2.1%   253       
   (0.7, 1]   (bad)         2   0.2%   233       
   (1, Inf)   (very bad)    0   0.0%   <NA>      
See help('pareto-k-diagnostic') for details.
# Extract the generated quantities
generated_quantities <- extract(fit_brownian)
Y <- ordinal_data$Y
empirical <- tibble(
    tv = total_variance(Y),
    lv = lineage_wise_variance(Y, variable),
    cv = crossfamily_variance(Y, variable)
)
prior_pc <- tibble(
    tv = apply(generated_quantities$y_prior, 1, total_variance),
    lv = apply(
        generated_quantities$y_prior, 1, function(x) lineage_wise_variance(x, variable)
    ),
    cv = apply(
        generated_quantities$y_prior, 1, function(x) crossfamily_variance(x, variable)
    ),
)
predictive_plot(prior_pc)

posterior_pc <- tibble(
    tv = apply(generated_quantities$y_posterior, 1, total_variance),
    lv = apply(
        generated_quantities$y_posterior, 1, function(x) lineage_wise_variance(x, variable)
    ),
    cv = apply(
        generated_quantities$y_posterior, 1, function(x) crossfamily_variance(x, variable)
    ),
)
predictive_plot(posterior_pc)

#fit_mixed3 <- readRDS("models/fit_mixed3.rds")
fit_mixed3 <- rstan::sampling(
   model_mixed,
   data = phylo_data,
   chains = 4,
   iter = 40000,
   thin = 20,
   control=list(max_treedepth=15)
)
Warning message:
“There were 4 chains where the estimated Bayesian Fraction of Missing Information was low. See
https://mc-stan.org/misc/warnings.html#bfmi-low”
Warning message:
“Examine the pairs() plot to diagnose sampling problems
”
Warning message:
“Bulk Effective Samples Size (ESS) is too low, indicating posterior means and medians may be unreliable.
Running the chains for more iterations may help. See
https://mc-stan.org/misc/warnings.html#bulk-ess”
Warning message:
“Tail Effective Samples Size (ESS) is too low, indicating posterior variances and tail quantiles may be unreliable.
Running the chains for more iterations may help. See
https://mc-stan.org/misc/warnings.html#tail-ess”
saveRDS(fit_mixed3, "models/fit_mixed3.rds")
print(fit_mixed3, pars=c("mu", "sigma", "sigma_root", "Intercept", "sigma_error", "lp__"))
Inference for Stan model: anon_model.
4 chains, each with iter=40000; warmup=20000; thin=20; 
post-warmup draws per chain=1000, total post-warmup draws=4000.

                 mean se_mean     sd     2.5%      25%      50%      75%
mu              -0.19    0.01   0.32    -0.83    -0.40    -0.20     0.02
sigma            1.54    0.00   0.23     1.12     1.38     1.53     1.68
sigma_root       0.99    0.00   0.18     0.66     0.86     0.98     1.11
Intercept[1]    -1.26    0.00   0.29    -1.85    -1.45    -1.26    -1.07
Intercept[2]     0.00    0.00   0.29    -0.59    -0.19    -0.01     0.19
Intercept[3]     1.80    0.01   0.31     1.20     1.59     1.79     2.01
sigma_error      0.26    0.01   0.11     0.13     0.18     0.23     0.30
lp__         -4572.68   34.85 442.38 -5545.25 -4847.57 -4534.39 -4243.60
                97.5% n_eff Rhat
mu               0.43  3847 1.00
sigma            2.01  3835 1.00
sigma_root       1.37  3165 1.00
Intercept[1]    -0.70  3648 1.00
Intercept[2]     0.56  3828 1.00
Intercept[3]     2.41  3630 1.00
sigma_error      0.55   151 1.03
lp__         -3833.28   161 1.03

Samples were drawn using NUTS(diag_e) at Sun Feb 11 20:09:05 2024.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at 
convergence, Rhat=1).
(loo_mixed <- loo(fit_mixed3))
Warning message:
“Some Pareto k diagnostic values are too high. See help('pareto-k-diagnostic') for details.
”

Computed from 4000 by 1183 log-likelihood matrix

         Estimate   SE
elpd_loo  -1539.3 13.8
p_loo       228.6  5.3
looic      3078.6 27.5
------
Monte Carlo SE of elpd_loo is NA.

Pareto k diagnostic values:
                         Count Pct.    Min. n_eff
(-Inf, 0.5]   (good)     1151  97.3%   293       
 (0.5, 0.7]   (ok)         30   2.5%   229       
   (0.7, 1]   (bad)         2   0.2%   205       
   (1, Inf)   (very bad)    0   0.0%   <NA>      
See help('pareto-k-diagnostic') for details.
# Extract the generated quantities
generated_quantities <- extract(fit_mixed3)
Y <- ordinal_data$Y
empirical <- tibble(
    tv = total_variance(Y),
    lv = lineage_wise_variance(Y, variable),
    cv = crossfamily_variance(Y, variable)
)
prior_pc <- tibble(
    tv = apply(generated_quantities$y_prior, 1, total_variance),
    lv = apply(
        generated_quantities$y_prior, 1, function(x) lineage_wise_variance(x, variable)
    ),
    cv = apply(
        generated_quantities$y_prior, 1, function(x) crossfamily_variance(x, variable)
    ),
)
predictive_plot(prior_pc)

posterior_pc <- tibble(
    tv = apply(generated_quantities$y_posterior, 1, total_variance),
    lv = apply(
        generated_quantities$y_posterior, 1, function(x) lineage_wise_variance(x, variable)
    ),
    cv = apply(
        generated_quantities$y_posterior, 1, function(x) crossfamily_variance(x, variable)
    ),
)
predictive_plot(posterior_pc)

model comparison

tibble(model = c("ordinal", "brownian", "mixed"),
       looic = c(
            loo_ordinal$estimates[3,1],
            loo_brownian$estimates[3,1],
            loo_mixed$estimates[3,1]
       )
) %>%
mutate(delta_looic = round(looic-min(looic), 3)) %>%
arrange(delta_looic)
A tibble: 3 × 3
model looic delta_looic
<chr> <dbl> <dbl>
brownian 3077.910 0.000
mixed 3078.557 0.647
ordinal 3250.511 172.601

Variable 4: agricultureLevel

variable <- "agricultureLevel"
Y <- d %>%
    select(all_of(variable)) %>%
    na.omit %>%
    pull
Y <- match(Y, sort(unique(Y)))
ordinal_data <- list()
ordinal_data$Ndata <- length(Y)
ordinal_data$nthres <- max(Y) -1
ordinal_data$Y <- Y
fit_ordinal <- rstan::sampling(
    model_ordinal,
    data=ordinal_data,
    chains=4,
    iter=2000,
)
print(fit_ordinal, pars=c("Intercept", "lp__"))
Inference for Stan model: anon_model.
4 chains, each with iter=2000; warmup=1000; thin=1; 
post-warmup draws per chain=1000, total post-warmup draws=4000.

                 mean se_mean   sd     2.5%      25%      50%      75%    97.5%
Intercept[1]    -1.53    0.00 0.08    -1.68    -1.58    -1.53    -1.48    -1.38
Intercept[2]    -1.32    0.00 0.07    -1.47    -1.37    -1.32    -1.28    -1.19
Intercept[3]    -1.19    0.00 0.07    -1.32    -1.23    -1.18    -1.14    -1.05
Intercept[4]    -1.02    0.00 0.07    -1.15    -1.07    -1.02    -0.98    -0.89
Intercept[5]    -0.67    0.00 0.06    -0.79    -0.72    -0.67    -0.63    -0.55
Intercept[6]     0.03    0.00 0.06    -0.08    -0.01     0.03     0.07     0.14
Intercept[7]     1.08    0.00 0.07     0.95     1.04     1.08     1.12     1.21
Intercept[8]     2.35    0.00 0.10     2.15     2.28     2.35     2.42     2.55
Intercept[9]     4.12    0.00 0.23     3.70     3.96     4.11     4.27     4.59
lp__         -2448.95    0.05 2.12 -2453.95 -2450.09 -2448.57 -2447.47 -2445.82
             n_eff Rhat
Intercept[1]  3772    1
Intercept[2]  4469    1
Intercept[3]  4537    1
Intercept[4]  4666    1
Intercept[5]  5386    1
Intercept[6]  4908    1
Intercept[7]  4597    1
Intercept[8]  5763    1
Intercept[9]  6636    1
lp__          1880    1

Samples were drawn using NUTS(diag_e) at Sun Feb 11 20:15:19 2024.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at 
convergence, Rhat=1).
(loo_ordinal <- loo(fit_ordinal))

Computed from 4000 by 1209 log-likelihood matrix

         Estimate   SE
elpd_loo  -2424.1 23.3
p_loo         8.9  0.3
looic      4848.1 46.6
------
Monte Carlo SE of elpd_loo is 0.0.

All Pareto k estimates are good (k < 0.5).
See help('pareto-k-diagnostic') for details.
# Extract the generated quantities
generated_quantities <- extract(fit_ordinal)
Y <- ordinal_data$Y
empirical <- tibble(
    tv = total_variance(Y),
    lv = lineage_wise_variance(Y, variable),
    cv = crossfamily_variance(Y, variable)
)
prior_pc <- tibble(
    tv = apply(generated_quantities$y_prior, 1, total_variance),
    lv = apply(
        generated_quantities$y_prior, 1, function(x) lineage_wise_variance(x, variable)
    ),
    cv = apply(
        generated_quantities$y_prior, 1, function(x) crossfamily_variance(x, variable)
    ),
)
predictive_plot(prior_pc)

posterior_pc <- tibble(
    tv = apply(generated_quantities$y_posterior, 1, total_variance),
    lv = apply(
        generated_quantities$y_posterior, 1, function(x) lineage_wise_variance(x, variable)
    ),
    cv = apply(
        generated_quantities$y_posterior, 1, function(x) crossfamily_variance(x, variable)
    ),
)
predictive_plot(posterior_pc)

defined <- which(!is.na(d[[variable]]))
undefined <- which(is.na(d[[variable]]))
po <- c()
for (i in get_postorder(roots, edges)) {
    if ((i %in% edges[,2]) & !(i %in% undefined)) {
        po <- c(po, i)
    }
}
edge_po <- match(po, edges[,2])
phylo_data = list()
phylo_data$Ntrees <- length(trees)
phylo_data$Nnodes <- nrow(nodes)
phylo_data$Y <- ordinal_data$Y
phylo_data$Nobservations <- length(phylo_data$Y)
phylo_data$Nedges <- length(edge_po)
phylo_data$observations <- match(pull(d[defined, "glottocode"]), nodes$label)
phylo_data$roots <- roots
phylo_data$edges <- edges[edge_po,]
phylo_data$edge_lengths <- edge_lengths[edge_po]
phylo_data$nthres <- length(unique(phylo_data$Y)) - 1
fit_brownian <- rstan::sampling(
   model_brownian,
   data = phylo_data,
   chains = 4,
   iter=2000
)
print(fit_brownian, pars=c("mu", "sigma", "sigma_root", "Intercept", "lp__"))
Inference for Stan model: anon_model.
4 chains, each with iter=2000; warmup=1000; thin=1; 
post-warmup draws per chain=1000, total post-warmup draws=4000.

                 mean se_mean    sd     2.5%      25%      50%      75%
mu               0.13    0.01  0.81    -1.46    -0.39     0.13     0.66
sigma            1.77    0.01  0.24     1.32     1.61     1.76     1.92
sigma_root       3.03    0.01  0.29     2.51     2.83     3.02     3.21
Intercept[1]    -1.42    0.01  0.83    -3.09    -1.96    -1.42    -0.88
Intercept[2]    -0.89    0.01  0.82    -2.55    -1.42    -0.89    -0.34
Intercept[3]    -0.56    0.01  0.83    -2.21    -1.09    -0.56    -0.01
Intercept[4]    -0.18    0.01  0.83    -1.84    -0.73    -0.19     0.36
Intercept[5]     0.58    0.01  0.82    -1.10     0.03     0.57     1.12
Intercept[6]     1.93    0.01  0.82     0.28     1.39     1.94     2.49
Intercept[7]     3.64    0.01  0.84     1.95     3.10     3.65     4.19
Intercept[8]     5.41    0.01  0.85     3.71     4.85     5.42     5.97
Intercept[9]     7.49    0.01  0.89     5.71     6.91     7.50     8.08
lp__         -5016.46    1.70 45.46 -5106.60 -5046.41 -5016.48 -4985.85
                97.5% n_eff Rhat
mu               1.71  5781 1.00
sigma            2.27   725 1.00
sigma_root       3.65  1206 1.00
Intercept[1]     0.16  5330 1.00
Intercept[2]     0.69  5316 1.00
Intercept[3]     1.04  5296 1.00
Intercept[4]     1.41  5341 1.00
Intercept[5]     2.14  5302 1.00
Intercept[6]     3.52  5211 1.00
Intercept[7]     5.26  4847 1.00
Intercept[8]     7.07  4545 1.00
Intercept[9]     9.23  4381 1.00
lp__         -4927.68   713 1.01

Samples were drawn using NUTS(diag_e) at Sun Feb 11 20:24:00 2024.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at 
convergence, Rhat=1).
(loo_brownian <- loo(fit_brownian))
Warning message:
“Some Pareto k diagnostic values are too high. See help('pareto-k-diagnostic') for details.
”

Computed from 4000 by 1209 log-likelihood matrix

         Estimate   SE
elpd_loo  -2009.1 34.5
p_loo       313.3  9.4
looic      4018.2 69.0
------
Monte Carlo SE of elpd_loo is NA.

Pareto k diagnostic values:
                         Count Pct.    Min. n_eff
(-Inf, 0.5]   (good)     1045  86.4%   357       
 (0.5, 0.7]   (ok)        112   9.3%   116       
   (0.7, 1]   (bad)        50   4.1%   54        
   (1, Inf)   (very bad)    2   0.2%   42        
See help('pareto-k-diagnostic') for details.
# Extract the generated quantities
generated_quantities <- extract(fit_brownian)
Y <- ordinal_data$Y
empirical <- tibble(
    tv = total_variance(Y),
    lv = lineage_wise_variance(Y, variable),
    cv = crossfamily_variance(Y, variable)
)
prior_pc <- tibble(
    tv = apply(generated_quantities$y_prior, 1, total_variance),
    lv = apply(
        generated_quantities$y_prior, 1, function(x) lineage_wise_variance(x, variable)
    ),
    cv = apply(
        generated_quantities$y_prior, 1, function(x) crossfamily_variance(x, variable)
    ),
)
predictive_plot(prior_pc)

posterior_pc <- tibble(
    tv = apply(generated_quantities$y_posterior, 1, total_variance),
    lv = apply(
        generated_quantities$y_posterior, 1, function(x) lineage_wise_variance(x, variable)
    ),
    cv = apply(
        generated_quantities$y_posterior, 1, function(x) crossfamily_variance(x, variable)
    ),
)
predictive_plot(posterior_pc)

#fit_mixed4 <- readRDS("models/fit_mixed4.rds")
fit_mixed4 <- rstan::sampling(
   model_mixed,
   data = phylo_data,
   chains = 4,
   iter=40000,
   thin=20,
   control=list(max_treedepth=15)
)
Warning message:
“There were 4 chains where the estimated Bayesian Fraction of Missing Information was low. See
https://mc-stan.org/misc/warnings.html#bfmi-low”
Warning message:
“Examine the pairs() plot to diagnose sampling problems
”
Warning message:
“Bulk Effective Samples Size (ESS) is too low, indicating posterior means and medians may be unreliable.
Running the chains for more iterations may help. See
https://mc-stan.org/misc/warnings.html#bulk-ess”
Warning message:
“Tail Effective Samples Size (ESS) is too low, indicating posterior variances and tail quantiles may be unreliable.
Running the chains for more iterations may help. See
https://mc-stan.org/misc/warnings.html#tail-ess”
saveRDS(fit_mixed4, "models/fit_mixed4.rds")
print(fit_mixed4, pars=c("mu", "sigma", "sigma_root", "Intercept", "sigma_error", "lp__"))
Inference for Stan model: anon_model.
4 chains, each with iter=40000; warmup=20000; thin=20; 
post-warmup draws per chain=1000, total post-warmup draws=4000.

                 mean se_mean     sd     2.5%      25%      50%      75%
mu               0.33    0.02   1.41    -2.49    -0.59     0.32     1.28
sigma            1.79    0.00   0.24     1.34     1.62     1.78     1.95
sigma_root       3.05    0.00   0.30     2.53     2.85     3.03     3.25
Intercept[1]    -1.24    0.02   1.39    -4.06    -2.16    -1.23    -0.32
Intercept[2]    -0.71    0.02   1.39    -3.51    -1.63    -0.69     0.21
Intercept[3]    -0.37    0.02   1.39    -3.17    -1.31    -0.35     0.56
Intercept[4]     0.01    0.02   1.39    -2.77    -0.92     0.04     0.94
Intercept[5]     0.77    0.02   1.39    -2.04    -0.17     0.80     1.71
Intercept[6]     2.14    0.02   1.39    -0.66     1.22     2.16     3.08
Intercept[7]     3.87    0.02   1.40     1.08     2.92     3.89     4.82
Intercept[8]     5.65    0.02   1.41     2.84     4.70     5.66     6.61
Intercept[9]     7.74    0.02   1.42     4.91     6.78     7.76     8.72
sigma_error      0.23    0.01   0.09     0.12     0.17     0.21     0.26
lp__         -4852.29   29.82 398.07 -5800.10 -5077.76 -4811.82 -4572.81
                97.5% n_eff Rhat
mu               3.05  4003 1.00
sigma            2.29  3995 1.00
sigma_root       3.65  4056 1.00
Intercept[1]     1.42  4045 1.00
Intercept[2]     2.00  4008 1.00
Intercept[3]     2.29  4007 1.00
Intercept[4]     2.67  4014 1.00
Intercept[5]     3.45  4012 1.00
Intercept[6]     4.82  3999 1.00
Intercept[7]     6.56  3928 1.00
Intercept[8]     8.38  3927 1.00
Intercept[9]    10.44  3921 1.00
sigma_error      0.47   151 1.02
lp__         -4200.04   178 1.01

Samples were drawn using NUTS(diag_e) at Sun Feb 11 23:14:35 2024.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at 
convergence, Rhat=1).
(loo_mixed <- loo(fit_mixed4))
Warning message:
“Some Pareto k diagnostic values are too high. See help('pareto-k-diagnostic') for details.
”

Computed from 4000 by 1209 log-likelihood matrix

         Estimate   SE
elpd_loo  -2009.8 34.5
p_loo       327.9  9.7
looic      4019.5 69.0
------
Monte Carlo SE of elpd_loo is NA.

Pareto k diagnostic values:
                         Count Pct.    Min. n_eff
(-Inf, 0.5]   (good)     1036  85.7%   379       
 (0.5, 0.7]   (ok)        105   8.7%   149       
   (0.7, 1]   (bad)        68   5.6%   42        
   (1, Inf)   (very bad)    0   0.0%   <NA>      
See help('pareto-k-diagnostic') for details.
# Extract the generated quantities
generated_quantities <- extract(fit_mixed4)
Y <- ordinal_data$Y
empirical <- tibble(
    tv = total_variance(Y),
    lv = lineage_wise_variance(Y, variable),
    cv = crossfamily_variance(Y, variable)
)
prior_pc <- tibble(
    tv = apply(generated_quantities$y_prior, 1, total_variance),
    lv = apply(
        generated_quantities$y_prior, 1, function(x) lineage_wise_variance(x, variable)
    ),
    cv = apply(
        generated_quantities$y_prior, 1, function(x) crossfamily_variance(x, variable)
    ),
)
predictive_plot(prior_pc)

posterior_pc <- tibble(
    tv = apply(generated_quantities$y_posterior, 1, total_variance),
    lv = apply(
        generated_quantities$y_posterior, 1, function(x) lineage_wise_variance(x, variable)
    ),
    cv = apply(
        generated_quantities$y_posterior, 1, function(x) crossfamily_variance(x, variable)
    ),
)
predictive_plot(posterior_pc)

model comparison

tibble(model = c("ordinal", "brownian", "mixed"),
       looic = c(
            loo_ordinal$estimates[3,1],
            loo_brownian$estimates[3,1],
            loo_mixed$estimates[3,1]
       )
) %>%
mutate(delta_looic = round(looic-min(looic), 3)) %>%
arrange(delta_looic)
A tibble: 3 × 3
model looic delta_looic
<chr> <dbl> <dbl>
brownian 4018.195 0.00
mixed 4019.505 1.31
ordinal 4848.144 829.95

Variable 5: settlement_strategy

variable <- "settlement_strategy"
Y <- d %>%
    select(all_of(variable)) %>%
    na.omit %>%
    pull
Y <- match(Y, sort(unique(Y)))
ordinal_data <- list()
ordinal_data$Ndata <- length(Y)
ordinal_data$nthres <- max(Y) -1
ordinal_data$Y <- Y
fit_ordinal <- rstan::sampling(
    model_ordinal,
    data=ordinal_data,
    chains=4,
    iter=2000,
)
print(fit_ordinal, pars=c("Intercept", "lp__"))
Inference for Stan model: anon_model.
4 chains, each with iter=2000; warmup=1000; thin=1; 
post-warmup draws per chain=1000, total post-warmup draws=4000.

                 mean se_mean   sd     2.5%      25%      50%      75%    97.5%
Intercept[1]    -1.42    0.00 0.07    -1.56    -1.47    -1.42    -1.37    -1.28
Intercept[2]    -0.88    0.00 0.06    -1.01    -0.93    -0.88    -0.84    -0.76
Intercept[3]     0.06    0.00 0.06    -0.06     0.02     0.06     0.10     0.18
lp__         -1371.80    0.03 1.21 -1374.83 -1372.33 -1371.48 -1370.92 -1370.45
             n_eff Rhat
Intercept[1]  2225    1
Intercept[2]  3110    1
Intercept[3]  5618    1
lp__          1803    1

Samples were drawn using NUTS(diag_e) at Sun Feb 11 23:20:48 2024.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at 
convergence, Rhat=1).
(loo_ordinal <- loo(fit_ordinal))

Computed from 4000 by 1107 log-likelihood matrix

         Estimate   SE
elpd_loo  -1364.9 18.0
p_loo         2.9  0.1
looic      2729.9 36.0
------
Monte Carlo SE of elpd_loo is 0.0.

All Pareto k estimates are good (k < 0.5).
See help('pareto-k-diagnostic') for details.
# Extract the generated quantities
generated_quantities <- extract(fit_ordinal)
Y <- ordinal_data$Y
empirical <- tibble(
    tv = total_variance(Y),
    lv = lineage_wise_variance(Y, variable),
    cv = crossfamily_variance(Y, variable)
)
prior_pc <- tibble(
    tv = apply(generated_quantities$y_prior, 1, total_variance),
    lv = apply(
        generated_quantities$y_prior, 1, function(x) lineage_wise_variance(x, variable)
    ),
    cv = apply(
        generated_quantities$y_prior, 1, function(x) crossfamily_variance(x, variable)
    ),
)
predictive_plot(prior_pc)

posterior_pc <- tibble(
    tv = apply(generated_quantities$y_posterior, 1, total_variance),
    lv = apply(
        generated_quantities$y_posterior, 1, function(x) lineage_wise_variance(x, variable)
    ),
    cv = apply(
        generated_quantities$y_posterior, 1, function(x) crossfamily_variance(x, variable)
    ),
)
predictive_plot(posterior_pc)

defined <- which(!is.na(d[[variable]]))
undefined <- which(is.na(d[[variable]]))
po <- c()
for (i in get_postorder(roots, edges)) {
    if ((i %in% edges[,2]) & !(i %in% undefined)) {
        po <- c(po, i)
    }
}
edge_po <- match(po, edges[,2])
phylo_data = list()
phylo_data$Ntrees <- length(trees)
phylo_data$Nnodes <- nrow(nodes)
phylo_data$Y <- ordinal_data$Y
phylo_data$Nobservations <- length(phylo_data$Y)
phylo_data$Nedges <- length(edge_po)
phylo_data$observations <- match(pull(d[defined, "glottocode"]), nodes$label)
phylo_data$roots <- roots
phylo_data$edges <- edges[edge_po,]
phylo_data$edge_lengths <- edge_lengths[edge_po]
phylo_data$nthres <- length(unique(phylo_data$Y)) - 1
fit_brownian <- rstan::sampling(
   model_brownian,
   data = phylo_data,
   chains = 4,
   iter=2000
)
print(fit_brownian, pars=c("mu", "sigma", "sigma_root", "Intercept", "lp__"))
Inference for Stan model: anon_model.
4 chains, each with iter=2000; warmup=1000; thin=1; 
post-warmup draws per chain=1000, total post-warmup draws=4000.

                 mean se_mean    sd     2.5%      25%      50%      75%
mu               0.34    0.00  0.24    -0.14     0.19     0.35     0.50
sigma            1.35    0.02  0.36     0.69     1.10     1.33     1.58
sigma_root       1.75    0.01  0.26     1.29     1.57     1.74     1.92
Intercept[1]    -1.17    0.00  0.17    -1.51    -1.28    -1.17    -1.06
Intercept[2]    -0.39    0.00  0.16    -0.71    -0.49    -0.39    -0.29
Intercept[3]     0.95    0.00  0.16     0.63     0.84     0.94     1.05
lp__         -4306.59    1.70 42.79 -4391.52 -4335.71 -4306.72 -4277.15
                97.5% n_eff Rhat
mu               0.82  2607 1.00
sigma            2.08   492 1.01
sigma_root       2.31  1159 1.00
Intercept[1]    -0.85  3356 1.00
Intercept[2]    -0.08  5316 1.00
Intercept[3]     1.27  5119 1.00
lp__         -4224.04   637 1.00

Samples were drawn using NUTS(diag_e) at Sun Feb 11 23:25:39 2024.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at 
convergence, Rhat=1).
(loo_brownian <- loo(fit_brownian))
Warning message:
“Some Pareto k diagnostic values are too high. See help('pareto-k-diagnostic') for details.
”

Computed from 4000 by 1107 log-likelihood matrix

         Estimate   SE
elpd_loo  -1232.0 21.8
p_loo       182.0  5.9
looic      2464.0 43.5
------
Monte Carlo SE of elpd_loo is NA.

Pareto k diagnostic values:
                         Count Pct.    Min. n_eff
(-Inf, 0.5]   (good)     1017  91.9%   562       
 (0.5, 0.7]   (ok)         85   7.7%   288       
   (0.7, 1]   (bad)         5   0.5%   228       
   (1, Inf)   (very bad)    0   0.0%   <NA>      
See help('pareto-k-diagnostic') for details.
# Extract the generated quantities
generated_quantities <- extract(fit_brownian)
Y <- ordinal_data$Y
empirical <- tibble(
    tv = total_variance(Y),
    lv = lineage_wise_variance(Y, variable),
    cv = crossfamily_variance(Y, variable)
)
prior_pc <- tibble(
    tv = apply(generated_quantities$y_prior, 1, total_variance),
    lv = apply(
        generated_quantities$y_prior, 1, function(x) lineage_wise_variance(x, variable)
    ),
    cv = apply(
        generated_quantities$y_prior, 1, function(x) crossfamily_variance(x, variable)
    ),
)
predictive_plot(prior_pc)

posterior_pc <- tibble(
    tv = apply(generated_quantities$y_posterior, 1, total_variance),
    lv = apply(
        generated_quantities$y_posterior, 1, function(x) lineage_wise_variance(x, variable)
    ),
    cv = apply(
        generated_quantities$y_posterior, 1, function(x) crossfamily_variance(x, variable)
    ),
)
predictive_plot(posterior_pc)

# fit_mixed5 <- readRDS("models/fit_mixed5.rds")
fit_mixed5 <- rstan::sampling(
   model_mixed,
   data = phylo_data,
   chains = 4,
   iter=40000,
   thin=20,
   control=list(max_treedepth=15)
)
Warning message:
“There were 1 transitions after warmup that exceeded the maximum treedepth. Increase max_treedepth above 15. See
https://mc-stan.org/misc/warnings.html#maximum-treedepth-exceeded”
Warning message:
“There were 4 chains where the estimated Bayesian Fraction of Missing Information was low. See
https://mc-stan.org/misc/warnings.html#bfmi-low”
Warning message:
“Examine the pairs() plot to diagnose sampling problems
”
Warning message:
“Bulk Effective Samples Size (ESS) is too low, indicating posterior means and medians may be unreliable.
Running the chains for more iterations may help. See
https://mc-stan.org/misc/warnings.html#bulk-ess”
Warning message:
“Tail Effective Samples Size (ESS) is too low, indicating posterior variances and tail quantiles may be unreliable.
Running the chains for more iterations may help. See
https://mc-stan.org/misc/warnings.html#tail-ess”
saveRDS(fit_mixed5, "models/fit_mixed5.rds")
print(fit_mixed5, pars=c("mu", "sigma", "sigma_root", "Intercept", "sigma_error", "lp__"))
Inference for Stan model: anon_model.
4 chains, each with iter=40000; warmup=20000; thin=20; 
post-warmup draws per chain=1000, total post-warmup draws=4000.

                 mean se_mean     sd     2.5%      25%      50%      75%
mu               0.36    0.00   0.24    -0.12     0.19     0.36     0.53
sigma            1.36    0.01   0.37     0.67     1.11     1.35     1.60
sigma_root       1.77    0.00   0.27     1.29     1.58     1.75     1.94
Intercept[1]    -1.18    0.00   0.17    -1.53    -1.29    -1.18    -1.06
Intercept[2]    -0.39    0.00   0.16    -0.71    -0.50    -0.39    -0.27
Intercept[3]     0.97    0.00   0.17     0.64     0.85     0.97     1.08
sigma_error      0.24    0.01   0.10     0.13     0.17     0.21     0.28
lp__         -4216.27   28.05 389.25 -5089.27 -4451.84 -4168.79 -3933.01
                97.5% n_eff Rhat
mu               0.84  3964 1.00
sigma            2.11  3787 1.00
sigma_root       2.37  3910 1.00
Intercept[1]    -0.86  3959 1.00
Intercept[2]    -0.08  4015 1.00
Intercept[3]     1.30  3593 1.00
sigma_error      0.50   173 1.02
lp__         -3587.28   193 1.01

Samples were drawn using NUTS(diag_e) at Mon Feb 12 00:51:07 2024.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at 
convergence, Rhat=1).
(loo_mixed <- loo(fit_mixed5))
Warning message:
“Some Pareto k diagnostic values are too high. See help('pareto-k-diagnostic') for details.
”

Computed from 4000 by 1107 log-likelihood matrix

         Estimate   SE
elpd_loo  -1231.1 21.7
p_loo       194.1  5.9
looic      2462.1 43.4
------
Monte Carlo SE of elpd_loo is NA.

Pareto k diagnostic values:
                         Count Pct.    Min. n_eff
(-Inf, 0.5]   (good)     1025  92.6%   376       
 (0.5, 0.7]   (ok)         79   7.1%   215       
   (0.7, 1]   (bad)         3   0.3%   167       
   (1, Inf)   (very bad)    0   0.0%   <NA>      
See help('pareto-k-diagnostic') for details.
# Extract the generated quantities
generated_quantities <- extract(fit_mixed5)
Y <- ordinal_data$Y
empirical <- tibble(
    tv = total_variance(Y),
    lv = lineage_wise_variance(Y, variable),
    cv = crossfamily_variance(Y, variable)
)
prior_pc <- tibble(
    tv = apply(generated_quantities$y_prior, 1, total_variance),
    lv = apply(
        generated_quantities$y_prior, 1, function(x) lineage_wise_variance(x, variable)
    ),
    cv = apply(
        generated_quantities$y_prior, 1, function(x) crossfamily_variance(x, variable)
    ),
)
predictive_plot(prior_pc)

posterior_pc <- tibble(
    tv = apply(generated_quantities$y_posterior, 1, total_variance),
    lv = apply(
        generated_quantities$y_posterior, 1, function(x) lineage_wise_variance(x, variable)
    ),
    cv = apply(
        generated_quantities$y_posterior, 1, function(x) crossfamily_variance(x, variable)
    ),
)
predictive_plot(posterior_pc)

model comparison

tibble(model = c("ordinal", "brownian", "mixed"),
       looic = c(
            loo_ordinal$estimates[3,1],
            loo_brownian$estimates[3,1],
            loo_mixed$estimates[3,1]
       )
) %>%
mutate(delta_looic = round(looic-min(looic), 3)) %>%
arrange(delta_looic)
A tibble: 3 × 3
model looic delta_looic
<chr> <dbl> <dbl>
mixed 2462.106 0.000
brownian 2464.050 1.944
ordinal 2729.880 267.774

Variable 6: exogamy

variable <- "exogamy"
Y <- d %>%
    select(all_of(variable)) %>%
    na.omit %>%
    pull
Y <- match(Y, sort(unique(Y)))

Since there are only two values for this variable, we need a logistic rather than an ordinal model.

binary_data <- list()
binary_data$Ndata <- length(Y)
binary_data$Y <- Y-1
binary_code <- "
data {
  int<lower=1> Ndata;  // total number of observations
  int<lower=0, upper=1> Y[Ndata];  // binary response variable
}
parameters {
  real Intercept;  // intercept for logistic regression
}

model {
  target += normal_lpdf(Intercept | 0, 2);
  for (n in 1:Ndata) {
    target += bernoulli_logit_lpmf(Y[n] | Intercept);
  }
}

generated quantities {
  int y_prior[Ndata];
  int y_posterior[Ndata];
  vector[Ndata] log_lik;
  real Intercept_prior = normal_rng(0, 2); // sample from the prior
  
  for (n in 1:Ndata) {
    y_prior[n] = bernoulli_logit_rng(Intercept_prior); // simulate prior predictive data
    y_posterior[n] = bernoulli_logit_rng(Intercept); // simulate posterior predictive data
    log_lik[n] = bernoulli_logit_lpmf(Y[n] | Intercept); // log likelihood for each observation
  }
}

"
model_binary <- stan_model(model_code=binary_code)
fit_binary <- rstan::sampling(
    model_binary,
    data=binary_data,
    chains=4,
    iter=2000,
)
print(fit_binary, pars=c("Intercept", "lp__"))
Inference for Stan model: anon_model.
4 chains, each with iter=2000; warmup=1000; thin=1; 
post-warmup draws per chain=1000, total post-warmup draws=4000.

             mean se_mean   sd    2.5%     25%     50%     75%   97.5% n_eff
Intercept   -0.66    0.00 0.06   -0.78   -0.70   -0.66   -0.62   -0.53  1447
lp__      -666.75    0.02 0.69 -668.83 -666.88 -666.50 -666.33 -666.28  1764
          Rhat
Intercept    1
lp__         1

Samples were drawn using NUTS(diag_e) at Mon Feb 12 00:58:07 2024.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at 
convergence, Rhat=1).
(loo_binary <- loo(fit_binary))

Computed from 4000 by 1036 log-likelihood matrix

         Estimate   SE
elpd_loo   -665.6 10.1
p_loo         0.9  0.0
looic      1331.1 20.1
------
Monte Carlo SE of elpd_loo is 0.0.

All Pareto k estimates are good (k < 0.5).
See help('pareto-k-diagnostic') for details.
# Extract the generated quantities
generated_quantities <- extract(fit_binary)
Y <- binary_data$Y
empirical <- tibble(
    tv = total_variance(Y),
    lv = lineage_wise_variance(Y, variable),
    cv = crossfamily_variance(Y, variable)
)
prior_pc <- tibble(
    tv = apply(generated_quantities$y_prior, 1, total_variance),
    lv = apply(
        generated_quantities$y_prior, 1, function(x) lineage_wise_variance(x, variable)
    ),
    cv = apply(
        generated_quantities$y_prior, 1, function(x) crossfamily_variance(x, variable)
    ),
)
predictive_plot(prior_pc)

posterior_pc <- tibble(
    tv = apply(generated_quantities$y_posterior, 1, total_variance),
    lv = apply(
        generated_quantities$y_posterior, 1, function(x) lineage_wise_variance(x, variable)
    ),
    cv = apply(
        generated_quantities$y_posterior, 1, function(x) crossfamily_variance(x, variable)
    ),
)
predictive_plot(posterior_pc)

defined <- which(!is.na(d[[variable]]))
undefined <- which(is.na(d[[variable]]))
po <- c()
for (i in get_postorder(roots, edges)) {
    if ((i %in% edges[,2]) & !(i %in% undefined)) {
        po <- c(po, i)
    }
}
edge_po <- match(po, edges[,2])
phylo_data = list()
phylo_data$Ntrees <- length(trees)
phylo_data$Nnodes <- nrow(nodes)
phylo_data$Y <- binary_data$Y
phylo_data$Nobservations <- length(phylo_data$Y)
phylo_data$Nedges <- length(edge_po)
phylo_data$observations <- match(pull(d[defined, "glottocode"]), nodes$label)
phylo_data$roots <- roots
phylo_data$edges <- edges[edge_po,]
phylo_data$edge_lengths <- edge_lengths[edge_po]
phylo_data$nthres <- length(unique(phylo_data$Y)) - 1

The phylogenetic ordinal regression models also have to be adjusted to a binary outcome variable.

brownian_binary_code <- "
data {
    int<lower=1> Ntrees;
    int<lower=0> Nnodes;
    int<lower=1> Nobservations;
    int<lower=1> Nedges;
    int<lower=1,upper=Nnodes> observations[Nobservations]; // indices of the observed tips
    int<lower=1, upper=Nnodes> roots[Ntrees];
    int<lower=1, upper=Nnodes> edges[Nedges, 2];
    vector[Nedges] edge_lengths;
    int<lower=0, upper=1> Y[Nobservations];  // response variable
}

parameters {
    vector[Nnodes] z_normal;
    real mu;
    real<lower=0> sigma;
    real<lower=0> sigma_root;
}

model {
    vector[Nnodes] z = rep_vector(0, Nnodes);
    target += normal_lpdf(mu | 0, 1);
    target += exponential_lpdf(sigma | 1);
    target += normal_lpdf(z_normal | 0, 1);
    
    for (i in 1:Ntrees) {
        int idx = roots[i];
        z[idx] = mu + z_normal[idx] * sigma_root;
    }
    
    for (i in 1:Nedges) {
        int j = Nedges - i + 1;
        int mother_node = edges[j, 1];
        int daughter_node = edges[j, 2];
        real local_mu = z[mother_node];
        real local_sigma = sigma * sqrt(edge_lengths[j]);
        z[daughter_node] = local_mu + z_normal[daughter_node] * local_sigma;
    }

    for (n in 1:Nobservations) {
        target += bernoulli_logit_lpmf(Y[n] | z[observations[n]]);
    }    
}


generated quantities {
    vector[Nnodes] z = rep_vector(0, Nnodes);
    int y_prior[Nobservations]; // Prior predictive samples
    int y_posterior[Nobservations]; // Posterior predictive samples
    vector[Nnodes] z_prior = rep_vector(0, Nnodes);
    vector[Nnodes] z_posterior = rep_vector(0, Nnodes);
    real mu_prior;
    real<lower=0> sigma_prior;
    real<lower=0> sigma_root_prior;
    real log_lik[Nobservations];

    for (i in 1:Ntrees) {
        int idx = roots[i];
        z[idx] = mu + z_normal[idx] * sigma_root;
    }
    
    for (i in 1:Nedges) {
        int j = Nedges - i + 1;
        int mother_node = edges[j, 1];
        int daughter_node = edges[j, 2];
        real local_mu = z[mother_node];
        real local_sigma = sigma * sqrt(edge_lengths[j]);
        z[daughter_node] = local_mu + z_normal[daughter_node] * local_sigma;
    }


    mu_prior = normal_rng(0, 10);
    sigma_prior = exponential_rng(1);
    sigma_root_prior = exponential_rng(1);

    for (i in 1:Ntrees) {
        z_prior[roots[i]] = normal_rng(mu_prior, sigma_root_prior);
    }

    for (i in 1:Ntrees) {
        int idx = roots[i];
        z_posterior[idx] = mu + z_normal[idx] * sigma_root;
    }
    
    for (i in 1:Nedges) {
        int j = Nedges - i + 1;
        int mother_node = edges[j, 1];
        int daughter_node = edges[j, 2];
        real local_mu = z_posterior[mother_node];
        real local_sigma = sigma * sqrt(edge_lengths[j]);
        z_prior[daughter_node] = normal_rng(
            z_prior[mother_node],
            sigma_prior * sqrt(edge_lengths[j])
        );

        z_posterior[daughter_node] = local_mu + z_normal[daughter_node] * local_sigma;
    }
    
    for (n in 1:Nobservations) {
        y_prior[n] = bernoulli_logit_rng(z_prior[observations[n]]); 
        y_posterior[n] = bernoulli_logit_rng(z[observations[n]]); 
        log_lik[n] = bernoulli_logit_lpmf(Y[n] | z[observations[n]]);
    }
}
"
model_brownian_binary <- stan_model(model_code=brownian_binary_code)
fit_brownian <- rstan::sampling(
   model_brownian_binary,
   data = phylo_data,
   chains = 4,
   iter=2000
)
print(fit_brownian, pars=c("mu", "sigma", "sigma_root", "lp__"))
Inference for Stan model: anon_model.
4 chains, each with iter=2000; warmup=1000; thin=1; 
post-warmup draws per chain=1000, total post-warmup draws=4000.

               mean se_mean    sd     2.5%      25%      50%      75%    97.5%
mu            -1.10    0.01  0.24    -1.63    -1.25    -1.08    -0.93    -0.66
sigma          3.19    0.04  0.73     2.03     2.69     3.09     3.60     4.84
sigma_root     1.58    0.03  0.51     0.70     1.23     1.53     1.90     2.70
lp__       -3631.70    2.25 47.67 -3720.89 -3664.01 -3633.21 -3600.92 -3532.82
           n_eff Rhat
mu          1091 1.00
sigma        401 1.01
sigma_root   402 1.01
lp__         449 1.01

Samples were drawn using NUTS(diag_e) at Mon Feb 12 01:03:28 2024.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at 
convergence, Rhat=1).
(loo_brownian <- loo(fit_brownian))
Warning message:
“Some Pareto k diagnostic values are too high. See help('pareto-k-diagnostic') for details.
”

Computed from 4000 by 1036 log-likelihood matrix

         Estimate   SE
elpd_loo   -597.9 13.3
p_loo       213.8  6.9
looic      1195.8 26.5
------
Monte Carlo SE of elpd_loo is NA.

Pareto k diagnostic values:
                         Count Pct.    Min. n_eff
(-Inf, 0.5]   (good)     614   59.3%   427       
 (0.5, 0.7]   (ok)       299   28.9%   194       
   (0.7, 1]   (bad)       44    4.2%   86        
   (1, Inf)   (very bad)  79    7.6%   145       
See help('pareto-k-diagnostic') for details.
# Extract the generated quantities
generated_quantities <- extract(fit_brownian)
Y <- binary_data$Y
empirical <- tibble(
    tv = total_variance(Y),
    lv = lineage_wise_variance(Y, variable),
    cv = crossfamily_variance(Y, variable)
)
prior_pc <- tibble(
    tv = apply(generated_quantities$y_prior, 1, total_variance),
    lv = apply(
        generated_quantities$y_prior, 1, function(x) lineage_wise_variance(x, variable)
    ),
    cv = apply(
        generated_quantities$y_prior, 1, function(x) crossfamily_variance(x, variable)
    ),
)
predictive_plot(prior_pc)

posterior_pc <- tibble(
    tv = apply(generated_quantities$y_posterior, 1, total_variance),
    lv = apply(
        generated_quantities$y_posterior, 1, function(x) lineage_wise_variance(x, variable)
    ),
    cv = apply(
        generated_quantities$y_posterior, 1, function(x) crossfamily_variance(x, variable)
    ),
)
predictive_plot(posterior_pc)

mixed_binary_code <- "
data {
    int<lower=1> Ntrees;
    int<lower=0> Nnodes;
    int<lower=1> Nobservations;
    int<lower=1> Nedges;
    int<lower=1,upper=Nnodes> observations[Nobservations]; // indices of the observed tips
    int<lower=1, upper=Nnodes> roots[Ntrees];
    int<lower=1, upper=Nnodes> edges[Nedges, 2];
    vector[Nedges] edge_lengths;
    int<lower=0, upper=1> Y[Nobservations];  // response variable
}

parameters {
    vector[Nnodes] z_normal;
    real mu;
    real<lower=0> sigma;
    real<lower=0> sigma_root;
    real<lower=0> sigma_error;
    vector[Nobservations] z_error;
}

model {
    vector[Nnodes] z = rep_vector(0, Nnodes);
    target += normal_lpdf(mu | 0, 10);
    target += exponential_lpdf(sigma | 1);
    target += normal_lpdf(z_normal | 0, 1);
    target += inv_gamma_lpdf(square(sigma_error) | 2, 0.1);

    target += normal_lpdf(z_error | 0, sigma_error);

    for (i in 1:Ntrees) {
        int idx = roots[i];
        z[idx] = mu + z_normal[idx] * sigma_root;
    }
    
    for (i in 1:Nedges) {
        int j = Nedges - i + 1;
        int mother_node = edges[j, 1];
        int daughter_node = edges[j, 2];
        real local_mu = z[mother_node];
        real local_sigma = sigma * sqrt(edge_lengths[j]);
        z[daughter_node] = local_mu + z_normal[daughter_node] * local_sigma;
    }

    for (n in 1:Nobservations) {
        target += bernoulli_logit_lpmf(Y[n] | z[observations[n]] + z_error[n]);
    }
    
}

generated quantities {
    vector[Nnodes] z = rep_vector(0, Nnodes);
    int y_prior[Nobservations]; // Prior predictive samples
    int y_posterior[Nobservations]; // Posterior predictive samples
    vector[Nnodes] z_prior = rep_vector(0, Nnodes);
    vector[Nnodes] z_posterior = rep_vector(0, Nnodes);
    real mu_prior;
    real<lower=0> sigma_prior;
    real<lower=0> sigma_root_prior;
    real log_lik[Nobservations];
    vector[Nobservations] z_error_prior;


    real sigma_error_sq = 1 / gamma_rng(2, 10);
    real sigma_error_prior = sqrt(sigma_error_sq);

    for (i in 1:Ntrees) {
        int idx = roots[i];
        z[idx] = mu + z_normal[idx] * sigma_root;
    }

    for (n in 1:Nobservations) {
        z_error_prior[n] = normal_rng(0, sigma_error_prior);
    }
    
    for (i in 1:Nedges) {
        int j = Nedges - i + 1;
        int mother_node = edges[j, 1];
        int daughter_node = edges[j, 2];
        real local_mu = z[mother_node];
        real local_sigma = sigma * sqrt(edge_lengths[j]);
        z[daughter_node] = local_mu + z_normal[daughter_node] * local_sigma;
    }

    mu_prior = normal_rng(0, 10);
    sigma_prior = exponential_rng(1);
    sigma_root_prior = exponential_rng(1);

    for (i in 1:Ntrees) {
        z_prior[roots[i]] = normal_rng(mu_prior, sigma_root_prior);
    }

    for (i in 1:Ntrees) {
        int idx = roots[i];
        z_posterior[idx] = mu + z_normal[idx] * sigma_root;
    }
    
    for (i in 1:Nedges) {
        int j = Nedges - i + 1;
        int mother_node = edges[j, 1];
        int daughter_node = edges[j, 2];
        real local_mu = z_posterior[mother_node];
        real local_sigma = sigma * sqrt(edge_lengths[j]);
        z_prior[daughter_node] = normal_rng(
            z_prior[mother_node],
            sigma_prior * sqrt(edge_lengths[j])
        );

        z_posterior[daughter_node] = local_mu + z_normal[daughter_node] * local_sigma;
    }
    
    for (n in 1:Nobservations) {
        y_prior[n] = bernoulli_logit_rng(z_prior[observations[n]] + z_error_prior[n]); 
        y_posterior[n] = bernoulli_logit_rng(z[observations[n]] + z_error[n]); 
        log_lik[n] = bernoulli_logit_lpmf(Y[n] | z[observations[n]] + z_error[n]);
    }
}

"
model_mixed_binary <- stan_model(model_code=mixed_binary_code)
#fit_mixed6 <- readRDS("models/fit_mixed6.rds")
fit_mixed6 <- rstan::sampling(
   model_mixed_binary,
   data = phylo_data,
   chains = 4,
   iter=40000,
   thin=20,
   control=list(max_treedepth=15)
)
Warning message:
“There were 4 chains where the estimated Bayesian Fraction of Missing Information was low. See
https://mc-stan.org/misc/warnings.html#bfmi-low”
Warning message:
“Examine the pairs() plot to diagnose sampling problems
”
Warning message:
“Bulk Effective Samples Size (ESS) is too low, indicating posterior means and medians may be unreliable.
Running the chains for more iterations may help. See
https://mc-stan.org/misc/warnings.html#bulk-ess”
Warning message:
“Tail Effective Samples Size (ESS) is too low, indicating posterior variances and tail quantiles may be unreliable.
Running the chains for more iterations may help. See
https://mc-stan.org/misc/warnings.html#tail-ess”
saveRDS(fit_mixed6, "models/fit_mixed6.rds")
print(fit_mixed6, pars=c("mu", "sigma", "sigma_root", "sigma_error", "lp__"))
Inference for Stan model: anon_model.
4 chains, each with iter=40000; warmup=20000; thin=20; 
post-warmup draws per chain=1000, total post-warmup draws=4000.

                mean se_mean     sd     2.5%      25%      50%      75%
mu             -1.21    0.00   0.28    -1.85    -1.38    -1.18    -1.02
sigma           3.38    0.01   0.79     2.12     2.83     3.28     3.83
sigma_root      1.73    0.01   0.56     0.80     1.34     1.67     2.04
sigma_error     0.24    0.01   0.10     0.13     0.18     0.21     0.27
lp__        -3525.01   24.65 353.58 -4337.38 -3721.87 -3482.58 -3284.97
               97.5% n_eff Rhat
mu             -0.73  3485 1.00
sigma           5.22  3537 1.00
sigma_root      2.99  4074 1.00
sigma_error     0.49   195 1.01
lp__        -2940.62   206 1.01

Samples were drawn using NUTS(diag_e) at Mon Feb 12 01:52:53 2024.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at 
convergence, Rhat=1).
(loo_mixed <- loo(fit_mixed6))
Warning message:
“Some Pareto k diagnostic values are too high. See help('pareto-k-diagnostic') for details.
”

Computed from 4000 by 1036 log-likelihood matrix

         Estimate   SE
elpd_loo   -597.0 13.5
p_loo       228.1  7.4
looic      1193.9 27.1
------
Monte Carlo SE of elpd_loo is NA.

Pareto k diagnostic values:
                         Count Pct.    Min. n_eff
(-Inf, 0.5]   (good)     647   62.5%   493       
 (0.5, 0.7]   (ok)       324   31.3%   211       
   (0.7, 1]   (bad)       64    6.2%   43        
   (1, Inf)   (very bad)   1    0.1%   39        
See help('pareto-k-diagnostic') for details.
# Extract the generated quantities
generated_quantities <- extract(fit_mixed6)
Y <- binary_data$Y
empirical <- tibble(
    tv = total_variance(Y),
    lv = lineage_wise_variance(Y, variable),
    cv = crossfamily_variance(Y, variable)
)
prior_pc <- tibble(
    tv = apply(generated_quantities$y_prior, 1, total_variance),
    lv = apply(
        generated_quantities$y_prior, 1, function(x) lineage_wise_variance(x, variable)
    ),
    cv = apply(
        generated_quantities$y_prior, 1, function(x) crossfamily_variance(x, variable)
    ),
)
predictive_plot(prior_pc)

posterior_pc <- tibble(
    tv = apply(generated_quantities$y_posterior, 1, total_variance),
    lv = apply(
        generated_quantities$y_posterior, 1, function(x) lineage_wise_variance(x, variable)
    ),
    cv = apply(
        generated_quantities$y_posterior, 1, function(x) crossfamily_variance(x, variable)
    ),
)
predictive_plot(posterior_pc)

model comparison

tibble(model = c("binary", "brownian", "mixed"),
       looic = c(
            loo_binary$estimates[3,1],
            loo_brownian$estimates[3,1],
            loo_mixed$estimates[3,1]
       )
) %>%
mutate(delta_looic = round(looic-min(looic), 3)) %>%
arrange(delta_looic)
A tibble: 3 × 3
model looic delta_looic
<chr> <dbl> <dbl>
mixed 1193.911 0.000
brownian 1195.849 1.937
binary 1331.121 137.210

Variable 7: crop_type

variable <- "crop_type"
Y <- d %>%
    select(all_of(variable)) %>%
    na.omit %>%
    pull
Y <- match(Y, sort(unique(Y)))
ordinal_data <- list()
ordinal_data$Ndata <- length(Y)
ordinal_data$nthres <- max(Y) -1
ordinal_data$Y <- Y
fit_ordinal <- rstan::sampling(
    model_ordinal,
    data=ordinal_data,
    chains=4,
    iter=2000,
)
print(fit_ordinal, pars=c("Intercept", "lp__"))
Inference for Stan model: anon_model.
4 chains, each with iter=2000; warmup=1000; thin=1; 
post-warmup draws per chain=1000, total post-warmup draws=4000.

                 mean se_mean   sd     2.5%      25%      50%      75%    97.5%
Intercept[1]    -1.50    0.00 0.08    -1.66    -1.55    -1.49    -1.44    -1.35
Intercept[2]    -1.48    0.00 0.08    -1.64    -1.53    -1.48    -1.43    -1.33
Intercept[3]    -1.44    0.00 0.08    -1.60    -1.50    -1.44    -1.39    -1.29
Intercept[4]    -1.04    0.00 0.07    -1.18    -1.09    -1.04    -1.00    -0.91
Intercept[5]    -0.12    0.00 0.06    -0.24    -0.16    -0.12    -0.08    -0.01
lp__         -1342.38    0.04 1.60 -1346.35 -1343.20 -1342.06 -1341.21 -1340.28
             n_eff Rhat
Intercept[1]  3286    1
Intercept[2]  3260    1
Intercept[3]  3411    1
Intercept[4]  4344    1
Intercept[5]  5025    1
lp__          1827    1

Samples were drawn using NUTS(diag_e) at Mon Feb 12 01:59:13 2024.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at 
convergence, Rhat=1).
(loo_ordinal <- loo(fit_ordinal))

Computed from 4000 by 1103 log-likelihood matrix

         Estimate   SE
elpd_loo  -1323.4 24.0
p_loo         4.7  0.7
looic      2646.8 48.1
------
Monte Carlo SE of elpd_loo is 0.0.

All Pareto k estimates are good (k < 0.5).
See help('pareto-k-diagnostic') for details.
# Extract the generated quantities
generated_quantities <- extract(fit_ordinal)
Y <- ordinal_data$Y
empirical <- tibble(
    tv = total_variance(Y),
    lv = lineage_wise_variance(Y, variable),
    cv = crossfamily_variance(Y, variable)
)
prior_pc <- tibble(
    tv = apply(generated_quantities$y_prior, 1, total_variance),
    lv = apply(
        generated_quantities$y_prior, 1, function(x) lineage_wise_variance(x, variable)
    ),
    cv = apply(
        generated_quantities$y_prior, 1, function(x) crossfamily_variance(x, variable)
    ),
)
predictive_plot(prior_pc)

posterior_pc <- tibble(
    tv = apply(generated_quantities$y_posterior, 1, total_variance),
    lv = apply(
        generated_quantities$y_posterior, 1, function(x) lineage_wise_variance(x, variable)
    ),
    cv = apply(
        generated_quantities$y_posterior, 1, function(x) crossfamily_variance(x, variable)
    ),
)
predictive_plot(posterior_pc)

defined <- which(!is.na(d[[variable]]))
undefined <- which(is.na(d[[variable]]))
po <- c()
for (i in get_postorder(roots, edges)) {
    if ((i %in% edges[,2]) & !(i %in% undefined)) {
        po <- c(po, i)
    }
}
edge_po <- match(po, edges[,2])
phylo_data = list()
phylo_data$Ntrees <- length(trees)
phylo_data$Nnodes <- nrow(nodes)
phylo_data$Y <- ordinal_data$Y
phylo_data$Nobservations <- length(phylo_data$Y)
phylo_data$Nedges <- length(edge_po)
phylo_data$observations <- match(pull(d[defined, "glottocode"]), nodes$label)
phylo_data$roots <- roots
phylo_data$edges <- edges[edge_po,]
phylo_data$edge_lengths <- edge_lengths[edge_po]
phylo_data$nthres <- length(unique(phylo_data$Y)) - 1
fit_brownian <- rstan::sampling(
   model_brownian,
   data = phylo_data,
   chains = 4,
   iter=2000
)
print(fit_brownian, pars=c("mu", "sigma", "sigma_root", "Intercept", "lp__"))
Inference for Stan model: anon_model.
4 chains, each with iter=2000; warmup=1000; thin=1; 
post-warmup draws per chain=1000, total post-warmup draws=4000.

                 mean se_mean    sd     2.5%      25%      50%      75%
mu              -0.93    0.01  0.33    -1.58    -1.15    -0.93    -0.72
sigma            1.71    0.01  0.26     1.25     1.53     1.69     1.87
sigma_root       3.27    0.01  0.41     2.54     2.99     3.25     3.53
Intercept[1]    -2.43    0.00  0.20    -2.81    -2.56    -2.42    -2.29
Intercept[2]    -2.39    0.00  0.20    -2.78    -2.52    -2.39    -2.26
Intercept[3]    -2.32    0.00  0.20    -2.70    -2.45    -2.32    -2.18
Intercept[4]    -1.56    0.00  0.18    -1.93    -1.69    -1.56    -1.43
Intercept[5]     0.08    0.00  0.17    -0.25    -0.03     0.08     0.19
lp__         -4111.76    1.22 39.06 -4186.64 -4138.22 -4112.04 -4084.78
                97.5% n_eff Rhat
mu              -0.29  1167 1.00
sigma            2.26   969 1.00
sigma_root       4.17   936 1.01
Intercept[1]    -2.05  2563 1.00
Intercept[2]    -2.01  2596 1.00
Intercept[3]    -1.95  2708 1.00
Intercept[4]    -1.21  3863 1.00
Intercept[5]     0.40  5636 1.00
lp__         -4034.93  1028 1.00

Samples were drawn using NUTS(diag_e) at Mon Feb 12 02:04:48 2024.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at 
convergence, Rhat=1).
(loo_brownian <- loo(fit_brownian))
Warning message:
“Some Pareto k diagnostic values are too high. See help('pareto-k-diagnostic') for details.
”

Computed from 4000 by 1103 log-likelihood matrix

         Estimate   SE
elpd_loo  -1045.2 31.4
p_loo       200.5  8.3
looic      2090.4 62.8
------
Monte Carlo SE of elpd_loo is NA.

Pareto k diagnostic values:
                         Count Pct.    Min. n_eff
(-Inf, 0.5]   (good)     939   85.1%   303       
 (0.5, 0.7]   (ok)        98    8.9%   135       
   (0.7, 1]   (bad)       64    5.8%   48        
   (1, Inf)   (very bad)   2    0.2%   54        
See help('pareto-k-diagnostic') for details.
# Extract the generated quantities
generated_quantities <- extract(fit_brownian)
Y <- ordinal_data$Y
empirical <- tibble(
    tv = total_variance(Y),
    lv = lineage_wise_variance(Y, variable),
    cv = crossfamily_variance(Y, variable)
)
prior_pc <- tibble(
    tv = apply(generated_quantities$y_prior, 1, total_variance),
    lv = apply(
        generated_quantities$y_prior, 1, function(x) lineage_wise_variance(x, variable)
    ),
    cv = apply(
        generated_quantities$y_prior, 1, function(x) crossfamily_variance(x, variable)
    ),
)
predictive_plot(prior_pc)

posterior_pc <- tibble(
    tv = apply(generated_quantities$y_posterior, 1, total_variance),
    lv = apply(
        generated_quantities$y_posterior, 1, function(x) lineage_wise_variance(x, variable)
    ),
    cv = apply(
        generated_quantities$y_posterior, 1, function(x) crossfamily_variance(x, variable)
    ),
)
predictive_plot(posterior_pc)

#fit_mixed7 <- readRDS("models/fit_mixed7.rds")
fit_mixed7 <- rstan::sampling(
   model_mixed,
   data = phylo_data,
   chains = 4,
   iter=40000,
   thin=20,
   control=list(max_treedepth=15)
)
Warning message:
“There were 4 chains where the estimated Bayesian Fraction of Missing Information was low. See
https://mc-stan.org/misc/warnings.html#bfmi-low”
Warning message:
“Examine the pairs() plot to diagnose sampling problems
”
Warning message:
“Bulk Effective Samples Size (ESS) is too low, indicating posterior means and medians may be unreliable.
Running the chains for more iterations may help. See
https://mc-stan.org/misc/warnings.html#bulk-ess”
Warning message:
“Tail Effective Samples Size (ESS) is too low, indicating posterior variances and tail quantiles may be unreliable.
Running the chains for more iterations may help. See
https://mc-stan.org/misc/warnings.html#tail-ess”
saveRDS(fit_mixed7, "models/fit_mixed7.rds")
print(fit_mixed7, pars=c("mu", "sigma", "sigma_root", "Intercept", "sigma_error", "lp__"))
Inference for Stan model: anon_model.
4 chains, each with iter=40000; warmup=20000; thin=20; 
post-warmup draws per chain=1000, total post-warmup draws=4000.

                 mean se_mean     sd     2.5%      25%      50%      75%
mu              -1.06    0.01   0.35    -1.77    -1.30    -1.05    -0.82
sigma            1.75    0.00   0.27     1.28     1.56     1.74     1.92
sigma_root       3.33    0.01   0.41     2.60     3.03     3.30     3.59
Intercept[1]    -2.47    0.00   0.21    -2.90    -2.61    -2.47    -2.33
Intercept[2]    -2.44    0.00   0.21    -2.86    -2.57    -2.43    -2.29
Intercept[3]    -2.36    0.00   0.21    -2.79    -2.50    -2.36    -2.22
Intercept[4]    -1.60    0.00   0.19    -1.98    -1.72    -1.59    -1.47
Intercept[5]     0.06    0.00   0.17    -0.29    -0.05     0.07     0.18
sigma_error      0.23    0.01   0.09     0.12     0.17     0.21     0.26
lp__         -3960.97   29.75 373.59 -4782.49 -4175.64 -3922.39 -3694.76
                97.5% n_eff Rhat
mu              -0.38  4000 1.00
sigma            2.33  4011 1.00
sigma_root       4.23  4011 1.00
Intercept[1]    -2.08  3931 1.00
Intercept[2]    -2.04  3968 1.00
Intercept[3]    -1.98  3961 1.00
Intercept[4]    -1.24  3993 1.00
Intercept[5]     0.39  3929 1.00
sigma_error      0.45   144 1.03
lp__         -3329.02   158 1.03

Samples were drawn using NUTS(diag_e) at Mon Feb 12 03:24:58 2024.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at 
convergence, Rhat=1).
(loo_mixed <- loo(fit_mixed7))
Warning message:
“Some Pareto k diagnostic values are too high. See help('pareto-k-diagnostic') for details.
”

Computed from 4000 by 1103 log-likelihood matrix

         Estimate   SE
elpd_loo  -1043.6 31.3
p_loo       210.0  8.3
looic      2087.1 62.7
------
Monte Carlo SE of elpd_loo is NA.

Pareto k diagnostic values:
                         Count Pct.    Min. n_eff
(-Inf, 0.5]   (good)     933   84.6%   366       
 (0.5, 0.7]   (ok)        94    8.5%   216       
   (0.7, 1]   (bad)       73    6.6%   41        
   (1, Inf)   (very bad)   3    0.3%   27        
See help('pareto-k-diagnostic') for details.
# Extract the generated quantities
generated_quantities <- extract(fit_mixed7)
Y <- ordinal_data$Y
empirical <- tibble(
    tv = total_variance(Y),
    lv = lineage_wise_variance(Y, variable),
    cv = crossfamily_variance(Y, variable)
)
prior_pc <- tibble(
    tv = apply(generated_quantities$y_prior, 1, total_variance),
    lv = apply(
        generated_quantities$y_prior, 1, function(x) lineage_wise_variance(x, variable)
    ),
    cv = apply(
        generated_quantities$y_prior, 1, function(x) crossfamily_variance(x, variable)
    ),
)
predictive_plot(prior_pc)

posterior_pc <- tibble(
    tv = apply(generated_quantities$y_posterior, 1, total_variance),
    lv = apply(
        generated_quantities$y_posterior, 1, function(x) lineage_wise_variance(x, variable)
    ),
    cv = apply(
        generated_quantities$y_posterior, 1, function(x) crossfamily_variance(x, variable)
    ),
)
predictive_plot(posterior_pc)

model comparison

tibble(model = c("ordinal", "brownian", "mixed"),
       looic = c(
            loo_ordinal$estimates[3,1],
            loo_brownian$estimates[3,1],
            loo_mixed$estimates[3,1]
       )
) %>%
mutate(delta_looic = round(looic-min(looic), 3)) %>%
arrange(delta_looic)
A tibble: 3 × 3
model looic delta_looic
<chr> <dbl> <dbl>
mixed 2087.136 0.000
brownian 2090.430 3.295
ordinal 2646.801 559.666

wrapping up

For all seven variables, the mixed model is clearly better than the non-phylogenetic logistic model, and it is about as good as the Brownian motion model. It is thus justified to work with for downstream tasks.

# # create directory "models"
# dir.create("models")
z_error <- list(
    extract(fit_mixed1, pars="z_error")$z_error,
    extract(fit_mixed2, pars="z_error")$z_error,
    extract(fit_mixed3, pars="z_error")$z_error,
    extract(fit_mixed4, pars="z_error")$z_error,
    extract(fit_mixed5, pars="z_error")$z_error,
    extract(fit_mixed6, pars="z_error")$z_error,
    extract(fit_mixed7, pars="z_error")$z_error
)
z <- list(
    extract(fit_mixed1, pars="z")$z,
    extract(fit_mixed2, pars="z")$z,
    extract(fit_mixed3, pars="z")$z,
    extract(fit_mixed4, pars="z")$z,
    extract(fit_mixed5, pars="z")$z,
    extract(fit_mixed6, pars="z")$z,
    extract(fit_mixed7, pars="z")$z
)
variables <- colnames(d)[3:9]
# for each i in 1:7, create a new column in d and fill those rows where d[[variables[i]]] is NA with NA, and the other rows with the posterior mean of z_error[[i]]

for (i in 1:7) {
    d[[paste0(variables[i], "_residuals")]] <- NA
    d[[paste0(variables[i], "_residuals")]][!is.na(d[[variables[i]]])] <- apply(z_error[[i]], 2, mean) 
}
tip_indices <- match(d$glottocode, nodes$label)
for (i in 1:7) {
    d[[paste0(variables[i], "_z")]] <- apply(z[[i]], 2, mean)[tip_indices]
}
write_csv(d, "../data/data_with_latent_variables.csv")