Causal inference

Now that the distribution of the latent variables are estimated, I’ll try my hand with causal inference, using th PC algorithm.

cd(@__DIR__)
using Pkg
Pkg.activate(".")
Pkg.instantiate()
  Activating project at `~/projects/research/natalia_causality/code`
using DataFrames
using CSV
using StatsBase
using StatsPlots
using Pipe
using CausalInference
using Turing
using Turing: Variational
using Flux
using TikzGraphs
using GraphRecipes
using GLM
using Distributions
using LinearAlgebra
using MCMCChains
using Base.Threads
using BenchmarkTools
using Random

d = CSV.File("../data/data_Oct2023_new.csv", missingstring="NA") |> DataFrame |> dropmissing

d.Glottocode[d.Glottocode .== "kose1239"] .= "awiy1238"
filter(x -> x  ["sout3004", "sout3270"], d.Glottocode)
insertcols!(d, :case => ["no case", "case"][1 .+ d.Case])

d
693×11 DataFrame
668 rows omitted
Row ISO Glottocode Language Family Area N_Speakers Case AP_Entropy VerbFinal VerbMiddle case
String3 String15 String String31 String15 Int64 Int64 Float64 Float64 Float64 String
1 aai arif1239 Arifama-Miniafia Austronesian Oceania 3470 0 0.751032 0.817204 0.150538 no case
2 aau abau1245 Abau Sepik Oceania 7500 1 0.615254 0.956522 0.0217391 case
3 abt ambu1247 Ambulas Ndu Oceania 33000 1 0.847862 0.666667 0.294118 case
4 aby anem1248 Aneme Wake Yareban Oceania 650 1 0.842658 0.65625 0.229167 case
5 acd giky1238 Gikyode Atlantic-Congo Africa 10400 0 0.591673 0.190476 0.738095 no case
6 ace achi1257 Acehnese Austronesian Eurasia 2840000 0 0.75617 0.128713 0.752475 no case
7 acn acha1249 Longchuan Achang Sino-Tibetan Eurasia 27700 1 0.837306 0.883562 0.0958904 case
8 acr achi1256 Achi Mayan North America 124000 0 0.721928 0.2 0.685714 no case
9 acu achu1248 Achuar-Shiwiar Chicham South America 3520 1 0.637387 0.774194 0.16129 case
10 ade adel1244 Adele Atlantic-Congo Africa 26400 0 0.558629 0.118012 0.763975 no case
11 adj adio1239 Adioukrou Atlantic-Congo Africa 159000 0 0.799759 0.257143 0.585714 no case
12 aey amel1241 Amele Nuclear Trans-New-Guinea Oceania 5300 0 0.510093 0.793814 0.14433 no case
13 afr afri1274 Afrikaans Indo-European Africa 17160000 0 0.424134 0.740979 0.199742 no case
682 zpi sant1451 Santa María Quiegolani Zapotec Otomanguean North America 2000 0 0.642938 0.0545455 0.2 no case
683 zpl lach1249 Lachixío Zapotec Otomanguean North America 6500 0 0.451623 0.0810811 0.472973 no case
684 zpm mixt1426 Mixtepec Zapotec Otomanguean North America 7000 0 0.591673 0.0952381 0.428571 no case
685 zpo amat1238 Amatlán Zapotec Otomanguean North America 10000 0 0.759276 0.0487805 0.536585 no case
686 zpq zoog1238 Zoogocho Zapotec Otomanguean North America 1000 0 0.867282 0.2 0.488889 no case
687 zpu yala1267 Yalálag Zapotec Otomanguean North America 3500 0 0.693127 0.116279 0.372093 no case
688 zpv chic1274 Chichicapan Zapotec Otomanguean North America 2720 0 0.911342 0.122449 0.408163 no case
689 zpz texm1235 Texmelucan Zapotec Otomanguean North America 4630 0 0.90117 0.121951 0.487805 no case
690 zsr sout3004 Southern Rincon Zapotec Otomanguean North America 12000 0 0.996134 0.0487805 0.341463 no case
691 ztq quio1241 Quioquitani-Quieri Zapotec Otomanguean North America 4000 0 0.77935 0.0989011 0.604396 no case
692 zty yate1242 Yatee Zapotec Otomanguean North America 5000 0 0.884115 0.186047 0.418605 no case
693 zul zulu1248 Zulu Atlantic-Congo Africa 27300000 0 0.503258 0.0666667 0.777778 no case
ap_entropy = CSV.read("../data/ap_entropy_ou.csv", DataFrame)
case = CSV.read("../data/case_ou.csv", DataFrame)
v1 = CSV.read("../data/v1_ou.csv", DataFrame)
vm = CSV.read("../data/vm_ou.csv", DataFrame)
vf = CSV.read("../data/vf_ou.csv", DataFrame);

Since I have a posterior distribution over the diachronic changes, there are various ways to proceed.

The simplest way is to work with the posterior means

d_mean = DataFrame(
    ap_entropy = mapslices(mean, Array(ap_entropy), dims=2) |> vec,
    case = mapslices(mean, Array(case), dims=2) |> vec,
    v1 = mapslices(mean, Array(v1), dims=2) |> vec,
    vm = mapslices(mean, Array(vm), dims=2) |> vec,
    vf = mapslices(mean, Array(vf), dims=2) |> vec
)

d_mean[1:10,:]
10×5 DataFrame
Row ap_entropy case v1 vm vf
Float64 Float64 Float64 Float64 Float64
1 -0.0561222 0.146859 0.312116 -0.502353 0.534095
2 -0.691529 0.190721 -0.442719 0.441969 -0.227042
3 -0.417357 0.726538 -0.145272 -0.0389503 0.276306
4 0.00392528 0.0203605 0.585015 0.443387 -1.29164
5 0.00558466 0.0380693 -0.496684 -0.500937 1.32527
6 0.265089 0.688952 0.650022 -0.442836 0.268705
7 0.060017 0.0344144 0.156385 0.0529664 -0.305342
8 0.154984 0.00466574 -0.355771 0.0205768 0.357639
9 0.904509 0.719974 -0.514243 0.247453 0.411839
10 0.232673 0.0342125 0.00857667 -0.202634 0.314142
corrplot(Array(d_mean), label = names(d_mean))
plot_pc_graph_tikz(pcalg(d_mean, 0.05, gausscitest), string.(names(d_mean)))

How sensitive is the outcome to the significance level?

plot_pc_graph_tikz(pcalg(d_mean, 0.01, gausscitest), string.(names(d_mean)))

plot_pc_graph_tikz(pcalg(d_mean, 0.001, gausscitest), string.(names(d_mean)))

The results are partially contradictory, but the most likely explanation is that the data is not enough to make a clear inference.