Constructing MrBayes scripts for phylogenetic inference

Overview

In my paper Global-scale phylogenetic linguistic inference from lexical resources, I describe a method to extract character matrices from ASJP data. On https://osf.io/a97sz/, I stored code and data from applying this workflow to version 19 of ASJP.

In this script, the character vectors for a predefined set of glottocodes are extracted from the data on the OSF repository, and mrbayes scripts are created, one for each language family present among the collection of glottocodes.

Workflow

Preamble: activating the local environment and loading packages

Note that I load PyCall and import the python package ete3, which is very convenient to manipulate phylogenies.

using Pkg
Pkg.activate(".")
Pkg.instantiate()
  Activating project at `~/projects/research/natalia_causality/code`
using CSV
using DataFrames
using JSON
using HTTP
using Pipe
using ProgressMeter
using PyCall
ete3 = pyimport("ete3")
using Random

Loading and preparing data

d = CSV.File("../data/data_with_counts.csv") |> DataFrame
d[shuffle(1:end)[1:10], :]
10×13 DataFrame
Row ISO Glottocode Language Family Area N_Speakers Case AP_Entropy VerbFinal VerbMiddle v1 vm vf
String3 String15 String String31 String15 Int64 String3 Float64 Float64 Float64 Int64 Int64 Int64
1 tab taba1259 Tabasaran Nakh-Daghestanian Eurasia 87200 1 0.681656 0.831325 0.156627 1 13 69
2 mta cota1241 Cotabato Manobo Austronesian Eurasia 30000 0 0.598347 0.0545455 0.345455 33 19 3
3 ukr ukra1253 Ukrainian Indo-European Eurasia 30800000 1 0.724605 0.258389 0.52349 65 156 77
4 tat tata1255 Tatar Turkic Eurasia 4070000 1 0.564914 0.884956 0.0973451 2 11 100
5 usp uspa1245 Uspanteco Mayan North America 5130 0 0.811278 0.210526 0.618421 13 47 16
6 sgb maga1263 Mag-Anchi Ayta Austronesian Eurasia 4200 NA 0.680077 0.08 0.38 27 19 4
7 kwj kwan1278 Kwanga Sepik Oceania 10000 NA 0.958712 0.690476 0.25 5 21 58
8 gag gaga1249 Gagauz Turkic Eurasia 115000 1 0.549253 0.319218 0.635179 14 195 98
9 lit lith1251 Lithuanian Indo-European Eurasia 2400000 1 0.449282 0.261224 0.722449 4 177 64
10 ksd kuan1248 Kuanua Austronesian Oceania 148000 0 0.283398 0.0633803 0.922535 2 131 9

This dataset contains the glottocode kose1239, which has been retired by the current version (4.8) of Glottolog, see https://glottolog.org/resource/languoid/id/kose1239. The closest usable counterpart is awiy1238 (https://glottolog.org/resource/languoid/id/awiy1238).

glottocode_corrected = Dict{String, String}(zip(d.Glottocode, d.Glottocode))
glottocode_corrected["kose1239"] = "awiy1238"

insertcols!(d, 2, :glottocode_corrected => [glottocode_corrected[x] for x in d.Glottocode]);

Getting the ASJP data

There are two types of characters in the OSF repo, cognate class characters and soundclass-concept characters. The two character matrices are downloaded and loaded in turn.

file_id = "h4a6z"
url = "https://api.osf.io/v2/files/$(file_id)/"
response = HTTP.get(url)
data = JSON.parse(String(response.body))

download_url = data["data"]["links"]["download"]


world_cc_ = DataFrame(
    hcat(
        split.(
            split(read(download(download_url), String), "\n")[2:end]
        )...
    ) |> permutedims, :auto)

rename!(world_cc_, :x1 => :longname, :x2 => :characters)


world_cc = @pipe world_cc_.characters |>
                 mapslices(x -> split.(x, ""), _, dims=1) |>
                 hcat(_...) |>
                 permutedims |>
                 DataFrame(_, :auto) |>
                 insertcols!(_, 1, :longname => world_cc_.longname)
7432×45705 DataFrame
45605 columns and 7407 rows omitted
Row longname x1 x2 x3 x4 x5 x6 x7 x8 x9 x10 x11 x12 x13 x14 x15 x16 x17 x18 x19 x20 x21 x22 x23 x24 x25 x26 x27 x28 x29 x30 x31 x32 x33 x34 x35 x36 x37 x38 x39 x40 x41 x42 x43 x44 x45 x46 x47 x48 x49 x50 x51 x52 x53 x54 x55 x56 x57 x58 x59 x60 x61 x62 x63 x64 x65 x66 x67 x68 x69 x70 x71 x72 x73 x74 x75 x76 x77 x78 x79 x80 x81 x82 x83 x84 x85 x86 x87 x88 x89 x90 x91 x92 x93 x94 x95 x96 x97 x98 x99
SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin…
1 AA.DIZOID.NAO 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
2 AuA.KHASIAN.KHASI 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
3 AuA.KHASIAN.KHASI_2 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
4 AuA.KHASIAN.LYNGNGAM 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
5 AuA.KHASIAN.PNAR_JOWAI 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
6 AuA.KHASIAN.WAR_JAINTIA 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
7 Gun.GUNWINYGIC.BUAN 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
8 Hok.YUMAN.YAVAPAI 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
9 Iwa.IWAIDJAN.AMURDAK 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
10 Iwa.IWAIDJAN.IWAIDJA 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
11 LSR.GRASS.ABU 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
12 NC.KWA.AJAGBE 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
13 NC.NORTHERN_ATLANTIC.WOLOF_8 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
7421 NC.BANTOID.NYANJA_NYASA - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
7422 NC.KAINJI.KUKI - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
7423 NC.KAINJI.REGI - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
7424 NC.KAINJI.ROGO - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
7425 NC.KAINJI.SHAMA - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
7426 NC.KWA.AKPAFU - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
7427 NC.PLATOID.BEROM_F - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
7428 AA.BERBER.CHAOUI - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
7429 Man.WESTERN_MANDE.SEEKU - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
7430 An.CELEBIC.TOLAKI - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
7431 AA.WEST_CHADIC.DERA - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
7432 NC.KAINJI.SEGEMUK - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
file_id = "3em9h"
url = "https://api.osf.io/v2/files/$(file_id)/"
response = HTTP.get(url)
data = JSON.parse(String(response.body))

download_url = data["data"]["links"]["download"]


world_sc_ = DataFrame(
    hcat(
        split.(
            split(read(download(download_url), String), "\n")[2:end]
        )...
    ) |> permutedims, :auto)


rename!(world_sc_, :x1 => :longname, :x2 => :characters)


world_sc = @pipe world_sc_.characters |>
                 mapslices(x -> split.(x, ""), _, dims=1) |>
                 hcat(_...) |>
                 permutedims |>
                 DataFrame(_, :auto) |>
                 insertcols!(_, 1, :longname => world_sc_.longname)
7432×1641 DataFrame
1541 columns and 7407 rows omitted
Row longname x1 x2 x3 x4 x5 x6 x7 x8 x9 x10 x11 x12 x13 x14 x15 x16 x17 x18 x19 x20 x21 x22 x23 x24 x25 x26 x27 x28 x29 x30 x31 x32 x33 x34 x35 x36 x37 x38 x39 x40 x41 x42 x43 x44 x45 x46 x47 x48 x49 x50 x51 x52 x53 x54 x55 x56 x57 x58 x59 x60 x61 x62 x63 x64 x65 x66 x67 x68 x69 x70 x71 x72 x73 x74 x75 x76 x77 x78 x79 x80 x81 x82 x83 x84 x85 x86 x87 x88 x89 x90 x91 x92 x93 x94 x95 x96 x97 x98 x99
SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin…
1 AA.DIZOID.NAO 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
2 AuA.KHASIAN.KHASI 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0
3 AuA.KHASIAN.KHASI_2 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 1 0 0 0 0 0
4 AuA.KHASIAN.LYNGNGAM 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 1 0 0 0 0
5 AuA.KHASIAN.PNAR_JOWAI 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 1 0 0 0 0
6 AuA.KHASIAN.WAR_JAINTIA 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 1 0 0 0 1 0
7 Gun.GUNWINYGIC.BUAN 0 1 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
8 Hok.YUMAN.YAVAPAI 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0
9 Iwa.IWAIDJAN.AMURDAK 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 1 1 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0
10 Iwa.IWAIDJAN.IWAIDJA 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 1 1 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0
11 LSR.GRASS.ABU 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0
12 NC.KWA.AJAGBE 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
13 NC.NORTHERN_ATLANTIC.WOLOF_8 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 - - - - - - - - - - - - - - - - -
7421 NC.BANTOID.NYANJA_NYASA - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0
7422 NC.KAINJI.KUKI - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
7423 NC.KAINJI.REGI - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
7424 NC.KAINJI.ROGO - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
7425 NC.KAINJI.SHAMA - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
7426 NC.KWA.AKPAFU - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
7427 NC.PLATOID.BEROM_F - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
7428 AA.BERBER.CHAOUI - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0
7429 Man.WESTERN_MANDE.SEEKU - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
7430 An.CELEBIC.TOLAKI - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
7431 AA.WEST_CHADIC.DERA - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
7432 NC.KAINJI.SEGEMUK - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

Next I fetch and prepare the metadata for the ASJP doculect.

file_id = "w4jnf"
url = "https://api.osf.io/v2/files/$(file_id)/"
response = HTTP.get(url)
data = JSON.parse(String(response.body))

download_url = data["data"]["links"]["download"]


asjp_languages = @pipe CSV.read(
                           download(download_url),
                           missingstring="",
                           DataFrame) |>
                       dropmissing(_, :classification_wals) |>
                       dropmissing(_, :Glottocode) |>
                       filter(row -> row.recently_extinct == 0, _) |>
                       filter(row -> row.long_extinct == 0, _) |>
                       select(_, [:Name, :Glottocode, :classification_wals]) |>
                       DataFrames.transform(_, [:classification_wals, :Name] => ByRow((x, y) -> string(x, ".", y)) => :longname) |>
                       select(_, Not(:classification_wals)) |>
                       DataFrames.transform(_, :longname => ByRow(x -> replace(x, "-" => "_")) => :longname) |>
                       dropmissing
8953×3 DataFrame
8928 rows omitted
Row Name Glottocode longname
String String15 String
1 A51_BAFIA_MAJA lefa1242 NC.BANTOID.A51_BAFIA_MAJA
2 A51_BAFIA_TUMI_TINGON lefa1242 NC.BANTOID.A51_BAFIA_TUMI_TINGON
3 A51_BAFIA_ZAKAAN lefa1242 NC.BANTOID.A51_BAFIA_ZAKAAN
4 A53_BAFIA_RIKPA bafi1243 NC.BANTOID.A53_BAFIA_RIKPA
5 A54_BAFIA_NJANTI tibe1274 NC.BANTOID.A54_BAFIA_NJANTI
6 A60_GUNU nugu1242 NC.BANTOID.A60_GUNU
7 A60_MMAALA mmaa1238 NC.BANTOID.A60_MMAALA
8 A61_NGORO_ASOM tuki1240 NC.BANTOID.A61_NGORO_ASOM
9 A62_KALONGE yang1293 NC.BANTOID.A62_KALONGE
10 A72a_EWONDO ewon1239 NC.BANTOID.A72a_EWONDO
11 AASAX aasa1238 AA.SOUTHERN_CUSHITIC.AASAX
12 ABAGA abag1245 TNG.EASTERN_HIGHLANDS.ABAGA
13 ABANYOM aban1242 NC.BANTOID.ABANYOM
8942 ZOOMBO_3 koon1244 NC.BANTOID.ZOOMBO_3
8943 ZOOMBO_4 koon1244 NC.BANTOID.ZOOMBO_4
8944 ZOQUE_FRANCISCO_LEON fran1266 MZ.MIXE_ZOQUE.ZOQUE_FRANCISCO_LEON
8945 ZOQUE_RAYON rayo1235 MZ.MIXE_ZOQUE.ZOQUE_RAYON
8946 ZUGUNUK_KALASHA kala1372 IE.INDIC.ZUGUNUK_KALASHA
8947 ZULGO zulg1242 AA.BIU_MANDARA.ZULGO
8948 ZULU zulu1248 NC.BANTOID.ZULU
8949 ZULU_2 zulu1248 NC.BANTOID.ZULU_2
8950 ZULU_NKANDLA zulu1248 NC.BANTOID.ZULU_NKANDLA
8951 ZUMBUN zumb1240 AA.WEST_CHADIC.ZUMBUN
8952 ZUNI zuni1245 Zun.ZUNI.ZUNI
8953 ZWAY zayy1238 AA.SEMITIC.ZWAY

I developed my own naming convention for ASJP doculects – [WALS family name].[WALS genus_name].[doculect name]. These must be matched with glottocodes.

longname2glottocode = Dict{String, String}(
    zip(asjp_languages.longname, asjp_languages.Glottocode)
)

glottocode2longname = Dict{String, String}(
    zip(asjp_languages.Glottocode, asjp_languages.longname)
)

for l in d.Glottocode
    if l  keys(glottocode2longname)
        longname2glottocode[l] = l
        glottocode2longname[l] = l
    end
end

Restricting the character vectors to the doculects for which I have a glottocode.

filter!(row -> row.longname  asjp_languages.longname, world_cc)
filter!(row -> row.longname  asjp_languages.longname, world_sc)
7034×1641 DataFrame
1541 columns and 7009 rows omitted
Row longname x1 x2 x3 x4 x5 x6 x7 x8 x9 x10 x11 x12 x13 x14 x15 x16 x17 x18 x19 x20 x21 x22 x23 x24 x25 x26 x27 x28 x29 x30 x31 x32 x33 x34 x35 x36 x37 x38 x39 x40 x41 x42 x43 x44 x45 x46 x47 x48 x49 x50 x51 x52 x53 x54 x55 x56 x57 x58 x59 x60 x61 x62 x63 x64 x65 x66 x67 x68 x69 x70 x71 x72 x73 x74 x75 x76 x77 x78 x79 x80 x81 x82 x83 x84 x85 x86 x87 x88 x89 x90 x91 x92 x93 x94 x95 x96 x97 x98 x99
SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin…
1 AA.DIZOID.NAO 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
2 AuA.KHASIAN.KHASI 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0
3 AuA.KHASIAN.KHASI_2 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 1 0 0 0 0 0
4 AuA.KHASIAN.LYNGNGAM 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 1 0 0 0 0
5 AuA.KHASIAN.PNAR_JOWAI 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 1 0 0 0 0
6 AuA.KHASIAN.WAR_JAINTIA 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 1 0 0 0 1 0
7 Gun.GUNWINYGIC.BUAN 0 1 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
8 Hok.YUMAN.YAVAPAI 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0
9 Iwa.IWAIDJAN.AMURDAK 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 1 1 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0
10 Iwa.IWAIDJAN.IWAIDJA 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 1 1 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0
11 LSR.GRASS.ABU 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0
12 NC.KWA.AJAGBE 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
13 NC.NORTHERN_ATLANTIC.WOLOF_8 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 - - - - - - - - - - - - - - - - -
7023 NC.BANTOID.NYANJA_NYASA - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0
7024 NC.KAINJI.KUKI - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
7025 NC.KAINJI.REGI - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
7026 NC.KAINJI.ROGO - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
7027 NC.KAINJI.SHAMA - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
7028 NC.KWA.AKPAFU - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
7029 NC.PLATOID.BEROM_F - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
7030 AA.BERBER.CHAOUI - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0
7031 Man.WESTERN_MANDE.SEEKU - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
7032 An.CELEBIC.TOLAKI - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
7033 AA.WEST_CHADIC.DERA - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
7034 NC.KAINJI.SEGEMUK - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

ASJP sometimes contains several doculects for the same glottocode. Therefore I now compute the number of missing entries for each doculect. For each glottocode, I select the ASJP doculect with fewest missing entries as representative.

insertcols!(
    world_cc,
    1,
    :Glottocode => [longname2glottocode[x] for x in world_cc.longname]
)

insertcols!(
    world_sc,
    1,
    :Glottocode => [longname2glottocode[x] for x in world_sc.longname]
)



best_languages = @pipe world_sc |> 
    DataFrame(
        longname = _.longname,
        Glottocode = _.Glottocode,
        nGaps = map(x -> sum(Array(x) .== "-"), eachrow(_))
    ) |> 
    sort(_, :nGaps) |>
    unique(_, :Glottocode).longname 
4261-element Vector{SubString{String}}:
 "AuA.KHASIAN.KHASI"
 "Hok.YUMAN.YAVAPAI"
 "Iwa.IWAIDJAN.IWAIDJA"
 "ST.BODIC.BUNAN"
 "ST.BODIC.EASTERN_BALTI"
 "ST.BODIC.GHACHOK"
 "ST.BODIC.HELAMBU_SHERPA"
 "ST.BODIC.KAGATE"
 "ST.BODIC.LHASA_TIBETAN"
 "ST.BODIC.LOWA"
 "ST.BODIC.MANANGE"
 "ST.BODIC.PATTANI"
 "ST.BODIC.PURIK"
 ⋮
 "Hok.YUMAN.MARICOPA"
 "NC.BANTOID.FANG"
 "NC.BANTOID.NJEN"
 "TNG.BINANDEREAN.GAINA"
 "NDe.ATHAPASKAN.HAN"
 "TNG.BINANDEREAN.OROKAIVA_SOSE"
 "ESu.NILOTIC.SOGOO"
 "NC.BANTOID.KOSHIN"
 "CSu.BONGO_BAGIRMI.GULA_SARA"
 "An.GREATER_CENTRAL_PHILIPPINE.MANDAYAN_ISLAM_PISO"
 "An.OCEANIC.PENRHYN"
 "AA.BIU_MANDARA.VEMGO_MABAS_2"

The character matrices are not restricted to the doculects representing a glottocode.

filter!(row -> row.longname  best_languages, world_cc)
filter!(row -> row.longname  best_languages, world_sc)

select!(world_cc, Not(:longname))
select!(world_sc, Not(:longname))
4261×1641 DataFrame
1541 columns and 4236 rows omitted
Row Glottocode x1 x2 x3 x4 x5 x6 x7 x8 x9 x10 x11 x12 x13 x14 x15 x16 x17 x18 x19 x20 x21 x22 x23 x24 x25 x26 x27 x28 x29 x30 x31 x32 x33 x34 x35 x36 x37 x38 x39 x40 x41 x42 x43 x44 x45 x46 x47 x48 x49 x50 x51 x52 x53 x54 x55 x56 x57 x58 x59 x60 x61 x62 x63 x64 x65 x66 x67 x68 x69 x70 x71 x72 x73 x74 x75 x76 x77 x78 x79 x80 x81 x82 x83 x84 x85 x86 x87 x88 x89 x90 x91 x92 x93 x94 x95 x96 x97 x98 x99
String SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin…
1 nayi1243 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
2 khas1269 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0
3 lyng1241 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 1 0 0 0 0
4 pnar1238 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 1 0 0 0 0
5 warj1242 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 1 0 0 0 1 0
6 ngal1292 0 1 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
7 hava1248 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0
8 amar1271 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 1 1 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0
9 iwai1244 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 1 1 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0
10 abuu1241 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0
11 ajab1235 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
12 nucl1347 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 - - - - - - - - - - - - - - - - -
13 amah1246 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0
4250 rapa1244 - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1
4251 toro1253 - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0
4252 amas1236 - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1
4253 lagw1237 - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0
4254 vemg1240 - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0
4255 rogo1238 - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
4256 sham1278 - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
4257 siwu1238 - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
4258 tach1249 - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0
4259 seek1238 - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
4260 dera1248 - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
4261 east2403 - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

For the glottocodes in the target set for which there are no ASJP data, a character vector consisting of missing entries is constructed. Then, the character matrices are restricted to the glottocodes from the target set.

for l in setdiff(d.glottocode_corrected, world_sc.Glottocode)
    nl_cc = repeat(["-"], size(world_cc, 2))
    nl_cc[1] = l
    push!(world_cc, nl_cc)
    nl_sc = repeat(["-"], size(world_sc, 2))
    nl_sc[1] = l
    push!(world_sc, nl_sc)
end

filter!(row -> row.Glottocode  d.glottocode_corrected, world_cc)
filter!(row -> row.Glottocode  d.glottocode_corrected, world_sc)
827×1641 DataFrame
1541 columns and 802 rows omitted
Row Glottocode x1 x2 x3 x4 x5 x6 x7 x8 x9 x10 x11 x12 x13 x14 x15 x16 x17 x18 x19 x20 x21 x22 x23 x24 x25 x26 x27 x28 x29 x30 x31 x32 x33 x34 x35 x36 x37 x38 x39 x40 x41 x42 x43 x44 x45 x46 x47 x48 x49 x50 x51 x52 x53 x54 x55 x56 x57 x58 x59 x60 x61 x62 x63 x64 x65 x66 x67 x68 x69 x70 x71 x72 x73 x74 x75 x76 x77 x78 x79 x80 x81 x82 x83 x84 x85 x86 x87 x88 x89 x90 x91 x92 x93 x94 x95 x96 x97 x98 x99
String SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin…
1 nucl1347 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 - - - - - - - - - - - - - - - - -
2 akha1245 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0
3 lahu1253 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
4 acha1249 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0
5 nucl1310 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
6 kach1280 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0
7 arab1268 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
8 naas1242 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0
9 lith1251 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 1 0 1 0 1 0 1 1 0 0 0 1 1 1 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0
10 wels1247 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0
11 mode1248 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0
12 guja1252 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 1 0 1 0 0 0 0 0 0 1 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0
13 iron1242 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0
816 east2398 - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
817 diux1235 - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
818 magd1235 - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
819 nucl1454 - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
820 ozol1235 - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
821 rinc1236 - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
822 sant1450 - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
823 taba1268 - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
824 coat1244 - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
825 zouu1235 - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
826 chic1274 - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
827 yate1242 - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

In the cognate class character matrix, all columns containing no 1 are removed.

select!(world_cc, Not([
    i for i in axes(world_cc, 2)[2:end] if sum(world_cc[:, i] .== "1") == 0
]))
827×10190 DataFrame
10090 columns and 802 rows omitted
Row Glottocode x1 x3 x4 x5 x7 x8 x10 x13 x22 x23 x25 x32 x35 x36 x39 x41 x44 x45 x48 x49 x56 x57 x59 x60 x63 x66 x68 x69 x72 x73 x79 x80 x84 x90 x92 x93 x97 x101 x107 x108 x112 x118 x120 x121 x123 x124 x128 x130 x131 x135 x137 x140 x147 x149 x151 x152 x156 x158 x163 x165 x166 x167 x170 x171 x172 x174 x177 x178 x179 x181 x189 x190 x191 x196 x199 x201 x203 x207 x210 x212 x215 x219 x220 x222 x248 x254 x257 x258 x262 x265 x266 x267 x268 x269 x271 x272 x284 x290 x301
String SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin…
1 nucl1347 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
2 akha1245 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
3 lahu1253 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
4 acha1249 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
5 nucl1310 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
6 kach1280 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
7 arab1268 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
8 naas1242 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
9 lith1251 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
10 wels1247 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
11 mode1248 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
12 guja1252 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0
13 iron1242 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
816 east2398 - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
817 diux1235 - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
818 magd1235 - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
819 nucl1454 - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
820 ozol1235 - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
821 rinc1236 - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
822 sant1450 - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
823 taba1268 - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
824 coat1244 - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
825 zouu1235 - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
826 chic1274 - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
827 yate1242 - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

Then I normalize the family names so that they don’t contain spaces, apostrophs or hyphens, because some software might not like those special characters.

d[:,:Family] = replace.(d[:,:Family], " " => "_", "-" => "_", "'" => "")
827-element Vector{String}:
 "Austronesian"
 "Angan"
 "Sepik"
 "Ndu"
 "Yareban"
 "Atlantic_Congo"
 "Austronesian"
 "Creole"
 "Sino_Tibetan"
 "Mayan"
 "Chicham"
 "Atlantic_Congo"
 "Nilotic"
 ⋮
 "Otomanguean"
 "Otomanguean"
 "Otomanguean"
 "Otomanguean"
 "Otomanguean"
 "Otomanguean"
 "Otomanguean"
 "Otomanguean"
 "Otomanguean"
 "Otomanguean"
 "Otomanguean"
 "Atlantic_Congo"

Here is a helper function that takes a DataFrame object representing a character matrix and constructs the content of a Nexus file representing that matrix.

If a family only contains two taxa, a dummy taxa is added which has all characters missing. This is required because MrBayes only works with datasets containing at least 3 taxa.

# create character matrices

function df2nexus(cm)
    pad = maximum(length.(cm.Glottocode)) + 5
    ntaxa = size(cm, 1) == 2 ? 3 : size(cm, 1)
    nex = """#Nexus
    BEGIN DATA;
    DIMENSIONS ntax=$ntaxa nchar = $(size(cm, 2)-1);
    FORMAT DATATYPE=Restriction GAP=? MISSING=- interleave=no;
    MATRIX

    """
    for i in axes(cm, 1)
        nex *= rpad(cm.Glottocode[i], pad) * join(Array(cm[i, 2:end])) * "\n"
    end
    if nrow(cm) == 2
        nex *= rpad("dummy", pad) * repeat("?", size(cm, 2)-1) * "\n"
    end
    nex *= ";\nEND"
    nex
end
df2nexus (generic function with 1 method)

concatenating the two character matrices…

char_mtx = innerjoin(
        world_sc,
        world_cc,
        on=:Glottocode => :Glottocode,
        makeunique=true,
    )
827×11830 DataFrame
11730 columns and 802 rows omitted
Row Glottocode x1 x2 x3 x4 x5 x6 x7 x8 x9 x10 x11 x12 x13 x14 x15 x16 x17 x18 x19 x20 x21 x22 x23 x24 x25 x26 x27 x28 x29 x30 x31 x32 x33 x34 x35 x36 x37 x38 x39 x40 x41 x42 x43 x44 x45 x46 x47 x48 x49 x50 x51 x52 x53 x54 x55 x56 x57 x58 x59 x60 x61 x62 x63 x64 x65 x66 x67 x68 x69 x70 x71 x72 x73 x74 x75 x76 x77 x78 x79 x80 x81 x82 x83 x84 x85 x86 x87 x88 x89 x90 x91 x92 x93 x94 x95 x96 x97 x98 x99
String SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin…
1 nucl1347 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 - - - - - - - - - - - - - - - - -
2 akha1245 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0
3 lahu1253 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
4 acha1249 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0
5 nucl1310 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
6 kach1280 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0
7 arab1268 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
8 naas1242 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0
9 lith1251 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 1 0 1 0 1 0 1 1 0 0 0 1 1 1 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0
10 wels1247 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0
11 mode1248 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0
12 guja1252 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 1 0 1 0 0 0 0 0 0 1 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0
13 iron1242 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0
816 east2398 - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
817 diux1235 - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
818 magd1235 - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
819 nucl1454 - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
820 ozol1235 - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
821 rinc1236 - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
822 sant1450 - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
823 taba1268 - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
824 coat1244 - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
825 zouu1235 - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
826 chic1274 - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
827 yate1242 - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
lineages = @pipe d |> 
    groupby(_, :Family) |> 
    combine(_, nrow) |> 
    sort(_,:nrow)

families = lineages.Family[lineages.nrow .> 1]
53-element PooledArrays.PooledVector{String31, UInt32, Vector{UInt32}}:
 "Border"
 "Kru"
 "Boran"
 "Paezan"
 "Huitotoan"
 "Teberan"
 "Eastern_Trans_Fly"
 "Pama_Nyungan"
 "Eskimo_Aleut"
 "North_Halmahera"
 "East_Birds_Head"
 "Angan"
 "Ndu"
 ⋮
 "Arawakan"
 "Creole"
 "Mayan"
 "Sino_Tibetan"
 "Quechuan"
 "Uto_Aztecan"
 "Afro_Asiatic"
 "Indo_European"
 "Otomanguean"
 "Nuclear_Trans_New_Guinea"
 "Atlantic_Congo"
 "Austronesian"

Now I fetch the Glottolog classification as a vector of newick strings from the Glottolog website.

glottologF = "../data/tree_glottolog_newick.txt"

isfile(glottologF) || download(
    "https://cdstar.eva.mpg.de//bitstreams/EAEA0-B701-6328-C3E3-0/tree_glottolog_newick.txt",
    glottologF
)
true
raw = readlines(glottologF);

First some clean-up to make the newick strings digestible by ete3. Then, the newick tree for each family is read in as ete3 tree object.

trees = []

for ln in raw
    ln = strip(ln)
    ln = replace(ln, r"\'[A-ZÄÖÜ][^[]*\[" => "[")
    ln = replace(ln, r"\][^']*\'" => "]")
    ln = replace(ln, r"\[|\]" => "")
    ln = replace(ln, "''" => "")
    ln = replace(ln, ":1" => "")
    push!(
        trees,
        ete3.Tree(ln, format=1)
    )
end

Next, the Glottolog for the individual families are combined to a Glottolog world tree.

This Glottolog tree contains internal nodes representing glottocodes. They may have daughter nodes representing dialects. To make sure that each glottocode is a leaf, I create another leaf daughter for each named internal node and shift the namer of the internal node to that new leaf.

glot = ete3.Tree()

for t in trees
    glot.add_child(t)
end

nonLeaves = [nd.name for nd in glot.traverse()
             if (nd.name != "") & !nd.is_leaf()
]

@showprogress for nm in nonLeaves
    nd = (glot & nm)
    nd.name = ""
    nd.add_child(name=nm)
end
Progress: 100%|█████████████████████████████████████████| Time: 0:01:21

Next I create a dictionary with the families from the target set as keys. For each target family, I prune the Glottolog world tree to the glottocodes from that family and store it as a newick string in the dictionary.

function family_to_tree(fm; d=d, glot=glot)
    fm_taxa = d.glottocode_corrected[d.Family.==fm]
    glot_fm = glot.copy()
    glot_fm.prune(fm_taxa)
    glot_fm.write(format=9)
end


glot_tree_dict = Dict()

for fm in families
    glot_tree_dict[fm] = family_to_tree(fm)
end

Finally the MrBayes files are created.

For each family, there are three nexus files:

  1. the file containing the character matrix, and
  2. two MrBayes scripts.

Using two MrBayes scripts is a hack because

  • I want at least 100,000 iterations per chain, and
  • I want to apply the early stop rule which terminates a chain when the topologies have sufficiently converged.

If I use the early stop rule from the outset, sampling for very small families will stop right away because there is no (or little) phylogenetic uncertainty. It is still advisable to do some sampling though, to get a good estimate for the branch lengths.

Not using the stop rule is not a good option either, because then I have to fix a sufficiently large number of iterations. For large families, this is in the tens of millions, which would be a waste of resources for smaller families.

As a compromise, I use the first script to run 10, 000,000 iterations without stop rule, and then continue up to 1,000,000,000 iterations with stop rule.

For the latter, I prepare the following analysis:

  • characters are partitioned into the cc and sc characters (see above)
  • relaxed clock
  • gamma-distributed rate variation
  • equilibrium probabilities and the clock rates are estimated for the two partitions separately
  • for the cc characters, ascertainment bias correction is conducted because all-0 columns are not included
  • the Glottolog tree is used as constraint tree
  • if the dataset contains a dummy taxon (because there are only two real taxa), the dummy taxon is treated as outgroup,
  • if the family contains \(\leq 100\) taxa, the mcmc is run with four runs and four chains, otherwise with two runs and two chains,
  • the maximum is 1, 000, 000, 000 iterations, with an early stop if the average standard deviation of split frequencies is \(\leq 0.01\)
function create_mb_script(
    fn,
    char_mtx,
    clades,
    fm_cc,
    fm_glottocodes,
    n_iterations,
    n_runs,
    n_chains,
    append,
    stoprule
)
    mb = """#Nexus
        Begin MrBayes;
            execute $fn.nex;
            charset sc = 1-1640;
            charset cc = 1641-$(size(char_mtx, 2)-1);
            partition dtype = 2:sc, cc;
            set partition = dtype;
            prset applyto=(all) brlenspr = clock:uniform;
            prset applyto=(all) clockvarpr = igr;
            lset applyto=(all) rates=gamma;
            unlink Statefreq=(all) shape=(all) igrvar=(all) rate=(all);
            prset applyto=(all) ratepr=Dirichlet(1, 1);
            prset applyto=(2) clockratepr=exp(1.0); [for partition 2]
            lset applyto=(1) coding=all;
            lset applyto=(2) coding=noabsencesites;
        """
    if length(clades) > 1
        for (i, cl) in enumerate(clades)
            cn = join(cl, " ")
            mb *= "    constraint c$i = "
            mb *= "$cn;\n"
        end
        mb *= "    prset topologypr = constraints("
        mb *= join(["c$i" for i in 1:length(clades)], ",") * ");\n"
    end
    if length(fm_glottocodes) == 2
        mb *= "constraint c1 = $(join(fm_cc.Glottocode, " "));\n"
        mb *= "prset topologypr = constraints(c1);\n"
    end
    if length(fm_glottocodes) > 100
        mb *= "    set beagleprecision=double;\n"
    end
    mb *= """    prset brlenspr = clock:uniform;
        prset clockvarpr = igr;
        mcmcp stoprule=$stoprule stopval=0.01 filename=output/$fn samplefreq=1000;
        mcmc ngen=$n_iterations nchains=$n_chains nruns=$n_runs append=$append;
        sumt;
        sump;
        q;
    end;
    """
    return mb
end


mkpath("mrbayes/output")
@showprogress for (i, fm) in enumerate(families)
    fm_glottocodes = d.glottocode_corrected[d.Family.==fm]
    fn = lpad(i, 3, "0")*"_"*fm
    fm_cc = @pipe world_cc |>
        filter(row -> row.Glottocode  fm_glottocodes, _) 

    fm_characters = names(fm_cc)[2:end]

    informative = map(x -> sum(string.(fm_cc[:,x]) .== "1") .> 0, fm_characters)

    fm_cc = select(
        fm_cc, 
        vcat(["Glottocode"], fm_characters[informative])
    )

    fm_sc = @pipe world_sc |>
        filter(row -> row.Glottocode  fm_glottocodes, _)
    fm_characters = [x for x in names(fm_sc) if x != "Glottocode"]

    fm_sc = select(
        fm_sc, 
        vcat(["Glottocode"], fm_characters)
    )
        
    char_mtx = innerjoin(
        fm_sc,
        fm_cc,
        on=:Glottocode => :Glottocode,
        makeunique=true,
    )


    fm_glot = ete3.Tree(glot_tree_dict[fm], format=1)

    internal_nodes = [
        nd for nd in fm_glot.traverse()
        if nd.is_leaf() == false && nd.is_root() == false
    ]
    clades = [x.get_leaf_names() for x in internal_nodes]

    n_iterations_head = 10_000_000
    n_iterations_tail = 1_000_000_000
    n_chains = length(fm_glottocodes) > 100 ? 4 : 2
    n_runs = length(fm_glottocodes) > 100 ? 4 : 2

    mb_head = create_mb_script(
        fn,
        char_mtx,
        clades,
        fm_cc,
        fm_glottocodes,
        n_iterations_head,
        n_runs,
        n_chains,
        "no",
        "no"
    )

    mb_tail = create_mb_script(
        fn,
        char_mtx,
        clades,
        fm_cc,
        fm_glottocodes,
        n_iterations_tail,
        n_runs,
        n_chains,
        "yes",
        "yes"
    )
    write("mrbayes/$(fn)_head.mb.nex", mb_head)
    write("mrbayes/$(fn)_tail.mb.nex", mb_tail)
    write("mrbayes/$fn.nex", df2nexus(char_mtx))
end
Progress: 100%|█████████████████████████████████████████| Time: 0:00:03

This completes data preparation for MrBayes.

In the next step, all MrBayes scripts must be run, ideally with as much parallelization as possible. I used the following shell script on a powerfuls server for this:

#!/bin/bash

# Script to run multiple instances of mb-mpi command using parallel processing

cd mrbayes
max_jobs=25

run_with_limit() {
  while [ "$(jobs | wc -l)" -ge "$max_jobs" ]; do
    sleep 1
  done
  mpirun -np "$1" mb-mpi "$2" &
}

# Main loop to run mb-mpi commands in parallel
for file in *_head.mb.nex; do
  if [[ "$file" == "052_Atlantic_Congo_head.mb.nex" || "$file" == "053_Austronesian_head.mb.nex" ]]; then
    run_with_limit 16 "$file"
  else
    run_with_limit 4 "$file"
  fi
done

wait

for file in *_tail.mb.nex; do
  if [[ "$file" == "052_Atlantic_Congo_tail.mb.nex" || "$file" == "053_Austronesian_tail.mb.nex" ]]; then
    run_with_limit 16 "$file"
  else
    run_with_limit 4 "$file"
  fi
done

echo "All jobs submitted."