Constructing MrBayes scripts for phylogenetic inference

In my paper Global-scale phylogenetic linguistic inference from lexical resources, I describe a method to extract character matrices from ASJP data. On https://osf.io/a97sz/, I stored code and data from applying this workflow to version 19 of ASJP.

In this script, the character vectors for a predefined set of glottocodes are extracted from the data on the OSF repository, and mrbayes scripts are created, one for each language family present among the collection of glottocodes.

Workflow

Preamble: activating the local environment and loading packages

Note that I load PyCall and import the python package ete3, which is very convenient to manipulate phylogenies.

using Pkg
Pkg.activate(".")
Pkg.instantiate()
  Activating project at `~/projects/research/ordinal_decorrelation/ordinal_decorrelation_website`
using CSV
using DataFrames
using JSON
using HTTP
using Pipe
using ProgressMeter
using PyCall
ete3 = pyimport("ete3")
PyObject <module 'ete3' from '/home/gjaeger/miniconda3/envs/jupyter6/lib/python3.8/site-packages/ete3/__init__.py'>

Loading and preparing data

d = CSV.File("../data/EA_vars_glotto.csv") |> DataFrame
1291×9 DataFrame
1266 rows omitted
Row soc_id glottocode political_complexity hierarchy_within domestic_organisation agricultureLevel settlement_strategy exogamy crop_type
String7 String15 String3 String3 String3 String3 String3 String3 String3
1 Aa1 juho1239 1 3 3 0 1 1 1
2 Aa2 okie1245 1 2 1 0 1 1 1
3 Aa3 nama1265 2 2 1 0 1 1 1
4 Aa4 dama1270 NA NA 2 0 NA 1 NA
5 Aa5 bila1255 1 2 1 0 1 1 1
6 Aa6 sand1273 2 2 1 5 3 1 6
7 Aa7 naro1249 1 2 1 0 1 1 1
8 Aa8 xamm1241 1 2 1 0 1 NA 1
9 Aa9 hadz1240 1 3 3 0 1 0 1
10 Ab1 here1253 1 3 3 0 1 1 1
11 Ab10 mpon1252 3 3 3 5 3 1 6
12 Ab11 xesi1238 4 3 2 5 3 0 5
13 Ab12 zulu1248 4 3 3 5 3 1 6
1280 Si9 awet1244 NA NA NA 4 NA NA NA
1281 Sj1 kara1500 1 3 3 2 2 0 5
1282 Sj10 mbya1239 1 2 1 5 4 0 5
1283 Sj11 xava1240 1 4 3 2 1 0 6
1284 Sj2 xere1240 1 3 1 4 4 0 5
1285 Sj3 xokl1240 1 2 2 0 1 0 1
1286 Sj4 cane1242 1 4 3 4 1 0 5
1287 Sj5 kren1239 1 3 4 0 1 1 1
1288 Sj6 temb1276 1 3 4 5 4 0 5
1289 Sj7 apin1244 1 3 4 5 4 0 5
1290 Sj8 tupi1273 2 3 4 4 4 0 5
1291 Sj9 kaya1330 1 3 1 6 1 0 5

Next I fetch metadata from the Glottolog website to assign a family to each doculect.

glottolog_cldf_zip = "../data/glottolog_cldf.zip"

isfile(glottolog_cldf_zip) || begin
    download(
        "https://zenodo.org/records/8131091/files/glottolog/glottolog-cldf-v4.8.zip?download=1",
        glottolog_cldf_zip
    )
    run(`unzip $glottolog_cldf_zip -d ../data/`)
end
true
pth = "../data/glottolog-glottolog-cldf-59a612c/cldf/"

glottolog_languages = CSV.File(joinpath(pth, "languages.csv")) |> DataFrame


function glottocode_2_family(g, glottolog_languages)
    # Find the row with the matching Glottocode
    row = findfirst(==(g), glottolog_languages.Glottocode)

    # If the Glottocode is not found, return nothing or an appropriate value
    if isnothing(row)
        return "Glottocode not found"
    end

    # Extract the family code
    family_code = glottolog_languages.Family_ID[row]

    # Check if the family code is missing
    if ismissing(family_code)
        return glottolog_languages.Name[row]
    end

    # Find the row for the family code
    family_row = findfirst(==(family_code), glottolog_languages.Glottocode)

    # If the family Glottocode is not found, return the name for the original Glottocode
    if isnothing(family_row)
        return glottolog_languages.Name[row]
    end

    # Return the name associated with the family Glottocode
    return glottolog_languages.Name[family_row]
end


insertcols!(d, :Family => [glottocode_2_family(g, glottolog_languages) for g in d.glottocode])
1291×10 DataFrame
1266 rows omitted
Row soc_id glottocode political_complexity hierarchy_within domestic_organisation agricultureLevel settlement_strategy exogamy crop_type Family
String7 String15 String3 String3 String3 String3 String3 String3 String3 String
1 Aa1 juho1239 1 3 3 0 1 1 1 Kxa
2 Aa2 okie1245 1 2 1 0 1 1 1 Nilotic
3 Aa3 nama1265 2 2 1 0 1 1 1 Khoe-Kwadi
4 Aa4 dama1270 NA NA 2 0 NA 1 NA Khoe-Kwadi
5 Aa5 bila1255 1 2 1 0 1 1 1 Atlantic-Congo
6 Aa6 sand1273 2 2 1 5 3 1 6 Sandawe
7 Aa7 naro1249 1 2 1 0 1 1 1 Khoe-Kwadi
8 Aa8 xamm1241 1 2 1 0 1 NA 1 Tuu
9 Aa9 hadz1240 1 3 3 0 1 0 1 Hadza
10 Ab1 here1253 1 3 3 0 1 1 1 Atlantic-Congo
11 Ab10 mpon1252 3 3 3 5 3 1 6 Atlantic-Congo
12 Ab11 xesi1238 4 3 2 5 3 0 5 Atlantic-Congo
13 Ab12 zulu1248 4 3 3 5 3 1 6 Atlantic-Congo
1280 Si9 awet1244 NA NA NA 4 NA NA NA Tupian
1281 Sj1 kara1500 1 3 3 2 2 0 5 Nuclear-Macro-Je
1282 Sj10 mbya1239 1 2 1 5 4 0 5 Tupian
1283 Sj11 xava1240 1 4 3 2 1 0 6 Nuclear-Macro-Je
1284 Sj2 xere1240 1 3 1 4 4 0 5 Nuclear-Macro-Je
1285 Sj3 xokl1240 1 2 2 0 1 0 1 Nuclear-Macro-Je
1286 Sj4 cane1242 1 4 3 4 1 0 5 Nuclear-Macro-Je
1287 Sj5 kren1239 1 3 4 0 1 1 1 Nuclear-Macro-Je
1288 Sj6 temb1276 1 3 4 5 4 0 5 Tupian
1289 Sj7 apin1244 1 3 4 5 4 0 5 Nuclear-Macro-Je
1290 Sj8 tupi1273 2 3 4 4 4 0 5 Tupian
1291 Sj9 kaya1330 1 3 1 6 1 0 5 Nuclear-Macro-Je

Getting the ASJP data

There are two types of characters in the OSF repo, cognate class characters and soundclass-concept characters. The two character matrices are downloaded and loaded in turn.

file_id = "h4a6z"
url = "https://api.osf.io/v2/files/$(file_id)/"
response = HTTP.get(url)
data = JSON.parse(String(response.body))

download_url = data["data"]["links"]["download"]


world_cc_ = DataFrame(
    hcat(
        split.(
            split(read(download(download_url), String), "\n")[2:end]
        )...
    ) |> permutedims, :auto)

rename!(world_cc_, :x1 => :longname, :x2 => :characters)


world_cc = @pipe world_cc_.characters |>
                 mapslices(x -> split.(x, ""), _, dims=1) |>
                 hcat(_...) |>
                 permutedims |>
                 DataFrame(_, :auto) |>
                 insertcols!(_, 1, :longname => world_cc_.longname)
7432×45705 DataFrame
45605 columns and 7407 rows omitted
Row longname x1 x2 x3 x4 x5 x6 x7 x8 x9 x10 x11 x12 x13 x14 x15 x16 x17 x18 x19 x20 x21 x22 x23 x24 x25 x26 x27 x28 x29 x30 x31 x32 x33 x34 x35 x36 x37 x38 x39 x40 x41 x42 x43 x44 x45 x46 x47 x48 x49 x50 x51 x52 x53 x54 x55 x56 x57 x58 x59 x60 x61 x62 x63 x64 x65 x66 x67 x68 x69 x70 x71 x72 x73 x74 x75 x76 x77 x78 x79 x80 x81 x82 x83 x84 x85 x86 x87 x88 x89 x90 x91 x92 x93 x94 x95 x96 x97 x98 x99
SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin…
1 AA.DIZOID.NAO 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
2 AuA.KHASIAN.KHASI 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
3 AuA.KHASIAN.KHASI_2 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
4 AuA.KHASIAN.LYNGNGAM 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
5 AuA.KHASIAN.PNAR_JOWAI 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
6 AuA.KHASIAN.WAR_JAINTIA 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
7 Gun.GUNWINYGIC.BUAN 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
8 Hok.YUMAN.YAVAPAI 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
9 Iwa.IWAIDJAN.AMURDAK 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
10 Iwa.IWAIDJAN.IWAIDJA 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
11 LSR.GRASS.ABU 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
12 NC.KWA.AJAGBE 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
13 NC.NORTHERN_ATLANTIC.WOLOF_8 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
7421 NC.BANTOID.NYANJA_NYASA - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
7422 NC.KAINJI.KUKI - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
7423 NC.KAINJI.REGI - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
7424 NC.KAINJI.ROGO - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
7425 NC.KAINJI.SHAMA - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
7426 NC.KWA.AKPAFU - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
7427 NC.PLATOID.BEROM_F - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
7428 AA.BERBER.CHAOUI - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
7429 Man.WESTERN_MANDE.SEEKU - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
7430 An.CELEBIC.TOLAKI - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
7431 AA.WEST_CHADIC.DERA - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
7432 NC.KAINJI.SEGEMUK - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
file_id = "3em9h"
url = "https://api.osf.io/v2/files/$(file_id)/"
response = HTTP.get(url)
data = JSON.parse(String(response.body))

download_url = data["data"]["links"]["download"]


world_sc_ = DataFrame(
    hcat(
        split.(
            split(read(download(download_url), String), "\n")[2:end]
        )...
    ) |> permutedims, :auto)


rename!(world_sc_, :x1 => :longname, :x2 => :characters)


world_sc = @pipe world_sc_.characters |>
                 mapslices(x -> split.(x, ""), _, dims=1) |>
                 hcat(_...) |>
                 permutedims |>
                 DataFrame(_, :auto) |>
                 insertcols!(_, 1, :longname => world_sc_.longname)
7432×1641 DataFrame
1541 columns and 7407 rows omitted
Row longname x1 x2 x3 x4 x5 x6 x7 x8 x9 x10 x11 x12 x13 x14 x15 x16 x17 x18 x19 x20 x21 x22 x23 x24 x25 x26 x27 x28 x29 x30 x31 x32 x33 x34 x35 x36 x37 x38 x39 x40 x41 x42 x43 x44 x45 x46 x47 x48 x49 x50 x51 x52 x53 x54 x55 x56 x57 x58 x59 x60 x61 x62 x63 x64 x65 x66 x67 x68 x69 x70 x71 x72 x73 x74 x75 x76 x77 x78 x79 x80 x81 x82 x83 x84 x85 x86 x87 x88 x89 x90 x91 x92 x93 x94 x95 x96 x97 x98 x99
SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin…
1 AA.DIZOID.NAO 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
2 AuA.KHASIAN.KHASI 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0
3 AuA.KHASIAN.KHASI_2 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 1 0 0 0 0 0
4 AuA.KHASIAN.LYNGNGAM 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 1 0 0 0 0
5 AuA.KHASIAN.PNAR_JOWAI 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 1 0 0 0 0
6 AuA.KHASIAN.WAR_JAINTIA 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 1 0 0 0 1 0
7 Gun.GUNWINYGIC.BUAN 0 1 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
8 Hok.YUMAN.YAVAPAI 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0
9 Iwa.IWAIDJAN.AMURDAK 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 1 1 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0
10 Iwa.IWAIDJAN.IWAIDJA 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 1 1 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0
11 LSR.GRASS.ABU 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0
12 NC.KWA.AJAGBE 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
13 NC.NORTHERN_ATLANTIC.WOLOF_8 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 - - - - - - - - - - - - - - - - -
7421 NC.BANTOID.NYANJA_NYASA - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0
7422 NC.KAINJI.KUKI - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
7423 NC.KAINJI.REGI - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
7424 NC.KAINJI.ROGO - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
7425 NC.KAINJI.SHAMA - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
7426 NC.KWA.AKPAFU - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
7427 NC.PLATOID.BEROM_F - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
7428 AA.BERBER.CHAOUI - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0
7429 Man.WESTERN_MANDE.SEEKU - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
7430 An.CELEBIC.TOLAKI - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
7431 AA.WEST_CHADIC.DERA - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
7432 NC.KAINJI.SEGEMUK - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

Next I fetch and prepare the metadata for the ASJP doculect.

file_id = "w4jnf"
url = "https://api.osf.io/v2/files/$(file_id)/"
response = HTTP.get(url)
data = JSON.parse(String(response.body))

download_url = data["data"]["links"]["download"]


asjp_languages = @pipe CSV.read(
                           download(download_url),
                           missingstring="",
                           DataFrame) |>
                       dropmissing(_, :classification_wals) |>
                       dropmissing(_, :Glottocode) |>
                       filter(row -> row.recently_extinct == 0, _) |>
                       filter(row -> row.long_extinct == 0, _) |>
                       select(_, [:Name, :Glottocode, :Family, :classification_wals]) |>
                       DataFrames.transform(_, [:classification_wals, :Name] => ByRow((x, y) -> string(x, ".", y)) => :longname) |>
                       select(_, Not(:classification_wals)) |>
                       DataFrames.transform(_, :longname => ByRow(x -> replace(x, "-" => "_")) => :longname) |>
                       dropmissing
8953×4 DataFrame
8928 rows omitted
Row Name Glottocode Family longname
String String15 String31 String
1 A51_BAFIA_MAJA lefa1242 Atlantic-Congo NC.BANTOID.A51_BAFIA_MAJA
2 A51_BAFIA_TUMI_TINGON lefa1242 Atlantic-Congo NC.BANTOID.A51_BAFIA_TUMI_TINGON
3 A51_BAFIA_ZAKAAN lefa1242 Atlantic-Congo NC.BANTOID.A51_BAFIA_ZAKAAN
4 A53_BAFIA_RIKPA bafi1243 Atlantic-Congo NC.BANTOID.A53_BAFIA_RIKPA
5 A54_BAFIA_NJANTI tibe1274 Atlantic-Congo NC.BANTOID.A54_BAFIA_NJANTI
6 A60_GUNU nugu1242 Atlantic-Congo NC.BANTOID.A60_GUNU
7 A60_MMAALA mmaa1238 Atlantic-Congo NC.BANTOID.A60_MMAALA
8 A61_NGORO_ASOM tuki1240 Atlantic-Congo NC.BANTOID.A61_NGORO_ASOM
9 A62_KALONGE yang1293 Atlantic-Congo NC.BANTOID.A62_KALONGE
10 A72a_EWONDO ewon1239 Atlantic-Congo NC.BANTOID.A72a_EWONDO
11 AASAX aasa1238 Afro-Asiatic AA.SOUTHERN_CUSHITIC.AASAX
12 ABAGA abag1245 Nuclear Trans New Guinea TNG.EASTERN_HIGHLANDS.ABAGA
13 ABANYOM aban1242 Atlantic-Congo NC.BANTOID.ABANYOM
8942 ZOOMBO_3 koon1244 Atlantic-Congo NC.BANTOID.ZOOMBO_3
8943 ZOOMBO_4 koon1244 Atlantic-Congo NC.BANTOID.ZOOMBO_4
8944 ZOQUE_FRANCISCO_LEON fran1266 Mixe-Zoque MZ.MIXE_ZOQUE.ZOQUE_FRANCISCO_LEON
8945 ZOQUE_RAYON rayo1235 Mixe-Zoque MZ.MIXE_ZOQUE.ZOQUE_RAYON
8946 ZUGUNUK_KALASHA kala1372 Indo-European IE.INDIC.ZUGUNUK_KALASHA
8947 ZULGO zulg1242 Afro-Asiatic AA.BIU_MANDARA.ZULGO
8948 ZULU zulu1248 Atlantic-Congo NC.BANTOID.ZULU
8949 ZULU_2 zulu1248 Atlantic-Congo NC.BANTOID.ZULU_2
8950 ZULU_NKANDLA zulu1248 Atlantic-Congo NC.BANTOID.ZULU_NKANDLA
8951 ZUMBUN zumb1240 Afro-Asiatic AA.WEST_CHADIC.ZUMBUN
8952 ZUNI zuni1245 Zuni Zun.ZUNI.ZUNI
8953 ZWAY zayy1238 Afro-Asiatic AA.SEMITIC.ZWAY

I developed my own naming convention for ASJP doculects – [WALS family name].[WALS genus_name].[doculect name]. These must be matched with glottocodes.

longname2glottocode = Dict{String, String}(
    zip(asjp_languages.longname, asjp_languages.Glottocode)
)

glottocode2longname = Dict{String, String}(
    zip(asjp_languages.Glottocode, asjp_languages.longname)
)

glottocode2family = Dict{String, String}(
    zip(asjp_languages.Glottocode, asjp_languages.Family)
)



for l in d.glottocode
    if l  keys(glottocode2longname)
        longname2glottocode[l] = l
        glottocode2longname[l] = l
    end
end

Restricting the character vectors to the doculects for which I have a glottocode.

filter!(row -> row.longname  asjp_languages.longname, world_cc)
filter!(row -> row.longname  asjp_languages.longname, world_sc)
7034×1641 DataFrame
1541 columns and 7009 rows omitted
Row longname x1 x2 x3 x4 x5 x6 x7 x8 x9 x10 x11 x12 x13 x14 x15 x16 x17 x18 x19 x20 x21 x22 x23 x24 x25 x26 x27 x28 x29 x30 x31 x32 x33 x34 x35 x36 x37 x38 x39 x40 x41 x42 x43 x44 x45 x46 x47 x48 x49 x50 x51 x52 x53 x54 x55 x56 x57 x58 x59 x60 x61 x62 x63 x64 x65 x66 x67 x68 x69 x70 x71 x72 x73 x74 x75 x76 x77 x78 x79 x80 x81 x82 x83 x84 x85 x86 x87 x88 x89 x90 x91 x92 x93 x94 x95 x96 x97 x98 x99
SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin…
1 AA.DIZOID.NAO 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
2 AuA.KHASIAN.KHASI 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0
3 AuA.KHASIAN.KHASI_2 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 1 0 0 0 0 0
4 AuA.KHASIAN.LYNGNGAM 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 1 0 0 0 0
5 AuA.KHASIAN.PNAR_JOWAI 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 1 0 0 0 0
6 AuA.KHASIAN.WAR_JAINTIA 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 1 0 0 0 1 0
7 Gun.GUNWINYGIC.BUAN 0 1 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
8 Hok.YUMAN.YAVAPAI 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0
9 Iwa.IWAIDJAN.AMURDAK 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 1 1 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0
10 Iwa.IWAIDJAN.IWAIDJA 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 1 1 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0
11 LSR.GRASS.ABU 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0
12 NC.KWA.AJAGBE 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
13 NC.NORTHERN_ATLANTIC.WOLOF_8 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 - - - - - - - - - - - - - - - - -
7023 NC.BANTOID.NYANJA_NYASA - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0
7024 NC.KAINJI.KUKI - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
7025 NC.KAINJI.REGI - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
7026 NC.KAINJI.ROGO - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
7027 NC.KAINJI.SHAMA - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
7028 NC.KWA.AKPAFU - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
7029 NC.PLATOID.BEROM_F - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
7030 AA.BERBER.CHAOUI - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0
7031 Man.WESTERN_MANDE.SEEKU - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
7032 An.CELEBIC.TOLAKI - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
7033 AA.WEST_CHADIC.DERA - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
7034 NC.KAINJI.SEGEMUK - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

ASJP sometimes contains several doculects for the same glottocode. Therefore I now compute the number of missing entries for each doculect. For each glottocode, I select the ASJP doculect with fewest missing entries as representative.

insertcols!(
    world_cc,
    1,
    :Glottocode => [longname2glottocode[x] for x in world_cc.longname]
)

insertcols!(
    world_sc,
    1,
    :Glottocode => [longname2glottocode[x] for x in world_sc.longname]
)



best_languages = @pipe world_sc |> 
    DataFrame(
        longname = _.longname,
        Glottocode = _.Glottocode,
        nGaps = map(x -> sum(Array(x) .== "-"), eachrow(_))
    ) |> 
    sort(_, :nGaps) |>
    unique(_, :Glottocode).longname 
4261-element Vector{SubString{String}}:
 "AuA.KHASIAN.KHASI"
 "Hok.YUMAN.YAVAPAI"
 "Iwa.IWAIDJAN.IWAIDJA"
 "ST.BODIC.BUNAN"
 "ST.BODIC.EASTERN_BALTI"
 "ST.BODIC.GHACHOK"
 "ST.BODIC.HELAMBU_SHERPA"
 "ST.BODIC.KAGATE"
 "ST.BODIC.LHASA_TIBETAN"
 "ST.BODIC.LOWA"
 "ST.BODIC.MANANGE"
 "ST.BODIC.PATTANI"
 "ST.BODIC.PURIK"
 ⋮
 "Hok.YUMAN.MARICOPA"
 "NC.BANTOID.FANG"
 "NC.BANTOID.NJEN"
 "TNG.BINANDEREAN.GAINA"
 "NDe.ATHAPASKAN.HAN"
 "TNG.BINANDEREAN.OROKAIVA_SOSE"
 "ESu.NILOTIC.SOGOO"
 "NC.BANTOID.KOSHIN"
 "CSu.BONGO_BAGIRMI.GULA_SARA"
 "An.GREATER_CENTRAL_PHILIPPINE.MANDAYAN_ISLAM_PISO"
 "An.OCEANIC.PENRHYN"
 "AA.BIU_MANDARA.VEMGO_MABAS_2"

The character matrices are now restricted to the doculects representing a glottocode.

filter!(row -> row.longname  best_languages, world_cc)
filter!(row -> row.longname  best_languages, world_sc)

select!(world_cc, Not(:longname))
select!(world_sc, Not(:longname))
4261×1641 DataFrame
1541 columns and 4236 rows omitted
Row Glottocode x1 x2 x3 x4 x5 x6 x7 x8 x9 x10 x11 x12 x13 x14 x15 x16 x17 x18 x19 x20 x21 x22 x23 x24 x25 x26 x27 x28 x29 x30 x31 x32 x33 x34 x35 x36 x37 x38 x39 x40 x41 x42 x43 x44 x45 x46 x47 x48 x49 x50 x51 x52 x53 x54 x55 x56 x57 x58 x59 x60 x61 x62 x63 x64 x65 x66 x67 x68 x69 x70 x71 x72 x73 x74 x75 x76 x77 x78 x79 x80 x81 x82 x83 x84 x85 x86 x87 x88 x89 x90 x91 x92 x93 x94 x95 x96 x97 x98 x99
String SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin…
1 nayi1243 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
2 khas1269 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0
3 lyng1241 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 1 0 0 0 0
4 pnar1238 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 1 0 0 0 0
5 warj1242 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 1 0 0 0 1 0
6 ngal1292 0 1 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
7 hava1248 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0
8 amar1271 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 1 1 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0
9 iwai1244 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 1 1 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0
10 abuu1241 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0
11 ajab1235 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
12 nucl1347 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 - - - - - - - - - - - - - - - - -
13 amah1246 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0
4250 rapa1244 - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1
4251 toro1253 - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0
4252 amas1236 - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1
4253 lagw1237 - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0
4254 vemg1240 - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0
4255 rogo1238 - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
4256 sham1278 - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
4257 siwu1238 - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
4258 tach1249 - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0
4259 seek1238 - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
4260 dera1248 - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
4261 east2403 - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

For the glottocodes in the target set for which there are no ASJP data, a character vector consisting of missing entries is constructed. Then, the character matrices are restricted to the glottocodes from the target set.

for l in setdiff(d.glottocode, world_sc.Glottocode)
    nl_cc = repeat(["-"], size(world_cc, 2))
    nl_cc[1] = l
    push!(world_cc, nl_cc)
    nl_sc = repeat(["-"], size(world_sc, 2))
    nl_sc[1] = l
    push!(world_sc, nl_sc)
end

filter!(row -> row.Glottocode  d.glottocode, world_cc)
filter!(row -> row.Glottocode  d.glottocode, world_sc)
1210×1641 DataFrame
1541 columns and 1185 rows omitted
Row Glottocode x1 x2 x3 x4 x5 x6 x7 x8 x9 x10 x11 x12 x13 x14 x15 x16 x17 x18 x19 x20 x21 x22 x23 x24 x25 x26 x27 x28 x29 x30 x31 x32 x33 x34 x35 x36 x37 x38 x39 x40 x41 x42 x43 x44 x45 x46 x47 x48 x49 x50 x51 x52 x53 x54 x55 x56 x57 x58 x59 x60 x61 x62 x63 x64 x65 x66 x67 x68 x69 x70 x71 x72 x73 x74 x75 x76 x77 x78 x79 x80 x81 x82 x83 x84 x85 x86 x87 x88 x89 x90 x91 x92 x93 x94 x95 x96 x97 x98 x99
String SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin…
1 khas1269 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0
2 nucl1347 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 - - - - - - - - - - - - - - - - -
3 amah1246 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0
4 sher1255 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
5 akha1245 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0
6 sich1238 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
7 nucl1310 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
8 kach1280 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0
9 karb1241 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 - - - - - - - - - - - - - - - - -
10 loth1237 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
11 lepc1244 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 1 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0
12 west2418 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
13 bori1243 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0
1199 piar1243 - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
1200 chib1270 - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
1201 yine1238 - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
1202 uruu1244 - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
1203 onaa1245 - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
1204 tehu1242 - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
1205 abip1241 - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
1206 trum1247 - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
1207 umot1240 - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
1208 awet1244 - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
1209 tupi1273 - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
1210 kaya1330 - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

In the cognate class character matrix, all columns containing no 1 are removed.

select!(world_cc, Not([
    i for i in axes(world_cc, 2)[2:end] if sum(world_cc[:, i] .== "1") == 0
]))
1210×12777 DataFrame
12677 columns and 1185 rows omitted
Row Glottocode x1 x2 x4 x5 x9 x10 x11 x13 x14 x20 x22 x24 x29 x32 x35 x44 x45 x47 x48 x49 x52 x54 x55 x56 x57 x59 x60 x63 x64 x65 x66 x68 x69 x71 x79 x80 x82 x84 x85 x90 x97 x101 x105 x107 x108 x109 x110 x113 x118 x121 x124 x127 x129 x134 x137 x140 x143 x144 x147 x148 x149 x150 x153 x154 x156 x157 x158 x159 x161 x163 x165 x166 x167 x168 x170 x174 x175 x178 x179 x181 x182 x183 x187 x190 x195 x196 x197 x201 x202 x203 x205 x208 x212 x217 x218 x220 x221 x222 x223
String SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin…
1 khas1269 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
2 nucl1347 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
3 amah1246 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
4 sher1255 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
5 akha1245 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
6 sich1238 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
7 nucl1310 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
8 kach1280 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
9 karb1241 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
10 loth1237 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
11 lepc1244 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
12 west2418 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
13 bori1243 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
1199 piar1243 - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
1200 chib1270 - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
1201 yine1238 - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
1202 uruu1244 - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
1203 onaa1245 - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
1204 tehu1242 - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
1205 abip1241 - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
1206 trum1247 - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
1207 umot1240 - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
1208 awet1244 - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
1209 tupi1273 - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
1210 kaya1330 - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

Now I add the Glottolog family names to d.

d[:,:Family] = replace.(d[:,:Family], " " => "_", "-" => "_", "'" => "")
1291-element Vector{String}:
 "Kxa"
 "Nilotic"
 "Khoe_Kwadi"
 "Khoe_Kwadi"
 "Atlantic_Congo"
 "Sandawe"
 "Khoe_Kwadi"
 "Tuu"
 "Hadza"
 "Atlantic_Congo"
 "Atlantic_Congo"
 "Atlantic_Congo"
 "Atlantic_Congo"
 ⋮
 "Tupian"
 "Nuclear_Macro_Je"
 "Tupian"
 "Nuclear_Macro_Je"
 "Nuclear_Macro_Je"
 "Nuclear_Macro_Je"
 "Nuclear_Macro_Je"
 "Nuclear_Macro_Je"
 "Tupian"
 "Nuclear_Macro_Je"
 "Tupian"
 "Nuclear_Macro_Je"

Here is a helper function that takes a DataFrame object representing a character matrix and constructs the content of a Nexus file representing that matrix.

If a family only contains two taxa, a dummy taxa is added which has all characters missing. This is required because MrBayes only works with datasets containing at least 3 taxa.

# create character matrices

function df2nexus(cm)
    pad = maximum(length.(cm.Glottocode)) + 5
    ntaxa = size(cm, 1) == 2 ? 3 : size(cm, 1)
    nex = """#Nexus
    BEGIN DATA;
    DIMENSIONS ntax=$ntaxa nchar = $(size(cm, 2)-1);
    FORMAT DATATYPE=Restriction GAP=? MISSING=- interleave=no;
    MATRIX

    """
    for i in axes(cm, 1)
        nex *= rpad(cm.Glottocode[i], pad) * join(Array(cm[i, 2:end])) * "\n"
    end
    if nrow(cm) == 2
        nex *= rpad("dummy", pad) * repeat("?", size(cm, 2)-1) * "\n"
    end
    nex *= ";\nEND"
    nex
end
df2nexus (generic function with 1 method)

concatenating the two character matrices…

char_mtx = innerjoin(
        world_sc,
        world_cc,
        on=:Glottocode => :Glottocode,
        makeunique=true,
    )
1210×14417 DataFrame
14317 columns and 1185 rows omitted
Row Glottocode x1 x2 x3 x4 x5 x6 x7 x8 x9 x10 x11 x12 x13 x14 x15 x16 x17 x18 x19 x20 x21 x22 x23 x24 x25 x26 x27 x28 x29 x30 x31 x32 x33 x34 x35 x36 x37 x38 x39 x40 x41 x42 x43 x44 x45 x46 x47 x48 x49 x50 x51 x52 x53 x54 x55 x56 x57 x58 x59 x60 x61 x62 x63 x64 x65 x66 x67 x68 x69 x70 x71 x72 x73 x74 x75 x76 x77 x78 x79 x80 x81 x82 x83 x84 x85 x86 x87 x88 x89 x90 x91 x92 x93 x94 x95 x96 x97 x98 x99
String SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin…
1 khas1269 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0
2 nucl1347 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 - - - - - - - - - - - - - - - - -
3 amah1246 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0
4 sher1255 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
5 akha1245 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0
6 sich1238 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
7 nucl1310 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
8 kach1280 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0
9 karb1241 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 - - - - - - - - - - - - - - - - -
10 loth1237 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
11 lepc1244 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 1 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0
12 west2418 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
13 bori1243 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0
1199 piar1243 - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
1200 chib1270 - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
1201 yine1238 - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
1202 uruu1244 - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
1203 onaa1245 - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
1204 tehu1242 - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
1205 abip1241 - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
1206 trum1247 - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
1207 umot1240 - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
1208 awet1244 - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
1209 tupi1273 - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
1210 kaya1330 - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
lineages = @pipe d |> 
    unique(_, :glottocode) |>
    groupby(_, :Family) |> 
    combine(_, nrow) |> 
    sort(_,:nrow)

families = lineages.Family[lineages.nrow .> 1]
73-element Vector{String}:
 "Kadugli_Krongo"
 "Songhay"
 "Basque"
 "Abkhaz_Adyge"
 "Hmong_Mien"
 "Tai_Kadai"
 "Ndu"
 "Koiarian"
 "Greater_Kwerba"
 "Chinookan"
 "Palaihnihan"
 "Maiduan"
 "Yuki_Wappo"
 ⋮
 "Uralic"
 "Salishan"
 "Mande"
 "Athabaskan_Eyak_Tlingit"
 "Uto_Aztecan"
 "Sino_Tibetan"
 "Algic"
 "Nilotic"
 "Indo_European"
 "Afro_Asiatic"
 "Austronesian"
 "Atlantic_Congo"

Now I fetch the Glottolog classification as a vector of newick strings from the Glottolog website.

glottologF = "../data/tree_glottolog_newick.txt"

isfile(glottologF) || download(
    "https://cdstar.eva.mpg.de//bitstreams/EAEA0-B701-6328-C3E3-0/tree_glottolog_newick.txt",
    glottologF
)
true
raw = readlines(glottologF);

First some clean-up to make the newick strings digestible by ete3. Then, the newick tree for each family is read in as ete3 tree object.

trees = []

for ln in raw
    ln = strip(ln)
    ln = replace(ln, r"\'[A-ZÄÖÜ][^[]*\[" => "[")
    ln = replace(ln, r"\][^']*\'" => "]")
    ln = replace(ln, r"\[|\]" => "")
    ln = replace(ln, "''" => "")
    ln = replace(ln, ":1" => "")
    push!(
        trees,
        ete3.Tree(ln, format=1)
    )
end

Next, the Glottolog for the individual families are combined to a Glottolog world tree.

This Glottolog tree contains internal nodes representing glottocodes. They may have daughter nodes representing dialects. To make sure that each glottocode is a leaf, I create another leaf daughter for each named internal node and shift the namer of the internal node to that new leaf.

glot = ete3.Tree()

for t in trees
    glot.add_child(t)
end

nonLeaves = [nd.name for nd in glot.traverse()
             if (nd.name != "") & !nd.is_leaf()
]

@showprogress for nm in nonLeaves
    nd = (glot & nm)
    nd.name = ""
    nd.add_child(name=nm)
end
Progress:   0%|                                         |  ETA: 1:27:33Progress:  25%|██████████▎                              |  ETA: 0:00:15Progress:  26%|██████████▋                              |  ETA: 0:00:15Progress:  26%|██████████▊                              |  ETA: 0:00:15Progress:  26%|██████████▉                              |  ETA: 0:00:15Progress:  27%|███████████                              |  ETA: 0:00:15Progress:  27%|███████████▏                             |  ETA: 0:00:16Progress:  27%|███████████▎                             |  ETA: 0:00:16Progress:  28%|███████████▍                             |  ETA: 0:00:16Progress:  28%|███████████▌                             |  ETA: 0:00:16Progress:  28%|███████████▋                             |  ETA: 0:00:16Progress:  28%|███████████▋                             |  ETA: 0:00:16Progress:  29%|███████████▊                             |  ETA: 0:00:16Progress:  29%|███████████▉                             |  ETA: 0:00:16Progress:  29%|████████████                             |  ETA: 0:00:16Progress:  30%|████████████▏                            |  ETA: 0:00:16Progress:  30%|████████████▎                            |  ETA: 0:00:16Progress:  30%|████████████▍                            |  ETA: 0:00:16Progress:  30%|████████████▌                            |  ETA: 0:00:16Progress:  31%|████████████▌                            |  ETA: 0:00:16Progress:  31%|████████████▋                            |  ETA: 0:00:16Progress:  31%|████████████▊                            |  ETA: 0:00:16Progress:  31%|████████████▉                            |  ETA: 0:00:16Progress:  32%|█████████████                            |  ETA: 0:00:16Progress:  32%|█████████████                            |  ETA: 0:00:16Progress:  32%|█████████████▏                           |  ETA: 0:00:16Progress:  32%|█████████████▎                           |  ETA: 0:00:16Progress:  33%|█████████████▍                           |  ETA: 0:00:16Progress:  33%|█████████████▌                           |  ETA: 0:00:16Progress:  33%|█████████████▌                           |  ETA: 0:00:17Progress:  33%|█████████████▋                           |  ETA: 0:00:17Progress:  33%|█████████████▊                           |  ETA: 0:00:17Progress:  34%|█████████████▉                           |  ETA: 0:00:17Progress:  34%|█████████████▉                           |  ETA: 0:00:17Progress:  34%|██████████████                           |  ETA: 0:00:17Progress:  34%|██████████████▏                          |  ETA: 0:00:17Progress:  35%|██████████████▏                          |  ETA: 0:00:17Progress:  35%|██████████████▎                          |  ETA: 0:00:17Progress:  35%|██████████████▍                          |  ETA: 0:00:17Progress:  35%|██████████████▌                          |  ETA: 0:00:17Progress:  35%|██████████████▌                          |  ETA: 0:00:17Progress:  36%|██████████████▋                          |  ETA: 0:00:17Progress:  36%|██████████████▊                          |  ETA: 0:00:17Progress:  36%|██████████████▊                          |  ETA: 0:00:17Progress:  36%|██████████████▉                          |  ETA: 0:00:17Progress:  36%|███████████████                          |  ETA: 0:00:17Progress:  37%|███████████████                          |  ETA: 0:00:17Progress:  37%|███████████████▏                         |  ETA: 0:00:17Progress:  37%|███████████████▎                         |  ETA: 0:00:17Progress:  37%|███████████████▎                         |  ETA: 0:00:17Progress:  37%|███████████████▍                         |  ETA: 0:00:17Progress:  38%|███████████████▌                         |  ETA: 0:00:17Progress:  38%|███████████████▌                         |  ETA: 0:00:17Progress:  38%|███████████████▋                         |  ETA: 0:00:17Progress:  38%|███████████████▊                         |  ETA: 0:00:17Progress:  38%|███████████████▊                         |  ETA: 0:00:17Progress:  39%|███████████████▉                         |  ETA: 0:00:17Progress:  39%|███████████████▉                         |  ETA: 0:00:17Progress:  39%|████████████████                         |  ETA: 0:00:17Progress:  39%|████████████████▏                        |  ETA: 0:00:17Progress:  39%|████████████████▏                        |  ETA: 0:00:17Progress:  40%|████████████████▎                        |  ETA: 0:00:17Progress:  40%|████████████████▍                        |  ETA: 0:00:17Progress:  40%|████████████████▍                        |  ETA: 0:00:17Progress:  40%|████████████████▌                        |  ETA: 0:00:18Progress:  40%|████████████████▌                        |  ETA: 0:00:18Progress:  41%|████████████████▋                        |  ETA: 0:00:18Progress:  41%|████████████████▊                        |  ETA: 0:00:18Progress:  41%|████████████████▊                        |  ETA: 0:00:18Progress:  41%|████████████████▉                        |  ETA: 0:00:18Progress:  41%|████████████████▉                        |  ETA: 0:00:18Progress:  41%|█████████████████                        |  ETA: 0:00:18Progress:  42%|█████████████████                        |  ETA: 0:00:18Progress:  42%|█████████████████▏                       |  ETA: 0:00:18Progress:  42%|█████████████████▎                       |  ETA: 0:00:18Progress:  42%|█████████████████▎                       |  ETA: 0:00:18Progress:  42%|█████████████████▍                       |  ETA: 0:00:18Progress:  42%|█████████████████▍                       |  ETA: 0:00:18Progress:  43%|█████████████████▌                       |  ETA: 0:00:18Progress:  43%|█████████████████▌                       |  ETA: 0:00:18Progress:  43%|█████████████████▋                       |  ETA: 0:00:18Progress:  43%|█████████████████▊                       |  ETA: 0:00:18Progress:  43%|█████████████████▊                       |  ETA: 0:00:18Progress:  43%|█████████████████▉                       |  ETA: 0:00:18Progress:  44%|█████████████████▉                       |  ETA: 0:00:18Progress:  44%|██████████████████                       |  ETA: 0:00:18Progress:  44%|██████████████████                       |  ETA: 0:00:18Progress:  44%|██████████████████▏                      |  ETA: 0:00:18Progress:  44%|██████████████████▏                      |  ETA: 0:00:18Progress:  44%|██████████████████▎                      |  ETA: 0:00:18Progress:  45%|██████████████████▎                      |  ETA: 0:00:18Progress:  45%|██████████████████▍                      |  ETA: 0:00:18Progress:  45%|██████████████████▍                      |  ETA: 0:00:18Progress:  45%|██████████████████▌                      |  ETA: 0:00:18Progress:  45%|██████████████████▌                      |  ETA: 0:00:18Progress:  45%|██████████████████▋                      |  ETA: 0:00:18Progress:  46%|██████████████████▋                      |  ETA: 0:00:18Progress:  46%|██████████████████▊                      |  ETA: 0:00:18Progress:  46%|██████████████████▊                      |  ETA: 0:00:18Progress:  46%|██████████████████▉                      |  ETA: 0:00:18Progress:  46%|██████████████████▉                      |  ETA: 0:00:18Progress:  46%|███████████████████                      |  ETA: 0:00:18Progress:  46%|███████████████████                      |  ETA: 0:00:18Progress:  47%|███████████████████▏                     |  ETA: 0:00:18Progress:  47%|███████████████████▏                     |  ETA: 0:00:18Progress:  47%|███████████████████▎                     |  ETA: 0:00:18Progress:  47%|███████████████████▎                     |  ETA: 0:00:18Progress:  47%|███████████████████▍                     |  ETA: 0:00:18Progress:  47%|███████████████████▍                     |  ETA: 0:00:18Progress:  47%|███████████████████▌                     |  ETA: 0:00:18Progress:  48%|███████████████████▌                     |  ETA: 0:00:18Progress:  48%|███████████████████▋                     |  ETA: 0:00:18Progress:  48%|███████████████████▋                     |  ETA: 0:00:18Progress:  48%|███████████████████▊                     |  ETA: 0:00:18Progress:  48%|███████████████████▊                     |  ETA: 0:00:18Progress:  48%|███████████████████▉                     |  ETA: 0:00:18Progress:  49%|███████████████████▉                     |  ETA: 0:00:18Progress:  49%|████████████████████                     |  ETA: 0:00:18Progress:  49%|████████████████████                     |  ETA: 0:00:18Progress:  49%|████████████████████                     |  ETA: 0:00:18Progress:  49%|████████████████████▏                    |  ETA: 0:00:18Progress:  49%|████████████████████▏                    |  ETA: 0:00:18Progress:  49%|████████████████████▎                    |  ETA: 0:00:18Progress:  49%|████████████████████▎                    |  ETA: 0:00:18Progress:  50%|████████████████████▍                    |  ETA: 0:00:18Progress:  50%|████████████████████▍                    |  ETA: 0:00:18Progress:  50%|████████████████████▌                    |  ETA: 0:00:18Progress:  50%|████████████████████▌                    |  ETA: 0:00:18Progress:  50%|████████████████████▋                    |  ETA: 0:00:18Progress:  50%|████████████████████▋                    |  ETA: 0:00:18Progress:  50%|████████████████████▊                    |  ETA: 0:00:18Progress:  51%|████████████████████▊                    |  ETA: 0:00:18Progress:  51%|████████████████████▊                    |  ETA: 0:00:18Progress:  51%|████████████████████▉                    |  ETA: 0:00:18Progress:  51%|████████████████████▉                    |  ETA: 0:00:18Progress:  51%|█████████████████████                    |  ETA: 0:00:18Progress:  51%|█████████████████████                    |  ETA: 0:00:18Progress:  51%|█████████████████████▏                   |  ETA: 0:00:18Progress:  52%|█████████████████████▏                   |  ETA: 0:00:18Progress:  52%|█████████████████████▏                   |  ETA: 0:00:18Progress:  52%|█████████████████████▎                   |  ETA: 0:00:18Progress:  52%|█████████████████████▎                   |  ETA: 0:00:18Progress:  52%|█████████████████████▍                   |  ETA: 0:00:18Progress:  52%|█████████████████████▍                   |  ETA: 0:00:18Progress:  52%|█████████████████████▌                   |  ETA: 0:00:18Progress:  52%|█████████████████████▌                   |  ETA: 0:00:18Progress:  53%|█████████████████████▌                   |  ETA: 0:00:18Progress:  53%|█████████████████████▋                   |  ETA: 0:00:18Progress:  53%|█████████████████████▋                   |  ETA: 0:00:18Progress:  53%|█████████████████████▊                   |  ETA: 0:00:18Progress:  53%|█████████████████████▊                   |  ETA: 0:00:18Progress:  53%|█████████████████████▉                   |  ETA: 0:00:18Progress:  53%|█████████████████████▉                   |  ETA: 0:00:18Progress:  53%|█████████████████████▉                   |  ETA: 0:00:18Progress:  54%|██████████████████████                   |  ETA: 0:00:18Progress:  54%|██████████████████████                   |  ETA: 0:00:18Progress:  54%|██████████████████████▏                  |  ETA: 0:00:18Progress:  54%|██████████████████████▏                  |  ETA: 0:00:18Progress:  54%|██████████████████████▏                  |  ETA: 0:00:18Progress:  54%|██████████████████████▎                  |  ETA: 0:00:18Progress:  54%|██████████████████████▎                  |  ETA: 0:00:18Progress:  54%|██████████████████████▍                  |  ETA: 0:00:18Progress:  55%|██████████████████████▍                  |  ETA: 0:00:18Progress:  55%|██████████████████████▌                  |  ETA: 0:00:18Progress:  55%|██████████████████████▌                  |  ETA: 0:00:18Progress:  55%|██████████████████████▌                  |  ETA: 0:00:18Progress:  55%|██████████████████████▋                  |  ETA: 0:00:18Progress:  55%|██████████████████████▋                  |  ETA: 0:00:18Progress:  55%|██████████████████████▊                  |  ETA: 0:00:18Progress:  55%|██████████████████████▊                  |  ETA: 0:00:18Progress:  56%|██████████████████████▊                  |  ETA: 0:00:18Progress:  56%|██████████████████████▉                  |  ETA: 0:00:18Progress:  56%|██████████████████████▉                  |  ETA: 0:00:18Progress:  56%|███████████████████████                  |  ETA: 0:00:18Progress:  56%|███████████████████████                  |  ETA: 0:00:18Progress:  56%|███████████████████████                  |  ETA: 0:00:18Progress:  56%|███████████████████████▏                 |  ETA: 0:00:18Progress:  56%|███████████████████████▏                 |  ETA: 0:00:18Progress:  57%|███████████████████████▏                 |  ETA: 0:00:18Progress:  57%|███████████████████████▎                 |  ETA: 0:00:18Progress:  57%|███████████████████████▎                 |  ETA: 0:00:18Progress:  57%|███████████████████████▍                 |  ETA: 0:00:18Progress:  57%|███████████████████████▍                 |  ETA: 0:00:18Progress:  57%|███████████████████████▍                 |  ETA: 0:00:18Progress:  57%|███████████████████████▌                 |  ETA: 0:00:18Progress:  57%|███████████████████████▌                 |  ETA: 0:00:18Progress:  57%|███████████████████████▋                 |  ETA: 0:00:18Progress:  58%|███████████████████████▋                 |  ETA: 0:00:18Progress:  58%|███████████████████████▋                 |  ETA: 0:00:18Progress:  58%|███████████████████████▊                 |  ETA: 0:00:18Progress:  58%|███████████████████████▊                 |  ETA: 0:00:18Progress:  58%|███████████████████████▊                 |  ETA: 0:00:18Progress:  58%|███████████████████████▉                 |  ETA: 0:00:18Progress:  58%|███████████████████████▉                 |  ETA: 0:00:18Progress:  58%|████████████████████████                 |  ETA: 0:00:18Progress:  59%|████████████████████████                 |  ETA: 0:00:18Progress:  59%|████████████████████████                 |  ETA: 0:00:18Progress:  59%|████████████████████████▏                |  ETA: 0:00:18Progress:  59%|████████████████████████▏                |  ETA: 0:00:18Progress:  59%|████████████████████████▏                |  ETA: 0:00:18Progress:  59%|████████████████████████▎                |  ETA: 0:00:18Progress:  59%|████████████████████████▎                |  ETA: 0:00:18Progress:  59%|████████████████████████▍                |  ETA: 0:00:18Progress:  59%|████████████████████████▍                |  ETA: 0:00:18Progress:  60%|████████████████████████▍                |  ETA: 0:00:18Progress:  60%|████████████████████████▌                |  ETA: 0:00:18Progress:  60%|████████████████████████▌                |  ETA: 0:00:18Progress:  60%|████████████████████████▌                |  ETA: 0:00:18Progress:  60%|████████████████████████▋                |  ETA: 0:00:18Progress:  60%|████████████████████████▋                |  ETA: 0:00:18Progress:  60%|████████████████████████▋                |  ETA: 0:00:18Progress:  60%|████████████████████████▊                |  ETA: 0:00:18Progress:  60%|████████████████████████▊                |  ETA: 0:00:18Progress:  61%|████████████████████████▊                |  ETA: 0:00:18Progress:  61%|████████████████████████▉                |  ETA: 0:00:18Progress:  61%|████████████████████████▉                |  ETA: 0:00:18Progress:  61%|█████████████████████████                |  ETA: 0:00:18Progress:  61%|█████████████████████████                |  ETA: 0:00:18Progress:  61%|█████████████████████████                |  ETA: 0:00:18Progress:  61%|█████████████████████████▏               |  ETA: 0:00:18Progress:  61%|█████████████████████████▏               |  ETA: 0:00:18Progress:  61%|█████████████████████████▏               |  ETA: 0:00:18Progress:  61%|█████████████████████████▎               |  ETA: 0:00:18Progress:  62%|█████████████████████████▎               |  ETA: 0:00:18Progress:  62%|█████████████████████████▎               |  ETA: 0:00:18Progress:  62%|█████████████████████████▍               |  ETA: 0:00:18Progress:  62%|█████████████████████████▍               |  ETA: 0:00:18Progress:  62%|█████████████████████████▍               |  ETA: 0:00:18Progress:  62%|█████████████████████████▌               |  ETA: 0:00:18Progress:  62%|█████████████████████████▌               |  ETA: 0:00:18Progress:  62%|█████████████████████████▌               |  ETA: 0:00:18Progress:  62%|█████████████████████████▋               |  ETA: 0:00:18Progress:  63%|█████████████████████████▋               |  ETA: 0:00:18Progress:  63%|█████████████████████████▋               |  ETA: 0:00:18Progress:  63%|█████████████████████████▊               |  ETA: 0:00:18Progress:  63%|█████████████████████████▊               |  ETA: 0:00:18Progress:  63%|█████████████████████████▊               |  ETA: 0:00:18Progress:  63%|█████████████████████████▉               |  ETA: 0:00:18Progress:  63%|█████████████████████████▉               |  ETA: 0:00:18Progress:  63%|█████████████████████████▉               |  ETA: 0:00:18Progress:  63%|██████████████████████████               |  ETA: 0:00:18Progress:  63%|██████████████████████████               |  ETA: 0:00:18Progress:  64%|██████████████████████████               |  ETA: 0:00:18Progress:  64%|██████████████████████████▏              |  ETA: 0:00:18Progress:  64%|██████████████████████████▏              |  ETA: 0:00:18Progress:  64%|██████████████████████████▏              |  ETA: 0:00:18Progress:  64%|██████████████████████████▎              |  ETA: 0:00:18Progress:  64%|██████████████████████████▎              |  ETA: 0:00:18Progress:  64%|██████████████████████████▎              |  ETA: 0:00:18Progress:  64%|██████████████████████████▍              |  ETA: 0:00:17Progress:  64%|██████████████████████████▍              |  ETA: 0:00:17Progress:  64%|██████████████████████████▍              |  ETA: 0:00:17Progress:  64%|██████████████████████████▌              |  ETA: 0:00:17Progress:  65%|██████████████████████████▌              |  ETA: 0:00:17Progress:  65%|██████████████████████████▌              |  ETA: 0:00:17Progress:  65%|██████████████████████████▌              |  ETA: 0:00:17Progress:  65%|██████████████████████████▋              |  ETA: 0:00:17Progress:  65%|██████████████████████████▋              |  ETA: 0:00:17Progress:  65%|██████████████████████████▋              |  ETA: 0:00:17Progress:  65%|██████████████████████████▊              |  ETA: 0:00:17Progress:  65%|██████████████████████████▊              |  ETA: 0:00:17Progress:  65%|██████████████████████████▊              |  ETA: 0:00:17Progress:  65%|██████████████████████████▉              |  ETA: 0:00:17Progress:  66%|██████████████████████████▉              |  ETA: 0:00:17Progress:  66%|██████████████████████████▉              |  ETA: 0:00:17Progress:  66%|███████████████████████████              |  ETA: 0:00:17Progress:  66%|███████████████████████████              |  ETA: 0:00:17Progress:  66%|███████████████████████████              |  ETA: 0:00:17Progress:  66%|███████████████████████████              |  ETA: 0:00:17Progress:  66%|███████████████████████████▏             |  ETA: 0:00:17Progress:  66%|███████████████████████████▏             |  ETA: 0:00:17Progress:  66%|███████████████████████████▏             |  ETA: 0:00:17Progress:  66%|███████████████████████████▎             |  ETA: 0:00:17Progress:  66%|███████████████████████████▎             |  ETA: 0:00:17Progress:  67%|███████████████████████████▎             |  ETA: 0:00:17Progress:  67%|███████████████████████████▍             |  ETA: 0:00:17Progress:  67%|███████████████████████████▍             |  ETA: 0:00:17Progress:  67%|███████████████████████████▍             |  ETA: 0:00:17Progress:  67%|███████████████████████████▌             |  ETA: 0:00:17Progress:  67%|███████████████████████████▌             |  ETA: 0:00:17Progress:  67%|███████████████████████████▌             |  ETA: 0:00:17Progress:  67%|███████████████████████████▌             |  ETA: 0:00:17Progress:  67%|███████████████████████████▋             |  ETA: 0:00:17Progress:  67%|███████████████████████████▋             |  ETA: 0:00:17Progress:  67%|███████████████████████████▋             |  ETA: 0:00:17Progress:  68%|███████████████████████████▊             |  ETA: 0:00:17Progress:  68%|███████████████████████████▊             |  ETA: 0:00:17Progress:  68%|███████████████████████████▊             |  ETA: 0:00:17Progress:  68%|███████████████████████████▉             |  ETA: 0:00:17Progress:  68%|███████████████████████████▉             |  ETA: 0:00:17Progress:  68%|███████████████████████████▉             |  ETA: 0:00:17Progress:  68%|████████████████████████████             |  ETA: 0:00:17Progress:  68%|████████████████████████████             |  ETA: 0:00:17Progress:  68%|████████████████████████████             |  ETA: 0:00:17Progress:  68%|████████████████████████████             |  ETA: 0:00:17Progress:  69%|████████████████████████████▏            |  ETA: 0:00:17Progress:  69%|████████████████████████████▏            |  ETA: 0:00:17Progress:  69%|████████████████████████████▏            |  ETA: 0:00:17Progress:  69%|████████████████████████████▎            |  ETA: 0:00:17Progress:  69%|████████████████████████████▎            |  ETA: 0:00:17Progress:  69%|████████████████████████████▎            |  ETA: 0:00:17Progress:  69%|████████████████████████████▍            |  ETA: 0:00:17Progress:  69%|████████████████████████████▍            |  ETA: 0:00:17Progress:  69%|████████████████████████████▍            |  ETA: 0:00:16Progress:  69%|████████████████████████████▌            |  ETA: 0:00:16Progress:  69%|████████████████████████████▌            |  ETA: 0:00:16Progress:  70%|████████████████████████████▌            |  ETA: 0:00:16Progress:  70%|████████████████████████████▌            |  ETA: 0:00:16Progress:  70%|████████████████████████████▋            |  ETA: 0:00:16Progress:  70%|████████████████████████████▋            |  ETA: 0:00:16Progress:  70%|████████████████████████████▋            |  ETA: 0:00:16Progress:  70%|████████████████████████████▊            |  ETA: 0:00:16Progress:  70%|████████████████████████████▊            |  ETA: 0:00:16Progress:  70%|████████████████████████████▉            |  ETA: 0:00:16Progress:  71%|████████████████████████████▉            |  ETA: 0:00:16Progress:  71%|█████████████████████████████            |  ETA: 0:00:16Progress:  71%|█████████████████████████████            |  ETA: 0:00:16Progress:  71%|█████████████████████████████▏           |  ETA: 0:00:16Progress:  71%|█████████████████████████████▏           |  ETA: 0:00:16Progress:  71%|█████████████████████████████▎           |  ETA: 0:00:16Progress:  71%|█████████████████████████████▎           |  ETA: 0:00:16Progress:  72%|█████████████████████████████▍           |  ETA: 0:00:16Progress:  72%|█████████████████████████████▍           |  ETA: 0:00:16Progress:  72%|█████████████████████████████▌           |  ETA: 0:00:15Progress:  72%|█████████████████████████████▌           |  ETA: 0:00:15Progress:  72%|█████████████████████████████▋           |  ETA: 0:00:15Progress:  72%|█████████████████████████████▋           |  ETA: 0:00:15Progress:  72%|█████████████████████████████▋           |  ETA: 0:00:15Progress:  73%|█████████████████████████████▊           |  ETA: 0:00:15Progress:  73%|█████████████████████████████▊           |  ETA: 0:00:15Progress:  73%|█████████████████████████████▉           |  ETA: 0:00:15Progress:  73%|█████████████████████████████▉           |  ETA: 0:00:15Progress:  73%|██████████████████████████████           |  ETA: 0:00:15Progress:  73%|██████████████████████████████           |  ETA: 0:00:15Progress:  73%|██████████████████████████████▏          |  ETA: 0:00:15Progress:  74%|██████████████████████████████▏          |  ETA: 0:00:15Progress:  74%|██████████████████████████████▎          |  ETA: 0:00:15Progress:  74%|██████████████████████████████▎          |  ETA: 0:00:15Progress:  74%|██████████████████████████████▎          |  ETA: 0:00:14Progress:  74%|██████████████████████████████▍          |  ETA: 0:00:14Progress:  74%|██████████████████████████████▍          |  ETA: 0:00:14Progress:  74%|██████████████████████████████▌          |  ETA: 0:00:14Progress:  74%|██████████████████████████████▌          |  ETA: 0:00:14Progress:  75%|██████████████████████████████▋          |  ETA: 0:00:14Progress:  75%|██████████████████████████████▋          |  ETA: 0:00:14Progress:  75%|██████████████████████████████▊          |  ETA: 0:00:14Progress:  75%|██████████████████████████████▊          |  ETA: 0:00:14Progress:  75%|██████████████████████████████▊          |  ETA: 0:00:14Progress:  75%|██████████████████████████████▉          |  ETA: 0:00:14Progress:  75%|██████████████████████████████▉          |  ETA: 0:00:14Progress:  76%|███████████████████████████████          |  ETA: 0:00:14Progress:  76%|███████████████████████████████          |  ETA: 0:00:14Progress:  76%|███████████████████████████████▏         |  ETA: 0:00:14Progress:  76%|███████████████████████████████▏         |  ETA: 0:00:13Progress:  76%|███████████████████████████████▏         |  ETA: 0:00:13Progress:  76%|███████████████████████████████▎         |  ETA: 0:00:13Progress:  76%|███████████████████████████████▎         |  ETA: 0:00:13Progress:  76%|███████████████████████████████▍         |  ETA: 0:00:13Progress:  77%|███████████████████████████████▍         |  ETA: 0:00:13Progress:  77%|███████████████████████████████▍         |  ETA: 0:00:13Progress:  77%|███████████████████████████████▌         |  ETA: 0:00:13Progress:  77%|███████████████████████████████▌         |  ETA: 0:00:13Progress:  77%|███████████████████████████████▋         |  ETA: 0:00:13Progress:  77%|███████████████████████████████▋         |  ETA: 0:00:13Progress:  77%|███████████████████████████████▊         |  ETA: 0:00:13Progress:  77%|███████████████████████████████▊         |  ETA: 0:00:13Progress:  78%|███████████████████████████████▊         |  ETA: 0:00:13Progress:  78%|███████████████████████████████▉         |  ETA: 0:00:13Progress:  78%|███████████████████████████████▉         |  ETA: 0:00:13Progress:  78%|████████████████████████████████         |  ETA: 0:00:13Progress:  78%|████████████████████████████████         |  ETA: 0:00:12Progress:  78%|████████████████████████████████         |  ETA: 0:00:12Progress:  78%|████████████████████████████████▏        |  ETA: 0:00:12Progress:  78%|████████████████████████████████▏        |  ETA: 0:00:12Progress:  79%|████████████████████████████████▎        |  ETA: 0:00:12Progress:  79%|████████████████████████████████▎        |  ETA: 0:00:12Progress:  79%|████████████████████████████████▍        |  ETA: 0:00:12Progress:  79%|████████████████████████████████▍        |  ETA: 0:00:12Progress:  79%|████████████████████████████████▍        |  ETA: 0:00:12Progress:  79%|████████████████████████████████▌        |  ETA: 0:00:12Progress:  79%|████████████████████████████████▌        |  ETA: 0:00:12Progress:  79%|████████████████████████████████▋        |  ETA: 0:00:12Progress:  80%|████████████████████████████████▋        |  ETA: 0:00:12Progress:  80%|████████████████████████████████▊        |  ETA: 0:00:12Progress:  80%|████████████████████████████████▊        |  ETA: 0:00:12Progress:  80%|████████████████████████████████▊        |  ETA: 0:00:12Progress:  80%|████████████████████████████████▉        |  ETA: 0:00:11Progress:  80%|████████████████████████████████▉        |  ETA: 0:00:11Progress:  80%|████████████████████████████████▉        |  ETA: 0:00:11Progress:  80%|█████████████████████████████████        |  ETA: 0:00:11Progress:  81%|█████████████████████████████████        |  ETA: 0:00:11Progress:  81%|█████████████████████████████████▏       |  ETA: 0:00:11Progress:  81%|█████████████████████████████████▏       |  ETA: 0:00:11Progress:  81%|█████████████████████████████████▏       |  ETA: 0:00:11Progress:  81%|█████████████████████████████████▎       |  ETA: 0:00:11Progress:  81%|█████████████████████████████████▎       |  ETA: 0:00:11Progress:  81%|█████████████████████████████████▍       |  ETA: 0:00:11Progress:  81%|█████████████████████████████████▍       |  ETA: 0:00:11Progress:  81%|█████████████████████████████████▍       |  ETA: 0:00:11Progress:  82%|█████████████████████████████████▌       |  ETA: 0:00:11Progress:  82%|█████████████████████████████████▌       |  ETA: 0:00:11Progress:  82%|█████████████████████████████████▌       |  ETA: 0:00:11Progress:  82%|█████████████████████████████████▋       |  ETA: 0:00:10Progress:  82%|█████████████████████████████████▋       |  ETA: 0:00:10Progress:  82%|█████████████████████████████████▊       |  ETA: 0:00:10Progress:  82%|█████████████████████████████████▊       |  ETA: 0:00:10Progress:  82%|█████████████████████████████████▊       |  ETA: 0:00:10Progress:  83%|█████████████████████████████████▉       |  ETA: 0:00:10Progress:  83%|█████████████████████████████████▉       |  ETA: 0:00:10Progress:  83%|██████████████████████████████████       |  ETA: 0:00:10Progress:  83%|██████████████████████████████████       |  ETA: 0:00:10Progress:  83%|██████████████████████████████████       |  ETA: 0:00:10Progress:  83%|██████████████████████████████████▏      |  ETA: 0:00:10Progress:  83%|██████████████████████████████████▏      |  ETA: 0:00:10Progress:  83%|██████████████████████████████████▏      |  ETA: 0:00:10Progress:  83%|██████████████████████████████████▎      |  ETA: 0:00:10Progress:  84%|██████████████████████████████████▎      |  ETA: 0:00:10Progress:  84%|██████████████████████████████████▍      |  ETA: 0:00:10Progress:  84%|██████████████████████████████████▍      |  ETA: 0:00:10Progress:  84%|██████████████████████████████████▍      |  ETA: 0:00:09Progress:  84%|██████████████████████████████████▌      |  ETA: 0:00:09Progress:  84%|██████████████████████████████████▌      |  ETA: 0:00:09Progress:  84%|██████████████████████████████████▌      |  ETA: 0:00:09Progress:  84%|██████████████████████████████████▋      |  ETA: 0:00:09Progress:  85%|██████████████████████████████████▋      |  ETA: 0:00:09Progress:  85%|██████████████████████████████████▊      |  ETA: 0:00:09Progress:  85%|██████████████████████████████████▊      |  ETA: 0:00:09Progress:  85%|██████████████████████████████████▊      |  ETA: 0:00:09Progress:  85%|██████████████████████████████████▉      |  ETA: 0:00:09Progress:  85%|██████████████████████████████████▉      |  ETA: 0:00:09Progress:  85%|███████████████████████████████████      |  ETA: 0:00:09Progress:  85%|███████████████████████████████████      |  ETA: 0:00:09Progress:  85%|███████████████████████████████████      |  ETA: 0:00:09Progress:  86%|███████████████████████████████████▏     |  ETA: 0:00:09Progress:  86%|███████████████████████████████████▏     |  ETA: 0:00:09Progress:  86%|███████████████████████████████████▏     |  ETA: 0:00:08Progress:  86%|███████████████████████████████████▎     |  ETA: 0:00:08Progress:  86%|███████████████████████████████████▎     |  ETA: 0:00:08Progress:  86%|███████████████████████████████████▍     |  ETA: 0:00:08Progress:  86%|███████████████████████████████████▍     |  ETA: 0:00:08Progress:  86%|███████████████████████████████████▍     |  ETA: 0:00:08Progress:  86%|███████████████████████████████████▌     |  ETA: 0:00:08Progress:  87%|███████████████████████████████████▌     |  ETA: 0:00:08Progress:  87%|███████████████████████████████████▌     |  ETA: 0:00:08Progress:  87%|███████████████████████████████████▋     |  ETA: 0:00:08Progress:  87%|███████████████████████████████████▋     |  ETA: 0:00:08Progress:  87%|███████████████████████████████████▋     |  ETA: 0:00:08Progress:  87%|███████████████████████████████████▊     |  ETA: 0:00:08Progress:  87%|███████████████████████████████████▊     |  ETA: 0:00:08Progress:  87%|███████████████████████████████████▊     |  ETA: 0:00:08Progress:  87%|███████████████████████████████████▉     |  ETA: 0:00:08Progress:  88%|███████████████████████████████████▉     |  ETA: 0:00:08Progress:  88%|███████████████████████████████████▉     |  ETA: 0:00:07Progress:  88%|████████████████████████████████████     |  ETA: 0:00:07Progress:  88%|████████████████████████████████████     |  ETA: 0:00:07Progress:  88%|████████████████████████████████████     |  ETA: 0:00:07Progress:  88%|████████████████████████████████████▏    |  ETA: 0:00:07Progress:  88%|████████████████████████████████████▏    |  ETA: 0:00:07Progress:  88%|████████████████████████████████████▎    |  ETA: 0:00:07Progress:  88%|████████████████████████████████████▎    |  ETA: 0:00:07Progress:  88%|████████████████████████████████████▎    |  ETA: 0:00:07Progress:  89%|████████████████████████████████████▍    |  ETA: 0:00:07Progress:  89%|████████████████████████████████████▍    |  ETA: 0:00:07Progress:  89%|████████████████████████████████████▍    |  ETA: 0:00:07Progress:  89%|████████████████████████████████████▌    |  ETA: 0:00:07Progress:  89%|████████████████████████████████████▌    |  ETA: 0:00:07Progress:  89%|████████████████████████████████████▌    |  ETA: 0:00:07Progress:  89%|████████████████████████████████████▋    |  ETA: 0:00:07Progress:  89%|████████████████████████████████████▋    |  ETA: 0:00:07Progress:  89%|████████████████████████████████████▋    |  ETA: 0:00:06Progress:  90%|████████████████████████████████████▊    |  ETA: 0:00:06Progress:  90%|████████████████████████████████████▊    |  ETA: 0:00:06Progress:  90%|████████████████████████████████████▊    |  ETA: 0:00:06Progress:  90%|████████████████████████████████████▉    |  ETA: 0:00:06Progress:  90%|████████████████████████████████████▉    |  ETA: 0:00:06Progress:  90%|████████████████████████████████████▉    |  ETA: 0:00:06Progress:  90%|█████████████████████████████████████    |  ETA: 0:00:06Progress:  90%|█████████████████████████████████████    |  ETA: 0:00:06Progress:  90%|█████████████████████████████████████    |  ETA: 0:00:06Progress:  90%|█████████████████████████████████████▏   |  ETA: 0:00:06Progress:  91%|█████████████████████████████████████▏   |  ETA: 0:00:06Progress:  91%|█████████████████████████████████████▏   |  ETA: 0:00:06Progress:  91%|█████████████████████████████████████▎   |  ETA: 0:00:06Progress:  91%|█████████████████████████████████████▎   |  ETA: 0:00:06Progress:  91%|█████████████████████████████████████▎   |  ETA: 0:00:06Progress:  91%|█████████████████████████████████████▍   |  ETA: 0:00:06Progress:  91%|█████████████████████████████████████▍   |  ETA: 0:00:05Progress:  91%|█████████████████████████████████████▍   |  ETA: 0:00:05Progress:  91%|█████████████████████████████████████▌   |  ETA: 0:00:05Progress:  92%|█████████████████████████████████████▌   |  ETA: 0:00:05Progress:  92%|█████████████████████████████████████▋   |  ETA: 0:00:05Progress:  92%|█████████████████████████████████████▋   |  ETA: 0:00:05Progress:  92%|█████████████████████████████████████▋   |  ETA: 0:00:05Progress:  92%|█████████████████████████████████████▋   |  ETA: 0:00:05Progress:  92%|█████████████████████████████████████▊   |  ETA: 0:00:05Progress:  92%|█████████████████████████████████████▊   |  ETA: 0:00:05Progress:  92%|█████████████████████████████████████▊   |  ETA: 0:00:05Progress:  92%|█████████████████████████████████████▉   |  ETA: 0:00:05Progress:  92%|█████████████████████████████████████▉   |  ETA: 0:00:05Progress:  93%|█████████████████████████████████████▉   |  ETA: 0:00:05Progress:  93%|██████████████████████████████████████   |  ETA: 0:00:05Progress:  93%|██████████████████████████████████████   |  ETA: 0:00:05Progress:  93%|██████████████████████████████████████   |  ETA: 0:00:05Progress:  93%|██████████████████████████████████████▏  |  ETA: 0:00:04Progress:  93%|██████████████████████████████████████▏  |  ETA: 0:00:04Progress:  93%|██████████████████████████████████████▏  |  ETA: 0:00:04Progress:  93%|██████████████████████████████████████▎  |  ETA: 0:00:04Progress:  93%|██████████████████████████████████████▎  |  ETA: 0:00:04Progress:  93%|██████████████████████████████████████▎  |  ETA: 0:00:04Progress:  93%|██████████████████████████████████████▍  |  ETA: 0:00:04Progress:  94%|██████████████████████████████████████▍  |  ETA: 0:00:04Progress:  94%|██████████████████████████████████████▍  |  ETA: 0:00:04Progress:  94%|██████████████████████████████████████▍  |  ETA: 0:00:04Progress:  94%|██████████████████████████████████████▌  |  ETA: 0:00:04Progress:  94%|██████████████████████████████████████▌  |  ETA: 0:00:04Progress:  94%|██████████████████████████████████████▌  |  ETA: 0:00:04Progress:  94%|██████████████████████████████████████▋  |  ETA: 0:00:04Progress:  94%|██████████████████████████████████████▋  |  ETA: 0:00:04Progress:  94%|██████████████████████████████████████▋  |  ETA: 0:00:04Progress:  94%|██████████████████████████████████████▊  |  ETA: 0:00:04Progress:  94%|██████████████████████████████████████▊  |  ETA: 0:00:04Progress:  95%|██████████████████████████████████████▊  |  ETA: 0:00:03Progress:  95%|██████████████████████████████████████▉  |  ETA: 0:00:03Progress:  95%|██████████████████████████████████████▉  |  ETA: 0:00:03Progress:  95%|██████████████████████████████████████▉  |  ETA: 0:00:03Progress:  95%|██████████████████████████████████████▉  |  ETA: 0:00:03Progress:  95%|███████████████████████████████████████  |  ETA: 0:00:03Progress:  95%|███████████████████████████████████████  |  ETA: 0:00:03Progress:  95%|███████████████████████████████████████  |  ETA: 0:00:03Progress:  95%|███████████████████████████████████████▏ |  ETA: 0:00:03Progress:  95%|███████████████████████████████████████▏ |  ETA: 0:00:03Progress:  96%|███████████████████████████████████████▏ |  ETA: 0:00:03Progress:  96%|███████████████████████████████████████▎ |  ETA: 0:00:03Progress:  96%|███████████████████████████████████████▎ |  ETA: 0:00:03Progress:  96%|███████████████████████████████████████▎ |  ETA: 0:00:03Progress:  96%|███████████████████████████████████████▍ |  ETA: 0:00:03Progress:  96%|███████████████████████████████████████▍ |  ETA: 0:00:03Progress:  96%|███████████████████████████████████████▍ |  ETA: 0:00:03Progress:  96%|███████████████████████████████████████▍ |  ETA: 0:00:02Progress:  96%|███████████████████████████████████████▌ |  ETA: 0:00:02Progress:  96%|███████████████████████████████████████▌ |  ETA: 0:00:02Progress:  96%|███████████████████████████████████████▌ |  ETA: 0:00:02Progress:  97%|███████████████████████████████████████▋ |  ETA: 0:00:02Progress:  97%|███████████████████████████████████████▋ |  ETA: 0:00:02Progress:  97%|███████████████████████████████████████▋ |  ETA: 0:00:02Progress:  97%|███████████████████████████████████████▊ |  ETA: 0:00:02Progress:  97%|███████████████████████████████████████▊ |  ETA: 0:00:02Progress:  97%|███████████████████████████████████████▊ |  ETA: 0:00:02Progress:  97%|███████████████████████████████████████▉ |  ETA: 0:00:02Progress:  97%|███████████████████████████████████████▉ |  ETA: 0:00:02Progress:  97%|███████████████████████████████████████▉ |  ETA: 0:00:02Progress:  97%|███████████████████████████████████████▉ |  ETA: 0:00:02Progress:  97%|████████████████████████████████████████ |  ETA: 0:00:02Progress:  98%|████████████████████████████████████████ |  ETA: 0:00:02Progress:  98%|████████████████████████████████████████ |  ETA: 0:00:02Progress:  98%|████████████████████████████████████████▏|  ETA: 0:00:01Progress:  98%|████████████████████████████████████████▏|  ETA: 0:00:01Progress:  98%|████████████████████████████████████████▏|  ETA: 0:00:01Progress:  98%|████████████████████████████████████████▏|  ETA: 0:00:01Progress:  98%|████████████████████████████████████████▎|  ETA: 0:00:01Progress:  98%|████████████████████████████████████████▎|  ETA: 0:00:01Progress:  98%|████████████████████████████████████████▎|  ETA: 0:00:01Progress:  98%|████████████████████████████████████████▍|  ETA: 0:00:01Progress:  98%|████████████████████████████████████████▍|  ETA: 0:00:01Progress:  99%|████████████████████████████████████████▍|  ETA: 0:00:01Progress:  99%|████████████████████████████████████████▌|  ETA: 0:00:01Progress:  99%|████████████████████████████████████████▌|  ETA: 0:00:01Progress:  99%|████████████████████████████████████████▌|  ETA: 0:00:01Progress:  99%|████████████████████████████████████████▋|  ETA: 0:00:01Progress:  99%|████████████████████████████████████████▋|  ETA: 0:00:01Progress:  99%|████████████████████████████████████████▋|  ETA: 0:00:01Progress:  99%|████████████████████████████████████████▋|  ETA: 0:00:01Progress:  99%|████████████████████████████████████████▊|  ETA: 0:00:00Progress:  99%|████████████████████████████████████████▊|  ETA: 0:00:00Progress: 100%|████████████████████████████████████████▊|  ETA: 0:00:00Progress: 100%|████████████████████████████████████████▉|  ETA: 0:00:00Progress: 100%|████████████████████████████████████████▉|  ETA: 0:00:00Progress: 100%|████████████████████████████████████████▉|  ETA: 0:00:00Progress: 100%|█████████████████████████████████████████|  ETA: 0:00:00Progress: 100%|█████████████████████████████████████████|  ETA: 0:00:00Progress: 100%|█████████████████████████████████████████| Time: 0:01:06

Next I create a dictionary with the families from the target set as keys. For each target family, I prune the Glottolog world tree to the glottocodes from that family and store it as a newick string in the dictionary.

function family_to_tree(fm; d=d, glot=glot)
    fm_taxa = d.glottocode[d.Family.==fm]
    glot_fm = glot.copy()
    glot_fm.prune(fm_taxa)
    glot_fm.write(format=9)
end


glot_tree_dict = Dict()

for fm in families
    @info fm
    glot_tree_dict[fm] = family_to_tree(fm)
end
[ Info: Kadugli_Krongo
[ Info: Songhay
[ Info: Basque
[ Info: Abkhaz_Adyge
[ Info: Hmong_Mien
[ Info: Tai_Kadai
[ Info: Ndu
[ Info: Koiarian
[ Info: Greater_Kwerba
[ Info: Chinookan
[ Info: Palaihnihan
[ Info: Maiduan
[ Info: Yuki_Wappo
[ Info: Miwok_Costanoan
[ Info: Mixe_Zoque
[ Info: Tucanoan
[ Info: Chonan
[ Info: Matacoan
[ Info: Bororoan
[ Info: Khoe_Kwadi
[ Info: Nubian
[ Info: Surmic
[ Info: South_Omotic
[ Info: Kartvelian
[ Info: Chukotko_Kamchatkan
[ Info: Wintuan
[ Info: Pomoan
[ Info: Yokutsan
[ Info: Iroquoian
[ Info: Pano_Tacanan
[ Info: Guaicuruan
[ Info: Saharan
[ Info: Japonic
[ Info: Sahaptian
[ Info: Caddoan
[ Info: Muskogean
[ Info: Yanomamic
[ Info: Kru
[ Info: Heibanic
[ Info: Wakashan
[ Info: Keresan
[ Info: Otomanguean
[ Info: Chibchan
[ Info: Ta_Ne_Omotic
[ Info: Mongolic_Khitan
[ Info: Tungusic
[ Info: Pama_Nyungan
[ Info: Kiowa_Tanoan
[ Info: Nuclear_Macro_Je
[ Info: Cochimi_Yuman
[ Info: Mayan
[ Info: Nuclear_Trans_New_Guinea
[ Info: Tupian
[ Info: Dravidian
[ Info: Siouan
[ Info: Turkic
[ Info: Cariban
[ Info: Central_Sudanic
[ Info: Arawakan
[ Info: Eskimo_Aleut
[ Info: Austroasiatic
[ Info: Uralic
[ Info: Salishan
[ Info: Mande
[ Info: Athabaskan_Eyak_Tlingit
[ Info: Uto_Aztecan
[ Info: Sino_Tibetan
[ Info: Algic
[ Info: Nilotic
[ Info: Indo_European
[ Info: Afro_Asiatic
[ Info: Austronesian
[ Info: Atlantic_Congo

Finally the MrBayes files are created.

For each family, there are three nexus files:

  1. the file containing the character matrix, and
  2. two MrBayes scripts.

Using two MrBayes scripts is a hack because

  • I want at least 100,000 iterations per chain, and
  • I want to apply the early stop rule which terminates a chain when the topologies have sufficiently converged.

If I use the early stop rule from the outset, sampling for very small families will stop right away because there is no (or little) phylogenetic uncertainty. It is still advisable to do some sampling though, to get a good estimate for the branch lengths.

Not using the stop rule is not a good option either, because then I have to fix a sufficiently large number of iterations. For large families, this is in the tens of millions, which would be a waste of resources for smaller families.

As a compromise, I use the first script to run 10, 000,000 iterations without stop rule, and then continue up to 1,000,000,000 iterations with stop rule.

For the latter, I prepare the following analysis:

  • characters are partitioned into the cc and sc characters (see above)
  • relaxed clock
  • gamma-distributed rate variation
  • equilibrium probabilities and the clock rates are estimated for the two partitions separately
  • for the cc characters, ascertainment bias correction is conducted because all-0 columns are not included
  • the Glottolog tree is used as constraint tree
  • if the dataset contains a dummy taxon (because there are only two real taxa), the dummy taxon is treated as outgroup,
  • if the family contains \(\leq 100\) taxa, the mcmc is run with four runs and four chains, otherwise with two runs and two chains,
  • the maximum is 1, 000, 000, 000 iterations, with an early stop if the average standard deviation of split frequencies is \(\leq 0.01\)
function create_mb_script(
    fn,
    char_mtx,
    clades,
    fm_cc,
    fm_glottocodes,
    n_iterations,
    n_runs,
    n_chains,
    append,
    stoprule
)
    mb = """#Nexus
        Begin MrBayes;
            execute $fn.nex;
            charset sc = 1-1640;
            charset cc = 1641-$(size(char_mtx, 2)-1);
            partition dtype = 2:sc, cc;
            set partition = dtype;
            prset applyto=(all) brlenspr = clock:uniform;
            prset applyto=(all) clockvarpr = igr;
            lset applyto=(all) rates=gamma;
            unlink Statefreq=(all) shape=(all) igrvar=(all) rate=(all);
            prset applyto=(all) ratepr=Dirichlet(1, 1);
            prset applyto=(2) clockratepr=exp(1.0); [for partition 2]
            lset applyto=(1) coding=all;
            lset applyto=(2) coding=noabsencesites;
        """
    if length(clades) > 1
        for (i, cl) in enumerate(clades)
            cn = join(cl, " ")
            mb *= "    constraint c$i = "
            mb *= "$cn;\n"
        end
        mb *= "    prset topologypr = constraints("
        mb *= join(["c$i" for i in 1:length(clades)], ",") * ");\n"
    end
    if length(fm_glottocodes) == 2
        mb *= "constraint c1 = $(join(fm_cc.Glottocode, " "));\n"
        mb *= "prset topologypr = constraints(c1);\n"
    end
    if length(fm_glottocodes) > 100
        mb *= "    set beagleprecision=double;\n"
    end
    mb *= """    prset brlenspr = clock:uniform;
        prset clockvarpr = igr;
        mcmcp stoprule=$stoprule stopval=0.05 filename=output/$fn samplefreq=1000;
        mcmc ngen=$n_iterations nchains=$n_chains nruns=$n_runs append=$append;
        sumt;
        sump;
        q;
    end;
    """
    return mb
end


mkpath("mrbayes/output")
@showprogress for (i, fm) in enumerate(families)
    fm_glottocodes = d.glottocode[d.Family.==fm]
    fn = lpad(i, 3, "0")*"_"*fm
    fm_cc = @pipe world_cc |>
        filter(row -> row.Glottocode  fm_glottocodes, _) 

    fm_characters = names(fm_cc)[2:end]

    informative = map(x -> sum(string.(fm_cc[:,x]) .== "1") .> 0, fm_characters)

    fm_cc = select(
        fm_cc, 
        vcat(["Glottocode"], fm_characters[informative])
    )

    fm_sc = @pipe world_sc |>
        filter(row -> row.Glottocode  fm_glottocodes, _)
    fm_characters = [x for x in names(fm_sc) if x != "Glottocode"]

    fm_sc = select(
        fm_sc, 
        vcat(["Glottocode"], fm_characters)
    )
        
    char_mtx = innerjoin(
        fm_sc,
        fm_cc,
        on=:Glottocode => :Glottocode,
        makeunique=true,
    )


    fm_glot = ete3.Tree(glot_tree_dict[fm], format=1)

    internal_nodes = [
        nd for nd in fm_glot.traverse()
        if nd.is_leaf() == false && nd.is_root() == false
    ]
    clades = [x.get_leaf_names() for x in internal_nodes]

    n_iterations_head = 10_000_000
    n_iterations_tail = 1_000_000_000
    n_chains = length(fm_glottocodes) > 100 ? 4 : 2
    n_runs = length(fm_glottocodes) > 100 ? 4 : 2

    mb_head = create_mb_script(
        fn,
        char_mtx,
        clades,
        fm_cc,
        fm_glottocodes,
        n_iterations_head,
        n_runs,
        n_chains,
        "no",
        "no"
    )

    mb_tail = create_mb_script(
        fn,
        char_mtx,
        clades,
        fm_cc,
        fm_glottocodes,
        n_iterations_tail,
        n_runs,
        n_chains,
        "yes",
        "yes"
    )
    write("mrbayes/$(fn)_head.mb.nex", mb_head)
    write("mrbayes/$(fn)_tail.mb.nex", mb_tail)
    write("mrbayes/$fn.nex", df2nexus(char_mtx))
end
Progress:   3%|█▏                                       |  ETA: 0:00:18Progress:  15%|██████▏                                  |  ETA: 0:00:04Progress:  26%|██████████▋                              |  ETA: 0:00:02Progress:  37%|███████████████▏                         |  ETA: 0:00:02Progress:  47%|███████████████████▏                     |  ETA: 0:00:01Progress:  56%|███████████████████████                  |  ETA: 0:00:01Progress:  64%|██████████████████████████▍              |  ETA: 0:00:01Progress:  73%|█████████████████████████████▊           |  ETA: 0:00:01Progress:  79%|████████████████████████████████▋        |  ETA: 0:00:00Progress:  86%|███████████████████████████████████▍     |  ETA: 0:00:00Progress:  92%|█████████████████████████████████████▋   |  ETA: 0:00:00Progress:  95%|██████████████████████████████████████▊  |  ETA: 0:00:00Progress:  97%|███████████████████████████████████████▉ |  ETA: 0:00:00Progress:  99%|████████████████████████████████████████▌|  ETA: 0:00:00Progress: 100%|█████████████████████████████████████████| Time: 0:00:03

This completes data preparation for MrBayes.

In the next step, all MrBayes scripts must be run, ideally with as much parallelization as possible. I used the following shell script on a powerfuls server for this:

#!/bin/bash

# Script to run multiple instances of mb-mpi command using parallel processing

cd mrbayes
max_jobs=25

run_with_limit() {
  while [ "$(jobs | wc -l)" -ge "$max_jobs" ]; do
    sleep 1
  done
  mpirun -np "$1" mb-mpi "$2" &
}

# Main loop to run mb-mpi commands in parallel
for file in *_head.mb.nex; do
  if [[ "$file" == "052_Atlantic_Congo_head.mb.nex" || "$file" == "053_Austronesian_head.mb.nex" ]]; then
    run_with_limit 16 "$file"
  else
    run_with_limit 4 "$file"
  fi
done

wait

for file in *_tail.mb.nex; do
  if [[ "$file" == "052_Atlantic_Congo_tail.mb.nex" || "$file" == "053_Austronesian_tail.mb.nex" ]]; then
    run_with_limit 16 "$file"
  else
    run_with_limit 4 "$file"
  fi
done

echo "All jobs submitted."