Gerhard Jäger, Tübingen University

Computational phylogenetics and language (pre)history: Achievements, challenges and prospects

C-Leste, June 14, 2024

phylogenetic linguistics

  • main goal: infer phylogenetic trees from lexical data

input

  • important: cognate classification

example data: iecor from lexibank:

21×4 DataFrame
Row Glottolog_Name Concepticon_Gloss Segments Cognateset_ID
String? String String String
1 Kamviri SAND ts y 108
2 Ashkun SAND ʃ ʊː ɽ̃ ɨ 108
3 Irish WET fʲ l̥ʲ ʊ x 1080
4 Irish YEAR bʲ lʲ i ə nʲ 1088
5 Latvian MAN v iː r s 1094
6 Lithuanian MAN ʋʲ iː r ɐ s 1094
7 Irish MAN fʲ æ ɾˠ 1094
8 Northwest Pashayi MAN ʋ iˑ ɾ 1094
9 Bengali BONE h a ɽ 11
10 Hindi BONE ɦ ə ɖː i 11
11 Marathi BONE h a ɖ 11
12 Maithili BONE h ə ɖː i 11
13 Urdu BONE h ə ɖː i 11
14 Kashmiri BONE ɨ ɽ i dʒ 11
15 Magahi BONE h ə ɖː i 11
16 Bhojpuri BONE ɦ ə ɖː i 11
17 Gawar-Bati BONE h ɜ ɖ oˑ k i 11
18 Assamese BONE h ɑ ɾ 11
19 Gawri BONE h ɔ ɖ̥ 11
20 Bulgarian ASH p ɛ p ɘ lˠ 1105
21 Belarusian ASH p ɔ pʲ ɛ lˠ 1105

phylogenetic linguistics

intermediate step

convert cognate classification into binary character matrix

phylogenetic linguistics

stochastic model: continuous time Markov chain

phylogenetic linguistics

  • main goal: infer phylogenetic trees from lexical data

output

phylogenetic linguistics

  • main goal: infer phylogenetic trees from lexical data

output

(source: Kolipakam et al. 2018, https://doi.org/10.1098/rsos.171504)

applications

  • control for common ancestry in statistical models (Jäger & Wahle 2021, …)
  • estimate time depth and geographic location of ancestral populations (Bouckaert et al 2012)
  • reconstruct properties of ancestral populations (Cathcart et al 2021, Carling & Cathcart 2021a,b, …)
  • statistic identification of patterns of language change (Blasi et al. 2019)

from word lists to trees

  1. perform cognate classification (manual or automatic)
  2. construct binary character matrix
  3. let computer search the tree(s) that best explain(s) the distribution of 0s and 1s in the character matrix

manual cognate annotation

phylogenetic signal below cognacy

  • sound change and morpological change contains relevant phylogenetic information
44×4 DataFrame
19 rows omitted
Row Glottolog_Name Concepticon_Gloss Segments Cognateset_ID
String? String String String
1 Eastern Armenian TOOTH ɑ t ɑ m 328
2 Bengali TOOTH d̪ ã t̪ 328
3 Catalan TOOTH d e n 328
4 Danish TOOTH t a nˀ 328
5 Dutch TOOTH t ɑ n t 328
6 Faroese TOOTH tʰ ɔ nː 328
7 French TOOTH d ɑ̃ 328
8 German TOOTH tsʰ aː n 328
9 Modern Greek TOOTH ð o̞ n d i 328
10 Hindi TOOTH d̪ ɑ̃ t̪ 328
11 Icelandic TOOTH tʰ œ nː 328
12 Italian TOOTH d ɛ n t e 328
13 Kashmiri TOOTH d̪ ɑ n d̪ 328
33 Romanian TOOTH d i n t e 328
34 Continental Southern Italian TOOTH r ɛ n t ə 328
35 Kumzari TOOTH d n aː n 328
36 Central Alemannic TOOTH ts a ŋ 328
37 Bakhtiari TOOTH d e n d u n 328
38 Khowar TOOTH d̪ ɔ n̪ 328
39 Gawar-Bati TOOTH d̪ ɜ n̪ t̪ 328
40 Assamese TOOTH d ã t 328
41 Gawri TOOTH d̪ ɔ̰ n̪̥ 328
42 Northwest Pashayi TOOTH d̪ aː n̪ d̪ ə 328
43 Kamviri TOOTH d u t 328
44 Ashkun TOOTH d ʊ̃ t 328

This talk

  • define largish collection of datasets with
    • manual cognate annotation
    • phonetic transcriptions
  • perform phylogenetic inference with manual cognate classification
  • compare this to four methods to infer trees directly from phonetic transcriptions:
    1. Neighbour-Joining from PMI distances
    2. automatic cognate detection with LexStat \(\Rightarrow\) ML phylogenetic inference
    3. multiple sequence alignment with T-Coffee \(\Rightarrow\) binarization \(\Rightarrow\) ML phylogenetic inference
    4. neural string autoencoder \(\Rightarrow\) ML phylogenetic inference
  • compare quality of results by comparison with Glottolog tree

Data

  • all Lexibank datasets I could get hold of
16×5 DataFrame
Row db n_entries n_taxa n_concepts n_cognate_classes
String Int64 Int64 Int64 Int64
1 abvdoceanic 67156 376 191 9016
2 grollemundbantu 37252 328 100 3809
3 global 23479 215 100 7388
4 bowernpny 19772 72 338 14164
5 gravinachadic 14438 41 717 10797
6 iecor 10874 63 170 2759
7 mixtecansubgrouping 9676 43 204 710
8 chaconcolumbian 8690 63 125 2429
9 sagartst 8416 34 250 4201
10 walworthpolynesian 6848 28 210 1379
11 seabor 6461 21 250 1802
12 savelyevturkic 6306 24 253 845
13 gerarditupi 6043 30 242 965
14 peirosaustroasiatic 4868 55 100 1423
15 listcognatebenchmark 4386 23 469 1620
16 sidwellbahnaric 4386 19 200 1036
16×5 DataFrame
Row db n_entries n_taxa n_concepts n_cognate_classes
String Int64 Int64 Int64 Int64
1 oskolskayatungusic 3861 11 254 939
2 crossandean 3681 24 150 756
3 dunnaslian 3641 17 146 889
4 saenkoromance 3500 15 110 377
5 robinsonap 3351 11 318 2155
6 utoaztecan 2880 22 121 924
7 dravlex 2111 20 100 778
8 mcelhanonhuon 1816 12 140 929
9 sidwellvietic 1791 10 116 509
10 constenlachibchan 1761 18 110 1040
11 galuciotupi 1753 18 100 654
12 starostinhmongmien 1522 12 102 280
13 felekesemitic 1242 10 138 411
14 chacontukanoan 1209 13 128 136
15 ratcliffearabic 916 10 100 100
16 starostinkaren 171 10 18 39

Phylogenetic inference from manual cognate classification

  • inference with raxml-ng (Alexey M. Kozlov, Diego Darriba, Tomáš Flouri, Benoit Morel, and Alexandros Stamatakis, 2019, RAxML-NG: A fast, scalable, and user-friendly tool for maximum likelihood phylogenetic inference. Bioinformatics, 35 (21), 4453-4455 doi:10.1093/bioinformatics/btz305)

  • running example: listcognatebenchmark

PMI distances

PMI distances

PMI distances

LexStat

Multiple Sequence Alignment

Example: heavy

    GreekMod    ---v---a--ris
PortugueseST    p-3z---a--8u-
     Italian    p-es---an-te-
     Catalan    f-3S---u--k--
      French    ---l---u--r--
     Swedish    ---s-v-o--r--
      Danish    ---d-h-o--N--
       Dutch    ---z-w-a--r--
      German    ---S-vea-----
     English    ---hEvi------
       Czech    T-ES------ki-
     Russian    tyaZ---o--l3y
      Polish    C-ES------ki-
    BretonST    ---p---onEr--
     Marathi    d--z---3--d--
   Mazhelong    ---c---o-----
       Dashi    ---c---o-----
       Dashi    ---t---o-----
    Gongxing    ---j---o-----
    Gongxing    ---d---o-----
    Gongxing    ---Z---o--N--
    Gongxing    ---c---o--N--

  • gaps are replaced by 0 and non-gaps by 1
  • for multiple entries per language, the maximum number is used
  • MSAs for different concepts are horizontally concatenated
    GreekMod    0001000100111
PortugueseST    1011000100110
     Italian    1011000110110
     Catalan    1011000100100
      French    0001000100100
     Swedish    0001010100100
      Danish    0001010100100
       Dutch    0001010100100
      German    0001011100000
     English    0001111000000
       Czech    1011000000110
     Russian    1111000100111
      Polish    1011000000110
    BretonST    0001000111100
     Marathi    1001000100100
   Mazhelong    0001000100000
       Dashi    0001000100000
    Gongxing    0001000100100

String autoencoder

  • two neural networks
    • encoder
    • decoder
  • input for encoder \(=\) output for decoder: ASJP string
  • output for encoder \(=\) input for decoder: binary vector (length 100)

  • combination to character matrices as for MSAs

ds
gqd_nj
gqd_msa
gqd_cc
gqd_lexstat
gqd_gru
global 0.035 0.028 0.067 0.055 0.067
listcognatebenchmark 0.021 0.05 0.042 0.063 0.039
seabor 0.101 0.17 0.339 0.234 0.105
chaconcolumbian 0.027 0.044 0.228 0.149 0.061
crossandean 0.047 0.078 0.083 0.111 0.064
constenlachibchan 0.298 0.465 0.246 0.365 0.469
utoaztecan 0.112 0.158 0.039 0.114 0.332
gravinachadic 0.475 0.462 0.42 0.48 0.498
grollemundbantu 0.235 0.231 0.203 0.225 0.314
iecor 0.032 0.104 0.007 0.174 0.029
sagartst 0.103 0.164 0.051 0.273 0.138
bowernpny 0.065 0.062 0.128 0.145 0.092
galuciotupi 0.0 0.141 0.084 0.058 0.19
peirosaustroasiatic 0.177 0.078 0.102 0.194 0.065
gerarditupi 0.371 0.304 0.233 0.335 0.231
dravlex 0.281 0.368 0.328 0.435 0.333
robinsonap 0.196 0.336 0.15 0.187 0.168
dunnaslian 0.0 0.147 0.115 0.06 0.0
sidwellvietic 0.231 0.14 0.204 0.269 0.102
abvdoceanic 0.18 0.183 0.025 0.159 0.347
mixtecansubgrouping 0.199 0.212 0.206 0.519 0.227
mcelhanonhuon 0.334 0.353 0.488 0.411 0.353
felekesemitic 0.0 0.0 0.0 0.0 0.0
chacontukanoan 0.421 0.516 0.386 0.443 0.475
savelyevturkic 0.326 0.323 0.385 0.392 0.483
starostinhmongmien 0.204 0.172 0.281 0.277 0.313
starostinkaren 0.413 0.563 0.557 0.497 0.587
saenkoromance 0.25 0.387 0.19 0.219 0.026
sidwellbahnaric 0.06 0.125 0.055 0.09 0.079
oskolskayatungusic 0.323 0.323 0.323 0.379 0.323
walworthpolynesian 0.252 0.289 0.174 0.28 0.24

ds
gqd_nj
gqd_msa
gqd_cc
gqd_lexstat
gqd_gru
listcognatebenchmark 0.021 0.05 0.042 0.063 0.039
seabor 0.101 0.17 0.339 0.234 0.105
chaconcolumbian 0.027 0.044 0.228 0.149 0.061
crossandean 0.047 0.078 0.083 0.111 0.064
constenlachibchan 0.298 0.465 0.246 0.365 0.469
utoaztecan 0.112 0.158 0.039 0.114 0.332
gravinachadic 0.475 0.462 0.42 0.48 0.498
grollemundbantu 0.235 0.231 0.203 0.225 0.314
iecor 0.032 0.104 0.007 0.174 0.029
sagartst 0.103 0.164 0.051 0.273 0.138
bowernpny 0.065 0.062 0.128 0.145 0.092
galuciotupi 0.0 0.141 0.084 0.058 0.19
peirosaustroasiatic 0.177 0.078 0.102 0.194 0.065
gerarditupi 0.371 0.304 0.233 0.335 0.231
dravlex 0.281 0.368 0.328 0.435 0.333
robinsonap 0.196 0.336 0.15 0.187 0.168
dunnaslian 0.0 0.147 0.115 0.06 0.0
sidwellvietic 0.231 0.14 0.204 0.269 0.102
abvdoceanic 0.18 0.183 0.025 0.159 0.347
mixtecansubgrouping 0.199 0.212 0.206 0.519 0.227
mcelhanonhuon 0.334 0.353 0.488 0.411 0.353
felekesemitic 0.0 0.0 0.0 0.0 0.0
chacontukanoan 0.421 0.516 0.386 0.443 0.475
savelyevturkic 0.326 0.323 0.385 0.392 0.483
starostinhmongmien 0.204 0.172 0.281 0.277 0.313
starostinkaren 0.413 0.563 0.557 0.497 0.587
saenkoromance 0.25 0.387 0.19 0.219 0.026
sidwellbahnaric 0.06 0.125 0.055 0.09 0.079
oskolskayatungusic 0.323 0.323 0.323 0.379 0.323
walworthpolynesian 0.252 0.289 0.174 0.28 0.24
ratcliffearabic 0.723 0.569 0.354 0.569 0.569

Difference to goldstandard

4×2 DataFrame
Row method mean GQD
String Float64
1 gqd_nj -6.25e-6
2 gqd_gru 0.0258531
3 gqd_msa 0.0328375
4 gqd_lexstat 0.0521063
4×2 DataFrame
Row method median GQD
String Float64
1 gqd_nj 0.0026
2 gqd_gru 0.02005
3 gqd_msa 0.0297
4 gqd_lexstat 0.0364

A selection of languages from many families

  • 215 languages
  • 100 concepts
1×6 DataFrame
Row ds gqd_nj gqd_msa gqd_cc gqd_lexstat gqd_gru
String31 Float64 Float64 Float64 Float64 Float64
1 global 0.035 0.0279 0.0667 0.0548 0.0674

Here, Neighbor Joining and MSA shine

Bayesian inference

median posterior GQD

dataset cc
lexstat
gru
msa
bowernpny 0.121 0.150 0.167 0.115
chaconcolumbian 0.023 0.035 0.061 0.047
chacontukanoan 0.487 0.466 0.394 0.498
constenlachibchan 0.184 0.381 0.432 0.473
crossandean 0.045 0.044 0.059 0.056
dravlex 0.287 0.246 0.329 0.326
dunnaslian 0.052 0.131 0.030 0.099
felekesemitic 0.000 0.000 0.217 0.000
galuciotupi 0.127 0.138 0.290 0.151
gerarditupi 0.286 0.435 0.376 0.420
global 0.022 0.073 0.179 0.027
iecor 0.022 0.041 0.027 0.061
listcognatebenchmark 0.031 0.038 0.031 0.038
mcelhanonhuon 0.309 0.234 0.169 0.334
mixtecansubgrouping 0.140 0.485 0.143 0.261
oskolskayatungusic 0.323 0.472 0.379 0.391
robinsonap 0.168 0.168 0.103 0.336
saenkoromance 0.292 0.266 0.167 0.323
sidwellbahnaric 0.067 0.073 0.100 0.152
sidwellvietic 0.296 0.328 0.328 0.167
starostinhmongmien 0.222 0.273 0.255 0.172
starostinkaren 0.605 0.539 0.563 0.605
utoaztecan 0.111 0.176 0.263 0.183

Conclusion

  • direct feature extraction from phonetic transcriptions is competitive with manual or automatic cognate clustering
  • Neighbour Joining works surprisingly well

Things to do

  • better quality measures for automatic phylogenies beyond comparison to Glottolog
  • fine-grained evaluation
    • identifying shared innovations for each clade in automatically inferred trees
    • simulation studies