Gerhard Jäger, Tübingen University

Computational phylogenetics and language (pre)history: Achievements, challenges and prospects

C-Leste, June 14, 2024

phylogenetic linguistics

main goal: infer phylogenetic trees from lexical data

input

important: cognate classification

example data: iecor from lexibank:

21×4 DataFrame

Row	Glottolog_Name	Concepticon_Gloss	Segments	Cognateset_ID
	String?	String	String	String
1	Kamviri	SAND	ts y	108
2	Ashkun	SAND	ʃ ʊː ɽ̃ ɨ	108
3	Irish	WET	fʲ l̥ʲ ʊ x	1080
4	Irish	YEAR	bʲ lʲ i ə nʲ	1088
5	Latvian	MAN	v iː r s	1094
6	Lithuanian	MAN	ʋʲ iː r ɐ s	1094
7	Irish	MAN	fʲ æ ɾˠ	1094
8	Northwest Pashayi	MAN	ʋ iˑ ɾ	1094
9	Bengali	BONE	h a ɽ	11
10	Hindi	BONE	ɦ ə ɖː i	11
11	Marathi	BONE	h a ɖ	11
12	Maithili	BONE	h ə ɖː i	11
13	Urdu	BONE	h ə ɖː i	11
14	Kashmiri	BONE	ɨ ɽ i dʒ	11
15	Magahi	BONE	h ə ɖː i	11
16	Bhojpuri	BONE	ɦ ə ɖː i	11
17	Gawar-Bati	BONE	h ɜ ɖ oˑ k i	11
18	Assamese	BONE	h ɑ ɾ	11
19	Gawri	BONE	h ɔ ɖ̥	11
20	Bulgarian	ASH	p ɛ p ɘ lˠ	1105
21	Belarusian	ASH	p ɔ pʲ ɛ lˠ	1105

phylogenetic linguistics

intermediate step

convert cognate classification into binary character matrix

phylogenetic linguistics

stochastic model: continuous time Markov chain

phylogenetic linguistics

main goal: infer phylogenetic trees from lexical data

output

phylogenetic linguistics

main goal: infer phylogenetic trees from lexical data

output

(source: Kolipakam et al. 2018, https://doi.org/10.1098/rsos.171504)

applications

control for common ancestry in statistical models (Jäger & Wahle 2021, …)
estimate time depth and geographic location of ancestral populations (Bouckaert et al 2012)
reconstruct properties of ancestral populations (Cathcart et al 2021, Carling & Cathcart 2021a,b, …)
statistic identification of patterns of language change (Blasi et al. 2019)
…

from word lists to trees

perform cognate classification (manual or automatic)
construct binary character matrix
let computer search the tree(s) that best explain(s) the distribution of 0s and 1s in the character matrix

manual cognate annotation

phylogenetic signal below cognacy

sound change and morpological change contains relevant phylogenetic information

44×4 DataFrame

19 rows omitted

Row	Glottolog_Name	Concepticon_Gloss	Segments	Cognateset_ID
	String?	String	String	String
1	Eastern Armenian	TOOTH	ɑ t ɑ m	328
2	Bengali	TOOTH	d̪ ã t̪	328
3	Catalan	TOOTH	d e n	328
4	Danish	TOOTH	t a nˀ	328
5	Dutch	TOOTH	t ɑ n t	328
6	Faroese	TOOTH	tʰ ɔ nː	328
7	French	TOOTH	d ɑ̃	328
8	German	TOOTH	tsʰ aː n	328
9	Modern Greek	TOOTH	ð o̞ n d i	328
10	Hindi	TOOTH	d̪ ɑ̃ t̪	328
11	Icelandic	TOOTH	tʰ œ nː	328
12	Italian	TOOTH	d ɛ n t e	328
13	Kashmiri	TOOTH	d̪ ɑ n d̪	328
⋮	⋮	⋮	⋮	⋮
33	Romanian	TOOTH	d i n t e	328
34	Continental Southern Italian	TOOTH	r ɛ n t ə	328
35	Kumzari	TOOTH	d n aː n	328
36	Central Alemannic	TOOTH	ts a ŋ	328
37	Bakhtiari	TOOTH	d e n d u n	328
38	Khowar	TOOTH	d̪ ɔ n̪	328
39	Gawar-Bati	TOOTH	d̪ ɜ n̪ t̪	328
40	Assamese	TOOTH	d ã t	328
41	Gawri	TOOTH	d̪ ɔ̰ n̪̥	328
42	Northwest Pashayi	TOOTH	d̪ aː n̪ d̪ ə	328
43	Kamviri	TOOTH	d u t	328
44	Ashkun	TOOTH	d ʊ̃ t	328

This talk

define largish collection of datasets with
- manual cognate annotation
- phonetic transcriptions
perform phylogenetic inference with manual cognate classification
compare this to four methods to infer trees directly from phonetic transcriptions:
1. Neighbour-Joining from PMI distances
2. automatic cognate detection with LexStat \(\Rightarrow\) ML phylogenetic inference
3. multiple sequence alignment with T-Coffee \(\Rightarrow\) binarization \(\Rightarrow\) ML phylogenetic inference
4. neural string autoencoder \(\Rightarrow\) ML phylogenetic inference
compare quality of results by comparison with Glottolog tree

Data

all Lexibank datasets I could get hold of

16×5 DataFrame

Row	db	n_entries	n_taxa	n_concepts	n_cognate_classes
	String	Int64	Int64	Int64	Int64
1	abvdoceanic	67156	376	191	9016
2	grollemundbantu	37252	328	100	3809
3	global	23479	215	100	7388
4	bowernpny	19772	72	338	14164
5	gravinachadic	14438	41	717	10797
6	iecor	10874	63	170	2759
7	mixtecansubgrouping	9676	43	204	710
8	chaconcolumbian	8690	63	125	2429
9	sagartst	8416	34	250	4201
10	walworthpolynesian	6848	28	210	1379
11	seabor	6461	21	250	1802
12	savelyevturkic	6306	24	253	845
13	gerarditupi	6043	30	242	965
14	peirosaustroasiatic	4868	55	100	1423
15	listcognatebenchmark	4386	23	469	1620
16	sidwellbahnaric	4386	19	200	1036

16×5 DataFrame

Row	db	n_entries	n_taxa	n_concepts	n_cognate_classes
	String	Int64	Int64	Int64	Int64
1	oskolskayatungusic	3861	11	254	939
2	crossandean	3681	24	150	756
3	dunnaslian	3641	17	146	889
4	saenkoromance	3500	15	110	377
5	robinsonap	3351	11	318	2155
6	utoaztecan	2880	22	121	924
7	dravlex	2111	20	100	778
8	mcelhanonhuon	1816	12	140	929
9	sidwellvietic	1791	10	116	509
10	constenlachibchan	1761	18	110	1040
11	galuciotupi	1753	18	100	654
12	starostinhmongmien	1522	12	102	280
13	felekesemitic	1242	10	138	411
14	chacontukanoan	1209	13	128	136
15	ratcliffearabic	916	10	100	100
16	starostinkaren	171	10	18	39

Phylogenetic inference from manual cognate classification

inference with raxml-ng (Alexey M. Kozlov, Diego Darriba, Tomáš Flouri, Benoit Morel, and Alexandros Stamatakis, 2019, RAxML-NG: A fast, scalable, and user-friendly tool for maximum likelihood phylogenetic inference. Bioinformatics, 35 (21), 4453-4455 doi:10.1093/bioinformatics/btz305)

running example: listcognatebenchmark

PMI distances

LexStat

Multiple Sequence Alignment

inspired by Automated Cognate Detection as a Supervised Link Prediction Task with Cognate Transformer (Akavarapu & Bhattacharya, EACL 2024)
MSA of all words for a concept, disregarding cognacy
conducted via the T-Coffee algorithm

Example: heavy

    GreekMod    ---v---a--ris
PortugueseST    p-3z---a--8u-
     Italian    p-es---an-te-
     Catalan    f-3S---u--k--
      French    ---l---u--r--
     Swedish    ---s-v-o--r--
      Danish    ---d-h-o--N--
       Dutch    ---z-w-a--r--
      German    ---S-vea-----
     English    ---hEvi------
       Czech    T-ES------ki-
     Russian    tyaZ---o--l3y
      Polish    C-ES------ki-
    BretonST    ---p---onEr--
     Marathi    d--z---3--d--
   Mazhelong    ---c---o-----
       Dashi    ---c---o-----
       Dashi    ---t---o-----
    Gongxing    ---j---o-----
    Gongxing    ---d---o-----
    Gongxing    ---Z---o--N--
    Gongxing    ---c---o--N--

gaps are replaced by 0 and non-gaps by 1
for multiple entries per language, the maximum number is used
MSAs for different concepts are horizontally concatenated

    GreekMod    0001000100111
PortugueseST    1011000100110
     Italian    1011000110110
     Catalan    1011000100100
      French    0001000100100
     Swedish    0001010100100
      Danish    0001010100100
       Dutch    0001010100100
      German    0001011100000
     English    0001111000000
       Czech    1011000000110
     Russian    1111000100111
      Polish    1011000000110
    BretonST    0001000111100
     Marathi    1001000100100
   Mazhelong    0001000100000
       Dashi    0001000100000
    Gongxing    0001000100100

String autoencoder

two neural networks
- encoder
- decoder
input for encoder \(=\) output for decoder: ASJP string
output for encoder \(=\) input for decoder: binary vector (length 100)

combination to character matrices as for MSAs

ds	gqd_nj	gqd_msa	gqd_cc	gqd_lexstat	gqd_gru
global	0.035	0.028	0.067	0.055	0.067
listcognatebenchmark	0.021	0.05	0.042	0.063	0.039
seabor	0.101	0.17	0.339	0.234	0.105
chaconcolumbian	0.027	0.044	0.228	0.149	0.061
crossandean	0.047	0.078	0.083	0.111	0.064
constenlachibchan	0.298	0.465	0.246	0.365	0.469
utoaztecan	0.112	0.158	0.039	0.114	0.332
gravinachadic	0.475	0.462	0.42	0.48	0.498
grollemundbantu	0.235	0.231	0.203	0.225	0.314
iecor	0.032	0.104	0.007	0.174	0.029
sagartst	0.103	0.164	0.051	0.273	0.138
bowernpny	0.065	0.062	0.128	0.145	0.092
galuciotupi	0.0	0.141	0.084	0.058	0.19
peirosaustroasiatic	0.177	0.078	0.102	0.194	0.065
gerarditupi	0.371	0.304	0.233	0.335	0.231
dravlex	0.281	0.368	0.328	0.435	0.333
robinsonap	0.196	0.336	0.15	0.187	0.168
dunnaslian	0.0	0.147	0.115	0.06	0.0
sidwellvietic	0.231	0.14	0.204	0.269	0.102
abvdoceanic	0.18	0.183	0.025	0.159	0.347
mixtecansubgrouping	0.199	0.212	0.206	0.519	0.227
mcelhanonhuon	0.334	0.353	0.488	0.411	0.353
felekesemitic	0.0	0.0	0.0	0.0	0.0
chacontukanoan	0.421	0.516	0.386	0.443	0.475
savelyevturkic	0.326	0.323	0.385	0.392	0.483
starostinhmongmien	0.204	0.172	0.281	0.277	0.313
starostinkaren	0.413	0.563	0.557	0.497	0.587
saenkoromance	0.25	0.387	0.19	0.219	0.026
sidwellbahnaric	0.06	0.125	0.055	0.09	0.079
oskolskayatungusic	0.323	0.323	0.323	0.379	0.323
walworthpolynesian	0.252	0.289	0.174	0.28	0.24

ds	gqd_nj	gqd_msa	gqd_cc	gqd_lexstat	gqd_gru
listcognatebenchmark	0.021	0.05	0.042	0.063	0.039
seabor	0.101	0.17	0.339	0.234	0.105
chaconcolumbian	0.027	0.044	0.228	0.149	0.061
crossandean	0.047	0.078	0.083	0.111	0.064
constenlachibchan	0.298	0.465	0.246	0.365	0.469
utoaztecan	0.112	0.158	0.039	0.114	0.332
gravinachadic	0.475	0.462	0.42	0.48	0.498
grollemundbantu	0.235	0.231	0.203	0.225	0.314
iecor	0.032	0.104	0.007	0.174	0.029
sagartst	0.103	0.164	0.051	0.273	0.138
bowernpny	0.065	0.062	0.128	0.145	0.092
galuciotupi	0.0	0.141	0.084	0.058	0.19
peirosaustroasiatic	0.177	0.078	0.102	0.194	0.065
gerarditupi	0.371	0.304	0.233	0.335	0.231
dravlex	0.281	0.368	0.328	0.435	0.333
robinsonap	0.196	0.336	0.15	0.187	0.168
dunnaslian	0.0	0.147	0.115	0.06	0.0
sidwellvietic	0.231	0.14	0.204	0.269	0.102
abvdoceanic	0.18	0.183	0.025	0.159	0.347
mixtecansubgrouping	0.199	0.212	0.206	0.519	0.227
mcelhanonhuon	0.334	0.353	0.488	0.411	0.353
felekesemitic	0.0	0.0	0.0	0.0	0.0
chacontukanoan	0.421	0.516	0.386	0.443	0.475
savelyevturkic	0.326	0.323	0.385	0.392	0.483
starostinhmongmien	0.204	0.172	0.281	0.277	0.313
starostinkaren	0.413	0.563	0.557	0.497	0.587
saenkoromance	0.25	0.387	0.19	0.219	0.026
sidwellbahnaric	0.06	0.125	0.055	0.09	0.079
oskolskayatungusic	0.323	0.323	0.323	0.379	0.323
walworthpolynesian	0.252	0.289	0.174	0.28	0.24
ratcliffearabic	0.723	0.569	0.354	0.569	0.569

Difference to goldstandard

4×2 DataFrame

Row	method	mean GQD
	String	Float64
1	gqd_nj	-6.25e-6
2	gqd_gru	0.0258531
3	gqd_msa	0.0328375
4	gqd_lexstat	0.0521063

4×2 DataFrame

Row	method	median GQD
	String	Float64
1	gqd_nj	0.0026
2	gqd_gru	0.02005
3	gqd_msa	0.0297
4	gqd_lexstat	0.0364

A selection of languages from many families

215 languages
100 concepts

1×6 DataFrame

Row	ds	gqd_nj	gqd_msa	gqd_cc	gqd_lexstat	gqd_gru
	String31	Float64	Float64	Float64	Float64	Float64
1	global	0.035	0.0279	0.0667	0.0548	0.0674

Here, Neighbor Joining and MSA shine

Bayesian inference

median posterior GQD

dataset	cc	lexstat	gru	msa
bowernpny	0.121	0.150	0.167	0.115
chaconcolumbian	0.023	0.035	0.061	0.047
chacontukanoan	0.487	0.466	0.394	0.498
constenlachibchan	0.184	0.381	0.432	0.473
crossandean	0.045	0.044	0.059	0.056
dravlex	0.287	0.246	0.329	0.326
dunnaslian	0.052	0.131	0.030	0.099
felekesemitic	0.000	0.000	0.217	0.000
galuciotupi	0.127	0.138	0.290	0.151
gerarditupi	0.286	0.435	0.376	0.420
global	0.022	0.073	0.179	0.027
iecor	0.022	0.041	0.027	0.061
listcognatebenchmark	0.031	0.038	0.031	0.038
mcelhanonhuon	0.309	0.234	0.169	0.334
mixtecansubgrouping	0.140	0.485	0.143	0.261
oskolskayatungusic	0.323	0.472	0.379	0.391
robinsonap	0.168	0.168	0.103	0.336
saenkoromance	0.292	0.266	0.167	0.323
sidwellbahnaric	0.067	0.073	0.100	0.152
sidwellvietic	0.296	0.328	0.328	0.167
starostinhmongmien	0.222	0.273	0.255	0.172
starostinkaren	0.605	0.539	0.563	0.605
utoaztecan	0.111	0.176	0.263	0.183

Conclusion

direct feature extraction from phonetic transcriptions is competitive with manual or automatic cognate clustering
Neighbour Joining works surprisingly well

Things to do

better quality measures for automatic phylogenies beyond comparison to Glottolog
fine-grained evaluation
- identifying shared innovations for each clade in automatically inferred trees
- simulation studies