Building a world tree of languages from ASJP v. 20

In my paper Global-scale phylogenetic linguistic inference from lexical resources, which appeared in Nature Scientific Data in 2018, I present a workflow to extract character matrices from the Automated Similarity Judgment Project data (https://asjp.clld.org/) from the then-current version 17.

These character matrices can be used to perform phylogenetic inference on any subset of the doculects present in ASJP. Using all doculects, I also computed and published a world tree in the mentioned paper.

In the meantime, I performed the same workflow with subsequent versions of ASJP and published the code and the results on OSF:

ASJP v. 18: https://osf.io/sdca4/
ASJP v. 19: https://osf.io/a97sz/

ASJP version 20 is now out since more than a year, so it is time to update the world tree as well.

One crucial intermediate step in my workflow is automatic cognate detection. So far I used a variant of the method from Jäger, List & Sofroniev (2017), Using support vector machines and state-of-the-art algorithms for phonetic alignment to identify cognates in multi-lingual wordlists, for this task. This method uses a Support Vector Machine.

In view of the dramatic progress in machine learning during the past six years, I decided to use a more modern method for automatic cognate detection. Also, since the release of Lexibank, more and better training data are available.

On this site, I will publish the source code for the entire workflow, from downloading ASJP and Lexibank to the final world tree. When the project is finished, I will also upload everything to OSF.

The entire project is written in Julia, which is a pretty cool programming language. It feels like a hybrid of Python and R, but with sufficient diligence, it can be made to be as fast as C.

Let us start with setting up the infrastructure.

infrastructure.jl

In the next step I construct the Glottolog tree based on the ASJP metadata.

constructGlottologTree.jl