Building a world tree of languages from ASJP v. 20
In my paper Global-scale phylogenetic linguistic inference from lexical resources, which appeared in Nature Scientific Data in 2018, I present a workflow to extract character matrices from the Automated Similarity Judgment Project data (https://asjp.clld.org/) from the then-current version 17.
These character matrices can be used to perform phylogenetic inference on any subset of the doculects present in ASJP. Using all doculects, I also computed and published a world tree in the mentioned paper.
In the meantime, I performed the same workflow with subsequent versions of ASJP and published the code and the results on OSF:
- ASJP v. 18: https://osf.io/sdca4/
- ASJP v. 19: https://osf.io/a97sz/
ASJP version 20 is now out since more than a year, so it is time to update the world tree as well.
One crucial intermediate step in my workflow is automatic cognate detection. So far I used a variant of the method from Jäger, List & Sofroniev (2017), Using support vector machines and state-of-the-art algorithms for phonetic alignment to identify cognates in multi-lingual wordlists, for this task. This method uses a Support Vector Machine.
In view of the dramatic progress in machine learning during the past six years, I decided to use a more modern method for automatic cognate detection. Also, since the release of Lexibank, more and better training data are available.
On this site, I will publish the source code for the entire workflow, from downloading ASJP and Lexibank to the final world tree. When the project is finished, I will also upload everything to OSF.
The entire project is written in Julia
, which is a pretty cool programming language. It feels like a hybrid of Python
and R
, but with sufficient diligence, it can be made to be as fast as C
.
Let us start with setting up the infrastructure.
In the next step I construct the Glottolog tree based on the ASJP metadata.