1 The main idea

There are quite a few useful crosslinguistic databases are publicly available:


The extraction and manipulation of those datasets is not always straightforward, especially not in R!

  • important exceptions:

I would like lingtypR to:

  • make useful datasets readily availabe in R
  • provide functions that make common tasks easier and faster

2 Installation

For now, the package is still in development and it lives on gitlab. It can be installed using the devtools package.
This page will be updated along with more information and examples.


3 add_glottolog(), add_wals(), add_apics()

A function to add information from a database to a typological sample.

  • data: a dataframe of a language sample (requires a column language or glottocode)
  • by: adds glottolog info by = "language" (default) or by ="glottocode" (and by="wals_code" for add_wals())
## Joining, by = "language"
## Joining, by = "language"

4 make_map()

A function to plot a map of the sample.

  • data: a dataframe with the following columns
    • language
    • latitude
    • longitude
    • (macroarea)
  • macroarea: option to color languages by macroarea (defaults to FALSE)
  • label: option to plot language labels instead of points (defaults to FALSE)
  • repel: option to repel the labels for better readability (defaults to FALSE)
  • feature: option to specify another feature to color languages (some_other_column)
  • legend: option to plot a legend (defaults to FALSE

Let’s plot a simple map of a dummy dataset, with colors for a continuous feature feature_B.


Let’s plot the WALS dataset and add colors by macroarea.


5 make_treedata() and make_tree()

Plots a phylogenetic structure of (a) language(s) based on the phylogenetic structure in glottolog.

There are 2 functions:

  • make_treedata():
    • data: a vector of glottocodes
    • returns a dataframe of languages that can be converted into a graph
  • make_tree():
    • data: dataframe with 2 columns (to, from)
    • returns a phylogenetic tree plot
    • this function can also be used on any kind of data if provided in this format (if the glottolog structure is not suitable for some reason)
## [1] "maip1246" "bani1254" "yavi1245" "tain1254" "yawa1261"


6 get_languages() and get_codes()

These are two helper functions:

  • get_languages(): takes a vector of glottocodes and returns a vector of language names (replace = FALSE by default)
  • get_codes(): takes a vector of language names and returns a vector of glottocodes (replace = FALSE by default)
## [1] "Adang"                   "Peruvian Sign Language" 
## [3] "Ndyuka-Trio Pidgin"      "Lak"                    
## [5] "Limbum"                  "Ulwa"                   
## [7] "Ulwa (Papua New Guinea)"
## [1] "adan1251" "lakk1252" "limb1268" "ndyu1241" "peru1235" "ulwa1239" "yaul1241"

7 get_wals(), get_apics()

These functions return a language + feature combination in WALS or APiCS as a dataframe.

  • language: vector of language names (defaults to NULL)
  • glottocode: vector of glottocodes (defaults to NULL)
  • feature: vector of WALS or APiCS feature codes (defaults to NULL)
  • by: by = "language" or by = "glottocode"

8 get_ud()

Easy import of Universal Dependency treebanks, which usually come with a nested folder structure and require some data wrangling to import everything in a useful format.

In order for it to work, the UD data needs to be stored locally on the computer.

  • path: the path to the directory containing the dataset
  • language: the 2-3 letter abbreviation used in the .conllu files
  • sentence: if sentence = FALSE sentence column is dropped
## [1] "/home/laura/Documents/projects/corpora/ud-2.8//UD_Akkadian-PISANDUB/akk_pisandub-ud-test.conllu"
## [1] "/home/laura/Documents/projects/corpora/ud-2.8//UD_Akkadian-RIAO/akk_riao-ud-test.conllu"

9 get_heads()

Imagine that we want to analyse in which syntactic environments adverbs occure in German.


We therefore want to extract the heads of all adjectives in the dataset, but we also want to be able to link all adverbs to their head elements for analysis.


get_heads() merges a selection of tokens with their heads into a dataframe.

  • data: dataframe in the UD treebank format
    • feature: a feature (<some_column>) to select tokens
    • value: a feature value in <some_column to select tokens

10 Import UniMorph

UniMorph is a database of inflectional paradigms of 142 languages, containing full paradigms for single lexemes.

## Warning: Removed 10 rows containing missing values (geom_point).


lingtypR loads the UniMorph dataset in a dataframe format:


11 … to be continued

  • add more functions to work with UD (get_args())
  • make available other (CLDF) datasets for easy use in R
  • add an R wrapper for epitran (transliterates ortthographies –> IPA)?