There are quite a few useful crosslinguistic databases are publicly available:
The extraction and manipulation of those datasets is not always straightforward, especially not in R!
multicastR
udpipe
to import treebanks in the .conllu formatlingtypology
for dynamic maps and searching a number of databasesI would like lingtypR
to:
For now, the package is still in development and it lives on gitlab. It can be installed using the devtools
package.
This page will be updated along with more information and examples.
## install.packages(devtools)
## devtools::install_git("git@gitlab.com:laurabecker/lingtypr.git")
## or
## install.packages(remotes)
## remotes::install_git("https://gitlab.com/laurabecker/lingtypr.git")
library(lingtypR)
add_glottolog()
, add_wals()
, add_apics()
A function to add information from a database to a typological sample.
data
: a dataframe of a language sample (requires a column language
or glottocode
)by
: adds glottolog info by = "language"
(default) or by ="glottocode"
(and by="wals_code"
for add_wals()
)## Joining, by = "language"
## Joining, by = "language"
make_map()
A function to plot a map of the sample.
data
: a dataframe with the following columns
language
latitude
longitude
macroarea
)macroarea
: option to color languages by macroarea (defaults to FALSE)label
: option to plot language labels instead of points (defaults to FALSE)repel
: option to repel the labels for better readability (defaults to FALSE)feature
: option to specify another feature to color languages (some_other_column
)legend
: option to plot a legend (defaults to FALSELet’s plot a simple map of a dummy dataset, with colors for a continuous feature feature_B
.
Let’s plot the WALS dataset and add colors by macroarea.
wals_langs <- wals %>%
distinct(language, .keep_all = TRUE) %>%
filter(!is.na(macroarea))
make_map(data = wals_langs, macroarea = TRUE, legend = TRUE)
make_treedata()
and make_tree()
Plots a phylogenetic structure of (a) language(s) based on the phylogenetic structure in glottolog.
There are 2 functions:
make_treedata()
:
data
: a vector of glottocodesmake_tree()
:
data
: dataframe with 2 columns (to
, from
)## [1] "maip1246" "bani1254" "yavi1245" "tain1254" "yawa1261"
get_languages()
and get_codes()
These are two helper functions:
get_languages()
: takes a vector of glottocodes and returns a vector of language names (replace = FALSE
by default)get_codes()
: takes a vector of language names and returns a vector of glottocodes (replace = FALSE
by default)## [1] "Adang" "Peruvian Sign Language"
## [3] "Ndyuka-Trio Pidgin" "Lak"
## [5] "Limbum" "Ulwa"
## [7] "Ulwa (Papua New Guinea)"
## [1] "adan1251" "lakk1252" "limb1268" "ndyu1241" "peru1235" "ulwa1239" "yaul1241"
get_wals(), get_apics()
These functions return a language + feature combination in WALS or APiCS as a dataframe.
language
: vector of language names (defaults to NULL)glottocode
: vector of glottocodes (defaults to NULL)feature
: vector of WALS or APiCS feature codes (defaults to NULL)by
: by = "language"
or by = "glottocode"
get_ud()
Easy import of Universal Dependency treebanks, which usually come with a nested folder structure and require some data wrangling to import everything in a useful format.
In order for it to work, the UD data needs to be stored locally on the computer.
path
: the path to the directory containing the datasetlanguage
: the 2-3 letter abbreviation used in the .conllu filessentence
: if sentence = FALSE
sentence column is droppedakkadian <- get_ud(language = "akk",
path = "/home/laura/Documents/projects/corpora/ud-2.8/",
sentence = TRUE)
## [1] "/home/laura/Documents/projects/corpora/ud-2.8//UD_Akkadian-PISANDUB/akk_pisandub-ud-test.conllu"
## [1] "/home/laura/Documents/projects/corpora/ud-2.8//UD_Akkadian-RIAO/akk_riao-ud-test.conllu"
get_heads()
Imagine that we want to analyse in which syntactic environments adverbs occure in German.
We therefore want to extract the heads of all adjectives in the dataset, but we also want to be able to link all adverbs to their head elements for analysis.
get_heads()
merges a selection of tokens with their heads into a dataframe.
data
: dataframe in the UD treebank format
feature
: a feature (<some_column>
) to select tokensvalue
: a feature value in <some_column
to select tokensImport UniMorph
UniMorph is a database of inflectional paradigms of 142 languages, containing full paradigms for single lexemes.
## Warning: Removed 10 rows containing missing values (geom_point).
lingtypR
loads the UniMorph dataset in a dataframe format:
get_args()
)