1 Prepare the data

1.1 Import files

To execute the following code, we need to load three extra packages.

library(tidyverse)
library(ggplot2)
library(ggpubr)

We first import a list of all german text files to have a list of all the files in the folder. This list is the object de_filelist that we created.

de_filelist <- list.files(path = ".",
                         pattern ="DE.*\\.txt")
length(de_filelist)
## [1] 221

Then, we apply the function that imports the actual data sets to the entire list of German text files. This corresponds to the object de_files that we created, which is a list of all 233 data frames for German.

de_files <- lapply(de_filelist, function(x) {
  read.delim(x, header = FALSE, sep = "\t", blank.lines.skip = FALSE)
})

length(de_files)
## [1] 221

In order to access an element from a list, we use square brackets like in the following command, which outputs the German text of our list of German texts.

de_files[[1]]

The next bit of code does the same for the English texts, importing all the English texts and creating a list called en_files that contains all the English texts as data frames.

en_filelist <- list.files(path = ".",
                         pattern ="US.*\\.txt")
length(en_filelist)
## [1] 92
en_files <- lapply(en_filelist, function(x) {
  read.delim(x, header = FALSE, sep = "\t", blank.lines.skip = FALSE)
})

length(en_files)
## [1] 92
en_files[[1]]

1.2 Arrange data set

We now need to add the filenames as a column to be able to classify the text data frames as written/spoken and formal/informal and to be able to identify the four texts that belong to the same speaker.

This is what the next bit of code does: it goes through the lists of English and German texts, merges them with the filelists names creating an additional column and converting the list into two large data frames, one for German, one for English.

To those two data frames, we can then add column names to be able to access the important parts of the data set, but also to be able to subset by modality, by speaker, by sentence, etc.

for (i in 1:length(de_files)) { de_files[[i]]<-cbind(de_files[[i]],de_filelist[i])
}
de_files2  <- do.call("rbind", de_files)
colnames(de_files2) <- c("word_ID", "text", "lemma", "POS", "POS_2", "morphosyntax", "dependency", "syntax", "add_1", "add_2", "file")


for (i in 1:length(en_files)) { en_files[[i]]<-cbind(en_files[[i]],en_filelist[i])
}
en_files2  <- do.call("rbind", en_files)
colnames(en_files2) <- c("word_ID", "text", "lemma", "POS", "POS_2", "morphosyntax", "dependency", "syntax", "add_1", "add_2", "file")

The next lines of code add the column language, specifying the language for each line of the two data frames.

de_files2$language <- "German"
en_files2$language <- "English"

Having an additional column that indicates the language, we can now bind the two data frames together, having one large data frame that contains all German and English texts.

all_files <- rbind(de_files2, en_files2)

In addition, we need two more columns that will come in handy when subsetting the data for modality and register:

  • the column register, specifying the register for each line of the two data frames (this is done by checking the file column, if it has an f, followed by two other letters and .txt, then the register is formal, if not, it is informal)
  • the column modality, specifying the modality for each line of the two data frames (done as for the register column, mutatis muntandis, searching for s folled by a letter and .txt)
all_files$register <- ifelse(str_detect(all_files$file, "f.{2}\\.txt"),
                            "formal",
                            "informal")

all_files$modality <- ifelse(str_detect(all_files$file, "s.\\.txt"),
                            "spoken",
                            "written")

We can now take a look at the first and last lines of our data frame to make sure that everything was added correctly.

head(all_files, 10)
tail(all_files, 10)

In order to count words per sentences, we need to exclude those rows that contain punctuation, so that the number of rows per file will correspond to the number of words per text.

all_files  <- all_files[all_files$POS!="PUNCT", ]
all_files <- all_files[all_files$lemma!=",",]

We also want to exclude those rows that contain extra-linguistic material or other languages.

all_files <- all_files[all_files$POS!="SYM", ] 
all_files <- all_files[all_files$POS!="X",] 

1.3 Create sentence ID

To be able to better keep track of the sentences, let’s create an additional column with a sentence ID that gives each sentence an individual number, starting with 1, and incrementing by +1 for each new sentence.

To do so, we define a counter that starts with 1. We will use this counter in a loop later to add +1 for each new sentence.

counter <- 1

We also need to define a column that will contain our sentence ID. We will assign it the value one for all rows for now.

all_files$sentence_id <- 1 

We now need a loop that goes through every row of the data set: every time that it encounters an empty cell in the POS column, it adds +1 to the counter. In addition, in the loop we set the value of the sentence_id column to the value of the counter. Since the counter is incremented by +1 everytime it sees a sentence boundary, the sentence_ID column will in the end have a different number for each sentence.

fun_sent_helper <- function(x) {
   for (i in 1:nrow(x)) { 
     if(x[i, "POS"]=="") { 
       counter = counter + 1         
     } 
     x[i, "sentence_id"] <- counter  
   }
   return(x)
}

all_texts <- split(all_files, all_files$file)

all_texts <- lapply(all_texts, fun_sent_helper)

length(all_texts)
## [1] 313
all_files <- do.call(rbind.data.frame, all_texts)

Let’s see if the loop properly generated the sentence IDs in the sentence_ID column. Look at the rightmost column (you may need to click on the right arrow to make it visible) of the informal-spoken text: it starts a new sentence_ID with every new sentence.

tail(all_files, 100)

Since we can now keep track of single sentences, we no longer need the otherwise empty rows. The following command deletes those additional rows.

nrow(all_files)
## [1] 52121
all_files <- all_files[all_files$POS!="", ]
nrow(all_files)
## [1] 47349

2 Syntax

2.1 Length of sentences

In order to compare the length of sentences across modalities and registers, we first need to calculate the length of sentences in words. We can then average across sentences per variety and compare those averages.

The first part splits the texts into a list of sentences, so that each sentence is its own data frame in the list. This way, we can perform the same operations on each single sentence.

all_texts <- split(all_files, all_files$file)

The next chunk of code defines a function to count the words per sentence. Words are counted in numbers of rows, since the text contains one word per row.

fun_sent_length <- function(x) {
  length <- nrow(x) / max(x$sentence_id)
  length
}

We now apply the word counting function to our list of sentences, add column names and a column with the information on the file name to be able to relate those counts to languages and varieties later.

length_sent <- lapply(all_texts, fun_sent_length) %>%
  unlist() %>%
  as.data.frame()

colnames(length_sent) = c("sent_length")

length_sent$file <- rownames(length_sent)

The next chunk of code adds the register, modality, and language information to our data frame that contains the length of all sentences

length_sent$register <- ifelse(str_detect(length_sent$file, "f.{2}\\.txt"),
                            "formal",
                            "informal")

length_sent$modality <- ifelse(str_detect(length_sent$file, "s.\\.txt"),
                              "spoken",
                              "written")

length_sent$language <- ifelse(str_detect(length_sent$file, "DE"),
                              "German",
                              "English")

To make the results of sentence lengths interpretable, we can summarize the sentence for language, modality, and register groups. We do that by averaging over the length of sentences that all belong to the same language, modality, and register combinations.

summary_length_sent <- length_sent %>% 
  group_by(language, modality, register) %>%
  summarize(average_length_sent = round(mean(sent_length), 1)) %>%
  as.data.frame()

2.2 Summary

The table below shows the length of sentences (in words) for each of the 8 combinations we are interested in.

summary_length_sent

2.3 Length of texts

We may also want to compare the average length of texts (in number of words) across the 8 varieties.

To do so, the next code chunk adds a column to the data set that counts the number of words per text.

fun_text_length <- function(x) {
  x <- mutate(x, word_id = row_number())
  x
}

text_length <- lapply(all_texts, fun_text_length)
all_files <- as.data.frame(do.call(rbind, text_length))

With this counter, we can calculate the average text lengths in number of words across the 8 varieties:

summary_length_texts <- all_files %>%
  group_by(language, modality, register) %>%
  summarize(average_length_text = round(max(word_id), 2)) %>%
  as.data.frame()

2.4 Summary

summary_length_texts

2.5 Clause types

The total number of sentences simply corresponds to the highest value of the sentence_ID column.

To calculate the number of clauses, of specific clause types, and the ratio of main and subordinate clauses, requires a number of different steps.

The relevant annotation is mostly contained in the syntax column. The tags used are those of the Universal Dependency treebanks. The UD treebanks are large, crosslinguistic collection of corpora (texts) that are syntactically annotated.

What we need are the tags that they use for coordination and subordination. You can find the detailed documentation of the tag set as well as explanations here: https://universaldependencies.org/u/overview/complex-syntax.html#subordination

Those are the tags that we are interested in:

tag explanation
cc conjoined clauses (technically only coordination, although also used for subordination)
to you and the car and that crazy guy
acl adnominal clause modifier
in the end everybody’s ok which is really what matters
advcl adverbial clause
the other day you called me when something crazy happened
csubj clausal subject
hitting the dog had to break the first car
to get together these two cars were coming down the street
xcomp complement clause with obligatory control
they seemed fine
I found that really interesting
ccomp complement clause without control
I know that they’re fine
I see a minor accident happen
rel relative clause pronoun
check your own damage which is what this driver did

Let’s take a look at some examples

all_files[all_files$syntax=="cc", ] %>%
  sample_n(10)
all_files[all_files$syntax=="advcl", ] %>%
  sample_n(10)
all_files[all_files$syntax=="csubj", ] %>%
  sample_n(10)
all_files[all_files$syntax=="xcomp", ] %>%
  sample_n(10)
all_files[all_files$syntax=="ccomp", ] %>%
  sample_n(10)

Relative clauses are less consistently annotated. We can combine the counts for the following tags to estimate the number of relative clauses:

  • in some cases, the POS_2 column contains a PRELS tag to signal relative markers
  • the morphosyntax column contains information on the pronoun type; in most cases, relative pronouns are marked as such by PronType=Rel
  • the syntax column uses acl mostly for relative clauses as well
all_files[all_files$POS_2=="PRELS" | all_files$syntax=="acl" | str_detect(all_files$morphosyntax, "PronType=Rel"), ] %>%
  sample_n(20)

2.6 Number of clauses in the texts

The next chunk of code calculates the number of those clause types per text and saves the counts as a data frame.

clause_counts <- lapply(all_texts, function(x){
  n_sent <- max(x$sentence_id)
  n_cc <- nrow(x[x$syntax=="cc", ])
  n_advcl <- nrow(x[x$syntax=="advcl", ])
  n_csubj <- nrow(x[x$syntax=="csubj", ])
  n_comp <- nrow(x[x$syntax=="xcomp", ]) + nrow(x[x$syntax=="ccomp", ])
  n_rel <- nrow(x[x$POS_2=="PRELS" | x$syntax=="acl" | str_detect(x$morphosyntax, "PronType=Rel"), ]) 
  file <- as.character(x$file[1])
  d <- cbind(n_sent, n_cc, n_advcl, n_csubj, n_comp, n_rel, file)
  d
})

clause_counts <- as.data.frame(do.call(rbind, clause_counts))

The head of this data frame looks like this:

head(clause_counts)

Again, to be able to compare the number of different types of clauses across varieties, we need to add language, modality, and register information to the data frame containing the clause counts.

clause_counts$register <- ifelse(str_detect(clause_counts$file, "f.{2}\\.txt"),
                                  "formal",
                              "informal")

clause_counts$modality <- ifelse(str_detect(clause_counts$file, "s.\\.txt"),
                                  "spoken",
                              "written")

clause_counts$language <- ifelse(str_detect(clause_counts$file, "DE"),
                                  "German",
                              "English")

2.7 Number of clause types as ratios per number of sentences

The next chunk of code first converts the counts into from characters into numbers. With this, we can calculate the ratios of clause types per number of sentences that each text contains. Remember, this is important because the texts have different lengths.

To interpret the clause/sentence ratios for different types of clauses, we need to average and summarize across the 8 different variations. This is done in the second part of the code chunk below.

clause_counts[, 1:6] <- lapply(clause_counts[, 1:6], as.character)
clause_counts[, 1:6] <- lapply(clause_counts[, 1:6], as.numeric)

clause_types_summary <- clause_counts %>% 
  group_by(language, modality, register) %>%
  summarize(mean_cc = round(mean(n_cc/n_sent), 2),
            mean_advcl = round(mean(n_advcl/n_sent), 2),
            mean_comp = round(mean((n_comp)/n_sent), 2),
            mean_csubj = round(mean(n_csubj/n_sent), 2),
            mean_rel = round(mean(n_rel/n_sent), 2)            
            ) %>%
  as.data.frame

clause_counts_summary <- clause_counts %>% 
  group_by(language, modality, register) %>%
  summarize(mean_subtotal = round(mean((n_advcl + n_csubj + n_comp + n_rel)/n_sent), 2)
            ) %>%
  as.data.frame

2.8 Summary

The table below shows the summary of the various clause types in terms of how frequent they are in the 8 varieties we are comparing. Their frequency is represented relatively to the number of sentences.

clause_types_summary

The next table shows the overvall number of subordinate clauses in relation to the number of sentences across the 8 varieties.

clause_counts_summary

2.9 Number of sentence-initial CCONJ

There is another issue that comes with counting conjunctions and that we have ignored so far. Many sentences often start with and, but, or. In the clause_count_summary, those are included. This means, we may have found a higher number of conjoined clauses than we actually have, since those clauses were counted as new sentences anyway. In order to exclude conjunctions that do not occur within sentences but at the beginning of sentences for our clause counts, we can do the following.

We do not necessarily keep track of sentence-initial CCONJ only to exlude those from the clause counts. This also shows how often sentence-initial conjunctions occur across the 8 varieties that we want to compare.

We first split our data frame into a list of data frames such that each data frame corresponds to a single sentence and is a separate element in our list.

sentence_list <- split(all_files, list(all_files$sentence_id,all_files$file), drop=TRUE)

Now, we want to go through the list and check for all sentences separately if their first element is CCONJ. This is what the next commands do. In addition, they create a data frame that relates each sentence to its file, and keeps track whether or not it is true for each sentence that it starts with CCONJ.

sent_init <- lapply(sentence_list, function(x) {
  isTRUE(x[1, "POS"]=="CCONJ")
})

sent_init <- cbind(sent_init, names(sent_init))

colnames(sent_init) <- c("is_cconj", "file")

sent_init <- as.data.frame(sent_init)

sent_init$file <- str_replace_all(sent_init$file, "[0-9][0-9]?\\.", "")

2.10 Summary

Again, to make this interpretable, we group all sentences by their file, and sum up the number of sentences that start with CCONJ.

sent_init_summary <-
  sent_init %>%
  group_by(file) %>%
  summarize(sum(is_cconj==TRUE)) %>%
  as.data.frame()

colnames(sent_init_summary) <- c("file", "sent_init_cconj")

The table below shows the number of sentence-initial CCONJ for all files in the data set. Some numbers are pretty high compared to others, suggesting that, indeed, it is important to look at their distribution across the different text varieties.

sent_init_summary

Again, to examine the number of sentence-initial conjunctions across the 8 varieties, we need to add register, modality, and language information to our summary data set.

sent_init_summary$register <- ifelse(str_detect(sent_init_summary$file, "f.{2}\\.txt"),
                                    "formal",
                                    "informal")

sent_init_summary$modality <- ifelse(str_detect(sent_init_summary$file, "s.\\.txt"),
                                    "spoken",
                                    "written")

sent_init_summary$language <- ifelse(str_detect(sent_init_summary$file, "DE"),
                                    "German",
                                    "English")

The next code chunk summarizes the counts of sentence-initial CCONJ across the 8 varieties.

sent_init_summary[, 2] <- as.character(sent_init_summary[, 2])
sent_init_summary[, 2] <- as.numeric(sent_init_summary[, 2])

sent_init_summary2 <- sent_init_summary %>%
  group_by(language, modality, register) %>%
  summarize(mean_sent_init_conj = round(mean(sent_init_cconj), 2)
            ) %>%
  as.data.frame

sent_init_summary2

2.11 Conjunctions and subjunctions

Let’s look at the types of conjunctions and subjunctions used in the 8 varieties in more detail.

The following code chunk selects all rows that are tagged as either CCONJ or SCONJ in POS, ie. the rows that contain conjunctions and subjunctions. In addition, the German corpus uses KOUS in the POS_2 column to mark subjunctions.

conj <- all_files[all_files$POS=="CCONJ" & all_files$POS_2!="KOUS", ]
nrow(conj)
## [1] 2492
subj <- all_files[all_files$POS=="SCONJ" | all_files$POS_2=="KOUS", ]
nrow(subj)
## [1] 646

We then split both the conj and subj data sets into 8 data sets according to language, modality, and register.

conj <- with(conj, split(conj, list(conj$language, conj$modality, conj$register)))
subj <- with(subj, split(subj, list(subj$language, subj$modality, subj$register)))

Now we can group conjunctions and subjunctions by lemma and rank the lemmas according to their frequency. The next two code chunks do that for the 8 varieties for both conjunctions and subjunctions.

conj_lemmas <- lapply(conj, function(x) {
  x %>%
    group_by(lemma) %>%
    summarize(n_lemmas = n()) %>%
    as.data.frame()
})

subj_lemmas <- lapply(subj, function(x) {
  x %>%
    group_by(lemma) %>%
    summarize(n_lemmas = n()) %>%
    as.data.frame()
})

The only thing left now is to rank the lemmas according to their frequency. This is what the next code chunk does, again for both both conjunctions and subjunctions.

conj_lemmas_list <- lapply(conj_lemmas, function(x) {
  helper <- x[order(-x$n_lemmas), ]
  helper$rank <- c(1:nrow(x))
  helper
})

subj_lemmas_list <- lapply(subj_lemmas, function(x) {
  helper <- x[order(-x$n_lemmas), ]
  helper$rank <- c(1:nrow(x))
  helper
})

2.12 Summary

The 8 data frames below rank the conjunctions used in the texts. Additionally, it shows the proportions for each lemma.

conj_lemmas_list <- lapply(conj_lemmas_list, function(x) {
  mutate(x,  prop = n_lemmas / sum(n_lemmas))
})

conj_lemmas_list
## $English.spoken.formal
##    lemma n_lemmas rank        prop
## 1    and      201    1 0.893333333
## 2    but       15    2 0.066666667
## 3     or        8    3 0.035555556
## 4 either        1    4 0.004444444
## 
## $German.spoken.formal
##              lemma n_lemmas rank        prop
## 1              und      546    1 0.833587786
## 2             also       56    2 0.085496183
## 3             aber       21    3 0.032061069
## 4             oder       14    4 0.021374046
## 5            genau        4    5 0.006106870
## 6          sondern        3    6 0.004580153
## 7           jedoch        2    7 0.003053435
## 8              als        1    8 0.001526718
## 9             weil        1    9 0.001526718
## 10 beziehungsweise        1   10 0.001526718
## 11              ja        1   11 0.001526718
## 12             nur        1   12 0.001526718
## 13   währenddessen        1   13 0.001526718
## 14            denn        1   14 0.001526718
## 15        entweder        1   15 0.001526718
## 16  beziehungweise        1   16 0.001526718
## 
## $English.written.formal
##   lemma n_lemmas rank      prop
## 1   and      112    1 0.9032258
## 2   but        6    2 0.0483871
## 3    or        6    3 0.0483871
## 
## $German.written.formal
##              lemma n_lemmas rank        prop
## 1              und      285    1 0.907643312
## 2              als       10    2 0.031847134
## 3             oder        4    3 0.012738854
## 4             also        4    4 0.012738854
## 5             aber        2    5 0.006369427
## 6             bzw.        2    6 0.006369427
## 7  beziehungsweise        1    7 0.003184713
## 8          sondern        1    8 0.003184713
## 9            weder        1    9 0.003184713
## 10          jedoch        1   10 0.003184713
## 11           sowie        1   11 0.003184713
## 12            doch        1   12 0.003184713
## 13    Währendessen        1   13 0.003184713
## 
## $English.spoken.informal
##     lemma n_lemmas rank        prop
## 1     and      234    1 0.850909091
## 2     but       29    2 0.105454545
## 3      or       11    3 0.040000000
## 4 because        1    4 0.003636364
## 
## $German.spoken.informal
##             lemma n_lemmas rank        prop
## 1             und      464    1 0.768211921
## 2            also       49    2 0.081125828
## 3            aber       48    3 0.079470199
## 4            oder       28    4 0.046357616
## 5            weil        6    5 0.009933775
## 6         sondern        4    6 0.006622517
## 7             als        3    7 0.004966887
## 8 beziehungsweise        1    8 0.001655629
## 9           weder        1    9 0.001655629
## 
## $English.written.informal
##   lemma n_lemmas rank       prop
## 1   and       82    1 0.88172043
## 2   but       10    2 0.10752688
## 3    or        1    3 0.01075269
## 
## $German.written.informal
##    lemma n_lemmas rank        prop
## 1    und      156    1 0.772277228
## 2   aber       22    2 0.108910891
## 3    als        8    3 0.039603960
## 4   also        6    4 0.029702970
## 5   oder        3    5 0.014851485
## 6   doch        2    6 0.009900990
## 7   Also        2    7 0.009900990
## 8    Und        1    8 0.004950495
## 9   denn        1    9 0.004950495
## 10   Als        1   10 0.004950495

The 8 data frames below rank the subjunctions used in the texts. Additionally, it shows the proportions for each lemma.

subj_lemmas_list <- lapply(subj_lemmas_list, function(x) {
  mutate(x,  prop = n_lemmas / sum(n_lemmas))
})

subj_lemmas_list
## $English.spoken.formal
##      lemma n_lemmas rank       prop
## 1     that       21    1 0.37500000
## 2       as       10    2 0.17857143
## 3  because        6    3 0.10714286
## 4       so        5    4 0.08928571
## 5    while        5    5 0.08928571
## 6     when        3    6 0.05357143
## 7    which        2    7 0.03571429
## 8       if        2    8 0.03571429
## 9     like        1    9 0.01785714
## 10    than        1   10 0.01785714
## 
## $German.spoken.formal
##      lemma n_lemmas rank        prop
## 1     dass       45    1 0.288461538
## 2      als       33    2 0.211538462
## 3      wie       15    3 0.096153846
## 4     weil       14    4 0.089743590
## 5       um        7    5 0.044871795
## 6       da        5    6 0.032051282
## 7       ob        5    7 0.032051282
## 8       wo        5    8 0.032051282
## 9   sodass        5    9 0.032051282
## 10  soweit        4   10 0.025641026
## 11   falls        4   11 0.025641026
## 12 nachdem        4   12 0.025641026
## 13    wenn        3   13 0.019230769
## 14   damit        2   14 0.012820513
## 15     was        1   15 0.006410256
## 16  sobald        1   16 0.006410256
## 17   indem        1   17 0.006410256
## 18   bevor        1   18 0.006410256
## 19 während        1   19 0.006410256
## 
## $English.written.formal
##      lemma n_lemmas rank       prop
## 1       as       20    1 0.32786885
## 2    while       12    2 0.19672131
## 3     that       10    3 0.16393443
## 4     when       10    4 0.16393443
## 5       if        3    5 0.04918033
## 6       so        1    6 0.01639344
## 7  because        1    7 0.01639344
## 8     like        1    8 0.01639344
## 9    after        1    9 0.01639344
## 10   until        1   10 0.01639344
## 11  whilst        1   11 0.01639344
## 
## $German.written.formal
##              lemma n_lemmas rank        prop
## 1              als       39    1 0.375000000
## 2             dass       12    2 0.115384615
## 3               da       10    3 0.096153846
## 4              wie        9    4 0.086538462
## 5           sodass        9    5 0.086538462
## 6               um        4    6 0.038461538
## 7          während        3    7 0.028846154
## 8               ob        2    8 0.019230769
## 9           jedoch        2    9 0.019230769
## 10           bevor        2   10 0.019230769
## 11            denn        2   11 0.019230769
## 12            weil        1   12 0.009615385
## 13 beziehungsweise        1   13 0.009615385
## 14             die        1   14 0.009615385
## 15              wo        1   15 0.009615385
## 16          sobald        1   16 0.009615385
## 17           falls        1   17 0.009615385
## 18           sowie        1   18 0.009615385
## 19           indem        1   19 0.009615385
## 20         nachdem        1   20 0.009615385
## 21             als        1   21 0.009615385
## 
## $English.spoken.informal
##     lemma n_lemmas rank       prop
## 1      so       28    1 0.30769231
## 2    that       24    2 0.26373626
## 3 because       16    3 0.17582418
## 4      as       10    4 0.10989011
## 5    when        7    5 0.07692308
## 6    like        2    6 0.02197802
## 7  though        2    7 0.02197802
## 8   cause        1    8 0.01098901
## 9   until        1    9 0.01098901
## 
## $German.spoken.informal
##      lemma n_lemmas rank        prop
## 1     weil       30    1 0.288461538
## 2     dass       23    2 0.221153846
## 3      als       17    3 0.163461538
## 4      wie        7    4 0.067307692
## 5     wenn        6    5 0.057692308
## 6       ob        5    6 0.048076923
## 7   sodass        2    7 0.019230769
## 8    damit        2    8 0.019230769
## 9   obwohl        2    9 0.019230769
## 10      um        1   10 0.009615385
## 11      da        1   11 0.009615385
## 12    also        1   12 0.009615385
## 13  soweit        1   13 0.009615385
## 14      wo        1   14 0.009615385
## 15  sobald        1   15 0.009615385
## 16     bis        1   16 0.009615385
## 17   bevor        1   17 0.009615385
## 18 während        1   18 0.009615385
## 19 nachdem        1   19 0.009615385
## 
## $English.written.informal
##     lemma n_lemmas rank       prop
## 1      so        8    1 0.23529412
## 2      as        7    2 0.20588235
## 3    when        5    3 0.14705882
## 4 because        5    4 0.14705882
## 5    that        3    5 0.08823529
## 6   while        2    6 0.05882353
## 7  though        2    7 0.05882353
## 8      if        1    8 0.02941176
## 9   since        1    9 0.02941176
## 
## $German.written.informal
##      lemma n_lemmas rank  prop
## 1     weil       11    1 0.275
## 2      als        9    2 0.225
## 3     dass        3    3 0.075
## 4   sodass        3    4 0.075
## 5  nachdem        3    5 0.075
## 6      wie        2    6 0.050
## 7       da        2    7 0.050
## 8       wo        2    8 0.050
## 9     wenn        2    9 0.050
## 10      um        1   10 0.025
## 11   damit        1   11 0.025
## 12 während        1   12 0.025

2.13 Conjunctions and subjunctions

The code below does the same for the combination of conjunctions and subjunctions. This way, we can compare the proportions of coordinating and subordinating con/subjunctions.

cc <- all_files[all_files$syntax=="cc", ]
nrow(cc)
## [1] 2781
cc <- with(cc, split(cc, list(cc$language, cc$modality, cc$register)))

cc_lemmas <- lapply(cc, function(x) {
  x %>%
    group_by(lemma) %>%
    summarize(n_lemmas = n()) %>%
    as.data.frame()
})

cc_lemmas_list <- lapply(cc_lemmas, function(x) {
  helper <- x[order(-x$n_lemmas), ]
  helper$rank <- c(1:nrow(x))
  helper
})

cc_lemmas_list <- lapply(cc_lemmas_list, function(x) {
  mutate(x,  prop = n_lemmas / sum(n_lemmas))
})

2.14 Summary

The tables below show the proportion of coordinating and subordinating con/subjunctions.

cc_lemmas_list
## $English.spoken.formal
##   lemma n_lemmas rank       prop
## 1   and      201    1 0.89732143
## 2   but       15    2 0.06696429
## 3    or        8    3 0.03571429
## 
## $German.spoken.formal
##              lemma n_lemmas rank        prop
## 1              und      566    1 0.752659574
## 2             also       42    2 0.055851064
## 3             dass       26    3 0.034574468
## 4             aber       20    4 0.026595745
## 5              als       16    5 0.021276596
## 6             oder       14    6 0.018617021
## 7             weil       10    7 0.013297872
## 8               äh        5    8 0.006648936
## 9               ja        5    9 0.006648936
## 10              ob        5   10 0.006648936
## 11             wie        4   11 0.005319149
## 12              da        4   12 0.005319149
## 13          soweit        4   13 0.005319149
## 14           falls        4   14 0.005319149
## 15          sodass        3   15 0.003989362
## 16         sondern        3   16 0.003989362
## 17         nachdem        3   17 0.003989362
## 18              wo        2   18 0.002659574
## 19            wenn        2   19 0.002659574
## 20           damit        2   20 0.002659574
## 21           genau        1   21 0.001329787
## 22              um        1   22 0.001329787
## 23 beziehungsweise        1   23 0.001329787
## 24             nur        1   24 0.001329787
## 25          sobald        1   25 0.001329787
## 26   währenddessen        1   26 0.001329787
## 27          jedoch        1   27 0.001329787
## 28           indem        1   28 0.001329787
## 29         während        1   29 0.001329787
## 30            denn        1   30 0.001329787
## 31        entweder        1   31 0.001329787
## 32  beziehungweise        1   32 0.001329787
## 
## $English.written.formal
##   lemma n_lemmas rank      prop
## 1   and      112    1 0.9032258
## 2   but        6    2 0.0483871
## 3    or        6    3 0.0483871
## 
## $German.written.formal
##              lemma n_lemmas rank        prop
## 1              und      295    1 0.776315789
## 2              als       31    2 0.081578947
## 3             dass       10    3 0.026315789
## 4               da        8    4 0.021052632
## 5           sodass        7    5 0.018421053
## 6             oder        4    6 0.010526316
## 7           jedoch        3    7 0.007894737
## 8             aber        2    8 0.005263158
## 9             also        2    9 0.005263158
## 10              ob        2   10 0.005263158
## 11           sowie        2   11 0.005263158
## 12            denn        2   12 0.005263158
## 13             wie        1   13 0.002631579
## 14 beziehungsweise        1   14 0.002631579
## 15             die        1   15 0.002631579
## 16            bzw.        1   16 0.002631579
## 17          sobald        1   17 0.002631579
## 18         sondern        1   18 0.002631579
## 19           weder        1   19 0.002631579
## 20           falls        1   20 0.002631579
## 21           indem        1   21 0.002631579
## 22            doch        1   22 0.002631579
## 23         während        1   23 0.002631579
## 24    Währendessen        1   24 0.002631579
## 
## $English.spoken.informal
##     lemma n_lemmas rank        prop
## 1     and      236    1 0.851985560
## 2     but       29    2 0.104693141
## 3      or       11    3 0.039711191
## 4 because        1    4 0.003610108
## 
## $German.spoken.informal
##              lemma n_lemmas rank        prop
## 1              und      499    1 0.715925395
## 2             aber       52    2 0.074605452
## 3             also       37    3 0.053084648
## 4             oder       30    4 0.043041607
## 5             weil       25    5 0.035868006
## 6             dass       14    6 0.020086083
## 7              als       11    7 0.015781923
## 8               ob        4    8 0.005738881
## 9          sondern        4    9 0.005738881
## 10            wenn        4   10 0.005738881
## 11             wie        2   11 0.002869440
## 12              ja        2   12 0.002869440
## 13           damit        2   13 0.002869440
## 14            Hund        1   14 0.001434720
## 15              da        1   15 0.001434720
## 16 beziehungsweise        1   16 0.001434720
## 17          soweit        1   17 0.001434720
## 18          sobald        1   18 0.001434720
## 19          sodass        1   19 0.001434720
## 20           weder        1   20 0.001434720
## 21             bis        1   21 0.001434720
## 22           bevor        1   22 0.001434720
## 23         nachdem        1   23 0.001434720
## 24          obwohl        1   24 0.001434720
## 
## $English.written.informal
##   lemma n_lemmas rank      prop
## 1   and       82    1 0.8723404
## 2   but       11    2 0.1170213
## 3    or        1    3 0.0106383
## 
## $German.written.informal
##      lemma n_lemmas rank        prop
## 1      und      165    1 0.708154506
## 2     aber       23    2 0.098712446
## 3      als       12    3 0.051502146
## 4     weil        7    4 0.030042918
## 5     also        5    5 0.021459227
## 6     dass        3    6 0.012875536
## 7     oder        3    7 0.012875536
## 8   sodass        2    8 0.008583691
## 9     wenn        2    9 0.008583691
## 10    doch        2   10 0.008583691
## 11 nachdem        2   11 0.008583691
## 12    Also        2   12 0.008583691
## 13    Auto        1   13 0.004291845
## 14     wie        1   14 0.004291845
## 15     Und        1   15 0.004291845
## 16    denn        1   16 0.004291845
## 17     Als        1   17 0.004291845

3 Lexicon

Two other measures that have been shown to vary across different types of text is lexical diversity and lexical density.

Lexical diversity

  • a measure of lexical richness
  • can be calculated as the type-token-ratio (TTR):
  • the ratio between the number of unique word stems (types) and the number of words (tokens)

Lexical density

  • a measure of complexity and of how dense the information in a text is
  • can be calculated as the ratio between the number of lexical words and the number of functional / grammatical words
  • or as the ratio between lexical words and the total number of words

Instead of relying on a single measure, we can compare various ratios.

The Part of Speech (POS) tags used can be found here: https://universaldependencies.org/u/pos/

The most relevant lexical POSs are:

  • NOUN: noun
  • VERB: verb
  • ADJ: adjective
  • ADV: adverb

The most relevant grammatical POSs are:

  • ADP: adposition
  • PRON: pronoun
  • AUX: auxiliary
  • CCONJ: conjunction
  • DET: determiner

Another relevant POS is:

  • INTJ: interjection

3.1 Lexical POS ratios

The following code chunk calculates the ratio between lexical POS types and the total number of words for all text files separately.

POSlex_overview <- lapply(all_texts, function(x){
  n_NOUN <- nrow(x[x$POS=="NOUN", ]) / nrow(x)
  n_VERB <- nrow(x[x$POS=="VERB", ]) / nrow(x)
  n_ADJ <- nrow(x[x$POS=="ADJ", ]) / nrow(x)
  n_ADV <- nrow(x[x$POS=="ADV", ]) / nrow(x)
  file <- as.character(x$file[1])
  d <- cbind(
    n_NOUN,
    n_VERB,
    n_ADJ,
    n_ADV,
    file)
  d
})

POSlex_overview <- as.data.frame(do.call(rbind, POSlex_overview))
POSlex_overview[, 1:4] <- lapply(POSlex_overview[, 1:4], as.character)
POSlex_overview[, 1:4] <- lapply(POSlex_overview[, 1:4], as.numeric)

To summarize across the 8 varieties, we add columns for language, register, and modality.

POSlex_overview$language <- ifelse(str_detect(POSlex_overview$file, "DE"),
                                  "German",
                                  "English")

POSlex_overview$register <- ifelse(str_detect(POSlex_overview$file, "f.{2}\\.txt"),
                                  "formal",
                                  "informal")

POSlex_overview$modality <- ifelse(str_detect(POSlex_overview$file, "s.\\.txt"),
                                  "spoken",
                                  "written")

3.2 Summary

The code below summarizes the POS ratios for lexical parts of speech across the 8 varieties:

summary_POSlex <- POSlex_overview %>%
  group_by(language, modality, register) %>%
  summarize(
    prop_NOUN = round(mean(n_NOUN), 2),
    prop_VERB = round(mean(n_VERB), 2),
    prop_ADJ = round(mean(n_ADJ), 2),
    prop_ADV = round(mean(n_ADV), 2)
  ) %>%
  as.data.frame()

summary_POSlex

3.3 Grammatical POS ratios

The following code chunk calculates the ratio between grammatical and other POS types and the total number of words for all texts. Again, we then add language, register, and modality columns in order to average across varieties.

POSgram_overview <- lapply(all_texts, function(x){
  n_PRON <- nrow(x[x$POS=="PRON", ]) / nrow(x)
  n_DET <- nrow(x[x$POS=="DET", ]) / nrow(x)
  n_ADP <- nrow(x[x$POS=="ADP", ]) / nrow(x)
  n_AUX <- nrow(x[x$POS=="AUX", ]) / nrow(x)
  n_CCONJ <- nrow(x[x$POS=="CCONJ", ]) / nrow(x)
  n_INTJ <- nrow(x[x$POS=="INTJ", ]) / nrow(x)
  file <- as.character(x$file[1])
  d <- cbind(
    n_PRON,
    n_DET,
    n_ADP,
    n_AUX,
    n_CCONJ,
    n_INTJ,
    file)
  d
})

POSgram_overview <- as.data.frame(do.call(rbind, POSgram_overview))
POSgram_overview[, 1:6] <- lapply(POSgram_overview[, 1:6], as.character)
POSgram_overview[, 1:6] <- lapply(POSgram_overview[, 1:6], as.numeric)

POSgram_overview$language <- ifelse(str_detect(POSgram_overview$file, "DE"),
                                   "German",
                                   "English")

POSgram_overview$register <- ifelse(str_detect(POSgram_overview$file, "f.{2}\\.txt"),
                                   "formal",
                                   "informal")

POSgram_overview$modality <- ifelse(str_detect(POSgram_overview$file, "s.\\.txt"),
                                   "spoken",
                                   "written")

3.4 Summary

The code below summarizes the grammatical POS ratios across the 8 varieties:

summary_POSgram <- POSgram_overview %>%
  group_by(language, modality, register) %>%
  summarize(
    prop_PRON = round(mean(n_PRON), 2),
    prop_DET = round(mean(n_DET), 2),
    prop_ADP = round(mean(n_ADP), 2),
    prop_AUX = round(mean(n_AUX), 2),
    prop_CCONJ = round(mean(n_CCONJ), 2)
  ) %>%
  as.data.frame()

summary_POSgram

The next table shows the ratio between interjections and number of words per text across the 8 varieties.

summary_intj <- POSgram_overview %>%
  group_by(language, modality, register) %>%
  summarize(prop_INTJ = round(mean(n_INTJ), 2)) %>%
  as.data.frame()

summary_intj

3.5 Most frequent words

Another way of examining the proportion between lexical and other words in a text in order to compare texts is the following: for each text, we can rank the lemmas (word stems) according to their frequency. For our purposes, we rather want to compare the most frequent words in the 8 different varieties.

Comparing the 10, 20, … most frequent words across the 4 modality-register combinations in both English and German can also reveal differences across registers and modalities.

To do so, we first need to create a list of 8 data frames containing single text variety each. This is what the next code chunk does.

all_varieties <- with(all_files, split(all_files, list(all_files$language, all_files$modality, all_files$register)))

The next step is to count the lemmas, find the 20 most frequent lemmas for each variety, and rank them:

lemma_count_list <- lapply(all_varieties, function(x) {
  x %>% 
    group_by(lemma) %>%
    summarize(counts = n()) %>%
    as.data.frame 
})

lemma_count_list <- lapply(lemma_count_list, function(x) {
  helper <- x[order(-x$counts), ]
  helper2 <- head(helper, 30)
  helper2$rank <- c(1:30)
  helper2
})

3.6 Summary

The next code chunk shows the top 20 frequent lemmas for each of the 8 varieties.

lemma_count_list
## $English.spoken.formal
##       lemma counts rank
## 96      the    436    1
## 25      and    201    2
## 32       be    188    3
## 20        a    158    4
## 37      car    123    5
## 101      to    104    6
## 73       of     91    7
## 18     ball     72    8
## 3        in     51    9
## 43      dog     51   10
## 9       man     45   11
## 109   woman     45   12
## 74       on     42   13
## 92   street     41   14
## 57     have     40   15
## 63       it     38   16
## 98     they     38   17
## 107    with     38   18
## 193    then     38   19
## 59        i     33   20
## 97    there     31   21
## 114      he     31   22
## 55  grocery     30   23
## 68      lot     30   24
## 78  parking     30   25
## 103     two     30   26
## 105    walk     30   27
## 152       I     30   28
## 14      her     29   29
## 95     that     29   30
## 
## $German.spoken.formal
##         lemma counts rank
## 14          d   1241    1
## 3          äh    698    2
## 47        und    566    3
## 42       sein    508    4
## 18        ein    375    5
## 27      haben    336    6
## 8         auf    227    7
## 33        ich    227    8
## 10       Auto    212    9
## 11       Ball    176   10
## 16       dann    171   11
## 37        mit    140   12
## 34         in    127   13
## 32       Hund    120   14
## 68       Frau    118   15
## 43     Straße    108   16
## 36       Mann    103   17
## 66         es     95   18
## 73        ihr     92   19
## 122      auch     91   20
## 94         zu     90   21
## 64         er     88   22
## 91        von     80   23
## 117        ja     78   24
## 5          an     69   25
## 175      also     64   26
## 277 Parkplatz     64   27
## 131     diese     63   28
## 29       halt     57   29
## 110        so     54   30
## 
## $English.written.formal
##        lemma counts rank
## 85       the    452    1
## 30        be    140    2
## 21         a    135    3
## 26       and    112    4
## 35       car    101    5
## 89        to     92    6
## 64        of     84    7
## 19      ball     72    8
## 39       dog     51    9
## 12       man     50   10
## 5         in     45   11
## 97     woman     43   12
## 48   grocery     35   13
## 82    street     35   14
## 60       lot     34   15
## 95      with     34   16
## 16       her     33   17
## 65        on     33   18
## 55        it     31   19
## 27        as     30   20
## 54      into     28   21
## 93      walk     28   22
## 167   driver     28   23
## 32      blue     27   24
## 69   parking     26   25
## 44     first     25   26
## 23  accident     23   27
## 74       she     23   28
## 103       he     23   29
## 91       two     22   30
## 
## $German.written.formal
##           lemma counts rank
## 13            d   1274    1
## 17          ein    310    2
## 43          und    295    3
## 38         sein    241    4
## 7           auf    223    5
## 9          Auto    210    6
## 10         Ball    195    7
## 33          mit    184    8
## 24        haben    151    9
## 30           in    151   10
## 28         Hund    125   11
## 75          ihr    122   12
## 32         Mann    121   13
## 69         Frau    115   14
## 39       Straße    111   15
## 101          zu    101   16
## 29          ich     97   17
## 65           er     83   18
## 238   Parkplatz     76   19
## 12         blau     75   20
## 4            an     65   21
## 172     Einkauf     61   22
## 103         aus     60   23
## 174      fahren     56   24
## 3           als     55   25
## 182      rollen     52   26
## 142 Kinderwagen     49   27
## 26         Hand     48   28
## 127     bremsen     45   29
## 141      hinter     45   30
## 
## $English.spoken.informal
##      lemma counts rank
## 90     the    360    1
## 32     and    236    2
## 36      be    180    3
## 27       a    119    4
## 41     car     99    5
## 71      of     90    6
## 94      to     78    7
## 63      it     72    8
## 6       so     60    9
## 23    ball     55   10
## 45     dog     49   11
## 107     he     49   12
## 152      I     49   13
## 3       in     45   14
## 91   there     44   15
## 73     one     43   16
## 154   like     43   17
## 210   then     43   18
## 92    they     39   19
## 72      on     38   20
## 80     she     38   21
## 59       i     36   22
## 18     her     33   23
## 89    that     33   24
## 190    guy     33   25
## 87  street     32   26
## 133    not     32   27
## 22    just     29   28
## 40     but     29   29
## 100   with     29   30
## 
## $German.spoken.informal
##      lemma counts rank
## 14       d    912    1
## 41     und    499    2
## 37    sein    493    3
## 3       äh    362    4
## 18     ein    357    5
## 23   haben    314    6
## 29     ich    229    7
## 16    dann    216    8
## 108     so    195    9
## 8      auf    185   10
## 11    Ball    169   11
## 10    Auto    153   12
## 28    Hund    112   13
## 33     mit    109   14
## 55      da    104   15
## 38  Straße    100   16
## 25    halt     98   17
## 59      er     97   18
## 64  gerade     92   19
## 30      in     90   20
## 118     ja     86   21
## 123   auch     81   22
## 63    Frau     76   23
## 86      zu     75   24
## 32    Mann     72   25
## 87    aber     70   26
## 163   also     66   27
## 74   nicht     65   28
## 61      es     60   29
## 67     ihr     59   30
## 
## $English.written.informal
##       lemma counts rank
## 83      the    196    1
## 31       be     97    2
## 23        a     96    3
## 27      and     84    4
## 36      car     67    5
## 19     ball     45    6
## 87       to     38    7
## 56       it     34    8
## 40      dog     33    9
## 65       of     31   10
## 3        in     27   11
## 175     see     25   12
## 99       he     22   13
## 93     with     21   14
## 80   street     20   15
## 53        i     19   16
## 168     guy     19   17
## 66       on     18   18
## 106     run     18   19
## 67      one     17   20
## 85     they     17   21
## 91     walk     17   22
## 6        so     16   23
## 18     just     16   24
## 46    first     16   25
## 50  grocery     16   26
## 101     his     16   27
## 55     into     15   28
## 61      lot     15   29
## 139       I     15   30
## 
## $German.written.informal
##         lemma counts rank
## 12          d    524    1
## 15        ein    261    2
## 34       sein    248    3
## 38        und    167    4
## 20      haben    143    5
## 6         auf    122    6
## 8        Auto    115    7
## 9        Ball    109    8
## 25       Hund     86    9
## 26        ich     82   10
## 30        mit     69   11
## 35     Straße     68   12
## 13       dann     66   13
## 63     gerade     64   14
## 58         er     61   15
## 62       Frau     61   16
## 27         in     49   17
## 99         so     49   18
## 29       Mann     48   19
## 82         zu     43   20
## 65        ihr     37   21
## 54         da     36   22
## 61     fallen     33   23
## 83       aber     31   24
## 152    fahren     31   25
## 39     Unfall     30   26
## 44     wollen     30   27
## 214 Parkplatz     30   28
## 28     kommen     28   29
## 96  passieren     28   30

3.7 Lexical diversity (TTR)

The other measures established above rather play into lexical density, since they looked at the ratios of certain POS types.

As mentioned before, lexical diversity is another useful measure to compare texts. Lexical diversity can be measured simply by dividing the number of lemmas by the number of words, giving us a ratio of different lemmas per words, i.e. how diverse a text is.

However, this measure can only be used to compare texts of the same length, since with more words, the numbers of lexical and grammatical words do not increase in the same way.

Is this a problem for this study? The texts differ in length, but they refer to the same event afterall…

We can assess lexical diversity comparing two measures:

  • ttr_all: type-token-ratio for the entire texts (may be biased by text lengths and not comparable in the strict sense)
  • ttr_100: type-token-ratio for the first 100 words of each text

The following code chunk calculates those two ttr measures for each text file.

ttr_overview_all <- lapply(all_texts, function(x) {
  ttr_helper <- length(unique(x$lemma)) / nrow(x)
}) 

ttr_overview_100 <- lapply(all_texts, function(x) {
  h <- head(x, 100)
  ttr_helper <- length(unique(h$lemma)) / nrow(h)
})

The code below formats the output file and adds language, modality, and register information.

ttr_overview_all <- as.data.frame(do.call(rbind, ttr_overview_all))
ttr_overview_all$file <- rownames(ttr_overview_all)

ttr_overview_100 <- as.data.frame(do.call(rbind, ttr_overview_100))
ttr_overview_100$file <- rownames(ttr_overview_100)

ttr_overview_all$language <- ifelse(str_detect(ttr_overview_all$file, "DE"),
                                   "German",
                                   "English")

ttr_overview_all$register <- ifelse(str_detect(ttr_overview_all$file, "f.{2}\\.txt"),
                                   "formal",
                               "informal")

ttr_overview_all$modality <- ifelse(str_detect(ttr_overview_all$file, "s.\\.txt"),
                                   "spoken",
                               "written")

ttr_overview_100$language <- ifelse(str_detect(ttr_overview_100$file, "DE"),
                                   "German",
                                   "English")

ttr_overview_100$register <- ifelse(str_detect(ttr_overview_100$file, "f.{2}\\.txt"),
                                   "formal",
                                   "informal")

ttr_overview_100$modality <- ifelse(str_detect(ttr_overview_100$file, "s.\\.txt"),
                                   "spoken",
                                   "written")

The next code chunk summarizes the two ttr measures across the 8 varieties.

ttr_summary_all <- ttr_overview_all %>%
  group_by(language, modality, register) %>%
  summarize(ttr = round(mean(V1), 2))

ttr_summary_100 <- ttr_overview_100 %>%
  group_by(language, modality, register) %>%
  summarize(ttr = round(mean(V1), 2))

3.8 Summary

The two tables below summarize the ttr measures across the 8 varieties. While ttr_summary_all considers the entire texts and is less suited for comparison across texts of different lengths, ttr_summary_100 only takes the 100 first words of each text, and is thus a more comparable measure.

ttr_summary_all
ttr_summary_100

3.9 1st, 2nd, 3rd person pronouns

Another relevant measure is the proportion of 3rd person pronouns vs. 1st and 2nd person pronouns.

The underlying assumption of written and formal texts is that the addressee is much more distant, and both the speaker and the addressee are less involved. Thus, we would expect to see a higher proportion of 1st and 2nd person pronouns in the spoken and informal varieties compared with the written and formal ones.

The code chunk below calculates three ratios:

  • 3rd person pronouns per total number of pronouns
  • 2nd person pronouns per total number of pronouns
  • 1st person pronouns per total number of pronouns
pro_overview <- lapply(all_texts, function(x){
  n_3 <- length(str_subset(x$morphosyntax, "Person=3")) / length(str_subset(x$morphosyntax, "Person="))
  n_2 <- length(str_subset(x$morphosyntax, "Person=2")) / length(str_subset(x$morphosyntax, "Person="))
  n_1 <- length(str_subset(x$morphosyntax, "Person=1")) / length(str_subset(x$morphosyntax, "Person="))  
  file <- as.character(x$file[1])
  d <- cbind(n_3,
            n_2,
            n_1,
            file)
  d
})

pro_overview <- as.data.frame(do.call(rbind, pro_overview))

The next code chunk formates those person ratios, adding language, modality, and register columns.

pro_overview[, 1:3] <- lapply(pro_overview[, 1:3], as.character)
pro_overview[, 1:3] <- lapply(pro_overview[, 1:3], as.numeric)

pro_overview$language <- ifelse(str_detect(pro_overview$file, "DE"),
                               "German",
                               "English")

pro_overview$register <- ifelse(str_detect(pro_overview$file, "f.{2}\\.txt"),
                               "formal",
                               "informal")

pro_overview$modality <- ifelse(str_detect(pro_overview$file, "s.\\.txt"),
                               "spoken",
                               "written")

3.10 Summary

The table below shows the ratios of 3rd vs. 1st and 2nd person pronouns across the 8 varieties.

summary_pronoun <- pro_overview %>%
  group_by(language, modality, register) %>%
  summarize(prop_3 = round(mean(n_3, na.rm=TRUE), 2),
            prop_1_2 = round(mean(n_2 + n_1, na.rm=TRUE), 2),
            ) %>%
  as.data.frame()

summary_pronoun

4 Indivdual differences

Another interesting perspective concerns not only the differences between the register-modality varieties but includes the speakers:

  • How different are the 4 texts of single speakers?
  • If we compared all texts, would the texts cluster together according to speakers, register, or modality?
  • In other words: are individual differences or differences across register and modality more important?

4.1 Prepare the data

The next code chunk merges the different overview data frames containing the different syntactic and lexical measures for each file. We also add a column combining register and modality information, i.e. a variable with the following four values:

  • formal.written
  • formal.spoken
  • informal.written
  • informal.spoken
## syntax
str(clause_counts)
## 'data.frame':    313 obs. of  10 variables:
##  $ n_sent  : num  9 12 22 10 18 17 17 12 45 13 ...
##  $ n_cc    : num  10 9 7 6 12 9 11 6 41 7 ...
##  $ n_advcl : num  0 1 1 1 1 2 1 0 4 1 ...
##  $ n_csubj : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ n_comp  : num  1 2 1 3 1 0 1 0 7 3 ...
##  $ n_rel   : num  0 7 0 3 5 7 6 4 23 14 ...
##  $ file    : Factor w/ 313 levels "DEbi01FT_fsD.txt",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ register: chr  "formal" "formal" "informal" "informal" ...
##  $ modality: chr  "spoken" "written" "spoken" "written" ...
##  $ language: chr  "German" "German" "German" "German" ...
str(length_sent)
## 'data.frame':    313 obs. of  5 variables:
##  $ sent_length: num  13.56 12.25 6.18 8.1 11.17 ...
##  $ file       : chr  "DEbi01FT_fsD.txt" "DEbi01FT_fwD.txt" "DEbi01FT_isD.txt" "DEbi01FT_iwD.txt" ...
##  $ register   : chr  "formal" "formal" "informal" "informal" ...
##  $ modality   : chr  "spoken" "written" "spoken" "written" ...
##  $ language   : chr  "German" "German" "German" "German" ...
str(sent_init_summary)
## 'data.frame':    313 obs. of  5 variables:
##  $ file           : chr  "DEbi01FT_fsD.txt" "DEbi01FT_fwD.txt" "DEbi01FT_isD.txt" "DEbi01FT_iwD.txt" ...
##  $ sent_init_cconj: num  6 5 6 5 9 8 8 3 27 1 ...
##  $ register       : chr  "formal" "formal" "informal" "informal" ...
##  $ modality       : chr  "spoken" "written" "spoken" "written" ...
##  $ language       : chr  "German" "German" "German" "German" ...
## lexicon
str(POSlex_overview)
## 'data.frame':    313 obs. of  8 variables:
##  $ n_NOUN  : num  0.164 0.211 0.14 0.198 0.179 ...
##  $ n_VERB  : num  0.082 0.102 0.0735 0.1358 0.0846 ...
##  $ n_ADJ   : num  0.0246 0.0748 0.0515 0.0123 0.0746 ...
##  $ n_ADV   : num  0.0902 0.0476 0.2353 0.1235 0.0647 ...
##  $ file    : Factor w/ 313 levels "DEbi01FT_fsD.txt",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ language: chr  "German" "German" "German" "German" ...
##  $ register: chr  "formal" "formal" "informal" "informal" ...
##  $ modality: chr  "spoken" "written" "spoken" "written" ...
str(POSgram_overview)
## 'data.frame':    313 obs. of  10 variables:
##  $ n_PRON  : num  0.041 0.0748 0.0956 0.0494 0.0896 ...
##  $ n_DET   : num  0.18 0.17 0.118 0.198 0.169 ...
##  $ n_ADP   : num  0.1066 0.1361 0.0662 0.0741 0.0995 ...
##  $ n_AUX   : num  0.1066 0.0816 0.1029 0.0988 0.0796 ...
##  $ n_CCONJ : num  0.082 0.0612 0.0515 0.0741 0.0597 ...
##  $ n_INTJ  : num  0.0738 0.0068 0.0515 0.0123 0.0846 ...
##  $ file    : Factor w/ 313 levels "DEbi01FT_fsD.txt",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ language: chr  "German" "German" "German" "German" ...
##  $ register: chr  "formal" "formal" "informal" "informal" ...
##  $ modality: chr  "spoken" "written" "spoken" "written" ...
str(pro_overview)
## 'data.frame':    313 obs. of  7 variables:
##  $ n_3     : num  0.737 0.826 0.68 1 0.789 ...
##  $ n_2     : num  0 0.087 0 0 0 ...
##  $ n_1     : num  0.263 0.087 0.32 0 0.211 ...
##  $ file    : Factor w/ 313 levels "DEbi01FT_fsD.txt",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ language: chr  "German" "German" "German" "German" ...
##  $ register: chr  "formal" "formal" "informal" "informal" ...
##  $ modality: chr  "spoken" "written" "spoken" "written" ...
str(ttr_overview_100)
## 'data.frame':    313 obs. of  5 variables:
##  $ V1      : num  0.45 0.51 0.56 0.543 0.56 ...
##  $ file    : chr  "DEbi01FT_fsD.txt" "DEbi01FT_fwD.txt" "DEbi01FT_isD.txt" "DEbi01FT_iwD.txt" ...
##  $ language: chr  "German" "German" "German" "German" ...
##  $ register: chr  "formal" "formal" "informal" "informal" ...
##  $ modality: chr  "spoken" "written" "spoken" "written" ...
## merge all overview dfs
all_overview <- merge(clause_counts, sent_init_summary)
all_overview <- merge(all_overview, length_sent)
all_overview <- merge(all_overview, POSlex_overview)
all_overview <- merge(all_overview, POSgram_overview)
all_overview <- merge(all_overview, pro_overview)
all_overview <- merge(all_overview, ttr_overview_100)
str(all_overview)
## 'data.frame':    313 obs. of  26 variables:
##  $ file           : Factor w/ 313 levels "DEbi01FT_fsD.txt",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ register       : chr  "formal" "formal" "informal" "informal" ...
##  $ modality       : chr  "spoken" "written" "spoken" "written" ...
##  $ language       : chr  "German" "German" "German" "German" ...
##  $ n_sent         : num  9 12 22 10 18 17 17 12 45 13 ...
##  $ n_cc           : num  10 9 7 6 12 9 11 6 41 7 ...
##  $ n_advcl        : num  0 1 1 1 1 2 1 0 4 1 ...
##  $ n_csubj        : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ n_comp         : num  1 2 1 3 1 0 1 0 7 3 ...
##  $ n_rel          : num  0 7 0 3 5 7 6 4 23 14 ...
##  $ sent_init_cconj: num  6 5 6 5 9 8 8 3 27 1 ...
##  $ sent_length    : num  13.56 12.25 6.18 8.1 11.17 ...
##  $ n_NOUN         : num  0.164 0.211 0.14 0.198 0.179 ...
##  $ n_VERB         : num  0.082 0.102 0.0735 0.1358 0.0846 ...
##  $ n_ADJ          : num  0.0246 0.0748 0.0515 0.0123 0.0746 ...
##  $ n_ADV          : num  0.0902 0.0476 0.2353 0.1235 0.0647 ...
##  $ n_PRON         : num  0.041 0.0748 0.0956 0.0494 0.0896 ...
##  $ n_DET          : num  0.18 0.17 0.118 0.198 0.169 ...
##  $ n_ADP          : num  0.1066 0.1361 0.0662 0.0741 0.0995 ...
##  $ n_AUX          : num  0.1066 0.0816 0.1029 0.0988 0.0796 ...
##  $ n_CCONJ        : num  0.082 0.0612 0.0515 0.0741 0.0597 ...
##  $ n_INTJ         : num  0.0738 0.0068 0.0515 0.0123 0.0846 ...
##  $ n_3            : num  0.737 0.826 0.68 1 0.789 ...
##  $ n_2            : num  0 0.087 0 0 0 ...
##  $ n_1            : num  0.263 0.087 0.32 0 0.211 ...
##  $ V1             : num  0.45 0.51 0.56 0.543 0.56 ...
## add the variety column
all_overview$type <- paste(all_overview$register, all_overview$modality, sep = ".")

Then, it is useful to split the data sets into an English and a German data set, since we can only compare the measures for files of the same language.

overview_en <- all_overview[all_overview$language=="English",]
overview_de <- all_overview[all_overview$language=="German",]

4.2 Individual differences in syntax

First, we are going to examine the similarity of the text files in terms of their syntactic properties that we examined. The measures are listed again here:

  • n_sent: number of sentences per text
  • sent_length: average number of words per sentence
  • n_cc: number of clause-combining conjunctions
  • n_advcl: number of adverbial clauses
  • n_comp: number of complement clauses
  • n_rel: number of relative clauses
  • n_csubj: number of clausal subjects
  • sent_init_cconj: number of sentence-initial conjunctions

First, we need to extract the columns containing the syntactic information for both languages.

In addition, we need to normalize the values of the different syntactic measures, because the actual numbers vary across different measures.

This means that for each column (i.e. single syntactic measure), all values are centered around 0. While we can no longer interprete the values as such, we can now compare them quantitatively across different syntactic measures.

syntax_en <- overview_en[, 5:12]
syntax_en <- apply(syntax_en, 2, scale)
head(syntax_en)
##          n_sent        n_cc    n_advcl   n_csubj     n_comp       n_rel
## [1,]  0.3936443  0.15619438 -0.5370272 -0.352854  0.4047278  0.01067641
## [2,] -1.0341980 -0.23930699 -0.8497265  2.352360  0.1406501  1.32031621
## [3,] -0.2553749  0.15619438  0.7137703 -0.352854  0.4047278  0.01067641
## [4,] -1.0341980 -0.50297457 -0.8497265 -0.352854 -0.1234276 -0.64414349
## [5,]  0.2638404  0.02436059 -0.2243278  2.352360 -0.3875053 -0.31673354
## [6,] -0.7745903 -0.37114078 -0.8497265 -0.352854 -0.1234276  0.33808636
##      sent_init_cconj sent_length
## [1,]     -0.10697305   0.4596646
## [2,]     -0.46484653   4.4294174
## [3,]      0.07196369   0.2471800
## [4,]     -0.64378328   0.5116053
## [5,]      0.25090043  -0.5568889
## [6,]     -0.64378328   2.0125853
syntax_de <- overview_de[, 5:12]
syntax_de <- apply(syntax_de, 2, scale)
head(syntax_de)
##          n_sent        n_cc    n_advcl     n_csubj      n_comp      n_rel
## [1,] -1.0295501  0.09276751 -0.8505749 -0.09534724 -0.09024014 -0.9160959
## [2,] -0.5980762 -0.04575695  0.3467312 -0.09534724  0.59745194  0.7590818
## [3,]  0.8401701 -0.32280586  0.3467312 -0.09534724 -0.09024014 -0.9160959
## [4,] -0.8857254 -0.46133032  0.3467312 -0.09534724  1.28514402 -0.1981626
## [5,]  0.2648716  0.36981642  0.3467312 -0.09534724 -0.09024014  0.2804596
## [6,]  0.1210470 -0.04575695  1.5440372 -0.09534724 -0.77793222  0.7590818
##      sent_init_cconj sent_length
## [1,]    -0.007134418  1.52024729
## [2,]    -0.204222707  1.03115996
## [3,]    -0.007134418 -1.24210279
## [4,]    -0.204222707 -0.52351337
## [5,]     0.584130451  0.62532154
## [6,]     0.387042162 -0.01006349

4.2.1 Distance matrix

In order to quantify the similarity between the individual texts, we can calculate a so-called distance matrix on the basis of all 8 syntactic measures.

The distance matrix gives us a value (i.e. distance) for each pair of texts. This means that we can now say, in quantitative terms, how similar or distant a pair of texts is based on their syntactic properties.

dist_syntax_en <- dist(syntax_en, method = "euclidean")
dist_syntax_de <- dist(syntax_de, method = "euclidean")

4.2.2 MDS

Of course, with 313 texts in total, a matrix with distances for all pairs of texts in the 2 languages is not very useful.

In other words, we cannot interpret the distance matrix as such. Instead, we can use it as the basis to visually represent the distances/similarities between single texts in order to see if the texts cluster according to single speakers or modality/register.

There are many different techniques to calculate and represent such clusters.

We are going to use Multi-dimensional scaling (MDS) here.

Roughly speaking, for our purposes, we can say that MDS can help to condense complex information into two dimensions along which the texts we are comparing differ. We can actually choose the number of dimensions that we want MDS to output; we could also choose a higher number. Two dimensions, however, can easily be represented in a coordinate system, like a map, which makes it useful for visual interpretation.

Thus, we end up with a coordinate system with dimension 1 on the x-axis and dimension 2 on the y-axis. dimension 1 contains the largest portion of the distance between texts, dimension 2 the second largest.

While the method behind MDS is much more complex than is mentioned here, the interpretation of a 2-dimensional MDS plot is very simple:

The texts that appear closer to each other are more similar to each other, the texts that are further away from each other differ more.

The next code chunk computes the MDS for both the English and the German distance matrix.

mds_syntax_en <- cmdscale(dist_syntax_en, k = 2) %>%
  as.data.frame()
colnames(mds_syntax_en) <- c("dim1", "dim2")
mds_syntax_en$type <- overview_en$type
mds_syntax_en$file <- overview_en$file

mds_syntax_de <- cmdscale(dist_syntax_de, k = 2) %>%
  as.data.frame()
colnames(mds_syntax_de) <- c("dim1", "dim2")
mds_syntax_de$type <- overview_de$type
mds_syntax_de$file <- overview_de$file

4.3 Plots

Finally, the plot below shows the MDS of the texts for English and German separately. In the plots, the modality/register combinations are shown in different colors.

plot_syntax_en <-  ggscatter(mds_syntax_en, x = "dim1", y = "dim2", 
                            label = mds_syntax_en$file,
                            color = "type",
                            repel = FALSE) + 
  theme_gray() +
  xlab("dimemsion 1") +
  ylab("dimemsion 2") +
  theme(legend.position = "none") + 
  ggtitle("English: Similarity of individual texts (syntax)")

plot_syntax_en

plot_syntax_de <-  ggscatter(mds_syntax_de, x = "dim1", y = "dim2", 
                            label = mds_syntax_de$file,
                            color = "type") +
  theme_gray() +
  xlab("dimemsion 1") +
  ylab("dimemsion 2") +
  theme(legend.position = "none") +
  ggtitle("German: Similarity of individual texts (syntax)")
plot_syntax_de

4.4 Individual differences in the lexicon

We can now do the same for the lexical measures, i.e. calculate a distance matrix and use MDS to visualize the similarity between texts.

The lexical measures that we used are listed again here:

  • n_NOUN: proportion of nouns

  • n_VERB: proportion of verbs

  • n_ADJ : proportion of adjectives

  • n_ADV : proportion of adverbs

  • n_PRON : proportion of pronouns

  • n_DET : proportion of determiners

  • n_ADP : proportion of adpositions

  • n_AUX : proportion of auxiliaries

  • n_CCONJ : proportion of conjunctions

  • n_INTJ : proportion of interjections

  • ttr_100: type-token-ratio for first 100 words of each text

  • n_3 : proportion of 3rd person pronouns out of all pronouns

  • n_2 : proportion of 2nd person pronouns out of all pronouns

  • n_1 : proportion of 1st person pronouns out of all pronouns

4.4.1 Prepare the data

lexicon_en <- overview_en[, 13:26]
lexicon_en <- apply(lexicon_en, 2, scale)
head(lexicon_en)
##          n_NOUN     n_VERB      n_ADJ       n_ADV     n_PRON      n_DET
## [1,]  0.1111359 -0.4336216  0.6791604 -0.01046832  0.8847536  0.3500977
## [2,]  0.4743459 -0.2982926  1.1331359 -0.88217187  1.2134097  0.4561988
## [3,] -0.3148356  1.1251699 -0.8184441  0.37095112  2.7278652 -0.6085615
## [4,]  0.8580521 -0.2372738 -0.5168242 -0.09721476  0.4655737  0.3535738
## [5,] -0.2352600 -0.9845336  0.9720414 -0.20925299  0.3519474 -0.4656259
## [6,]  0.6683780 -1.6651148  0.6662840 -0.58785260 -0.2996314  0.6428683
##            n_ADP      n_AUX    n_CCONJ       n_INTJ         n_3        n_2
## [1,]  0.86731957 -0.3377273 -0.2581483 -0.267615058  1.05556731 -0.3726356
## [2,]  0.86267010  0.4709188 -0.1889002 -0.834501252  1.44571047 -0.3726356
## [3,] -0.10161617  0.1363206  0.7873688 -0.834501252 -2.99216796  1.2131215
## [4,] -0.02516109  0.1949957  0.5188312 -0.834501252 -1.09022006 -0.3726356
## [5,] -0.26708419  1.4946337  0.3227420 -0.004711316  0.27528099 -0.3726356
## [6,] -0.43341668  2.2242133 -0.4553403 -0.834501252  0.09321418 -0.3726356
##              n_1         V1
## [1,] -1.06243657  0.2399312
## [2,] -1.51964780 -0.4579833
## [3,]  2.93816169 -0.3416642
## [4,]  1.45222519  1.2722631
## [5,] -0.14801411 -0.6906215
## [6,]  0.06535113 -0.5743024
lexicon_de <- overview_de[, 13:26]
lexicon_de <- apply(lexicon_de, 2, scale)
head(lexicon_de)
##           n_NOUN     n_VERB      n_ADJ      n_ADV      n_PRON      n_DET
## [1,] -0.18942507 -0.7020980 -0.6641138 -0.3237806 -0.94253076  0.5162903
## [2,]  0.57874545 -0.1816690  1.0634017 -1.0500005  0.03248249  0.3340076
## [3,] -0.58584002 -0.9208568  0.2601808  2.1535190  0.63046826 -0.5973358
## [4,]  0.36026254  0.6936369 -1.0851456  0.2445118 -0.70057708  0.8219293
## [5,]  0.05877949 -0.6344336  1.0564192 -0.7588364  0.45658892  0.3177725
## [6,]  1.19332947  0.3934428  0.1989353 -1.1206817 -0.33388951  0.9540789
##           n_ADP        n_AUX     n_CCONJ     n_INTJ         n_3        n_2
## [1,]  0.8025942  0.723808047  0.86166554  0.9771003 -0.76308215 -0.6358419
## [2,]  1.6782085  0.058497067  0.07717710 -0.8403808 -0.02233588  2.4917413
## [3,] -0.3961055  0.627281483 -0.29171482  0.3718890 -1.23488055 -0.6358419
## [4,] -0.1616667  0.515819123  0.56314754 -0.6899469  1.42116968 -0.6358419
## [5,]  0.5931711  0.004292961  0.01957745  1.2703881 -0.32623179 -0.6358419
## [6,]  1.5113905 -0.959950962 -0.12417126 -0.8564350  0.42515084 -0.6358419
##             n_1         V1
## [1,]  0.9570048 -1.6901138
## [2,] -0.5813496 -0.9654577
## [3,]  1.4532739 -0.3615776
## [4,] -1.3405375 -0.5643620
## [5,]  0.4974964 -0.3615776
## [6,] -0.2928582 -0.3615776

4.4.2 Distance matrix

dist_lexicon_en <- dist(lexicon_en, method = "euclidean")
dist_lexicon_de <- dist(lexicon_de, method = "euclidean")

4.4.3 MDS

mds_lexicon_en <- cmdscale(dist_lexicon_en, k = 2) %>%
  as.data.frame()
colnames(mds_lexicon_en) <- c("dim1", "dim2")
mds_lexicon_en$type <- overview_en$type
mds_lexicon_en$file <- overview_en$file

mds_lexicon_de <- cmdscale(dist_lexicon_de, k = 2) %>%
  as.data.frame()
colnames(mds_lexicon_de) <- c("dim1", "dim2")
mds_lexicon_de$type <- overview_de$type
mds_lexicon_de$file <- overview_de$file

4.5 Plots

The plots below show the MDS of the texts for English and German separately. This time, we see the the distance/similarities between texts according to their lexical properties.

In the plots, the modality/register combinations are shown in different colors.

plot_lexicon_en <-  ggscatter(mds_lexicon_en, x = "dim1", y = "dim2", 
                             label = mds_lexicon_en$file,
                             color = "type",
                            repel = FALSE) + 
  theme_gray() +
  xlab("dimemsion 1") +
  ylab("dimemsion 2") +
  theme(legend.position = "none") + 
  ggtitle("English: Similarity of individual texts (lexicon)")
plot_lexicon_en

plot_lexicon_de <-  ggscatter(mds_lexicon_de, x = "dim1", y = "dim2", 
                             label = mds_lexicon_de$file,
                             color = "type") +
  theme_gray() +
  xlab("dimemsion 1") +
  ylab("dimemsion 2") +
  theme(legend.position = "none") +
  ggtitle("German: Similarity of individual texts (lexicon)")
plot_lexicon_de